本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新595篇论文,其中:
- 自然语言处理84篇
- 信息检索19篇
- 计算机视觉113篇
自然语言处理
1. 【2602.10092】Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing
链接:https://arxiv.org/abs/2602.10092
作者:Mohamed Afane,Kayla Laufer,Wenqi Wei,Ying Mao,Junaid Farooq,Ying Wang,Juntao Chen
类目:Computation and Language (cs.CL)
关键词:explaining theoretical concepts, quantum computing education, summarizing technical papers, quantum computing, Language models
备注: 18 pages
点击查看摘要
Abstract:Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.
2. 【2602.10090】Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
链接:https://arxiv.org/abs/2602.10090
作者:Zhaoyang Wang,Canwen Xu,Boyi Liu,Yite Wang,Siwei Han,Zhewei Yao,Huaxiu Yao,Yuxiong He
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language model, Recent advances, empowered autonomous agents, perform complex tasks, Agent World Model
备注: 41 pages
点击查看摘要
Abstract:Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at this https URL.
3. 【2602.10081】Anagent For Enhancing Scientific Table Figure Analysis
链接:https://arxiv.org/abs/2602.10081
作者:Xuehang Guo,Zhiyong Lu,Tom Hope,Qingyun Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:complex multimodal knowledge, requires accurately interpreting, accurately interpreting complex, interpreting complex multimodal, drawing inferences grounded
备注:
点击查看摘要
Abstract:In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \ figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \ figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43\%$ in training-free settings and $\uparrow 42.12\%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \ figure analysis. Our project page: this https URL.
4. 【2602.10074】CAPID: Context-Aware PII Detection for Question-Answering Systems
链接:https://arxiv.org/abs/2602.10074
作者:Mariia Ponomarenko,Sepideh Abedini,Masoumeh Shafieinejad,D.B.Emerson,Shubhankar Mohapatra,Xi He
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Detecting personally identifiable, Detecting personally, personally identifiable information, PII, question-answering systems
备注: Accepted to the Student Research Workshop at EACL 2026
点击查看摘要
Abstract:Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.
5. 【2602.10024】Overview of the TREC 2025 RAGTIME Track
链接:https://arxiv.org/abs/2602.10024
作者:Dawn Lawrie,Sean MacAvaney,James Mayfield,Luca Soldaini,Eugene Yang,Andrew Yates
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:RAG TREC Instrument, RAG TREC, TREC Instrument, English Report Generation, study report generation
备注: 10 pages, 3 figures, notebook version of the RAGTIME 2025 overview paper
点击查看摘要
Abstract:The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.
6. 【2602.10023】MEVER: Multi-Modal and Explainable Claim Verification with Graph-based Evidence Retrieval
链接:https://arxiv.org/abs/2602.10023
作者:Delvin Ce Zhang,Suhan Cui,Zhelin Chu,Xianren Zhang,Dongwon Lee
类目:Computation and Language (cs.CL)
关键词:Verifying the truthfulness, requires joint multi-modal, claim verification, verification, requires joint
备注: Accepted to EACL-26
点击查看摘要
Abstract:Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.
7. 【2602.10021】Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference
链接:https://arxiv.org/abs/2602.10021
作者:Wenxuan Xie,Yujia Wang,Xin Tan,Chaochao Lu,Xia Hu,Xuhong Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, significant challenge due, remains a significant, Language Models
备注:
点击查看摘要
Abstract:The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at this https URL.
8. 【2602.10017】SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation
链接:https://arxiv.org/abs/2602.10017
作者:Homaira Huda Shomee,Rochana Chaturvedi,Yangxinyu Xie,Tanwi Mallick
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, support question answering, decision-critical details, question answering primarily
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.
9. 【2602.10003】ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition
链接:https://arxiv.org/abs/2602.10003
作者:Khoa Anh Nguyen,Long Minh Hoang,Nghia Hieu Nguyen,Luan Thanh Nguyen,Ngan Luu-Thuy Nguyen
类目:Computation and Language (cs.CL)
关键词:Automatic Speech Recognition, Vietnamese Automatic Speech, vice versa, Vietnamese ASR, textbf
备注:
点击查看摘要
Abstract:Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbf{Vi}etnamese \textbf{Speech} Trans\textbf{Former}), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). To the best of our knowledge, this is the first Vietnamese ASR framework that explicitly models phonemic representations. Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias. This phoneme-based paradigm is also promising for other languages with phonetic orthographies. The code will be released upon acceptance of this paper.
10. 【2602.09992】A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models
链接:https://arxiv.org/abs/2602.09992
作者:Xiulin Yang,Arianna Bisazza,Nathan Schneider,Ethan Gotlieb Wilcox
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:children acquire native-level, acquire native-level syntax, acquire native-level, Stimulus Hypothesis, limited input
备注:
点击查看摘要
Abstract:How can children acquire native-level syntax from limited input? According to the Poverty of the Stimulus Hypothesis (PoSH), the linguistic input children receive is insufficient to explain certain generalizations that are robustly learned; innate linguistic constraints, many have argued, are thus necessary to explain language learning. Neural language models, which lack such language-specific constraints in their design, offer a computational test of this longstanding (but controversial) claim. We introduce \poshbench, a training-and-evaluation suite targeting question formation, islands to movement, and other English phenomena at the center of the PoSH arguments. Training Transformer models on 10--50M words of developmentally plausible text, we find indications of generalization on all phenomena even without direct positive evidence -- yet neural models remain less data-efficient and their generalizations are weaker than those of children. We further enhance our models with three recently proposed cognitively motivated inductive biases. We find these biases improve general syntactic competence but not \poshbench performance. Our findings challenge the claim that innate syntax is the only possible route to generalization, while suggesting that human-like data efficiency requires inductive biases beyond those tested here.
11. 【2602.09961】ViMultiChoice: Toward a Method That Gives Explanation for Multiple-Choice Reading Comprehension in Vietnamese
链接:https://arxiv.org/abs/2602.09961
作者:Trung Tien Cao,Lam Minh Thai,Nghia Hieu Nguyen,Duc-Vu Nguyen,Ngan Luu-Thuy Nguyen
类目:Computation and Language (cs.CL)
关键词:Vietnamese reading comprehension, aim to select, set of candidate, Multiple-choice Reading Comprehension, Reading Comprehension
备注:
点击查看摘要
Abstract:Multiple-choice Reading Comprehension (MCRC) models aim to select the correct answer from a set of candidate options for a given question. However, they typically lack the ability to explain the reasoning behind their choices. In this paper, we introduce a novel Vietnamese dataset designed to train and evaluate MCRC models with explanation generation capabilities. Furthermore, we propose ViMultiChoice, a new method specifically designed for modeling Vietnamese reading comprehension that jointly predicts the correct answer and generates a corresponding explanation. Experimental results demonstrate that ViMultiChoice outperforms existing MCRC baselines, achieving state-of-the-art (SotA) performance on both the ViMMRC 2.0 benchmark and the newly introduced dataset. Additionally, we show that jointly training option decision and explanation generation leads to significant improvements in multiple-choice accuracy.
12. 【2602.09953】ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning
链接:https://arxiv.org/abs/2602.09953
作者:Shuaiyi Nie,Siyu Ding,Wenyuan Zhang,Linhao Yu,Tianmeng Yang,Yao Chen,Tingwen Liu,Weichong Yin,Yu Sun,Hua Wu
类目:Computation and Language (cs.CL)
关键词:Large reasoning models, complex reasoning tasks, achieve strong performance, Large reasoning, verifiable rewards
备注: Work in process
点击查看摘要
Abstract:Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.
13. 【2602.09924】LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations
链接:https://arxiv.org/abs/2602.09924
作者:William Lugoloobi,Thomas Foster,William Bankes,Chris Russell
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:compute remains challenging, require additional compute, additional compute remains, Running LLMs, remains challenging
备注:
点击查看摘要
Abstract:Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: this https URL
14. 【2602.09914】AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning
链接:https://arxiv.org/abs/2602.09914
作者:Tilahun Yeshambel,Moncef Garouani,Josiane Mothe
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:high-quality supervised data, GPT-style generative models, generative models rely, Amharic data resource, rely on large
备注: 7 pages, Submitted to resource track
点击查看摘要
Abstract:Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.
15. 【2602.09901】QP-OneModel: A Unified Generative LLM for Multi-Task Query Understanding in Xiaohongshu Search
链接:https://arxiv.org/abs/2602.09901
作者:Jianzhao Huang,Xiaorui Huang,Fei Zhao,Yunpeng Liu,Hui Zhang,Fangcheng Shi,Congfeng Li,Zechen Sun,Yi Wu,Yao Hu,Yunhan Bai,Shaosheng Cao
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Social Network Service, large-scale Social Network, Network Service, Social Network, large-scale Social
备注:
点击查看摘要
Abstract:Query Processing (QP) bridges user intent and content supply in large-scale Social Network Service (SNS) search engines. Traditional QP systems rely on pipelines of isolated discriminative models (e.g., BERT), suffering from limited semantic understanding and high maintenance overhead. While Large Language Models (LLMs) offer a potential solution, existing approaches often optimize sub-tasks in isolation, neglecting intrinsic semantic synergy and necessitating independent iterations. Moreover, standard generative methods often lack grounding in SNS scenarios, failing to bridge the gap between open-domain corpora and informal SNS linguistic patterns, while struggling to adhere to rigorous business definitions. We present QP-OneModel, a Unified Generative LLM for Multi-Task Query Understanding in the SNS domain. We reformulate heterogeneous sub-tasks into a unified sequence generation paradigm, adopting a progressive three-stage alignment strategy culminating in multi-reward Reinforcement Learning. Furthermore, QP-OneModel generates intent descriptions as a novel high-fidelity semantic signal, effectively augmenting downstream tasks such as query rewriting and ranking. Offline evaluations show QP-OneModel achieves a 7.35% overall gain over discriminative baselines, with significant F1 boosts in NER (+9.01%) and Term Weighting (+9.31%). It also exhibits superior generalization, surpassing a 32B model by 7.60% accuracy on unseen tasks. Fully deployed at Xiaohongshu, online A/B tests confirm its industrial value, optimizing retrieval relevance (DCG) by 0.21% and lifting user retention by 0.044%.
16. 【2602.09877】he Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies
链接:https://arxiv.org/abs/2602.09877
作者:Chenxu Wang,Chaozhuo Li,Songyang Liu,Zejian Chen,Jinyu Hou,Ji Qi,Rui Li,Litian Zhang,Qiwei Ye,Zheng Liu,Xu Chen,Xi Zhang,Philip S. Yu
类目:Computation and Language (cs.CL)
关键词:large language models, scalable collective intelligence, multi-agent systems built, language models, offers a promising
备注:
点击查看摘要
Abstract:The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.
17. 【2602.09870】Steer2Edit: From Activation Steering to Component-Level Editing
链接:https://arxiv.org/abs/2602.09870
作者:Chung-En Sun,Ge Yan,Zimo Wang,Tsui-Wei Weng
类目:Computation and Language (cs.CL)
关键词:Large Language Model, influence Large Language, Large Language, Language Model behavior, methods influence Large
备注:
点击查看摘要
Abstract:Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.
18. 【2602.09866】SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech
链接:https://arxiv.org/abs/2602.09866
作者:Johan Sofalas,Dilushri Pavithra,Nevidu Jayatilleke,Ruvan Weerasinghe
类目:Computation and Language (cs.CL)
关键词:Figures of Speech, Neural Machine Translation, consist of multi-word, intertwined with culture, multi-word phrases
备注: 19 pages, 6 figures, 8 tables, Accepted paper at the 22nd Workshop on Multiword Expressions (MWE 2026) @ EACL 2026
点击查看摘要
Abstract:Figures of Speech (FoS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.
19. 【2602.09856】Code2World: A GUI World Model via Renderable Code Generation
链接:https://arxiv.org/abs/2602.09856
作者:Yuhao Zheng,Li'an Zhong,Yi Wang,Rui Dai,Kaikui Liu,Xiangxiang Chu,Linyuan Lv,Philip Torr,Kevin Qinghong Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Autonomous GUI agents, GUI agents interact, Autonomous GUI, GUI World model, interact with environments
备注: github: [this https URL](https://github.com/AMAP-ML/Code2World) project page: [this https URL](https://amap-ml.github.io/Code2World/)
点击查看摘要
Abstract:Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at this https URL.
20. 【2602.09838】How Do People Quantify Naturally: Evidence from Mandarin Picture Description
链接:https://arxiv.org/abs/2602.09838
作者:Yayun Zhang,Guanyi Chen,Fahime Same,Saad Mahamood,Tingting He
类目:Computation and Language (cs.CL)
关键词:fundamental component, component of everyday, Mandarin Chinese, Quantification, speakers decide
备注:
点击查看摘要
Abstract:Quantification is a fundamental component of everyday language use, yet little is known about how speakers decide whether and how to quantify in naturalistic production. We investigate quantification in Mandarin Chinese using a picture-based elicited description task in which speakers freely described scenes containing multiple objects, without explicit instructions to count or quantify. Across both spoken and written modalities, we examine three aspects of quantification: whether speakers choose to quantify at all, how precise their quantification is, and which quantificational strategies they adopt. Results show that object numerosity, animacy, and production modality systematically shape quantificational behaviour. In particular, increasing numerosity reduces both the likelihood and the precision of quantification, while animate referents and modality selectively modulate strategy choice. This study demonstrates how quantification can be examined under unconstrained production conditions and provides a naturalistic dataset for further analyses of quantity expression in language production.
21. 【2602.09832】LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse
链接:https://arxiv.org/abs/2602.09832
作者:Bakhtawar Ahtisham,Kirk Vanacore,Zhuqian Zhou,Jinsook Lee,Rene F. Kizilcec
类目:Computation and Language (cs.CL)
关键词:Large Language Models, current pipelines lack, Large Language, pipelines lack reliable, increasingly deployed
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.
22. 【2602.09826】From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models
链接:https://arxiv.org/abs/2602.09826
作者:Abdulmuizz Khalak,Abderrahmane Issam,Gerasimos Spanakis
类目:Computation and Language (cs.CL)
关键词:Modern Standard Arabic, Modern Standard, predominately on Modern, Natural Language Processing, Arabic
备注: Accepted to VarDial 2026
点击查看摘要
Abstract:Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.
23. 【2602.09823】Covo-Audio Technical Report
链接:https://arxiv.org/abs/2602.09823
作者:Wenfu Wang,Chenxing Li,Liqiang Zhang,Yiyang Zhao,Yuxiang Zou,Hanzhao Li,Mingyu Cui,Hao Zhang,Kun Wei,Le Xu,Zikang Huang,Jiajun Xu,Jiliang Hu,Xiang He,Zeyu Xie,Jiawen Kang,Youjun Chen,Meng Yu,Dong Yu,Rilin Chen,Linlin Di,Shulin Feng,Na Hu,Yang Liu,Bang Wang,Shan Yang
类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:single unified architecture, directly processes continuous, processes continuous audio, continuous audio inputs, generates audio outputs
备注: Technical Report
点击查看摘要
Abstract:In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual reasoning, instruction following, and generating contextually appropriate and empathetic responses, validating its applicability to real-world conversational assistant scenarios. Covo-Audio-Chat-FD, the evolved full-duplex model, achieves substantially superior performance on both spoken dialogue capabilities and full-duplex interaction behaviors, demonstrating its competence in practical robustness. To mitigate the high cost of deploying end-to-end LALMs for natural conversational systems, we propose an intelligence-speaker decoupling strategy that separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data while preserving dialogue performance. Overall, our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs.
24. 【2602.09821】xt summarization via global structure awareness
链接:https://arxiv.org/abs/2602.09821
作者:Jiaquan Zhang,Chaoning Zhang,Shuxu Chen,Yibei Liu,Chenghao Li,Qigan Sun,Shuai Yuan,Fachrina Dewi Puspitasari,Dongshen Han,Guoqing Wang,Sung-Ho Bae,Yang Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:processing increasingly demanding, made long-document processing, long-document processing increasingly, increasingly demanding, information explosion
备注: 24pages
点击查看摘要
Abstract:Text summarization is a fundamental task in natural language processing (NLP), and the information explosion has made long-document processing increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, we introduce GloSA-sum, the first summarization approach that achieves global structure awareness via topological data analysis (TDA). GloSA-sum summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool'' as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.
25. 【2602.09817】AnalyticsGPT: An LLM Workflow for Scientometric Question Answering
链接:https://arxiv.org/abs/2602.09817
作者:Khang Ly,Georgios Cheirmpos,Adrian Raudaschl,Christopher James,Seyed Amin Tabatabaei
类目:Computation and Language (cs.CL); Digital Libraries (cs.DL)
关键词:paper introduces AnalyticsGPT, large language model, efficient large language, introduces AnalyticsGPT, intuitive and efficient
备注:
点击查看摘要
Abstract:This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the "science of science." When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named-entity recognition of academic entities within questions and multi-faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end-to-end system implementing a sequential workflow with retrieval-augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well-structured high-level analyses. As a database for retrieval-augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs-as-judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: this https URL.
26. 【2602.09805】Decomposing Reasoning Efficiency in Large Language Models
链接:https://arxiv.org/abs/2602.09805
作者:Daniel Kaiser,Arnoldo Frigessi,Ali Ramezani-Kebrya,Benjamin Ricaud
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, standard evaluations report, Large language, language models trained, spent or wasted
备注: Preprint (under review). 29 pages, 4 figures
点击查看摘要
Abstract:Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $\rho=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
27. 【2602.09802】Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices
链接:https://arxiv.org/abs/2602.09802
作者:Manon Reusens,Sofie Goethals,Toon Calders,David Martens
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, correct answer exists, objectively correct answer, answer exists
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users' past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.
28. 【2602.09785】Where Are We At with Automatic Speech Recognition for the Bambara Language?
链接:https://arxiv.org/abs/2602.09785
作者:Seydou Diallo,Yacouba Diarra,Mamadou K. Keita,Panga Azazia Kamaté,Adam Bouno Kampo,Aboubacar Ouattara
类目:Computation and Language (cs.CL)
关键词:Automatic Speech Recognition, Malian constitutional text, professionally recorded Malian, recorded Malian constitutional, evaluating Automatic Speech
备注: v1- 8 pages, 5 tables, 1 figure- AfricaNLP Workshop @ EACL 2026
点击查看摘要
Abstract:This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards in a narrow formal domain; the top-performing system in terms of Word Error Rate (WER) achieved 46.76\% and the best Character Error Rate (CER) of 13.00\% was set by another model, while several prominent multilingual models exceeded 100\% WER. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures are yet to be tested against practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
29. 【2602.09784】Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path
链接:https://arxiv.org/abs/2602.09784
作者:Andres Saurez,Neha Sengar,Dongsoo Har
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:separate research threads, research threads, representational space, developed as separate, separate research
备注: Submitted to ICML 2026. 15 pages, 11 figures
点击查看摘要
Abstract:Circuit discovery and activation steering in transformers have developed as separate research threads, yet both operate on the same representational space. Are they two views of the same underlying structure? We show they follow a single geometric principle: answer tokens, processed in isolation, encode the directions that would produce them. This Circuit Fingerprint hypothesis enables circuit discovery without gradients or causal intervention -- recovering comparable structure to gradient-based methods through geometric alignment alone. We validate this on standard benchmarks (IOI, SVA, MCQA) across four model families, achieving circuit discovery performance comparable to gradient-based methods. The same directions that identify circuit components also enable controlled steering -- achieving 69.8\% emotion classification accuracy versus 53.1\% for instruction prompting while preserving factual accuracy. Beyond method development, this read-write duality reveals that transformer circuits are fundamentally geometric structures: interpretability and controllability are two facets of the same object.
30. 【2602.09783】Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints
链接:https://arxiv.org/abs/2602.09783
作者:Andres Saurez,Yousung Lee,Dongsoo Har
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:consistently recover meaningful, recover meaningful structure, autoencoders consistently recover, simple methods succeed, nonlinear systems
备注: Submitted to ICML 2026. 19 pages, 13 figures
点击查看摘要
Abstract:Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations -- yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the \emph{Invariant Subspace Necessity} theorem and derive the \emph{Self-Reference Property}: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides \textbf{a principled architectural explanation} for why linear interpretability methods work, unifying linear probes and sparse autoencoders.
31. 【2602.09782】Flexible Entropy Control in RLVR with Gradient-Preserving Perspective
链接:https://arxiv.org/abs/2602.09782
作者:Kun Chen,Peng Shi,Fanfan Liu,Haibo Qiu,Zhixiong Zeng,Siqi Yang,Wenji Mao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Verifiable Rewards, Language Models, Large Language, Reinforcement Learning
备注: [this https URL](https://github.com/Kwen-Chen/Flexible-Entropy-Control)
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.
32. 【2602.09760】Improving Interpretability of Lexical Semantic Change with Neurobiological Features
链接:https://arxiv.org/abs/2602.09760
作者:Kohei Oda,Hiroya Takamura,Kiyoaki Shirai,Natthawut Kertkeidkachorn
类目:Computation and Language (cs.CL)
关键词:Lexical Semantic Change, Lexical Semantic, LSC, word change, Semantic Change
备注: PACLIC 2025
点击查看摘要
Abstract:Lexical Semantic Change (LSC) is the phenomenon in which the meaning of a word change over time. Most studies on LSC focus on improving the performance of estimating the degree of LSC, however, it is often difficult to interpret how the meaning of a word change. Enhancing the interpretability of LSC is a significant challenge as it could lead to novel insights in this field. To tackle this challenge, we propose a method to map the semantic space of contextualized embeddings of words obtained by a pre-trained language model to a neurobiological feature space. In the neurobiological feature space, each dimension corresponds to a primitive feature of words, and its value represents the intensity of that feature. This enables humans to interpret LSC systematically. When employed for the estimation of the degree of LSC, our method demonstrates superior performance in comparison to the majority of the previous methods. In addition, given the high interpretability of the proposed method, several analyses on LSC are carried out. The results demonstrate that our method not only discovers interesting types of LSC that have been overlooked in previous studies but also effectively searches for words with specific types of LSC.
33. 【2602.09724】argum -- A Multilingual New Testament Translation Corpus
链接:https://arxiv.org/abs/2602.09724
作者:Maciej Rapacz,Aleksander Smywiński-Pohl
类目:Computation and Language (cs.CL)
关键词:European languages possess, prioritizing linguistic breadth, languages possess rich, possess rich biblical, European languages
备注:
点击查看摘要
Abstract:Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization empowers researchers to define "uniqueness" for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first resource designed for such flexible, multilevel analysis, our corpus establishes a new benchmark for the quantitative study of translation history.
34. 【2602.09723】AI-Assisted Scientific Assessment: A Case Study on Climate Change
链接:https://arxiv.org/abs/2602.09723
作者:Christian Buck,Levke Caesar,Michelle Chen Huebscher,Massimiliano Ciaramita,Erich M. Fischer,Zeke Hausfather,Özge Kart Tokmak,Reto Knutti,Markus Leippold,Joseph Ludescher,Katharine J. Mach,Sofia Palazzo Corner,Kasra Rafiezadeh Shahi,Johan Rockström,Joeri Rogelj,Boris Sakschewski
类目:Computation and Language (cs.CL)
关键词:agents explore search, explore search spaces, guess and check, repeatable verification, co-scientists focuses
备注:
点击查看摘要
Abstract:The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in 'guess and check' loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.
35. 【2602.09719】Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs
链接:https://arxiv.org/abs/2602.09719
作者:Longhuan Xu,Cunjian Chen,Feng Yin
类目:Computation and Language (cs.CL)
关键词:large language models, inference time, time using signals, TTA, language models
备注:
点击查看摘要
Abstract:Test-time adaptation (TTA) for large language models (LLMs) updates model parameters at inference time using signals available at deployment. This paper focuses on a common yet under-explored regime: unsupervised, sample-specific TTA, where the model adapts independently for each prompt using only the prompt itself, without gold answers or external supervision. Although appealing, naive unsupervised TTA with a fixed, handcrafted learning rate can be unstable: updates may overfit to prompt-specific statistics, drift from the desired answer distribution, and ultimately degrade generation quality. This failure mode is not surprising, as in this case TTA must adapt to a single prompt within only a few gradient steps, unlike standard training that averages updates over large datasets and long optimization horizons. Therefore, we propose layer-wise dynamic test-time adaptation, a framework which explicitly modulates TTA strength as a function of prompt representation, LLM structure and adaptation step. In our setting, TTA updates only LoRA parameters, and a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers, enabling fine-grained control. Experiments across various datasets and LLMs consistently show that our method substantially strengthens TTA by learning effective scaling patterns over adaptation steps and transformer layer projections, improving stability while delivering better performance.
36. 【2602.09712】raceMem: Weaving Narrative Memory Schemata from User Conversational Traces
链接:https://arxiv.org/abs/2602.09712
作者:Yiming Shu,Pei Liu,Tiange Zhang,Ruiyang Gao,Jun Ma,Chen Sun
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Sustaining long-term interactions, Language Models, Large Language, limited context windows
备注:
点击查看摘要
Abstract:Sustaining long-term interactions remains a bottleneck for Large Language Models (LLMs), as their limited context windows struggle to manage dialogue histories that extend over time. Existing memory systems often treat interactions as disjointed snippets, failing to capture the underlying narrative coherence of the dialogue stream. We propose TraceMem, a cognitively-inspired framework that weaves structured, narrative memory schemata from user conversational traces through a three-stage pipeline: (1) Short-term Memory Processing, which employs a deductive topic segmentation approach to demarcate episode boundaries and extract semantic representation; (2) Synaptic Memory Consolidation, a process that summarizes episodes into episodic memories before distilling them alongside semantics into user-specific traces; and (3) Systems Memory Consolidation, which utilizes two-stage hierarchical clustering to organize these traces into coherent, time-evolving narrative threads under unifying themes. These threads are encapsulated into structured user memory cards, forming narrative memory schemata. For memory utilization, we provide an agentic search mechanism to enhance reasoning process. Evaluation on the LoCoMo benchmark shows that TraceMem achieves state-of-the-art performance with a brain-inspired architecture. Analysis shows that by constructing coherent narratives, it surpasses baselines in multi-hop and temporal reasoning, underscoring its essential role in deep narrative comprehension. Additionally, we provide an open discussion on memory systems, offering our perspectives and future outlook on the field. Our code implementation is available at: this https URL
37. 【2602.09703】Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding
链接:https://arxiv.org/abs/2602.09703
作者:Abdulhai Alali,Abderrahmane Issam
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, increasingly multilingual, supporting hundreds
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages, especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation. In this work, we adapt a pre-trained LLM to improve dialectal performance. Specifically, we use Low Rank Adaptation (LoRA) fine-tuning on monolingual and English Dialect parallel data, adapter merging and dialect-aware MBR decoding to improve dialectal fidelity generation and translation. Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy. This combination provides a compact and effective framework for robust dialectal Arabic generation.
38. 【2602.09691】Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs
链接:https://arxiv.org/abs/2602.09691
作者:Joseph Attieh,Timothee Mickus,Anne-Laure Ligozat,Aurélie Névéol,Jörg Tiedemann
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:larger system, Knowledge distillation, compress a larger, Knowledge, translation quality
备注:
点击查看摘要
Abstract:Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.
39. 【2602.09642】MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering
链接:https://arxiv.org/abs/2602.09642
作者:Sieun Hyeon,Jusang Oh,Sunghwan Steve Cho,Jaeyoung Do
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Table Question Answering, Recent advances, Large Language, advances in Large
备注:
点击查看摘要
Abstract:Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open-source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at this https URL.
40. 【2602.09624】MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation
链接:https://arxiv.org/abs/2602.09624
作者:Nalin Srun(UL, CNRS, LORIA),Parisa Rastin(UL, CNRS, LORIA),Guénaël Cabanes(UL, CNRS, LORIA),Lydia Boudjeloud Assala(UL, CNRS, LORIA)
类目:Computation and Language (cs.CL)
关键词:Large Language Models, evaluating Large Language, Language Models, Large Language, evaluating Large
备注:
点击查看摘要
Abstract:We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
41. 【2602.09621】AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models
链接:https://arxiv.org/abs/2602.09621
作者:R E Zera Marveen Lyngkhoi,Chirag Chawla,Pratinav Seth,Utsav Avaiya,Soham Bhattacharjee,Mykola Khandoga,Rui Yuan,Vinay Kumar Sankarapu
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, ad-hoc glue code, deploying large language, practical workflows remain, workflows remain split
备注: [this https URL](https://github.com/Lexsi-Labs/aligntune)
点击查看摘要
Abstract:Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.
42. 【2602.09598】Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning
链接:https://arxiv.org/abs/2602.09598
作者:Qiao Liang,Yuke Zhu,Chao Ge,Lei Yang,Ying Shen,Bo Zheng,Sheng Guo
类目:Computation and Language (cs.CL)
关键词:enables LLM agents, step-level credit assignment, weak step-level credit, Tool-integrated reasoning, outcome-only reinforcement learning
备注: 20 pages, 11 figures
点击查看摘要
Abstract:Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.
43. 【2602.09591】On the Optimal Reasoning Length for RL-Trained Language Models
链接:https://arxiv.org/abs/2602.09591
作者:Daisuke Nohara,Taishi Nakamura,Rio Yokota
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Reinforcement learning substantially, Reinforcement learning, learning substantially improves, large language models, increase computational cost
备注: 15 pages, 10 figures. Submitted to the Workshop on Scaling Post-training for LLMs (SPOT) at ICLR 2026
点击查看摘要
Abstract:Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
44. 【2602.09590】Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models
链接:https://arxiv.org/abs/2602.09590
作者:Shweta Parihar,Liu Guangliang,Natalie Parde,Lu Cheng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:fine-tuned language models, harm downstream performance, language modeling capability, challenge in mitigating, potential reduction
备注:
点击查看摘要
Abstract:A challenge in mitigating social bias in fine-tuned language models (LMs) is the potential reduction in language modeling capability, which can harm downstream performance. Counterfactual data augmentation (CDA), a widely used method for fine-tuning, highlights this issue by generating synthetic data that may align poorly with real-world distributions or creating overly simplistic counterfactuals that ignore the social context of altered sensitive attributes (e.g., gender) in the pretraining corpus. To address these limitations, we propose a simple yet effective context-augmented CDA method, Context-CDA, which uses large LMs to enhance the diversity and contextual relevance of the debiasing corpus. By minimizing discrepancies between the debiasing corpus and pretraining data through augmented context, this approach ensures better alignment, enhancing language modeling capability. We then employ uncertainty-based filtering to exclude generated counterfactuals considered low-quality by the target smaller LMs (i.e., LMs to be debiased), further improving the fine-tuning corpus quality. Experimental results on gender bias benchmarks demonstrate that Context-CDA effectively mitigates bias without sacrificing language modeling performance while offering insights into social biases by analyzing distribution shifts in next-token generation probabilities.
45. 【2602.09574】Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs
链接:https://arxiv.org/abs/2602.09574
作者:Sora Miyamoto,Daisuke Oba,Naoaki Okazaki
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, real-world deployment imposes, fixed per-query token, language models, varies across settings
备注:
点击查看摘要
Abstract:Tree-search decoding is an effective form of test-time scaling for large language models (LLMs), but real-world deployment imposes a fixed per-query token budget that varies across settings. Existing tree-search policies are largely budget-agnostic, treating the budget as a termination condition, which can lead to late-stage over-branching or premature termination. We propose {Budget-Guided MCTS} (BG-MCTS), a tree-search decoding algorithm that aligns its search policy with the remaining token budget: it starts with broad exploration, then prioritizes refinement and answer completion as the budget depletes while reducing late-stage branching from shallow nodes. BG-MCTS consistently outperforms budget-agnostic tree-search baselines across different budgets on MATH500 and AIME24/25 with open-weight LLMs.
46. 【2602.09570】LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval
链接:https://arxiv.org/abs/2602.09570
作者:Narges Baba Ahmadi,Jan Strich,Martin Semmann,Chris Biemann
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:access legal information, Large language models, Large language, Large, legal information
备注: Accepted at EACL SRW 26
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{this https URL}{GitHub Repository}} and data\footnote{\href{this https URL}{Hugging Face Dataset}}.
47. 【2602.09555】Advancing Block Diffusion Language Models for Test-Time Scaling
链接:https://arxiv.org/abs/2602.09555
作者:Yi Lu,Deyang Kong,Jianing Wang,Linsen Guo,Xue Wang,Qi Guo,Tao Gui,Xuanjing Huang,Wei Ye,Shikun Zhang,Wei Wang
类目:Computation and Language (cs.CL)
关键词:Recent advances, block diffusion language, diffusion language models, demonstrated competitive performance, diffusion language
备注:
点击查看摘要
Abstract:Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.
48. 【2602.09552】Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA
链接:https://arxiv.org/abs/2602.09552
作者:Klejda Alushi,Jan Strich,Chris Biemann,Martin Semmann
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:large language models, question answering increasingly, answering increasingly relies, ground large language, Conversational question answering
备注: Accepted to EACL SRW 26
点击查看摘要
Abstract:Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.\footnote{\href{this https URL}{GitHub Repository}}
49. 【2602.09538】UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment
链接:https://arxiv.org/abs/2602.09538
作者:Hongyan Xie,Yikun Ban,Ruiyu Fang,Zixuan Huang,Deqing Wang,Jianxin Li,Yitong Yao,Chao Wang,Shuangyong Song
类目:Computation and Language (cs.CL)
关键词:align LLM responses, Multi-objective alignment aims, multi-objective test-time alignment, multiple human preference, human preference objectives
备注: Under Review
点击查看摘要
Abstract:Multi-objective alignment aims to align LLM responses with multiple human preference objectives. Among existing methods, guiding the generation of frozen LLMs through autoregressive reward models (ARMs) to accomplish multi-objective test-time alignment is a low-cost solution. However, these methods typically rely on independent parameters for each preference objective, either by training ARMs independently across preference dimensions, which neglects interactions among preference features, or by training a single ARM with separate feature extraction modules for each preference, which can cause feature entanglement. Both strategies can result in misalignment between generated outputs and user preferences. To address this limitation, we propose Preference-Modulated \ Shared Low-Rank Adaptation (MoSLoRA) for ARM training, which first extracts shared features via a preference-agnostic module and then applies affine transformations to shared features via a preference modulation module conditioned on mixed preference vectors. This design mitigates feature entanglement and enables precise control over preference trade-offs during inference. Building on this, we introduce the Unified Autoregressive Reward Model (UniARM), a novel framework for multi-objective test-time alignment. UniARM jointly models all preference dimensions in a single parameter space, eliminating the need for independent parameters for each preference objective. es on larger-scale LLMs, enhancing its practical usability.
50. 【2602.09517】Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models
链接:https://arxiv.org/abs/2602.09517
作者:Sangwon Yu,Ik-hwan Kim,Donghun Kang,Bongkyu Hwang,Junhwa Choi,Suk-hoon Jung,Seungki Hong,Taehee Lee,Sungroh Yoon
类目:Computation and Language (cs.CL)
关键词:Modern Large Language, Large Language Models, Modern Large, Large Language, demonstrated remarkable capabilities
备注:
点击查看摘要
Abstract:Modern Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks by employing search-augmented reasoning to incorporate external knowledge into long chains of thought. However, we identify a critical yet underexplored bottleneck in this paradigm, termed Knowledge Integration Decay (KID). Specifically, we observe that as the length of reasoning generated before search grows, models increasingly fail to integrate retrieved evidence into subsequent reasoning steps, limiting performance even when relevant information is available. To address this, we propose Self-Anchored Knowledge Encoding (SAKE), a training-free inference-time strategy designed to stabilize knowledge utilization. By anchoring retrieved knowledge at both the beginning and end of the reasoning process, SAKE prevents it from being overshadowed by prior context, thereby preserving its semantic integrity. Extensive experiments on multi-hop QA and complex reasoning benchmarks demonstrate that SAKE significantly mitigates KID and improves performance, offering a lightweight yet effective solution for knowledge integration in agentic LLMs.
51. 【2602.09516】he CLEF-2026 CheckThat! Lab: Advancing Multilingual Fact-Checking
链接:https://arxiv.org/abs/2602.09516
作者:Julia Maria Struß,Sebastian Schellhammer,Stefan Dietze,Venktesh V,Vinay Setty,Tanmoy Chakraborty,Preslav Nakov,Avishek Anand,Primakov Chungkham,Salim Hafid,Dhruv Sahnan,Konstantin Todorov
类目:Computation and Language (cs.CL)
关键词:verification pipeline, verification, tasks, Task, innovative technologies combating
备注: misinformation, disinformation, fact-checking, claim source retrieval, generating fact-checking articles
点击查看摘要
Abstract:The CheckThat! lab aims to advance the development of innovative technologies combating disinformation and manipulation efforts in online communication across a multitude of languages and platforms. While in early editions the focus has been on core tasks of the verification pipeline (check-worthiness, evidence retrieval, and verification), in the past three editions, the lab added additional tasks linked to the verification process. In this year's edition, the verification pipeline is at the center again with the following tasks: Task 1 on source retrieval for scientific web claims (a follow-up of the 2025 edition), Task 2 on fact-checking numerical and temporal claims, which adds a reasoning component to the 2025 edition, and Task 3, which expands the verification pipeline with generation of full-fact-checking articles. These tasks represent challenging classification and retrieval problems as well as generation challenges at the document and span level, including multilingual settings.
52. 【2602.09514】EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
链接:https://arxiv.org/abs/2602.09514
作者:Xavier Hu,Jinxiang Xia,Shengze Xu,Kangqi Song,Yishuo Yuan,Guibin Zhang,Jincheng Ren,Boyu Feng,Li Lu,Tieyong Zeng,Jiaheng Liu,Minghao Liu,Yuchen Elenor Jiang,Wei Wang,He Zhu,Wangchunshu Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:current evaluation frameworks, persistent economic dynamics, evaluation frameworks suffer, autonomous LLM-based agents, largely episodic
备注: work in progress
点击查看摘要
Abstract:Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.
53. 【2602.09501】Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models
链接:https://arxiv.org/abs/2602.09501
作者:Hikaru Asano,Tadashi Kozuno,Kuniaki Saito,Yukino Baba
类目:Computation and Language (cs.CL)
关键词:Diffusion Language Models, Masked Diffusion Language, Diffusion Language, Language Models, iteratively filling masked
备注: 15 pages, 6 figures
点击查看摘要
Abstract:Masked Diffusion Language Models (MDLMs) generate text by iteratively filling masked tokens, requiring two coupled decisions at each step: which positions to unmask (where-to-unmask) and which tokens to place (what-to-unmask). While standard MDLM training directly optimizes token prediction (what-to-unmask), inference-time unmasking orders (where-to-unmask) are typically determined by heuristic confidence measures or trained through reinforcement learning with costly on-policy rollouts. To address this, we introduce Gt-Margin, a position-wise score derived from ground-truth tokens, defined as the probability margin between the correct token and its strongest alternative. Gt-Margin yields an oracle unmasking order that prioritizes easier positions first under each partially masked state. We demonstrate that leveraging this oracle unmasking order significantly enhances final generation quality, particularly on logical reasoning benchmarks. Building on this insight, we train a supervised unmasking planner via learning-to-rank to imitate the oracle ordering from masked contexts. The resulting planner integrates into standard MDLM sampling to select where-to-unmask, improving reasoning accuracy without modifying the token prediction model.
54. 【2602.09486】Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement
链接:https://arxiv.org/abs/2602.09486
作者:Koduvayur Subbalakshmi,Sabbir Hossain Ujjal,Venkata Krishna Teja Mangichetty,Nastaran Jamalipour Soofi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Pretrained Large Language, Large Language Models, Pretrained Large, Large Language, incorrect text-a phenomenon
备注: Preprint, 23 pages, 13 tables, 12 figures
点击查看摘要
Abstract:Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span's factuality is correlated with its representational instability across the model's internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers, and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization and code generation demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.
55. 【2602.09469】NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts
链接:https://arxiv.org/abs/2602.09469
作者:Huu-Huy-Hoang Tran,Gia-Bao Duong,Quoc-Viet-Anh Tran,Thi-Hai-Yen Vuong,Hoang-Quynh Le
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Electronic Health Records, Natural Language Processing, unstructured Electronic Health, Health Records remains, clinical Natural Language
备注:
点击查看摘要
Abstract:Extracting drug use information from unstructured Electronic Health Records remains a major challenge in clinical Natural Language Processing. While Large Language Models demonstrate advancements, their use in clinical NLP is limited by concerns over trust, control, and efficiency. To address this, we present NOWJ submission to the ToxHabits Shared Task at BioCreative IX. This task targets the detection of toxic substance use and contextual attributes in Spanish clinical texts, a domain-specific, low-resource setting. We propose a multi-output ensemble system tackling both Subtask 1 - ToxNER and Subtask 2 - ToxUse. Our system integrates BETO with a CRF layer for sequence labeling, employs diverse training strategies, and uses sentence filtering to boost precision. Our top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection.
56. 【2602.09464】AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
链接:https://arxiv.org/abs/2602.09464
作者:Haoyu Zhao,Ziran Yang,Jiawei Li,Deyuan He,Zenan Li,Chi Jin,Venugopal V. Veeravalli,Aarti Gupta,Sanjeev Arora
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:formally verified code, rigorous specifications, generation of formally, formally verified, Vericoding refers
备注: 32 pages
点击查看摘要
Abstract:Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at this https URL.
57. 【2602.09447】SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents
链接:https://arxiv.org/abs/2602.09447
作者:Zhirui Zhang,Hongbo Zhang,Haoxiang Fei,Zhiyuan Bao,Yubin Chen,Zhengyu Lei,Ziyue Liu,Yixuan Sun,Mingkun Xiao,Zihang Ye,Yu Zhang,Hongcheng Zhu,Yuxiang Wen,Heung-Yeung Shum
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:impressive coding capabilities, demonstrated impressive coding, large language models, coding capabilities, open question
备注: 20 pages, 3 figures
点击查看摘要
Abstract:Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.
58. 【2602.09444】Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality
链接:https://arxiv.org/abs/2602.09444
作者:Takumi Ohashi,Hitoshi Iyatomi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Conceptual Cultural Index, level remains underexplored, sentence level remains, Large language
备注: 9 pages, 2 figures, 8 tables. Accepted at the First Workshop on Multilingual Multicultural Evaluation (MME) @ EACL 2026
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at this https URL .
59. 【2602.09442】Evaluating Social Bias in RAG Systems: When External Context Helps and Reasoning Hurts
链接:https://arxiv.org/abs/2602.09442
作者:Shweta Parihar,Lu Cheng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Social biases inherent, raise significant fairness, significant fairness concerns, large language models, raise significant
备注: Accepted as a full paper with an oral presentation at the 30th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2026)
点击查看摘要
Abstract:Social biases inherent in large language models (LLMs) raise significant fairness concerns. Retrieval-Augmented Generation (RAG) architectures, which retrieve external knowledge sources to enhance the generative capabilities of LLMs, remain susceptible to the same bias-related challenges. This work focuses on evaluating and understanding the social bias implications of RAG. Through extensive experiments across various retrieval corpora, LLMs, and bias evaluation datasets, encompassing more than 13 different bias types, we surprisingly observe a reduction in bias in RAG. This suggests that the inclusion of external context can help counteract stereotype-driven predictions, potentially improving fairness by diversifying the contextual grounding of the model's outputs. To better understand this phenomenon, we then explore the model's reasoning process by integrating Chain-of-Thought (CoT) prompting into RAG while assessing the faithfulness of the model's CoT. Our experiments reveal that the model's bias inclinations shift between stereotype and anti-stereotype responses as more contextual information is incorporated from the retrieved documents. Interestingly, we find that while CoT enhances accuracy, contrary to the bias reduction observed with RAG, it increases overall bias across datasets, highlighting the need for bias-aware reasoning frameworks that can mitigate this trade-off.
60. 【2602.09438】Breaking the Pre-Sampling Barrier: Activation-Informed Difficulty-Aware Self-Consistency
链接:https://arxiv.org/abs/2602.09438
作者:Taewoong Yoon,Geunyeong Jeong,Geon Park,Sihyeong Yeom,Harksoo Kim
类目:Computation and Language (cs.CL)
关键词:Large Language Models, effective decoding strategy, Large Language, Language Models, reasoning performance
备注:
点击查看摘要
Abstract:Self-Consistency (SC) is an effective decoding strategy that improves the reasoning performance of Large Language Models (LLMs) by generating multiple chain-of-thought reasoning paths and selecting the final answer via majority voting. However, it suffers from substantial inference costs because it requires a large number of samples. To mitigate this issue, Difficulty-Adaptive Self-Consistency (DSC) was proposed to reduce unnecessary token usage for easy problems by adjusting the number of samples according to problem difficulty. However, DSC requires additional model calls and pre-sampling to estimate difficulty, and this process is repeated when applying to each dataset, leading to significant computational overhead. In this work, we propose Activation-Informed Difficulty-Aware Self-Consistency (ACTSC) to address these limitations. ACTSC leverages internal difficulty signals reflected in the feed-forward network neuron activations to construct a lightweight difficulty estimation probe, without any additional token generation or model calls. The probe dynamically adjusts the number of samples for SC and can be applied to new datasets without requiring pre-sampling for difficulty estimation. To validate its effectiveness, we conduct experiments on five benchmarks. Experimental results show that ACTSC effectively reduces inference costs while maintaining accuracy relative to existing methods.
61. 【2602.09416】Are Language Models Sensitive to Morally Irrelevant Distractors?
链接:https://arxiv.org/abs/2602.09416
作者:Andrew Shaw,Christina Hahn,Catherine Rasgaitis,Yash Mishra,Alisa Liu,Natasha Jaques,Yulia Tsvetkov,Amy X. Zhang
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:large language models, moral, language models, high-stakes settings, rapid development
备注:
点击查看摘要
Abstract:With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this "situationist" view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 "moral distractors" from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.
62. 【2602.09388】Effective vocabulary expanding of multilingual language models for extremely low-resource languages
链接:https://arxiv.org/abs/2602.09388
作者:Jianyu Zheng
类目:Computation and Language (cs.CL)
关键词:Multilingual pre-trained language, offer significant benefits, Multilingual pre-trained, pre-trained language models, offer significant
备注: 12 pages, 5 figures, 7 tables, under review
点击查看摘要
Abstract:Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models' performance on the source language does not degrade after continued pre-training.
63. 【2602.09384】Contractual Deepfakes: Can Large Language Models Generate Contracts?
链接:https://arxiv.org/abs/2602.09384
作者:Eliza Mik
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Notwithstanding their unprecedented, unprecedented ability, understand the meaning, sense of context, generate text
备注: Accepted for publication
点击查看摘要
Abstract:Notwithstanding their unprecedented ability to generate text, LLMs do not understand the meaning of words, have no sense of context and cannot reason. Their output constitutes an approximation of statistically dominant word patterns. And yet, the drafting of contracts is often presented as a typical legal task that could be facilitated by this technology. This paper seeks to put an end to such unreasonable ideas. Predicting words differs from using language in the circumstances of specific transactions and reconstituting common contractual phrases differs from reasoning about the law. LLMs seem to be able to generate generic and superficially plausible contractual documents. In the cold light of day, such documents may turn out to be useless assemblages of inconsistent provisions or contracts that are enforceable but unsuitable for a given transaction. This paper casts a shadow on the simplistic assumption that LLMs threaten the continued viability of the legal industry.
64. 【2602.09383】BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
链接:https://arxiv.org/abs/2602.09383
作者:Peng Lai,Zhihao Ou,Yong Wang,Longyue Wang,Jian Yang,Yun Chen,Guanhua Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
关键词:practical applications, critical issue, widely adopted, research and practical, remain a critical
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of potential unknown biases is still lacking. Nevertheless, such exploration is crucial for enhancing the robustness and reliability of evaluations. To bridge this gap, we propose BiasScope, a LLM-driven framework for automatically and at scale discovering potential biases that may arise during model evaluation. BiasScope can uncover potential biases across different model families and scales, with its generality and effectiveness validated on the JudgeBench dataset. It overcomes the limitations of existing approaches, transforming bias discovery from a passive process relying on manual effort and predefined bias lists into an active and comprehensive automated exploration. Moreover, based on BiasScope, we propose JudgeBench-Pro, an extended version of JudgeBench and a more challenging benchmark for evaluating the robustness of LLM-as-a-judge. Strikingly, even powerful LLMs as evaluators show error rates above 50\% on JudgeBench-Pro, underscoring the urgent need to strengthen evaluation robustness and to mitigate potential biases further.
65. 【2602.09379】LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis
链接:https://arxiv.org/abs/2602.09379
作者:Shihao Xu,Tiancheng Zhou,Jiatong Ma,Yanli Ding,Yiming Yan,Ming Xiao,Guoyi Li,Haiyang Geng,Yunyun Han,Jianhua Chen,Yafeng Deng
类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)
关键词:highly prevalent worldwide, consistent mental-health assessment, create substantial barriers, Mental disorders, interview-based diagnosis create
备注:
点击查看摘要
Abstract:Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at this https URL.
66. 【2602.09373】AfriNLLB: Efficient Translation Models for African Languages
链接:https://arxiv.org/abs/2602.09373
作者:Yasmin Moslem,Aman Kassahun Wassie,Amanuel Gizachew Abebe
类目:Computation and Language (cs.CL)
关键词:African Union official, African languages, Egyptian Arabic, African, series of lightweight
备注: Accepted at AfricaNLP 2026 (oral)
点击查看摘要
Abstract:In this work, we present AfriNLLB, a series of lightweight models for efficient translation from and into African languages. AfriNLLB supports 15 language pairs (30 translation directions), including Swahili, Hausa, Yoruba, Amharic, Somali, Zulu, Lingala, Afrikaans, Wolof, and Egyptian Arabic, as well as other African Union official languages such as Arabic (MSA), French, Portuguese, and Spanish. Our training data covers bidirectional translation between English and 13 languages, and between French and two languages (Lingala and Wolof). AfriNLLB models are based on NLLB-200 600M, which we compress using iterative layer pruning and quantization. We fine-tune the pruned models on parallel corpora we curated for African languages, employing knowledge distillation from a larger teacher model. Our work aims at enabling efficient deployment of translation models for African languages in resource-constrained settings. Our evaluation results demonstrate that AfriNLLB models achieve performance comparable to the baseline while being significantly faster. We release two versions of the AfriNLLB models, a Transformers version that allows further fine-tuning and a CTranslate2 version for efficient inference. Moreover, we release all the training data that we used for fine-tuning the baseline and pruned models to facilitate further research.
Comments:
Accepted at AfricaNLP 2026 (oral)
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.09373 [cs.CL]
(or
arXiv:2602.09373v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.09373
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
Proceedings of the Seventh Workshop on African Natural Language Processing (AfricaNLP 2026)
67. 【2602.09372】AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis
链接:https://arxiv.org/abs/2602.09372
作者:Zexu Sun,Bokai Ji,Hengyi Cai,Shuaiqiang Wang,Lei Wang,Guangxia Li,Xu Chen
类目:Computation and Language (cs.CL)
关键词:Large Language Model, Large Language, Language Model agents, solving real-world problems, agents demonstrate potential
备注: 33 pages, 9 figures
点击查看摘要
Abstract:Large Language Model agents demonstrate potential in solving real-world problems via tools, yet generalist intelligence is bottlenecked by scarce high-quality, long-horizon data. Existing methods collect privacy-constrained API logs or generate scripted interactions lacking diversity, which struggle to produce data requisite for scaling capabilities. We propose AgentSkiller, a fully automated framework synthesizing multi-turn interaction data across realistic, semantically linked domains. It employs a DAG-based architecture with explicit state transitions to ensure determinism and recoverability. The pipeline builds a domain ontology and Person-Centric Entity Graph, defines tool interfaces via Service Blueprints for Model Context Protocol servers, and populates environments with consistent databases and strict Domain Policies. A cross-domain fusion mechanism links services to simulate complex tasks. Finally, the pipeline creates user tasks by verifying solution paths, filtering via execution-based validation, and generating queries using a Persona-based Simulator for automated rollout. This produces reliable environments with clear state changes. To demonstrate effectiveness, we synthesized $\approx$ 11K interaction samples; experimental results indicate that models trained on this dataset achieve significant improvements on function calling over baselines, particularly in larger parameter regimes.
68. 【2602.09366】Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only
链接:https://arxiv.org/abs/2602.09366
作者:Jianyu Zheng
类目:Computation and Language (cs.CL)
关键词:POS, typically adopt unsupervised, adopt unsupervised approaches, languages typically adopt, annotated data
备注: 16 pages, 6 figures, 7 tables, under review
点击查看摘要
Abstract:Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.
69. 【2602.09346】Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs
链接:https://arxiv.org/abs/2602.09346
作者:Yoshifumi Kawasaki
类目:Computation and Language (cs.CL)
关键词:exhibits substantial regional, Large Language Models, substantial regional variation, Large Language, study examines
备注:
点击查看摘要
Abstract:This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves particularly difficult for the models to distinguish. Importantly, differences in the volume of country-level digital resources do not account for these performance patterns, suggesting that factors beyond data quantity shape dialectal representation in LLMs. By providing a fine-grained, large-scale evaluation of geographic lexical variation, this work advances empirical understanding of dialectal knowledge in LLMs and contributes new evidence to discussions of Digital Linguistic Bias in Spanish.
70. 【2602.09343】Not-in-Perspective: Towards Shielding Google's Perspective API Against Adversarial Negation Attacks
链接:https://arxiv.org/abs/2602.09343
作者:Michail S. Alexiou,J. Sukarno Mertoguno
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:moderate online interactions, social media platforms, media platforms involving, platforms involving toxic, involving toxic comments
备注:
点击查看摘要
Abstract:The rise of cyberbullying in social media platforms involving toxic comments has escalated the need for effective ways to monitor and moderate online interactions. Existing solutions of automated toxicity detection systems, are based on a machine or deep learning algorithms. However, statistics-based solutions are generally prone to adversarial attacks that contain logic based modifications such as negation in phrases and sentences. In that regard, we present a set of formal reasoning-based methodologies that wrap around existing machine learning toxicity detection systems. Acting as both pre-processing and post-processing steps, our formal reasoning wrapper helps alleviating the negation attack problems and significantly improves the accuracy and efficacy of toxicity scoring. We evaluate different variations of our wrapper on multiple machine learning models against a negation adversarial dataset. Experimental results highlight the improvement of hybrid (formal reasoning and machine-learning) methods against various purely statistical solutions.
71. 【2602.09339】Understanding Risk and Dependency in AI Chatbot Use from User Discourse
链接:https://arxiv.org/abs/2602.09339
作者:Jianfeng Zhu,Karin G. Coifman,Ruoming Jin
类目:Computation and Language (cs.CL)
关键词:Generative AI systems, users remains limited, everyday life, remains limited, systems are increasingly
备注: 21 pages, 5 figures
点击查看摘要
Abstract:Generative AI systems are increasingly embedded in everyday life, yet empirical understanding of how psychological risk associated with AI use emerges, is experienced, and is regulated by users remains limited. We present a large-scale computational thematic analysis of posts collected between 2023 and 2025 from two Reddit communities, r/AIDangers and r/ChatbotAddiction, explicitly focused on AI-related harm and distress. Using a multi-agent, LLM-assisted thematic analysis grounded in Braun and Clarke's reflexive framework, we identify 14 recurring thematic categories and synthesize them into five higher-order experiential dimensions. To further characterize affective patterns, we apply emotion labeling using a BERT-based classifier and visualize emotional profiles across dimensions. Our findings reveal five empirically derived experiential dimensions of AI-related psychological risk grounded in real-world user discourse, with self-regulation difficulties emerging as the most prevalent and fear concentrated in concerns related to autonomy, control, and technical risk. These results provide early empirical evidence from lived user experience of how AI safety is perceived and emotionally experienced outside laboratory or speculative contexts, offering a foundation for future AI safety research, evaluation, and responsible governance.
72. 【2602.09336】FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding
链接:https://arxiv.org/abs/2602.09336
作者:Siyuan Huang,Ziyu Wang,Chao Pan,Han Zhao
类目:Computation and Language (cs.CL)
关键词:Standard Operating Procedures, Standard Operating, Operating Procedures, existing language models, language models struggle
备注:
点击查看摘要
Abstract:Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters.
73. 【2602.09331】Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization
链接:https://arxiv.org/abs/2602.09331
作者:Mykola Khandoga,Rui Yuan,Vinay Kumar Sankarapu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:GRPO and DAPO, assign uniform credit, policy gradient updates, Policy gradient, Policy gradient methods
备注: 12 pages, 1 figure
点击查看摘要
Abstract:Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase "Let me think" receives the same gradient update as the critical calculation "23 + 45 = 68." We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model's own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.
74. 【2602.09312】Don't Shoot The Breeze: Topic Continuity Model Using Nonlinear Naive Bayes With Attention
链接:https://arxiv.org/abs/2602.09312
作者:Shu-Ting Pi,Pradeep Bagavan,Yejia Li,Disha,Qun Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Utilizing Large Language, Utilizing Large, diverse business scenarios, Large Language Models, Large Language
备注: EMNLP 2024: Industry Track; 8 pages, 2 figures, 1 table
点击查看摘要
Abstract:Utilizing Large Language Models (LLM) as chatbots in diverse business scenarios often presents the challenge of maintaining topic continuity. Abrupt shifts in topics can lead to poor user experiences and inefficient utilization of computational resources. In this paper, we present a topic continuity model aimed at assessing whether a response aligns with the initial conversation topic. Our model is built upon the expansion of the corresponding natural language understanding (NLU) model into quantifiable terms using a Naive Bayes approach. Subsequently, we have introduced an attention mechanism and logarithmic nonlinearity to enhance its capability to capture topic continuity. This approach allows us to convert the NLU model into an interpretable analytical formula. In contrast to many NLU models constrained by token limits, our proposed model can seamlessly handle conversations of any length with linear time complexity. Furthermore, the attention mechanism significantly improves the model's ability to identify topic continuity in complex conversations. According to our experiments, our model consistently outperforms traditional methods, particularly in handling lengthy and intricate conversations. This unique capability offers us an opportunity to ensure the responsible and interpretable use of LLMs.
75. 【2602.09289】riggered: A Statistical Analysis of Environmental Influences on Extremist Groups
链接:https://arxiv.org/abs/2602.09289
作者:Christine de Kock,Eduard Hovy
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
关键词:Online extremist communities, Online extremist, extremist communities operate, shaped by real-world, wider information ecosystem
备注:
点击查看摘要
Abstract:Online extremist communities operate within a wider information ecosystem shaped by real-world events, news coverage, and cross-community interaction. We adopt a systems perspective to examine these influences using seven years of data from two ideologically distinct extremist forums (Stormfront and Incels) and a mainstream reference community (r/News). We ask three questions: how extremist violence impacts community behaviour; whether news coverage of political entities predicts shifts in conversation dynamics; and whether linguistic diffusion occurs between mainstream and extremist spaces and across extremist ideologies. Methodologically, we combine counterfactual synthesis to estimate event-level impacts with vector autoregression and Granger causality analyses to model ongoing relationships among news signals, behavioural outcomes, and cross-community language change. Across analyses, our results indicate that Stormfront and r/News appear to be more reactive to external stimuli, while Incels demonstrates less cross-community linguistic influence and less responsiveness to news and violent events. These findings underscore that extremist communities are not homogeneous, but differ in how tightly they are coupled to the surrounding information ecosystem.
76. 【2602.09276】Effective Reasoning Chains Reduce Intrinsic Dimensionality
链接:https://arxiv.org/abs/2602.09276
作者:Archiki Prasad,Mandar Joshi,Kenton Lee,Mohit Bansal,Peter Shaw
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:remain poorly understood, generalization remain poorly, poorly understood, intrinsic dimensionality, variants have substantially
备注: 20 pages, 3 figures
点击查看摘要
Abstract:Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.
77. 【2602.09269】Measuring Inclusion in Interaction: Inclusion Analytics for Human-AI Collaborative Learning
链接:https://arxiv.org/abs/2602.09269
作者:Jaeyoon Choi,Nia Nixon
类目:Computation and Language (cs.CL)
关键词:coarse sample descriptors, collaborative problem solving, problem solving, shaped moment, access are widely
备注:
点击查看摘要
Abstract:Inclusion, equity, and access are widely valued in AI and education, yet are often assessed through coarse sample descriptors or post-hoc self-reports that miss how inclusion is shaped moment by moment in collaborative problem solving (CPS). In this proof-of-concept paper, we introduce inclusion analytics, a discourse-based framework for examining inclusion as a dynamic, interactional process in CPS. We conceptualize inclusion along three complementary dimensions -- participation equity, affective climate, and epistemic equity -- and demonstrate how these constructs can be made analytically visible using scalable, interaction-level measures. Using both simulated conversations and empirical data from human-AI teaming experiments, we illustrate how inclusion analytics can surface patterns of participation, relational dynamics, and idea uptake that remain invisible to aggregate or post-hoc evaluations. This work represents an initial step toward process-oriented approaches to measuring inclusion in human-AI collaborative learning environments.
78. 【2602.09163】FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases
链接:https://arxiv.org/abs/2602.09163
作者:Xingjian Zhang,Sophia Moylan,Ziyang Xiong,Qiaozhu Mei,Yichen Luo,Jiaqi W. Ma
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:bases accelerate discovery, queryable formats, emerging AI systems, accelerate discovery, discovery by curating
备注:
点击查看摘要
Abstract:Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.
79. 【2602.09147】Overview of PAN 2026: Voight-Kampff Generative AI Detection, Text Watermarking, Multi-Author Writing Style Analysis, Generative Plagiarism Detection, and Reasoning Trajectory Detection
链接:https://arxiv.org/abs/2602.09147
作者:Janek Bevendorff,Maik Fröbe,André Greiner-Petter,Andreas Jakoby,Maximilian Mayerl,Preslav Nakov,Henry Plutz,Martin Potthast,Benno Stein,Minh Ngoc Ta,Yuxia Wang,Eva Zangerle
类目:Computation and Language (cs.CL)
关键词:advance computational stylometry, Multi-author Writing Style, Writing Style Analysis, Generative Plagiarism Detection, Reasoning Trajectory Detection
备注:
点击查看摘要
Abstract:The goal of the PAN workshop is to advance computational stylometry and text forensics via objective and reproducible evaluation. In 2026, we run the following five tasks: (1) Voight-Kampff Generative AI Detection, particularly in mixed and obfuscated authorship scenarios, (2) Text Watermarking, a new task that aims to find new and benchmark the robustness of existing text watermarking schemes, (3) Multi-author Writing Style Analysis, a continued task that aims to find positions of authorship change, (4) Generative Plagiarism Detection, a continued task that targets source retrieval and text alignment between generated text and source documents, and (5) Reasoning Trajectory Detection, a new task that deals with source detection and safety detection of LLM-generated or human-written reasoning trajectories. As in previous years, PAN invites software submissions as easy-to-reproduce Docker containers for most of the tasks. Since PAN 2012, more than 1,100 submissions have been made this way via the TIRA experimentation platform.
80. 【2602.09138】PABU: Progress-Aware Belief Update for Efficient LLM Agents
链接:https://arxiv.org/abs/2602.09138
作者:Haitao Jiang,Lin Ge,Hengrui Cai,Rui Song
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Model, Large Language, full action-observation histories, higher inference cost, introduce task-irrelevant information
备注:
点击查看摘要
Abstract:Large Language Model (LLM) agents commonly condition actions on full action-observation histories, which introduce task-irrelevant information that easily leads to redundant actions and higher inference cost. We propose Progress-Aware Belief Update (PABU), a belief-state framework that compactly represents an agent's state by explicitly modeling task progress and selectively retaining past actions and observations. At each step, the agent predicts its relative progress since the previous round and decides whether the newly encountered interaction should be stored, conditioning future decisions only on the retained subset. Across eight environments in the AgentGym benchmark, and using identical training trajectories, PABU achieves an 81.0% task completion rate, outperforming previous State of the art (SoTA) models with full-history belief by 23.9%. Additionally, PABU's progress-oriented action selection improves efficiency, reducing the average number of interaction steps to 9.5, corresponding to a 26.9% reduction. Ablation studies show that both explicit progress prediction and selective retention are necessary for robust belief learning and performance gains.
81. 【2602.09113】Benchmarking the Energy Savings with Speculative Decoding Strategies
链接:https://arxiv.org/abs/2602.09113
作者:Rohit Dutta,Paramita Koley,Soham Poddar,Janardan Misra,Sanjay Podder,Naveen Balani,Saptarshi Ghosh,Niloy Ganguly
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:cost of LLM, LLM inferences, speculative decoding strategies, Speculative decoding, effective method
备注: Accepted at EACL Findings 2026
点击查看摘要
Abstract:Speculative decoding has emerged as an effective method to reduce latency and inference cost of LLM inferences. However, there has been inadequate attention towards the energy requirements of these models. To address this gap, this paper presents a comprehensive survey of energy requirements of speculative decoding strategies, with detailed analysis on how various factors -- model size and family, speculative decoding strategies, and dataset characteristics -- influence the energy optimizations.
82. 【2602.09082】UI-Venus-1.5 Technical Report
链接:https://arxiv.org/abs/2602.09082
作者:Veuns-Team:Changlong Gao,Zhangxuan Gu,Yulin Liu,Xinyu Qiu,Shuheng Shen,Yue Wen,Tianyu Xia,Zhenyu Xu,Zhengwen Zeng,Beitong Zhou,Xingran Zhou,Weizhi Chen,Sunhao Dai,Jingya Dou,Yichen Gong,Yuan Guo,Zhenlin Guo,Feng Li,Qian Li,Jinzhen Lin,Yuqi Zhou,Linchao Zhu,Liang Chen,Zhenyu Guo,Changhua Meng,Weiqiang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Online Reinforcement Learning, unified GUI Agent, GUI Agent designed, GUI Agent constructed, foundational GUI semantics
备注:
点击查看摘要
Abstract:GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains this http URL this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world this http URL proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application this http URL to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: this https URL Model: this https URL
83. 【2602.09389】VTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization
链接:https://arxiv.org/abs/2602.09389
作者:Waris Quamer,Mu-Ruei Tseng,Ghady Nasrallah,Ricardo Gutierrez-Osuna
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
关键词:Real-time voice conversion, Real-time voice, anonymization require causal, require causal, voice conversion
备注:
点击查看摘要
Abstract:Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with 80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.
84. 【2602.09270】Collective Behavior of AI Agents: the Case of Moltbook
链接:https://arxiv.org/abs/2602.09270
作者:Giordano De Marzo,David Garcia
类目:Physics and Society (physics.soc-ph); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:large scale data, scale data analysis, media platform exclusively, platform exclusively populated, Reddit-style social media
备注:
点击查看摘要
Abstract:We present a large scale data analysis of Moltbook, a Reddit-style social media platform exclusively populated by AI agents. Analyzing over 369,000 posts and 3.0 million comments from approximately 46,000 active agents, we find that AI collective behavior exhibits many of the same statistical regularities observed in human online communities: heavy-tailed distributions of activity, power-law scaling of popularity metrics, and temporal decay patterns consistent with limited attention dynamics. However, we also identify key differences, including a sublinear relationship between upvotes and discussion size that contrasts with human behavior. These findings suggest that, while individual AI agents may differ fundamentally from humans, their emergent collective dynamics share structural similarities with human social systems.
信息检索
1. 【2602.10024】Overview of the TREC 2025 RAGTIME Track
链接:https://arxiv.org/abs/2602.10024
作者:Dawn Lawrie,Sean MacAvaney,James Mayfield,Luca Soldaini,Eugene Yang,Andrew Yates
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:RAG TREC Instrument, RAG TREC, TREC Instrument, English Report Generation, study report generation
备注: 10 pages, 3 figures, notebook version of the RAGTIME 2025 overview paper
点击查看摘要
Abstract:The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.
2. 【2602.10016】Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design
链接:https://arxiv.org/abs/2602.10016
作者:Bojian Hou,Xiaolong Liu,Xiaoyi Liu,Jiaqi Xu,Yasmine Badr,Mengyue Hang,Sudhanshu Chanpuriya,Junqing Zhou,Yuhang Yang,Han Xu,Qiuling Suo,Laming Chen,Yuxi Hu,Jiasheng Zhang,Huaqing Xiong,Yuzhen Huang,Chao Chen,Yue Dong,Yi Yang,Shuo Chang,Xiaorui Gan,Wenlin Chen,Santanu Kolay,Darren Liu,Jade Nie,Chunzhi Yang,Jiyan Yang,Huayu Li
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:massive-scale recommendation systems, Deriving predictable scaling, recommendation systems, Deriving predictable, govern the relationship
备注: 10 pages, 4 figures
点击查看摘要
Abstract:Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.
3. 【2602.09935】Efficient Learning of Sparse Representations from Interactions
链接:https://arxiv.org/abs/2602.09935
作者:Vojtěch Vančura,Martin Spišák,Rodrigo Alves,Ladislav Peška
类目:Information Retrieval (cs.IR)
关键词:Behavioral patterns captured, production recommender systems, Behavioral patterns, recommender systems, patterns captured
备注: In the proceedings of the Web Conference (WWW) 2026 (4 pages)
点击查看摘要
Abstract:Behavioral patterns captured in embeddings learned from interaction data are pivotal across various stages of production recommender systems. However, in the initial retrieval stage, practitioners face an inherent tradeoff between embedding expressiveness and the scalability and latency of serving components, resulting in the need for representations that are both compact and expressive. To address this challenge, we propose a training strategy for learning high-dimensional sparse embedding layers in place of conventional dense ones, balancing efficiency, representational expressiveness, and interpretability. To demonstrate our approach, we modified the production-grade collaborative filtering autoencoder ELSA, achieving up to 10x reduction in embedding size with no loss of recommendation accuracy, and up to 100x reduction with only a 2.5% loss. Moreover, the active embedding dimensions reveal an interpretable inverted-index structure that segments items in a way directly aligned with the model's latent space, thereby enabling integration of segment-level recommendation functionality (e.g., 2D homepage layouts) within the candidate retrieval model itself. Source codes, additional results, as well as a live demo are available at this https URL
4. 【2602.09914】AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning
链接:https://arxiv.org/abs/2602.09914
作者:Tilahun Yeshambel,Moncef Garouani,Josiane Mothe
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:high-quality supervised data, GPT-style generative models, generative models rely, Amharic data resource, rely on large
备注: 7 pages, Submitted to resource track
点击查看摘要
Abstract:Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.
5. 【2602.09901】QP-OneModel: A Unified Generative LLM for Multi-Task Query Understanding in Xiaohongshu Search
链接:https://arxiv.org/abs/2602.09901
作者:Jianzhao Huang,Xiaorui Huang,Fei Zhao,Yunpeng Liu,Hui Zhang,Fangcheng Shi,Congfeng Li,Zechen Sun,Yi Wu,Yao Hu,Yunhan Bai,Shaosheng Cao
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Social Network Service, large-scale Social Network, Network Service, Social Network, large-scale Social
备注:
点击查看摘要
Abstract:Query Processing (QP) bridges user intent and content supply in large-scale Social Network Service (SNS) search engines. Traditional QP systems rely on pipelines of isolated discriminative models (e.g., BERT), suffering from limited semantic understanding and high maintenance overhead. While Large Language Models (LLMs) offer a potential solution, existing approaches often optimize sub-tasks in isolation, neglecting intrinsic semantic synergy and necessitating independent iterations. Moreover, standard generative methods often lack grounding in SNS scenarios, failing to bridge the gap between open-domain corpora and informal SNS linguistic patterns, while struggling to adhere to rigorous business definitions. We present QP-OneModel, a Unified Generative LLM for Multi-Task Query Understanding in the SNS domain. We reformulate heterogeneous sub-tasks into a unified sequence generation paradigm, adopting a progressive three-stage alignment strategy culminating in multi-reward Reinforcement Learning. Furthermore, QP-OneModel generates intent descriptions as a novel high-fidelity semantic signal, effectively augmenting downstream tasks such as query rewriting and ranking. Offline evaluations show QP-OneModel achieves a 7.35% overall gain over discriminative baselines, with significant F1 boosts in NER (+9.01%) and Term Weighting (+9.31%). It also exhibits superior generalization, surpassing a 32B model by 7.60% accuracy on unseen tasks. Fully deployed at Xiaohongshu, online A/B tests confirm its industrial value, optimizing retrieval relevance (DCG) by 0.21% and lifting user retention by 0.044%.
6. 【2602.09829】Internalizing Multi-Agent Reasoning for Accurate and Efficient LLM-based Recommendation
链接:https://arxiv.org/abs/2602.09829
作者:Yang Wu,Haoze Wang,Qian Li,Jun Zhang,Huan Yu,Jie Jiang
类目:Information Retrieval (cs.IR)
关键词:interpret user intent, Large Language Models, leveraging extensive world, extensive world knowledge, Large Language
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are reshaping recommender systems by leveraging extensive world knowledge and semantic reasoning to interpret user intent. However, effectively integrating these capabilities with collaborative signals while avoiding prohibitive inference latency remains a critical bottleneck. To address this, we propose a trajectory-driven internalization framework to develop a Single-agent Trajectory-Aligned Recommender (STAR). Specifically, to internalize complex reasoning capabilities into a single efficient model, we first design a multi-agent teacher system capable of multi-turn tool usage and reflection. This teacher utilizes a Collaborative Signal Translation mechanism to explicitly convert latent behavioral patterns into descriptive natural language evidence to enhance reasoning accuracy. Subsequently, a trajectory-driven distillation pipeline transfers this agentic logic, including planning, tool usage, and self-reflection, into the compact STAR model. Extensive experiments demonstrate that STAR surpasses its teacher by 8.7% to 39.5% while eliminating iterative latency, paving the way for real-time, reasoning-enhanced recommendation.
7. 【2602.09764】Self-Supervised Learning as Discrete Communication
链接:https://arxiv.org/abs/2602.09764
作者:Kawtar Zaher,Ilyass Moummad,Olivier Buisson,Alexis Joly
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:offering limited control, methods learn continuous, methods learn, offering limited, limited control
备注:
点击查看摘要
Abstract:Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.
8. 【2602.09744】DiffuReason: Bridging Latent Reasoning and Generative Refinement for Sequential Recommendation
链接:https://arxiv.org/abs/2602.09744
作者:Jie Jiang,Yang Wu,Qian Li,Yuling Xiong,Yihang Su,Junbang Huo,Longfei Lu,Jun Zhang,Huan Yu
类目:Information Retrieval (cs.IR)
关键词:capture complex user, promising paradigm, capture complex, sequential recommendation, Group Relative Policy
备注:
点击查看摘要
Abstract:Latent reasoning has emerged as a promising paradigm for sequential recommendation, enabling models to capture complex user intent through multi-step deliberation. Yet existing approaches often rely on deterministic latent chains that accumulate noise and overlook the uncertainty inherent in user intent, and they are typically trained in staged pipelines that hinder joint optimization and exploration. To address these challenges, we propose DiffuReason, a unified "Think-then-Diffuse" framework for sequential recommendation. It integrates multi-step Thinking Tokens for latent reasoning, diffusion-based refinement for denoising intermediate representations, and end-to-end Group Relative Policy Optimization (GRPO) alignment to optimize for ranking performance. In the Think stage, the model generates Thinking Tokens that reason over user history to form an initial intent hypothesis. In the Diffuse stage, rather than treating this hypothesis as the final output, we refine it through a diffusion process that models user intent as a probabilistic distribution, providing iterative denoising against reasoning noise. Finally, GRPO-based reinforcement learning enables the reasoning and refinement modules to co-evolve throughout training, without the constraints of staged optimization. Extensive experiments on four benchmarks demonstrate that DiffuReason consistently improves diverse backbone architectures. Online A/B tests on a large-scale industrial platform further validate its practical effectiveness.
9. 【2602.09616】With Argus Eyes: Assessing Retrieval Gaps via Uncertainty Scoring to Detect and Remedy Retrieval Blind Spots
链接:https://arxiv.org/abs/2602.09616
作者:Zeinab Sadat Taghavi,Ali Modarressi,Hinrich Schutze,Andreas Marfurt
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Reliable retrieval-augmented generation, find relevant information, Reliable retrieval-augmented, systems depend fundamentally, RAG systems
备注: 8 pages
点击查看摘要
Abstract:Reliable retrieval-augmented generation (RAG) systems depend fundamentally on the retriever's ability to find relevant information. We show that neural retrievers used in RAG systems have blind spots, which we define as the failure to retrieve entities that are relevant to the query, but have low similarity to the query embedding. We investigate the training-induced biases that cause such blind spot entities to be mapped to inaccessible parts of the embedding space, resulting in low retrievability. Using a large-scale dataset constructed from Wikidata relations and first paragraphs of Wikipedia, and our proposed Retrieval Probability Score (RPS), we show that blind spot risk in standard retrievers (e.g., CONTRIEVER, REASONIR) can be predicted pre-index from entity embedding geometry, avoiding expensive retrieval evaluations. To address these blind spots, we introduce ARGUS, a pipeline that enables the retrievability of high-risk (low-RPS) entities through targeted document augmentation from a knowledge base (KB), first paragraphs of Wikipedia, in our case. Extensive experiments on BRIGHT, IMPLIRET, and RAR-B show that ARGUS achieves consistent improvements across all evaluated retrievers (averaging +3.4 nDCG@5 and +4.5 nDCG@10 absolute points), with substantially larger gains in challenging subsets. These results establish that preemptively remedying blind spots is critical for building robust and trustworthy RAG systems (Code and Data).
10. 【2602.09570】LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval
链接:https://arxiv.org/abs/2602.09570
作者:Narges Baba Ahmadi,Jan Strich,Martin Semmann,Chris Biemann
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:access legal information, Large language models, Large language, Large, legal information
备注: Accepted at EACL SRW 26
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{this https URL}{GitHub Repository}} and data\footnote{\href{this https URL}{Hugging Face Dataset}}.
11. 【2602.09552】Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA
链接:https://arxiv.org/abs/2602.09552
作者:Klejda Alushi,Jan Strich,Chris Biemann,Martin Semmann
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:large language models, question answering increasingly, answering increasingly relies, ground large language, Conversational question answering
备注: Accepted to EACL SRW 26
点击查看摘要
Abstract:Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.\footnote{\href{this https URL}{GitHub Repository}}
12. 【2602.09448】he Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training
链接:https://arxiv.org/abs/2602.09448
作者:Xincan Feng,Noriki Nishida,Yusuke Sakai,Yuji Matsumoto
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Prior work reports, work reports conflicting, reports conflicting results, Prior work, synthetic data generation
备注: Under review
点击查看摘要
Abstract:Prior work reports conflicting results on query diversity in synthetic data generation for dense retrieval. We identify this conflict and design Q-D metrics to quantify diversity's impact, making the problem measurable. Through experiments on 4 benchmark types (31 datasets), we find query diversity especially benefits multi-hop retrieval. Deep analysis on multi-hop data reveals that diversity benefit correlates strongly with query complexity ($r$$\geq$0.95, $p$$$0.05 in 12/14 conditions), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides actionable thresholds (CW$$10: use diversity; CW$$7: avoid it). Guided by CDP, we propose zero-shot multi-query synthesis for multi-hop tasks, achieving state-of-the-art performance.
13. 【2602.09445】Personalized Parameter-Efficient Fine-Tuning of Foundation Models for Multimodal Recommendation
链接:https://arxiv.org/abs/2602.09445
作者:Sunwoo Kim,Hyunjin Hwang,Kijung Shin
类目:Information Retrieval (cs.IR)
关键词:substantial research, recommender systems, encode such data, multimodal foundation models, PEFT
备注: To be published at The Web Conference 2026 (WWW 2026)
点击查看摘要
Abstract:In recent years, substantial research has integrated multimodal item metadata into recommender systems, often by using pre-trained multimodal foundation models to encode such data. Since these models are not originally trained for recommendation tasks, recent works efficiently adapt them via parameter-efficient fine-tuning (PEFT). However, even with PEFT, item embeddings from multimodal foundation models remain user-blind: item embeddings are not conditioned on user interests, despite the fact that users with diverse interests attend to different item aspects. To address this limitation, we propose PerPEFT, a personalized PEFT strategy for multimodal recommendation. Specifically, PerPEFT groups users by interest and assigns a distinct PEFT module to each group, enabling each module to capture the fine-grained item aspects most predictive of that group`s purchase decisions. We further introduce a specialized training technique that strengthens this user-group conditioning. Notably, PerPEFT is PEFT-agnostic and can be paired with any PEFT method applicable to multimodal foundation models. Through extensive experiments, we show that (1) PerPEFT outperforms the strongest baseline by up to 15.3% (NDCG@20) and (2) delivers consistent gains across diverse PEFT variants. It is noteworthy that, even with personalization, PEFT remains lightweight, adding only 1.3% of the parameter count of the foundation model. We provide our code and datasets at this https URL.
14. 【2602.09401】SARM: LLM-Augmented Semantic Anchor for End-to-End Live-Streaming Ranking
链接:https://arxiv.org/abs/2602.09401
作者:Ruochen Yang,Yueyang Liu,Zijie Zhuang,Changxin Lao,Yuhui Zhang,Jiangxia Cao,Jia Xu,Xiang Chen,Haoke Xiao,Xiangyu Wu,Xiaoyou Zhou,Xiao Lv,Shuang Yang,Tingwen Liu,Zhaojie Liu,Han Li,Kun Gai
类目:Information Retrieval (cs.IR)
关键词:recommendation requires precise, requires precise modeling, real-time serving constraints, live-streaming recommendation requires, strict real-time serving
备注:
点击查看摘要
Abstract:Large-scale live-streaming recommendation requires precise modeling of non-stationary content semantics under strict real-time serving constraints. In industrial deployment, two common approaches exhibit fundamental limitations: discrete semantic abstractions sacrifice descriptive precision through clustering, while dense multimodal embeddings are extracted independently and remain weakly aligned with ranking optimization, limiting fine-grained content-aware ranking. To address these limitations, we propose \textbf{SARM}, an end-to-end ranking architecture that integrates natural-language semantic anchors directly into ranking optimization, enabling fine-grained author representations conditioned on multimodal content. Each semantic anchor is represented as learnable text tokens jointly optimized with ranking features, allowing the model to adapt content descriptions to ranking objectives. A lightweight dual-token gated design captures domain-specific live-streaming semantics, while an asymmetric deployment strategy preserves low-latency online training and serving. Extensive offline evaluation and large-scale A/B tests show consistent improvements over production baselines. SARM is fully deployed and serves over 400 million users daily.
15. 【2602.09387】Query-Mixed Interest Extraction and Heterogeneous Interaction: A Scalable CTR Model for Industrial Recommender Systems
链接:https://arxiv.org/abs/2602.09387
作者:Fangye Wang,Guowei Yang,Xiaojiang Zhou,Song Yang,Pengjie Wang
类目:Information Retrieval (cs.IR)
关键词:Learning effective feature, modern recommender systems, industrial settings due, sparse multi-field inputs, Learning effective
备注:
点击查看摘要
Abstract:Learning effective feature interactions is central to modern recommender systems, yet remains challenging in industrial settings due to sparse multi-field inputs and ultra-long user behavior sequences. While recent scaling efforts have improved model capacity, they often fail to construct both context-aware and context-independent user intent from the long-term and real-time behavior sequence. Meanwhile, recent work also suffers from inefficient and homogeneous interaction mechanisms, leading to suboptimal prediction performance. To address these limitations, we propose HeMix, a scalable ranking model that unifies adaptive sequence tokenization and heterogeneous interaction structure. Specifically, HeMix introduces a Query-Mixed Interest Extraction module that jointly models context-aware and context-independent user interests via dynamic and fixed queries over global and real-time behavior sequences. For interaction, we replace self-attention with the HeteroMixer block, enabling efficient, multi-granularity cross-feature interactions that adopt the multi-head token fusion, heterogeneous interaction and group-aligned reconstruction pipelines. HeMix demonstrates favorable scaling behavior, driven by the HeteroMixer block, where increasing model scale via parameter expansion leads to steady improvements in recommendation accuracy. Experiments on industrial-scale datasets show that HeMix scales effectively and consistently outperforms strong baselines. Most importantly, HeMix has been deployed on the AMAP platform, delivering significant online gains: +0.61% GMV, +2.32% PV_CTR, and +0.81% UV_CVR.
16. 【2602.09386】SMES: Towards Scalable Multi-Task Recommendation via Expert Sparsity
链接:https://arxiv.org/abs/2602.09386
作者:Yukun Zhang,Si Dong,Xu Wang,Bo Chen,Qinglin Jia,Shengzhe Wang,Jinlong Jiao,Runhan Li,Jiaqing Liu,Chaoyi Ma,Ruiming Tang,Guorui Zhou,Han Li,Kun Gai
类目:Information Retrieval (cs.IR)
关键词:Industrial recommender systems, recommender systems typically, systems typically rely, Industrial recommender, estimate diverse user
备注:
点击查看摘要
Abstract:Industrial recommender systems typically rely on multi-task learning to estimate diverse user feedback signals and aggregate them for ranking. Recent advances in model scaling have shown promising gains in recommendation. However, naively increasing model capacity imposes prohibitive online inference costs and often yields diminishing returns for sparse tasks with skewed label distributions. This mismatch between uniform parameter scaling and heterogeneous task capacity demands poses a fundamental challenge for scalable multi-task recommendation. In this work, we investigate parameter sparsification as a principled scaling paradigm and identify two critical obstacles when applying sparse Mixture-of-Experts (MoE) to multi-task recommendation: exploded expert activation that undermines instance-level sparsity and expert load skew caused by independent task-wise routing. To address these challenges, we propose SMES, a scalable sparse MoE framework with progressive expert routing. SMES decomposes expert activation into a task-shared expert subset jointly selected across tasks and task-adaptive private experts, explicitly bounding per-instance expert execution while preserving task-specific capacity. In addition, SMES introduces a global multi-gate load-balancing regularizer that stabilizes training by regulating aggregated expert utilization across all tasks. SMES has been deployed in Kuaishou large-scale short-video services, supporting over 400 million daily active users. Extensive online experiments demonstrate stable improvements, with GAUC gain of 0.29% and a 0.31% uplift in user watch time.
17. 【2602.09229】Beyond the Unit Hypersphere: Embedding Magnitude in Contrastive Learning
链接:https://arxiv.org/abs/2602.09229
作者:Xincan Feng,Taro Watanabe
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:implicit assumption, prevalent in contrastive, makes an implicit, Cosine similarity, cosine similarity comparable
备注: Preliminary work. Under review
点击查看摘要
Abstract:Cosine similarity is prevalent in contrastive learning, yet it makes an implicit assumption: embedding magnitude is noise. Prior work occasionally found dot product and cosine similarity comparable, but left unanswered WHAT information magnitude carries, WHEN it helps, and HOW to leverage it. We conduct a systematic study through a $2 \times 2$ ablation that independently controls input-side and output-side normalization across text and vision models. Our findings reveal three key insights. First, in text retrieval, output (document) magnitude strongly correlates with relevance (Cohen's $d$ up to 1.80), yielding the largest gains on reasoning-intensive tasks. Second, input and output magnitudes serve asymmetric roles: output magnitude directly scales similarity scores while input magnitude modulates training dynamics. Third, magnitude learning benefits asymmetric tasks (text retrieval, RAG) but harms symmetric tasks (STS, text-image alignment). These findings establish a task symmetry principle: the choice between cosine and dot product depends on whether the task has distinct input roles, enabling cost-free improvements by simply removing an unnecessary constraint.
18. 【2602.09163】FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases
链接:https://arxiv.org/abs/2602.09163
作者:Xingjian Zhang,Sophia Moylan,Ziyang Xiong,Qiaozhu Mei,Yichen Luo,Jiaqi W. Ma
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:bases accelerate discovery, queryable formats, emerging AI systems, accelerate discovery, discovery by curating
备注:
点击查看摘要
Abstract:Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.
19. 【2602.09126】An Interactive Metrics Dashboard for the Keck Observatory Archive
链接:https://arxiv.org/abs/2602.09126
作者:G. Bruce Berriman,Min Phone Myat Zaw
类目:Instrumentation and Methods for Astrophysics (astro-ph.IM); Information Retrieval (cs.IR)
关键词:Exoplanet Science Institute, NASA Exoplanet Science, Science Institute, NASA Exoplanet, Exoplanet Science
备注: 4 pages, 2 figures, Submitted to Proc. ADASS 2025
点击查看摘要
Abstract:Since 2004, the Keck Observatory Archive (KOA) has operated as a NASA-funded collaboration between the NASA Exoplanet Science Institute ( NExScI) and the W.M. Keck Observatory. It ingests and serves all data acquired by the twin 10-meter Keck telescopes on Mauna Kea, Hawaii. In the past three years, KOA has begun a modernization program to replace the architecture and systems used since the archive's creation with a new modern Python-based infrastructure. This infrastructure will position KOA to respond to the rapid growth of new and complex data sets that will be acquired by new instruments now in development, and enable follow-up to identify the deluge of alerts of transient sources expected by new survey telescopes such as the Vera C. Rubin Observatory. Since 2022, KOA has ingested new data in near-real time, generally within one minute of creation, and has made them immediately accessible to observers through a dedicated web interface. The archive is now deploying a new, scalable, Python-based, VO-compliant query infrastructure built with the Plotly-Dash framework and R-tree indices to speed-up queries by a factor of 20. The project described here exploits the new query infrastructure to develop a dashboard that will return live metrics on the performance and growth of the archive. These metrics assess the current health of the archive and guide planning future hardware and software upgrades. This single dashboard will enable, for example, monitoring of real-time ingestion, as well as studying the long-term growth of the archive. Current methods of gathering metrics that have been in place since the archive opened will not support the archive as it continues to scale. These methods suffer from high latency, are not optimized for on-demand metrics, are scattered among various tools, and are cumbersome to use.
Comments:
4 pages, 2 figures, Submitted to Proc. ADASS 2025
Subjects:
Instrumentation and Methods for Astrophysics (astro-ph.IM); Information Retrieval (cs.IR)
Cite as:
arXiv:2602.09126 [astro-ph.IM]
(or
arXiv:2602.09126v1 [astro-ph.IM] for this version)
https://doi.org/10.48550/arXiv.2602.09126
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
计算机视觉
1. 【2602.10116】SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
链接:https://arxiv.org/abs/2602.10116
作者:Hongchi Xia,Xuan Li,Zhaoshuo Li,Qianli Ma,Jiashu Xu,Ming-Yu Liu,Yin Cui,Tsung-Yi Lin,Wei-Chiu Ma,Shenlong Wang,Shuran Song,Fangyin Wei
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Real-world data collection, agents remains costly, embodied agents remains, calling for scalable, costly and unsafe
备注: Project Page: [this https URL](https://nvlabs.github.io/sage)
点击查看摘要
Abstract:Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: this https URL.
2. 【2602.10115】Quantum Multiple Rotation Averaging
链接:https://arxiv.org/abs/2602.10115
作者:Shuteng Wang,Natacha Kuete Meli,Michael Möller,Vladislav Golyanik
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multiple rotation averaging, noisy relative measurements, recover globally consistent, globally consistent absolute, consistent absolute rotations
备注: 16 pages, 13 figures, 4 tables; project page: [this https URL](https://4dqv.mpi-inf.mpg.de/QMRA/)
点击查看摘要
Abstract:Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations from noisy relative measurements. Established classical methods, such as L1-IRLS and Shonan, face limitations including local minima susceptibility and reliance on convex relaxations that fail to preserve the exact manifold geometry, leading to reduced accuracy in high-noise scenarios. We introduce IQARS (Iterative Quantum Annealing for Rotation Synchronization), the first algorithm that reformulates MRA as a sequence of local quadratic non-convex sub-problems executable on quantum annealers after binarization, to leverage inherent hardware advantages. IQARS removes convex relaxation dependence and better preserves non-Euclidean rotation manifold geometry while leveraging quantum tunneling and parallelism for efficient solution space exploration. We evaluate IQARS's performance on synthetic and real-world datasets. While current annealers remain in their nascent phase and only support solving problems of limited scale with constrained performance, we observed that IQARS on D-Wave annealers can already achieve ca. 12% higher accuracy than Shonan, i.e., the best-performing classical method evaluated empirically.
3. 【2602.10113】ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
链接:https://arxiv.org/abs/2602.10113
作者:Mingyang Wu,Ashirbad Mishra,Soumik Dey,Shuo Xing,Naveen Ravipati,Hansi Wu,Binbin Li,Zhengzhong Tu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:preserving fine-grained object, changing viewpoints remains, coherent video sequence, fine-grained object identity, animates a static
备注: Project page: [this https URL](https://myangwu.github.io/ConsID-Gen)
点击查看摘要
Abstract:Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at this https URL.
4. 【2602.10104】Olaf-World: Orienting Latent Actions for Video World Modeling
链接:https://arxiv.org/abs/2602.10104
作者:Yuxin Jiang,Yuchao Gu,Ivor W. Tsang,Mike Zheng Shou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Scaling action-controllable world, Scaling action-controllable, action-controllable world models, Scaling, action
备注: Project page: [this https URL](https://showlab.github.io/Olaf-World/) Code: [this https URL](https://github.com/showlab/Olaf-World)
点击查看摘要
Abstract:Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$\Delta$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
5. 【2602.10102】VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
链接:https://arxiv.org/abs/2602.10102
作者:Zhongwei Ren,Yunchao Wei,Xiao Yu,Guixun Luo,Yao Zhao,Bingyi Kang,Jiashi Feng,Xiaojie Jin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Learning transferable knowledge, Learning transferable, intelligent agents, fundamental capability, capability of intelligent
备注: Code and models are released at: [this https URL](https://maverickren.github.io/VideoWorld2.github.io/)
点击查看摘要
Abstract:Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
6. 【2602.10099】Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders
链接:https://arxiv.org/abs/2602.10099
作者:Amandeep Kumar,Vishal M. Patel
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Leveraging representation encoders, Leveraging representation, generative modeling offers, high-fidelity synthesis, modeling offers
备注: Technical Report
点击查看摘要
Abstract:Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: this https URL
7. 【2602.10098】VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
链接:https://arxiv.org/abs/2602.10098
作者:Jingwen Sun,Wenyao Zhang,Zekun Qi,Shaojie Ren,Zezhi Liu,Hanxin Zhu,Guangzhong Sun,Xin Jin,Zhibo Chen
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:action-relevant state transitions, policies on internet-scale, video is appealing, wrong thing, making them vulnerable
备注:
点击查看摘要
Abstract:Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.
8. 【2602.10095】Causality in Video Diffusers is Separable from Denoising
链接:https://arxiv.org/abs/2602.10095
作者:Xingjian Bai,Guande He,Zhengqi Li,Eli Shechtman,Xun Huang,Zongze Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:complex generative processes, uni-directional cause-effect relationships, uni-directional cause-effect, relationships between components, underlies many complex
备注:
点击查看摘要
Abstract:Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
9. 【2602.10094】4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere
链接:https://arxiv.org/abs/2602.10094
作者:Yihang Luo,Shangchen Zhou,Yushi Lan,Xingang Pan,Chen Change Loy
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unified feed-forward framework, unified feed-forward, feed-forward framework, monocular videos, dense scene geometry
备注: Project page: [this https URL](https://yihangluo.com/projects/4RC/)
点击查看摘要
Abstract:We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.
10. 【2602.10079】Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach
链接:https://arxiv.org/abs/2602.10079
作者:Soumyaroop Nandi,Prem Natarajan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:attention-based state-space framework, framework for image, jointly localizes, image forgery detection, introduce Forensim
备注:
点击查看摘要
Abstract:We introduce Forensim, an attention-based state-space framework for image forgery detection that jointly localizes both manipulated (target) and source regions. Unlike traditional approaches that rely solely on artifact cues to detect spliced or forged areas, Forensim is designed to capture duplication patterns crucial for understanding context. In scenarios such as protest imagery, detecting only the forged region, for example a duplicated act of violence inserted into a peaceful crowd, can mislead interpretation, highlighting the need for joint source-target localization. Forensim outputs three-class masks (pristine, source, target) and supports detection of both splicing and copy-move forgeries within a unified architecture. We propose a visual state-space model that leverages normalized attention maps to identify internal similarities, paired with a region-based block attention module to distinguish manipulated regions. This design enables end-to-end training and precise localization. Forensim achieves state-of-the-art performance on standard benchmarks. We also release CMFD-Anything, a new dataset addressing limitations of existing copy-move forgery datasets.
11. 【2602.10062】Vendi Novelty Scores for Out-of-Distribution Detection
链接:https://arxiv.org/abs/2602.10062
作者:Amey P. Pasarkar,Adji Bousso Dieng
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:machine learning systems, learning systems, machine learning, VNS, OOD detection
备注:
点击查看摘要
Abstract:Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.
12. 【2602.10052】Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving
链接:https://arxiv.org/abs/2602.10052
作者:Serin Varghese,Kevin Ross,Fabian Hueger,Kira Maag
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep neural networks, achieved remarkable success, Deep neural, neural networks, environmental perception
备注:
点击查看摘要
Abstract:Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.
13. 【2602.10045】Conformal Prediction Sets for Instance Segmentation
链接:https://arxiv.org/abs/2602.10045
作者:Kerri Lu,Dan M. Kluger,Stephen Bates,Sherrie Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
关键词:principled uncertainty quantification, lack principled uncertainty, Current instance segmentation, segmentation models achieve, models achieve high
备注:
点击查看摘要
Abstract:Current instance segmentation models achieve high performance on average predictions, but lack principled uncertainty quantification: their outputs are not calibrated, and there is no guarantee that a predicted mask is close to the ground truth. To address this limitation, we introduce a conformal prediction algorithm to generate adaptive confidence sets for instance segmentation. Given an image and a pixel coordinate query, our algorithm generates a confidence set of instance predictions for that pixel, with a provable guarantee for the probability that at least one of the predictions has high Intersection-Over-Union (IoU) with the true object instance mask. We apply our algorithm to instance segmentation examples in agricultural field delineation, cell segmentation, and vehicle detection. Empirically, we find that our prediction sets vary in size based on query difficulty and attain the target coverage, outperforming existing baselines such as Learn Then Test, Conformal Risk Control, and morphological dilation-based methods. We provide versions of the algorithm with asymptotic and finite sample guarantees.
14. 【2602.10043】Simple Image Processing and Similarity Measures Can Link Data Samples across Databases through Brain MRI
链接:https://arxiv.org/abs/2602.10043
作者:Gaurang Sharma,Harri Polonen,Juha Pajula,Jutta Suksi,Jussi Tohka
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Head Magnetic Resonance, Magnetic Resonance Imaging, Head Magnetic, Resonance Imaging, Magnetic Resonance
备注:
点击查看摘要
Abstract:Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual's skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.10043 [cs.CV]
(or
arXiv:2602.10043v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.10043
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
15. 【2602.10042】Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection
链接:https://arxiv.org/abs/2602.10042
作者:Changjiang Jiang,Xinkuan Sha,Fengchang Yu,Jingjing Liu,Jian Liu,Mingqi Fang,Chenfeng Zhang,Wei Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:detect synthetic images, Recent studies, demonstrated that incorporating, synthetic images, studies have demonstrated
备注: Accepted by ICASSP 2026
点击查看摘要
Abstract:Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
16. 【2602.10032】Perception with Guarantees: Certified Pose Estimation via Reachability Analysis
链接:https://arxiv.org/abs/2602.10032
作者:Tobias Ladner,Yasser Shoukry,Matthias Althoff
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:cyber-physical systems, systems are increasingly, increasingly entrusted, safety-critical tasks, Agents
备注:
点击查看摘要
Abstract:Agents in cyber-physical systems are increasingly entrusted with safety-critical tasks. Ensuring safety of these agents often requires localizing the pose for subsequent actions. Pose estimates can, e.g., be obtained from various combinations of lidar sensors, cameras, and external services such as GPS. Crucially, in safety-critical domains, a rough estimate is insufficient to formally determine safety, i.e., guaranteeing safety even in the worst-case scenario, and external services might additionally not be trustworthy. We address this problem by presenting a certified pose estimation in 3D solely from a camera image and a well-known target geometry. This is realized by formally bounding the pose, which is computed by leveraging recent results from reachability analysis and formal neural network verification. Our experiments demonstrate that our approach efficiently and accurately localizes agents in both synthetic and real-world experiments.
17. 【2602.09999】Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
链接:https://arxiv.org/abs/2602.09999
作者:Florian Hahlbohm,Linus Franke,Martin Eisemann,Marcus Magnor
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Recent advances, Gaussian Splatting, focused on accelerating, Recent, Splatting
备注: Project page: [this https URL](https://fhahlbohm.github.io/faster-gaussian-splatting)
点击查看摘要
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison. In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation. The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5$\times$ faster training while maintaining visual quality, establishing a new cost-effective and resource efficient baseline for 3DGS optimization. Furthermore, we demonstrate that optimizations can be applied to 4D Gaussian reconstruction, leading to efficient non-rigid scene optimization.
18. 【2602.09989】Efficient Special Stain Classification
链接:https://arxiv.org/abs/2602.09989
作者:Oskar Thaeter,Christian Grashei,Anette Haas,Elisa Schmoeckel,Han Li,Peter J. Schüffler
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Haematoxylin and Eosin, Toggle, specific tissue characteristics, visualize specific tissue, Toggle Hugging Face
备注: 14 pages, 7 figures, 2 tables
点击查看摘要
Abstract:Stains are essential in histopathology to visualize specific tissue characteristics, with Haematoxylin and Eosin (HE) serving as the clinical standard. However, pathologists frequently utilize a variety of special stains for the diagnosis of specific morphologies. Maintaining accurate metadata for these slides is critical for quality control in clinical archives and for the integrity of computational pathology datasets. In this work, we compare two approaches for automated classification of stains using whole slide images, covering the 14 most commonly used special stains in our institute alongside standard and frozen-section HE. We evaluate a Multi-Instance Learning (MIL) pipeline and a proposed lightweight thumbnail-based approach. On internal test data, MIL achieved the highest performance (macro F1: 0.941 for 16 classes; 0.969 for 14 merged classes), while the thumbnail approach remained competitive (0.897 and 0.953, respectively). On external TCGA data, the thumbnail model generalized best (weighted F1: 0.843 vs. 0.807 for MIL). The thumbnail approach also increased throughput by two orders of magnitude (5.635 vs. 0.018 slides/s for MIL with all patches). We conclude that thumbnail-based classification provides a scalable and robust solution for routine visual quality control in digital pathology workflows.
Comments:
14 pages, 7 figures, 2 tables
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.09989 [cs.CV]
(or
arXiv:2602.09989v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.09989
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Oskar Thaeter [view email] [v1]
Tue, 10 Feb 2026 17:15:13 UTC (13,991 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Efficient Special Stain Classification, by Oskar Thaeter and 5 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.CV
prev
|
next
new
|
recent
| 2026-02
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
19. 【2602.09985】Online Monitoring Framework for Automotive Time Series Data using JEPA Embeddings
链接:https://arxiv.org/abs/2602.09985
作者:Alexander Fertig,Karthikeyan Chandra Sekaran,Lakshman Balasubramanian,Michael Botsch
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous vehicles, vehicles are rolled, ensure their safe, anomalies, safe operation
备注: Accepted at the 2026 IEEE Intelligent Vehicles Symposium. Copyright 2026 IEEE. Permission from IEEE must be obtained for use in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
点击查看摘要
Abstract:As autonomous vehicles are rolled out, measures must be taken to ensure their safe operation. In order to supervise a system that is already in operation, monitoring frameworks are frequently employed. These run continuously online in the background, supervising the system status and recording anomalies. This work proposes an online monitoring framework to detect anomalies in object state representations. Thereby, a key challenge is creating a framework for anomaly detection without anomaly labels, which are usually unavailable for unknown anomalies. To address this issue, this work applies a self-supervised embedding method to translate object data into a latent representation space. For this, a JEPA-based self-supervised prediction task is constructed, allowing training without anomaly labels and the creation of rich object embeddings. The resulting expressive JEPA embeddings serve as input for established anomaly detection methods, in order to identify anomalies within object state representations. This framework is particularly useful for applications in real-world environments, where new or unknown anomalies may occur during operation for which there are no labels available. Experiments performed on the publicly available, real-world nuScenes dataset illustrate the framework's capabilities.
20. 【2602.09983】Coupled Inference in Diffusion Models for Semantic Decomposition
链接:https://arxiv.org/abs/2602.09983
作者:Calvin Yeung,Ali Zakeri,Zhuowen Zou,Mohsen Imani
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:visual scenes, networks, decomposition, Hopfield networks, Abstract
备注: 15 pages
点击查看摘要
Abstract:Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.
21. 【2602.09979】Learning to Detect Baked Goods with Limited Supervision
链接:https://arxiv.org/abs/2602.09979
作者:Thomas H. Schmitt,Maximilian Bundscherer,Tobias Bocklet
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Monitoring leftover products, optimize future production, Monitoring leftover, future production, leftover products
备注:
点击查看摘要
Abstract:Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.
22. 【2602.09949】Bladder Vessel Segmentation using a Hybrid Attention-Convolution Framework
链接:https://arxiv.org/abs/2602.09949
作者:Franziska Krauß,Matthias Ege,Zoltan Lovasz,Albrecht Bartz-Schmidt,Igor Tsaur,Oliver Sawodny,Carina Veil
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:cancer surveillance requires, surveillance requires tracking, requires tracking tumor, tracking tumor sites, lacks stable landmarks
备注:
点击查看摘要
Abstract:Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation. While blood vessels visible during endoscopy offer a patient-specific "vascular fingerprint" for navigation, automated segmentation is challenged by imperfect endoscopic data, including sparse labels, artifacts like bubbles or variable lighting, continuous deformation, and mucosal folds that mimic vessels. State-of-the-art vessel segmentation methods often fail to address these domain-specific complexities. We introduce a Hybrid Attention-Convolution (HAC) architecture that combines Transformers to capture global vessel topology prior with a CNN that learns a residual refinement map to precisely recover thin-vessel details. To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches. Furthermore, to address data scarcity, we employ a physics-aware pretraining, that is a self-supervised strategy using clinically grounded augmentations on unlabeled data. Evaluated on the BlaVeS dataset, consisting of endoscopic video frames, our approach achieves high accuracy (0.94) and superior precision (0.61) and clDice (0.66) compared to state-of-the-art medical segmentation models. Crucially, our method successfully suppresses false positives from mucosal folds that dynamically appear and vanish as the bladder fills and empties during surgery. Hence, HAC provides the reliable structural stability required for clinical navigation.
23. 【2602.09934】VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization
链接:https://arxiv.org/abs/2602.09934
作者:Yikun Liu,Yuan Liu,Shangzhe Di,Haicheng Wang,Zhongyin Zhao,Le Tian,Xiao Zhou,Jie Zhou,Jiangchao Yao,Yanfeng Wang,Weidi Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, demonstrating superior high-level
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.
24. 【2602.09933】Unbalanced optimal transport for robust longitudinal lesion evolution with registration-aware and appearance-guided priors
链接:https://arxiv.org/abs/2602.09933
作者:Melika Qahqaie,Dominik Neumann,Tobias Heimann,Andreas Maier,Veronika A. Zimmer
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:assessing treatment response, time remains challenging, Evaluating lesion evolution, reliable lesion correspondence, establishing reliable lesion
备注: This work has been submitted to the IEEE for possible publication. Accepted at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026
点击查看摘要
Abstract:Evaluating lesion evolution in longitudinal CT scans of can cer patients is essential for assessing treatment response, yet establishing reliable lesion correspondence across time remains challenging. Standard bipartite matchers, which rely on geometric proximity, struggle when lesions appear, disappear, merge, or split. We propose a registration-aware matcher based on unbalanced optimal transport (UOT) that accommodates unequal lesion mass and adapts priors to patient-level tumor-load changes. Our transport cost blends (i) size-normalized geometry, (ii) local registration trust from the deformation-field Jacobian, and (iii) optional patch-level appearance consistency. The resulting transport plan is sparsified by relative pruning, yielding one-to-one matches as well as new, disappearing, merging, and splitting lesions without retraining or heuristic rules. On longitudinal CT data, our approach achieves consistently higher edge-detection precision and recall, improved lesion-state recall, and superior lesion-graph component F1 scores versus distance-only baselines.
25. 【2602.09932】GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery
链接:https://arxiv.org/abs/2602.09932
作者:Han Jinzhen,JinByeong Lee,JiSung Kim,MinKyung Cho,DaHee Kim,HongSik Yun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate three-dimensional urban, disaster risk assessment, poor cross-city generalisation, remain scarce due, three-dimensional urban data
备注:
点击查看摘要
Abstract:Accurate three-dimensional urban data are critical for climate modelling, disaster risk assessment, and urban planning, yet remain scarce due to reliance on proprietary sensors or poor cross-city generalisation. We propose GeoFormer, an open-source Swin Transformer framework that jointly estimates building height (BH) and footprint (BF) on a 100 m grid using only Sentinel-1/2 imagery and open DEM data. A geo-blocked splitting strategy ensures strict spatial independence between training and test sets. Evaluated over 54 diverse cities, GeoFormer achieves a BH RMSE of 3.19 m and a BF RMSE of 0.05, improving 7.5% and 15.3% over the strongest CNN baseline, while maintaining under 3.5 m BH RMSE in cross-continent transfer. Ablation studies confirm that DEM is indispensable for height estimation and that optical reflectance dominates over SAR, though multi-source fusion yields the best overall accuracy. All code, weights, and global products are publicly released.
26. 【2602.09929】Monocular Normal Estimation via Shading Sequence Estimation
链接:https://arxiv.org/abs/2602.09929
作者:Zongrui Li,Xinhua Ma,Minghui Hu,Yunqing Zhao,Yingchen Yu,Qian Zheng,Chang Liu,Xudong Jiang,Song Bai
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:single RGB image, single RGB, RGB image, normal maps, normal
备注: Accepted by ICLR 2026 (Oral Presentation)
点击查看摘要
Abstract:Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.
27. 【2602.09927】A benchmark for video-based laparoscopic skill analysis and assessment
链接:https://arxiv.org/abs/2602.09927
作者:Isabel Funke,Sebastian Bodenstedt,Felix von Bechtolsheim,Florian Oehme,Michael Maruschke,Stefanie Herrlich,Jürgen Weitz,Marius Distler,Sören Torge Mees,Stefanie Speidel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex surgical technique, requires extensive training, technique that requires, requires extensive, deep learning
备注: under review
点击查看摘要
Abstract:Laparoscopic surgery is a complex surgical technique that requires extensive training. Recent advances in deep learning have shown promise in supporting this training by enabling automatic video-based assessment of surgical skills. However, the development and evaluation of deep learning models is currently hindered by the limited size of available annotated datasets. To address this gap, we introduce the Laparoscopic Skill Analysis and Assessment (LASANA) dataset, comprising 1270 stereo video recordings of four basic laparoscopic training tasks. Each recording is annotated with a structured skill rating, aggregated from three independent raters, as well as binary labels indicating the presence or absence of task-specific errors. The majority of recordings originate from a laparoscopic training course, thereby reflecting a natural variation in the skill of participants. To facilitate benchmarking of both existing and novel approaches for video-based skill assessment and error recognition, we provide predefined data splits for each task. Furthermore, we present baseline results from a deep learning model as a reference point for future comparisons.
28. 【2602.09918】SARS: A Novel Face and Body Shape and Appearance Aware 3D Reconstruction System extends Morphable Models
链接:https://arxiv.org/abs/2602.09918
作者:Gulraiz Khan,Kenneth Y. Wertheim,Kevin Pimbblet,Waqas Ahmed
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Morphable Models, inputs and recreates, physical appearance, morphable model, Morphable
备注:
点击查看摘要
Abstract:Morphable Models (3DMMs) are a type of morphable model that takes 2D images as inputs and recreates the structure and physical appearance of 3D objects, especially human faces and bodies. 3DMM combines identity and expression blendshapes with a basic face mesh to create a detailed 3D model. The variability in the 3D Morphable models can be controlled by tuning diverse parameters. They are high-level image descriptors, such as shape, texture, illumination, and camera parameters. Previous research in 3D human reconstruction concentrated solely on global face structure or geometry, ignoring face semantic features such as age, gender, and facial landmarks characterizing facial boundaries, curves, dips, and wrinkles. In order to accommodate changes in these high-level facial characteristics, this work introduces a shape and appearance-aware 3D reconstruction system (named SARS by us), a c modular pipeline that extracts body and face information from a single image to properly rebuild the 3D model of the human full body.
29. 【2602.09883】AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization
链接:https://arxiv.org/abs/2602.09883
作者:Shaoqiu Zhang,Zizhong Ding,Kaicheng Yang,Junyi Wu,Xianglong Yan,Xi Li,Bingnan Duan,Jianping Fang,Yulun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformers, backbone for high-fidelity, video generation, high-fidelity image, image and video
备注: Code will be released at [this https URL](https://github.com/Qiushao-E/AdaTSQ/)
点击查看摘要
Abstract:Diffusion Transformers (DiTs) have emerged as the state-of-the-art backbone for high-fidelity image and video generation. However, their massive computational cost and memory footprint hinder deployment on edge devices. While post-training quantization (PTQ) has proven effective for large language models (LLMs), directly applying existing methods to DiTs yields suboptimal results due to the neglect of the unique temporal dynamics inherent in diffusion processes. In this paper, we propose AdaTSQ, a novel PTQ framework that pushes the Pareto frontier of efficiency and quality by exploiting the temporal sensitivity of DiTs. First, we propose a Pareto-aware timestep-dynamic bit-width allocation strategy. We model the quantization policy search as a constrained pathfinding problem. We utilize a beam search algorithm guided by end-to-end reconstruction error to dynamically assign layer-wise bit-widths across different timesteps. Second, we propose a Fisher-guided temporal calibration mechanism. It leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, seamlessly integrating with Hessian-based weight optimization. Extensive experiments on four advanced DiTs (e.g., Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1) demonstrate that AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q. Our code will be released at this https URL.
30. 【2602.09878】MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation
链接:https://arxiv.org/abs/2602.09878
作者:Jiaxu Wang,Yicheng Jiang,Tianlun He,Jingkai Sun,Qiang Zhang,Junhao He,Jiahang Cao,Zesen Gan,Mingyuan Sun,Qiming Shao,Xiangyu Yue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing approaches typically, approaches typically support, purely image-based forecasting, reasoning over partial, limiting their ability
备注:
点击查看摘要
Abstract:World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.
31. 【2602.09872】BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices
链接:https://arxiv.org/abs/2602.09872
作者:Mridankan Mandal
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:Human activity recognition, Human activity, heterogeneous sensor configurations, activity recognition, wearable and mobile
备注:
点击查看摘要
Abstract:Human activity recognition (HAR) on wearable and mobile devices is constrained by memory footprint and computational budget, yet competitive accuracy must be maintained across heterogeneous sensor configurations. Selective state space models (SSMs) offer linear time sequence processing with input dependent gating, presenting a compelling alternative to quadratic complexity attention mechanisms. However, the design space for deploying SSMs in the TinyML regime remains largely unexplored. In this paper, BabyMamba-HAR is introduced, a framework comprising two novel lightweight Mamba inspired architectures optimized for resource constrained HAR: (1) CI-BabyMamba-HAR, using a channel independent stem that processes each sensor channel through shared weight, but instance independent transformations to prevent cross channel noise propagation, and (2) Crossover-BiDir-BabyMamba-HAR, using an early fusion stem that achieves channel count independent computational complexity. Both variants incorporate weight tied bidirectional scanning and lightweight temporal attention pooling. Through evaluation across eight diverse benchmarks, it is demonstrated that Crossover-BiDir-BabyMamba-HAR achieves 86.52% average macro F1-score with approximately 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. Systematic ablation studies reveal that bidirectional scanning contributes up to 8.42% F1-score improvement, and gated temporal attention provides up to 8.94% F1-score gain over mean pooling. These findings establish practical design principles for deploying selective state space models as efficient TinyML backbones for HAR.
32. 【2602.09868】Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence
链接:https://arxiv.org/abs/2602.09868
作者:Xiaoyue Ling,Chuqin Zhou,Chunyi Li,Yunuo Chen,Yuan Tian,Guo Lu,Wenjun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieving visually pleasing, Building on recent, visually pleasing reconstructions, recent advances, paradigm for achieving
备注:
点击查看摘要
Abstract:Building on recent advances in video generation, generative video compression has emerged as a new paradigm for achieving visually pleasing reconstructions. However, existing methods exhibit limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. In this paper, we propose Free-GVC, a training-free generative video compression framework that reformulates video coding as latent trajectory compression guided by a video diffusion prior. Our method operates at the group-of-pictures (GOP) level, encoding video segments into a compact latent space and progressively compressing them along the diffusion trajectory. To ensure perceptually consistent reconstruction across GOPs, we introduce an Adaptive Quality Control module that dynamically constructs an online rate-perception surrogate model to predict the optimal diffusion step for each GOP. In addition, an Inter-GOP Alignment module establishes frame overlap and performs latent fusion between adjacent groups, thereby mitigating flicker and enhancing temporal coherence. Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural codec DCVC-RT, and a user study further confirms its superior perceptual quality and temporal coherence at ultra-low bitrates.
33. 【2602.09856】Code2World: A GUI World Model via Renderable Code Generation
链接:https://arxiv.org/abs/2602.09856
作者:Yuhao Zheng,Li'an Zhong,Yi Wang,Rui Dai,Kaikui Liu,Xiangxiang Chu,Linyuan Lv,Philip Torr,Kevin Qinghong Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Autonomous GUI agents, GUI agents interact, Autonomous GUI, GUI World model, interact with environments
备注: github: [this https URL](https://github.com/AMAP-ML/Code2World) project page: [this https URL](https://amap-ml.github.io/Code2World/)
点击查看摘要
Abstract:Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at this https URL.
34. 【2602.09850】Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection
链接:https://arxiv.org/abs/2602.09850
作者:Peng Chen,Chao Huang,Yunkang Cao,Chengliang Liu,Wenqiang Wang,Mingbo Yang,Li Shen,Wenqi Ren,Xiaochun Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fine-grained defect patterns, detection demands precise, Industrial anomaly detection, demands precise reasoning, demands precise
备注:
点击查看摘要
Abstract:Industrial anomaly detection demands precise reasoning over fine-grained defect patterns. However, existing multimodal large language models (MLLMs), pretrained on general-domain data, often struggle to capture category-specific anomalies, thereby limiting both detection accuracy and interpretability. To address these limitations, we propose Reason-IAD, a knowledge-guided dynamic latent reasoning framework for explainable industrial anomaly detection. Reason-IAD comprises two core components. First, a retrieval-augmented knowledge module incorporates category-specific textual descriptions into the model input, enabling context-aware reasoning over domain-specific defects. Second, an entropy-driven latent reasoning mechanism conducts iterative exploration within a compact latent space using optimizable latent think tokens, guided by an entropy-based reward that encourages confident and stable predictions. Furthermore, a dynamic visual injection strategy selectively incorporates the most informative image patches into the latent sequence, directing the reasoning process toward regions critical for anomaly detection. Extensive experimental results demonstrate that Reason-IAD consistently outperforms state-of-the-art methods. The code will be publicly available at this https URL.
35. 【2602.09843】Kelix Technique Report
链接:https://arxiv.org/abs/2602.09843
作者:Boyang Ding,Chenglong Chu,Dunju Zang,Han Li,Jiangxia Cao,Kun Gai,Muhao Wei,Ruiming Tang,Shiyao Wang,Siyang Mao,Xinchen Luo,Yahui Liu,Zhixin Ling,Zhuoran Yang,Ziming Li,Chengru Song,Guorui Zhou,Guowang Zhang,Hao Peng,Hao Wang,Jiaxin Deng,Jin Ouyang,Jinghao Zhang,Lejian Ren,Qianqian Wang,Qigen Hu,Tao Wang,Xingmei Wang,Yiping Yang,Zixing Zhang,Ziqi Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:expressing diverse tasks, Autoregressive large language, large language models, next-token prediction, large language
备注: Work in progress
点击查看摘要
Abstract:Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.
36. 【2602.09839】ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge
链接:https://arxiv.org/abs/2602.09839
作者:Yijie Lin,Guofeng Ding,Haochen Zhou,Haobin Li,Mouxing Yang,Xi Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:largely emphasize semantic, offer limited diagnostics, Existing multimodal retrieval, benchmarks largely emphasize, emphasize semantic matching
备注:
点击查看摘要
Abstract:Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.
37. 【2602.09825】SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding
链接:https://arxiv.org/abs/2602.09825
作者:Zhaoxu Li,Chenqi Kong,Peijun Bao,Song Xia,Yi Tu,Yi Yu,Xinghao Jiang,Xudong Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, pose significant security, Large Vision-Language, pose significant, real-world applications
备注:
点击查看摘要
Abstract:Hallucinations in Large Vision-Language Models (LVLMs) pose significant security and reliability risks in real-world applications. Inspired by the observation that humans are more error-prone when uncertain or hesitant, we investigate how instability in a model 's internal knowledge contributes to LVLM hallucinations. We conduct extensive empirical analyses from three perspectives, namely attention heads, model layers, and decoding tokens, and identify three key hallucination patterns: (i) visual activation drift across attention heads, (ii) pronounced knowledge fluctuations across layers, and (iii) visual focus distraction between neighboring output tokens. Building on these findings, we propose Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model. By contrasting the most stability-aware and stability-agnostic layers, SAKED suppresses decoding noise and dynamically leverages the most reliable internal knowledge for faithful token generation. Moreover, SAKED is training-free and can be seamlessly integrated into different architectures. Extensive experiments demonstrate that SAKED achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.
38. 【2602.09816】CompSplat: Compression-aware 3D Gaussian Splatting for Real-world Video
链接:https://arxiv.org/abs/2602.09816
作者:Hojun Song,Heejung Choi,Aro Kim,Chae-yeong Song,Gahyeon Kim,Soo Ye Kim,Jaehyup Lee,Sang-hyo Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cultural heritage preservation, High-quality novel view, digital twins, view synthesis, heritage preservation
备注: Preprint. Under review
点击查看摘要
Abstract:High-quality novel view synthesis (NVS) from real-world videos is crucial for applications such as cultural heritage preservation, digital twins, and immersive media. However, real-world videos typically contain long sequences with irregular camera trajectories and unknown poses, leading to pose drift, feature misalignment, and geometric distortion during reconstruction. Moreover, lossy compression amplifies these issues by introducing inconsistencies that gradually degrade geometry and rendering quality. While recent studies have addressed either long-sequence NVS or unposed reconstruction, compression-aware approaches still focus on specific artifacts or limited scenarios, leaving diverse compression patterns in long videos insufficiently explored. In this paper, we propose CompSplat, a compression-aware training framework that explicitly models frame-wise compression characteristics to mitigate inter-frame inconsistency and accumulated geometric errors. CompSplat incorporates compression-aware frame weighting and an adaptive pruning strategy to enhance robustness and geometric consistency, particularly under heavy compression. Extensive experiments on challenging benchmarks, including Tanks and Temples, Free, and Hike, demonstrate that CompSplat achieves state-of-the-art rendering quality and pose accuracy, significantly surpassing most recent state-of-the-art NVS approaches under severe compression conditions.
39. 【2602.09809】SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing
链接:https://arxiv.org/abs/2602.09809
作者:Tong Zhang,Honglin Lin,Zhou Liu,Chong Chen,Wentao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:structurally incorrect results, produce visually plausible, diagrams convey explicit, explicit structural information, incorrect results
备注:
点击查看摘要
Abstract:Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.
40. 【2602.09775】Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets
链接:https://arxiv.org/abs/2602.09775
作者:Abhipsa Basu,Yugam Bahl,Kirti Bhagat,Preethi Seshadri,R. Venkatesh Babu,Danish Pruthi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent studies show, generate geographically representative, Recent studies, models often fail, raising concerns
备注: 41 pages, 20 figures
点击查看摘要
Abstract:Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0\%$ of samples, while South American and African countries are severely under-represented with only $1.8\%$ and $3.8\%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in the data ($\rho = 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.
41. 【2602.09764】Self-Supervised Learning as Discrete Communication
链接:https://arxiv.org/abs/2602.09764
作者:Kawtar Zaher,Ilyass Moummad,Olivier Buisson,Alexis Joly
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:offering limited control, methods learn continuous, methods learn, offering limited, limited control
备注:
点击查看摘要
Abstract:Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.
42. 【2602.09740】Robust Vision Systems for Connected and Autonomous Vehicles: Security Challenges and Attack Vectors
链接:https://arxiv.org/abs/2602.09740
作者:Sandeep Gupta,Roberto Passerone
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving capabilities, Autonomous Vehicles, Connected and Autonomous, autonomous driving, CAV vision system
备注: Submitted to IEEE Transactions on Intelligent Vehicles
点击查看摘要
Abstract:This article investigates the robustness of vision systems in Connected and Autonomous Vehicles (CAVs), which is critical for developing Level-5 autonomous driving capabilities. Safe and reliable CAV navigation undeniably depends on robust vision systems that enable accurate detection of objects, lane markings, and traffic signage. We analyze the key sensors and vision components essential for CAV navigation to derive a reference architecture for CAV vision system (CAVVS). This reference architecture provides a basis for identifying potential attack surfaces of CAVVS. Subsequently, we elaborate on identified attack vectors targeting each attack surface, rigorously evaluating their implications for confidentiality, integrity, and availability (CIA). Our study provides a comprehensive understanding of attack vector dynamics in vision systems, which is crucial for formulating robust security measures that can uphold the principles of the CIA triad.
43. 【2602.09736】oward Fine-Grained Facial Control in 3D Talking Head Generation
链接:https://arxiv.org/abs/2602.09736
作者:Shaoyang Xie,Xiaofeng Cong,Baosheng Yu,Zhipeng Gui,Jie Gui,Yuan Yan Tang,James Tin-Yau Kwok
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Audio-driven talking head, shown strong performance, Audio-driven talking, talking head generation, Gaussian Splatting
备注:
点击查看摘要
Abstract:Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.
44. 【2602.09730】Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings
链接:https://arxiv.org/abs/2602.09730
作者:Laura Paul,Holger Rauhut,Martin Burger,Samira Kabri,Tim Roith
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
关键词:enabled non-invasive detailed, non-invasive detailed analysis, Recent advances, imaging technologies, supporting their documentation
备注:
点击查看摘要
Abstract:Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford--Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.
45. 【2602.09717】From Lightweight CNNs to SpikeNets: Benchmarking Accuracy-Energy Tradeoffs with Pruned Spiking SqueezeNet
链接:https://arxiv.org/abs/2602.09717
作者:Radib Bin Kabir,Tawsif Tashwar Dipto,Mehedi Ahamed,Sabbir Ahmed,Md Hasanul Kabir
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
关键词:Convolutional Neural Networks, Spiking Neural Networks, Convolutional Neural, Neural Networks, Spiking Neural
备注:
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) are increasingly studied as energy-efficient alternatives to Convolutional Neural Networks (CNNs), particularly for edge intelligence. However, prior work has largely emphasized large-scale models, leaving the design and evaluation of lightweight CNN-to-SNN pipelines underexplored. In this paper, we present the first systematic benchmark of lightweight SNNs obtained by converting compact CNN architectures into spiking networks, where activations are modeled with Leaky-Integrate-and-Fire (LIF) neurons and trained using surrogate gradient descent under a unified setup. We construct spiking variants of ShuffleNet, SqueezeNet, MnasNet, and MixNet, and evaluate them on CIFAR-10, CIFAR-100, and TinyImageNet, measuring accuracy, F1-score, parameter count, computational complexity, and energy consumption. Our results show that SNNs can achieve up to 15.7x higher energy efficiency than their CNN counterparts while retaining competitive accuracy. Among these, the SNN variant of SqueezeNet consistently outperforms other lightweight SNNs. To further optimize this model, we apply a structured pruning strategy that removes entire redundant modules, yielding a pruned architecture, SNN-SqueezeNet-P. This pruned model improves CIFAR-10 accuracy by 6% and reduces parameters by 19% compared to the original SNN-SqueezeNet. Crucially, it narrows the gap with CNN-SqueezeNet, achieving nearly the same accuracy (only 1% lower) but with an 88.1% reduction in energy consumption due to sparse spike-driven computations. Together, these findings establish lightweight SNNs as practical, low-power alternatives for edge deployment, highlighting a viable path toward deploying high-performance, low-power intelligence on the edge.
46. 【2602.09713】Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models
链接:https://arxiv.org/abs/2602.09713
作者:Ruisi Zhao,Haoren Zheng,Zongxin Yang,Hehe Fan,Yi Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deformation and animation, Skeletal Graph VAE, assets are fundamental, Skeletal Graph, Controllable Skeleton Generation
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton's graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE's decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.
47. 【2602.09708】Physics-informed diffusion models in spectral space
链接:https://arxiv.org/abs/2602.09708
作者:Davide Gallon,Philippe von Wurstemberger,Patrick Cheridito,Arnulf Jentzen
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
关键词:parametric partial differential, inverse PDE problems, combines generative latent, partial differential equations, physics-informed machine learning
备注: 24 pages, 9 figures
点击查看摘要
Abstract:We propose a methodology that combines generative latent diffusion models with physics-informed machine learning to generate solutions of parametric partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid-based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics-informed constraints and measurement conditions during inference, applying Adam-based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier--Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion-based PDE solvers, which are state of the art for sparse observations. Code is available at this https URL.
48. 【2602.09701】GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
链接:https://arxiv.org/abs/2602.09701
作者:Sandesh Hegde,Jaison Saji Chacko,Debarshi Banerjee,Uma Mahesh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:study fine-grained referring, referring image segmentation, fine-grained referring image, study fine-grained, fine-grained referring
备注:
点击查看摘要
Abstract:We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.09701 [cs.CV]
(or
arXiv:2602.09701v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.09701
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
49. 【2602.09686】Semi-supervised Liver Segmentation and Patch-based Fibrosis Staging with Registration-aided Multi-parametric MRI
链接:https://arxiv.org/abs/2602.09686
作者:Boya Wang,Ruizhe Li,Chao Chen,Xin Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:accurate disease staging, precise liver segmentation, Liver fibrosis poses, liver fibrosis staging, Liver fibrosis
备注:
点击查看摘要
Abstract:Liver fibrosis poses a substantial challenge in clinical practice, emphasizing the necessity for precise liver segmentation and accurate disease staging. Based on the CARE Liver 2025 Track 4 Challenge, this study introduces a multi-task deep learning framework developed for liver segmentation (LiSeg) and liver fibrosis staging (LiFS) using multiparametric MRI. The LiSeg phase addresses the challenge of limited annotated images and the complexities of multi-parametric MRI data by employing a semi-supervised learning model that integrates image segmentation and registration. By leveraging both labeled and unlabeled data, the model overcomes the difficulties introduced by domain shifts and variations across modalities. In the LiFS phase, we employed a patchbased method which allows the visualization of liver fibrosis stages based on the classification outputs. Our approach effectively handles multimodality imaging data, limited labels, and domain shifts. The proposed method has been tested by the challenge organizer on an independent test set that includes in-distribution (ID) and out-of-distribution (OOD) cases using three-channel MRIs (T1, T2, DWI) and seven-channel MRIs (T1, T2, DWI, GED1-GED4). The code is freely available. Github link: this https URL
50. 【2602.09662】reeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution
链接:https://arxiv.org/abs/2602.09662
作者:Deyang Jiang,Jing Huang,Xuanle Zhao,Lei Chen,Liming Zheng,Fanfan Liu,Haibo Qiu,Peng Shi,Zhixiong Zeng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Effectively scaling GUI, sophisticated data collection, scaling GUI grounding, existing work primarily, crucial GUI planning
备注: 14 pages, 7 figures
点击查看摘要
Abstract:Effectively scaling GUI automation is essential for computer-use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. Thus, organizing large-scale trajectories into tree structures can reduce data cost and streamline the data scaling of GUI planning. In this work, we propose TreeCUA to efficiently scale GUI automation with tree-structured verifiable evolution. We propose a multi-agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high-quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree-based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (\emph{i.e.}, trajectory difficulty) and breadth (\emph{i.e.}, trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low-quality generation. Finally, we naturally extend and propose the TreeCUA-DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA-DPO offer significant improvements, and out-of-domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be available at this https URL.
51. 【2602.09648】me2General: Learning Spatiotemporal Invariant Representations for Domain-Generalization Video Semantic Segmentation
链接:https://arxiv.org/abs/2602.09648
作者:Siyu Chen,Ting Han,Haoling Huang,Chaolei Wang,Chengzheng Fu,Duxin Zhu,Guorong Cai,Jinhe Su
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video Semantic Segmentation, Generalized Video Semantic, Domain Generalized Video, Semantic Segmentation, Generalized Video
备注:
点击查看摘要
Abstract:Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.
52. 【2602.09638】VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
链接:https://arxiv.org/abs/2602.09638
作者:Hanqing Wang,Mingyu Liu,Xiaoyu Chen,Chengwei MA,Yiming Zhong,Wenti Yin,Yuhao Liu,Zhiqing Cui,Jiahao Yuan,Lu Dai,Zhiyuan Ma,Hui Xiong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:robotic manipulation, aims to highlight, highlight the actionable, actionable regions, crucial for robotic
备注:
点击查看摘要
Abstract:3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.
53. 【2602.09637】owards Training-free Multimodal Hate Localisation with Large Language Models
链接:https://arxiv.org/abs/2602.09637
作者:Yueming Sun,Long Yang,Jianbo Jiao,Zeyu Fu
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:poses severe threats, online videos poses, videos poses severe, societal harmony, poses severe
备注:
点击查看摘要
Abstract:The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
54. 【2602.09617】AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception
链接:https://arxiv.org/abs/2602.09617
作者:Ruoxuan Feng,Yuxuan Zhou,Siyu Mei,Dongzhan Zhou,Pengwei Wang,Shaowei Cui,Bin Fang,Guocai Yao,Di Hu
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:contact-rich manipulation demands, manipulation demands robots, tactile, subtle surface deformations, capture subtle surface
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Real-world contact-rich manipulation demands robots to perceive temporal tactile feedback, capture subtle surface deformations, and reason about object properties as well as force dynamics. Although optical tactile sensors are uniquely capable of providing such rich information, existing tactile datasets and models remain limited. These resources primarily focus on object-level attributes (e.g., material) while largely overlooking fine-grained tactile temporal dynamics during physical interactions. We consider that advancing dynamic tactile perception requires a systematic hierarchy of dynamic perception capabilities to guide both data collection and model design. To address the lack of tactile data with rich dynamic information, we present ToucHD, a large-scale hierarchical tactile dataset spanning tactile atomic actions, real-world manipulations, and touch-force paired data. Beyond scale, ToucHD establishes a comprehensive tactile dynamic data ecosystem that explicitly supports hierarchical perception capabilities from the data perspective. Building on it, we propose AnyTouch 2, a general tactile representation learning framework for diverse optical tactile sensors that unifies object-level understanding with fine-grained, force-aware dynamic perception. The framework captures both pixel-level and action-specific deformations across frames, while explicitly modeling physical force dynamics, thereby learning multi-level dynamic perception capabilities from the model perspective. We evaluate our model on benchmarks that covers static object properties and dynamic physical attributes, as well as real-world manipulation tasks spanning multiple tiers of dynamic perception capabilities-from basic object-level understanding to force-aware dexterous manipulation. Experimental results demonstrate consistent and strong performance across sensors and tasks.
55. 【2602.09611】AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
链接:https://arxiv.org/abs/2602.09611
作者:Yue Li,Xin Yi,Dongsheng Shi,Yongyi Cui,Gerard de Melo,Linlin Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:Large Vision-Language Models, intellectual property protection, Vision-Language Models, protection in Large, Large Vision-Language
备注: preprint
点击查看摘要
Abstract:Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases. Additionally, current vision-specific watermarks rely on a static, one-time estimation of vision critical weights and ignore the weight distribution density when determining the proportion of protected tokens. This design fails to account for dynamic changes in visual dependence during generation and may introduce low-quality tokens in the long tail. To address these challenges, we propose Attention-Guided Dynamic Watermarking (AGMark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. At each decoding step, AGMark first dynamically identifies semantic-critical evidence based on attention weights for visual relevance, together with context-aware coherence cues, resulting in a more adaptive and well-calibrated evidence-weight distribution. It then determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), thereby enabling adaptive vocabulary partitioning to avoid irrelevant tokens. Empirical results confirm that AGMark outperforms conventional methods, observably improving generation quality and yielding particularly strong gains in visual semantic fidelity in the later stages of generation. The framework maintains highly competitive detection accuracy (at least 99.36\% AUC) and robust attack resilience (at least 88.61\% AUC) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multi-modal watermarking.
56. 【2602.09609】-Omni: a Unified Multimodal Framework for Video Generation and Editing
链接:https://arxiv.org/abs/2602.09609
作者:Jialun Liu,Yukuo Ma,Xiao Cao,Tian Li,Gonghu Shang,Haibin Huang,Chi Zhang,Xuelong Li,Cong Liu,Junqi Liu,Jiakui Hu,Robby T. Tan,Shiwen Zhang,Liying Yang,Xiaoyan Yang,Qizhen Weng,Xiangzhen Chang,Yuanzhi Liang,Yifan Xu,Zhiyong Huang,Zuoxin Li,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, video generation, substantially improved visual, improved visual fidelity, video
备注:
点击查看摘要
Abstract:Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.
57. 【2602.09600】Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures
链接:https://arxiv.org/abs/2602.09600
作者:Yuxi Wang,Wenqi Ouyang,Tianyi Wei,Yi Dong,Zhiqi Shen,Xingang Pan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low latency, long-term stability, essential for augmented, augmented reality, reality and embodied
备注:
点击查看摘要
Abstract:Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.
58. 【2602.09587】MieDB-100k: A Comprehensive Dataset for Medical Image Editing
链接:https://arxiv.org/abs/2602.09587
作者:Yongfan Lai,Wen Qian,Bo Liu,Hongyan Li,Hao Luo,Fan Wang,Bohan Zhuang,Shenda Hong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:adapting multimodal generative, medical image editing, medical image, image editing, multimodal generative models
备注:
点击查看摘要
Abstract:The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.
59. 【2602.09586】Delving into Spectral Clustering with Vision-Language Representations
链接:https://arxiv.org/abs/2602.09586
作者:Bo Peng,Yuanwei Hu,Bo Liu,Ling Chen,Jie Lu,Zhen Fang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unsupervised data analysis, Spectral clustering, Kernel Spectral Clustering, data analysis, Neural Tangent Kernel
备注: ICLR26
点击查看摘要
Abstract:Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks -- including classical, large-scale, fine-grained and domain-shifted datasets -- manifest that our method consistently outperforms the state-of-the-art by a large margin.
60. 【2602.09566】ECG-IMN: Interpretable Mesomorphic Neural Networks for 12-Lead Electrocardiogram Interpretation
链接:https://arxiv.org/abs/2602.09566
作者:Vajira Thambawita,Jonas L. Isaksen,Jørgen K. Kanters,Hugo L. Hammer,Pål Halvorsen
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
关键词:achieved expert-level performance, automated electrocardiogram, achieved expert-level, Interpretable Mesomorphic Neural, Mesomorphic Neural Network
备注:
点击查看摘要
Abstract:Deep learning has achieved expert-level performance in automated electrocardiogram (ECG) diagnosis, yet the "black-box" nature of these models hinders their clinical deployment. Trust in medical AI requires not just high accuracy but also transparency regarding the specific physiological features driving predictions. Existing explainability methods for ECGs typically rely on post-hoc approximations (e.g., Grad-CAM and SHAP), which can be unstable, computationally expensive, and unfaithful to the model's actual decision-making process. In this work, we propose the ECG-IMN, an Interpretable Mesomorphic Neural Network tailored for high-resolution 12-lead ECG classification. Unlike standard classifiers, the ECG-IMN functions as a hypernetwork: a deep convolutional backbone generates the parameters of a strictly linear model specific to each input sample. This architecture enforces intrinsic interpretability, as the decision logic is mathematically transparent and the generated weights (W) serve as exact, high-resolution feature attribution maps. We introduce a transition decoder that effectively maps latent features to sample-wise weights, enabling precise localization of pathological evidence (e.g., ST-elevation, T-wave inversion) in both time and lead dimensions. We evaluate our approach on the PTB-XL dataset for classification tasks, demonstrating that the ECG-IMN achieves competitive predictive performance (AUROC comparable to black-box baselines) while providing faithful, instance-specific explanations. By explicitly decoupling parameter generation from prediction execution, our framework bridges the gap between deep learning capability and clinical trustworthiness, offering a principled path toward "white-box" cardiac diagnostics.
61. 【2602.09541】Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
链接:https://arxiv.org/abs/2602.09541
作者:Ziqiang Shi,Rujie Liu,Shanshan Yu,Satoshi Munakata,Koichi Shirahata
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Rapid progress, large vision-language models, achieved unprecedented performance, vision-language tasks, achieved unprecedented
备注: WACV 2026 (It was accepted in the first round, with an acceptance rate of 6%.)
点击查看摘要
Abstract:Rapid progress in large vision-language models (LVLMs) has achieved unprecedented performance in vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities, LVLMs often generate outputs inconsistent with visual content - termed hallucination. To address this, we propose \textbf{Scalpel}, a method that reduces hallucination by refining attention activation distributions toward more credible regions. Scalpel predicts trusted attention directions for each head in Transformer layers during inference and adjusts activations accordingly. It employs a Gaussian mixture model to capture multi-peak distributions of attention in trust and hallucination manifolds, and uses entropic optimal transport (equivalent to Schrödinger bridge problem) to map Gaussian components precisely. During mitigation, Scalpel dynamically adjusts intervention strength and direction based on component membership and mapping relationships between hallucination and trust activations. Extensive experiments across multiple datasets and benchmarks demonstrate that Scalpel effectively mitigates hallucinations, outperforming previous methods and achieving state-of-the-art performance. Moreover, Scalpel is model- and data-agnostic, requiring no additional computation, only a single decoding step.
62. 【2602.09534】AUHead: Realistic Emotional Talking Head Generation via Action Units Control
链接:https://arxiv.org/abs/2602.09534
作者:Jiayi Lyu,Leigang Qu,Wenjing Zhang,Hanyu Jiang,Kai Liu,Zhenglin Zhou,Xiaobo Xia,Jian Xue,Tat-Seng Chua
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fine-grained emotion control, film production, virtual avatars, interactive systems, critical for virtual
备注: [this https URL](https://openreview.net/forum?id=dmzlAUkulz&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DICLR.cc%2F2026%2FConference%2FAuthors%23your-submissions))
点击查看摘要
Abstract:Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at this https URL
63. 【2602.09532】RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
链接:https://arxiv.org/abs/2602.09532
作者:Michael Baltaxe,Dan Levi,Sagie Benaim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Monocular Metric Depth, Metric Depth Estimation, accurate depth estimation, physically intelligent systems, Monocular Metric
备注:
点击查看摘要
Abstract:Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.
64. 【2602.09531】DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment
链接:https://arxiv.org/abs/2602.09531
作者:Bohan Fu,Guanyi Qin,Fazhan Zhang,Zihao Huang,Mingxuan Li,Runze Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Blind Image Quality, effectively capture subtle, Blind Image, Image Quality Assessment, subtle distortion cues
备注: Accepted by AAAI 2026
点击查看摘要
Abstract:Blind Image Quality Assessment, aiming to replicate human perception of visual quality without reference, plays a key role in vision tasks, yet existing models often fail to effectively capture subtle distortion cues, leading to a misalignment with human subjective judgments. We identify that the root cause of this limitation lies in the lack of reliable distortion priors, as methods typically learn shallow relationships between unified image features and quality scores, resulting in their insensitive nature to distortions and thus limiting their performance. To address this, we introduce this http URL, a novel prior-driven BIQA framework designed to explicitly incorporate distortion priors, enabling a reliable quality assessment. this http URL begins by leveraging a degradation-aware vision-language model to obtain distortion-specific priors, which are further refined and enhanced by the proposed Distortion-Saliency Differential Module through distinguishing them from semantic attentions, thereby ensuring the genuine representations of distortions. The refined priors, along with semantics and bridging representation, are then fused by a proposed mixture-of-experts style module named the Dynamic Distortion Weighting Module. This mechanism weights each distortion-specific feature as per its perceptual impact, ensuring that the final quality prediction aligns with human perception. Extensive experiments conducted on five challenging BIQA benchmarks demonstrate the superiority of this http URL over current methods and showcase its excellence in terms of generalization and data efficiency.
65. 【2602.09529】SCA-Net: Spatial-Contextual Aggregation Network for Enhanced Small Building and Road Change Detection
链接:https://arxiv.org/abs/2602.09529
作者:Emad Gholibeigi,Abbas Koochari,Azadeh ZamaniFar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remote sensing imagery, Automated change detection, Automated change, environmental monitoring, urban management
备注: 6 pages, 2 figures, 3 tables. Submitted for review
点击查看摘要
Abstract:Automated change detection in remote sensing imagery is critical for urban management, environmental monitoring, and disaster assessment. While deep learning models have advanced this field, they often struggle with challenges like low sensitivity to small objects and high computational costs. This paper presents SCA-Net, an enhanced architecture built upon the Change-Agent framework for precise building and road change detection in bi-temporal images. Our model incorporates several key innovations: a novel Difference Pyramid Block for multi-scale change analysis, an Adaptive Multi-scale Processing module combining shape-aware and high-resolution enhancement blocks, and multi-level attention mechanisms (PPM and CSAGate) for joint contextual and detail processing. Furthermore, a dynamic composite loss function and a four-phase training strategy are introduced to stabilize training and accelerate convergence. Comprehensive evaluations on the LEVIR-CD and LEVIR-MCI datasets demonstrate SCA-Net's superior performance over Change-Agent and other state-of-the-art methods. Our approach achieves a significant 2.64% improvement in mean Intersection over Union (mIoU) on LEVIR-MCI and a remarkable 57.9% increase in IoU for small buildings, while reducing the training time by 61%. This work provides an efficient, accurate, and robust solution for practical change detection applications.
66. 【2602.09528】SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem
链接:https://arxiv.org/abs/2602.09528
作者:Ziqiang Shi,Rujie Liu,Shanshan Yu,Satoshi Munakata,Koichi Shirahata
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, achieved significant success
备注: ICASSP 2026
点击查看摘要
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success across various domains. However, their use in high-stakes fields like healthcare remains limited due to persistent hallucinations, where generated text contradicts or ignores visual input. We contend that MLLMs can comprehend images but struggle to produce accurate token sequences. Minor perturbations can shift attention from truthful to untruthful states, and the autoregressive nature of text generation often prevents error correction. To address this, we propose SchröMind-a novel framework reducing hallucinations via solving the Schrödinger bridge problem. It establishes a token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training, while preserving the model's original capabilities. Extensive experiments on the POPE and MME benchmarks demonstrate the superiority of Schrödinger, which achieves state-of-the-art performance while introducing only minimal computational overhead.
67. 【2602.09524】HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection
链接:https://arxiv.org/abs/2602.09524
作者:Han Zhou,Yuxuan Gao,Yinchao Du,Xuezhe Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modern manufacturing inspection, Unsupervised industrial anomaly, industrial anomaly detection, manufacturing inspection, anomaly detection
备注: 14 pages, 6 figures, references added
点击查看摘要
Abstract:Unsupervised industrial anomaly detection (UAD) is essential for modern manufacturing inspection, where defect samples are scarce and reliable detection is required. In this paper, we propose HLGFA, a high-low resolution guided feature alignment framework that learns normality by modeling cross-resolution feature consistency between high-resolution and low-resolution representations of normal samples, instead of relying on pixel-level reconstruction. Dual-resolution inputs are processed by a shared frozen backbone to extract multi-level features, and high-resolution representations are decomposed into structure and detail priors to guide the refinement of low-resolution features through conditional modulation and gated residual correction. During inference, anomalies are naturally identified as regions where cross-resolution alignment breaks down. In addition, a noise-aware data augmentation strategy is introduced to suppress nuisance-induced responses commonly observed in industrial environments. Extensive experiments on standard benchmarks demonstrate the effectiveness of HLGFA, achieving 97.9% pixel-level AUROC and 97.5% image-level AUROC on the MVTec AD dataset, outperforming representative reconstruction-based and feature-based methods.
68. 【2602.09523】Singpath-VL Technical Report
链接:https://arxiv.org/abs/2602.09523
作者:Zhen Qiu,Kaiwen Xiao,Zhengwei Lu,Xiangyu Liu,Lei Zhao,Hao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision-language large model, fill the vacancy, cervical cytology, vision-language large, multi-modal large language
备注:
点击查看摘要
Abstract:We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.
69. 【2602.09521】Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs
链接:https://arxiv.org/abs/2602.09521
作者:Jingyi Wang,Fei Li,Rujie Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing Large Vision-Language, Large Vision-Language Models, Existing Large, exhibit insufficient visual, Vision-Language Models
备注:
点击查看摘要
Abstract:Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.
70. 【2602.09518】A Universal Action Space for General Behavior Analysis
链接:https://arxiv.org/abs/2602.09518
作者:Hung-Shuo Chang,Yue-Cheng Yang,Yu-Hsi Chen,Wei-Hsin Chen,Chien-Yao Wang,James C. Liao,Chien-Chang Chen,Hen-Hsen Huang,Hong-Yuan Mark Liao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, animal and human, challenging task, task in computer, Analyzing animal
备注:
点击查看摘要
Abstract:Analyzing animal and human behavior has long been a challenging task in computer vision. Early approaches from the 1970s to the 1990s relied on hand-crafted edge detection, segmentation, and low-level features such as color, shape, and texture to locate objects and infer their identities-an inherently ill-posed problem. Behavior analysis in this era typically proceeded by tracking identified objects over time and modeling their trajectories using sparse feature points, which further limited robustness and generalization. A major shift occurred with the introduction of ImageNet by Deng and Li in 2010, which enabled large-scale visual recognition through deep neural networks and effectively served as a comprehensive visual dictionary. This development allowed object recognition to move beyond complex low-level processing toward learned high-level representations. In this work, we follow this paradigm to build a large-scale Universal Action Space (UAS) using existing labeled human-action datasets. We then use this UAS as the foundation for analyzing and categorizing mammalian and chimpanzee behavior datasets. The source code is released on GitHub at this https URL.
71. 【2602.09515】Energy-Efficient Fast Object Detection on Edge Devices for IoT Systems
链接:https://arxiv.org/abs/2602.09515
作者:Mas Nurul Achmadiah,Afaroj Ahamad,Chi-Chia Sun,Wen-Kai Kuo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Internet of Things, presents an Internet, Jetson Orin Nano, paper presents, classifier for fast-object
备注: 14 pages, 12 figures
点击查看摘要
Abstract:This paper presents an Internet of Things (IoT) application that utilizes an AI classifier for fast-object detection using the frame difference method. This method, with its shorter duration, is the most efficient and suitable for fast-object detection in IoT systems, which require energy-efficient applications compared to end-to-end methods. We have implemented this technique on three edge devices: AMD AlveoT M U50, Jetson Orin Nano, and Hailo-8T M AI Accelerator, and four models with artificial neural networks and transformer models. We examined various classes, including birds, cars, trains, and airplanes. Using the frame difference method, the MobileNet model consistently has high accuracy, low latency, and is highly energy-efficient. YOLOX consistently shows the lowest accuracy, lowest latency, and lowest efficiency. The experimental results show that the proposed algorithm has improved the average accuracy gain by 28.314%, the average efficiency gain by 3.6 times, and the average latency reduction by 39.305% compared to the end-to-end method. Of all these classes, the faster objects are trains and airplanes. Experiments show that the accuracy percentage for trains and airplanes is lower than other categories. So, in tasks that require fast detection and accurate results, end-to-end methods can be a disaster because they cannot handle fast object detection. To improve computational efficiency, we designed our proposed method as a lightweight detection algorithm. It is well suited for applications in IoT systems, especially those that require fast-moving object detection and higher accuracy.
72. 【2602.09510】Robust Depth Super-Resolution via Adaptive Diffusion Sampling
链接:https://arxiv.org/abs/2602.09510
作者:Kun Wang,Yun Zhu,Pan Zhou,Na Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:robustly recovers high-resolution, arbitrarily degraded low-resolution, recovers high-resolution depth, high-resolution depth maps, degraded low-resolution inputs
备注:
点击查看摘要
Abstract:We propose AdaDS, a generalizable framework for depth super-resolution that robustly recovers high-resolution depth maps from arbitrarily degraded low-resolution inputs. Unlike conventional approaches that directly regress depth values and often exhibit artifacts under severe or unknown degradation, AdaDS capitalizes on the contraction property of Gaussian smoothing: as noise accumulates in the forward process, distributional discrepancies between degraded inputs and their pristine high-quality counterparts diminish, ultimately converging to isotropic Gaussian prior. Leveraging this, AdaDS adaptively selects a starting timestep in the reverse diffusion trajectory based on estimated refinement uncertainty, and subsequently injects tailored noise to position the intermediate sample within the high-probability region of the target posterior distribution. This strategy ensures inherent robustness, enabling generative prior of a pre-trained diffusion model to dominate recovery even when upstream estimations are imperfect. Extensive experiments on real-world and synthetic benchmarks demonstrate AdaDS's superior zero-shot generalization and resilience to diverse degradation patterns compared to state-of-the-art methods.
73. 【2602.09506】Equilibrium contrastive learning for imbalanced image classification
链接:https://arxiv.org/abs/2602.09506
作者:Sumin Roh,Harim Kim,Ho Yun Lee,Il Yong Chun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:showed limited performance, Contrastive learning, class, predominant technique, technique in image
备注: 18 pages, 8 figures
点击查看摘要
Abstract:Contrastive learning (CL) is a predominant technique in image classification, but they showed limited performance with an imbalanced dataset. Recently, several supervised CL methods have been proposed to promote an ideal regular simplex geometric configuration in the representation space-characterized by intra-class feature collapse and uniform inter-class mean spacing, especially for imbalanced datasets. In particular, existing prototype-based methods include class prototypes, as additional samples to consider all classes. However, the existing CL methods suffer from two limitations. First, they do not consider the alignment between the class means/prototypes and classifiers, which could lead to poor generalization. Second, existing prototype-based methods treat prototypes as only one additional sample per class, making their influence depend on the number of class instances in a batch and causing unbalanced contributions across classes. To address these limitations, we propose Equilibrium Contrastive Learning (ECL), a supervised CL framework designed to promote geometric equilibrium, where class features, means, and classifiers are harmoniously balanced under data imbalance. The proposed ECL framework uses two main components. First, ECL promotes the representation geometric equilibrium (i.e., a regular simplex geometry characterized by collapsed class samples and uniformly distributed class means), while balancing the contributions of class-average features and class prototypes. Second, ECL establishes a classifier-class center geometric equilibrium by aligning classifier weights and class prototypes. We ran experiments with three long-tailed datasets, the CIFAR-10(0)-LT, ImageNet-LT, and the two imbalanced medical datasets, the ISIC 2019 and our constructed LCCT dataset. Results show that ECL outperforms existing SOTA supervised CL methods designed for imbalanced classification.
74. 【2602.09494】OSI: One-step Inversion Excels in Extracting Diffusion Watermarks
链接:https://arxiv.org/abs/2602.09494
作者:Yuwei Chen,Zhenliang He,Jia Tang,Meina Kan,Shiguang Shan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Shading, extracting Gaussian Shading, OSI, important mechanism, mechanism for provenance
备注:
点击查看摘要
Abstract:Watermarking is an important mechanism for provenance and copyright protection of diffusion-generated images. Training-free methods, exemplified by Gaussian Shading, embed watermarks into the initial noise of diffusion models with negligible impact on the quality of generated images. However, extracting this type of watermark typically requires multi-step diffusion inversion to obtain precise initial noise, which is computationally expensive and time-consuming. To address this issue, we propose One-step Inversion (OSI), a significantly faster and more accurate method for extracting Gaussian Shading style watermarks. OSI reformulates watermark extraction as a learnable sign classification problem, which eliminates the need for precise regression of the initial noise. Then, we initialize the OSI model from the diffusion backbone and finetune it on synthesized noise-image pairs with a sign classification objective. In this manner, the OSI model is able to accomplish the watermark extraction efficiently in only one step. Our OSI substantially outperforms the multi-step diffusion inversion method: it is 20x faster, achieves higher extraction accuracy, and doubles the watermark payload capacity. Extensive experiments across diverse schedulers, diffusion backbones, and cryptographic schemes consistently show improvements, demonstrating the generality of our OSI framework.
75. 【2602.09483】Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions
链接:https://arxiv.org/abs/2602.09483
作者:Lin Chen,Xiaoke Zhao,Kun Ding,Weiwei Feng,Changtao Miao,Zili Wang,Wenxuan Guo,Ying Wang,Kaiyuan Zheng,Bo Zhang,Zhe Li,Shiming Xiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Large Language, significant deployment challenges, substantial size poses
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at this https URL.
76. 【2602.09477】Weakly Supervised Contrastive Learning for Histopathology Patch Embeddings
链接:https://arxiv.org/abs/2602.09477
作者:Bodong Zhang,Xiwen Li,Hamid Manoochehri,Xiaoya Tang,Deepika Sirohi,Beatrice S. Knudsen,Tolga Tasdizen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:provide gigapixel-scale high-resolution, Digital histopathology, digital histopathology image, gigapixel-scale high-resolution images, provide gigapixel-scale
备注:
点击查看摘要
Abstract:Digital histopathology whole slide images (WSIs) provide gigapixel-scale high-resolution images that are highly useful for disease diagnosis. However, digital histopathology image analysis faces significant challenges due to the limited training labels, since manually annotating specific regions or small patches cropped from large WSIs requires substantial time and effort. Weakly supervised multiple instance learning (MIL) offers a practical and efficient solution by requiring only bag-level (slide-level) labels, while each bag typically contains multiple instances (patches). Most MIL methods directly use frozen image patch features generated by various image encoders as inputs and primarily focus on feature aggregation. However, feature representation learning for encoder pretraining in MIL settings has largely been neglected. In our work, we propose a novel feature representation learning framework called weakly supervised contrastive learning (WeakSupCon) that incorporates bag-level label information during training. Our method does not rely on instance-level pseudo-labeling, yet it effectively separates patches with different labels in the feature space. Experimental results demonstrate that the image features generated by our WeakSupCon method lead to improved downstream MIL performance compared to self-supervised contrastive learning approaches in three datasets. Our related code is available at this http URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.09477 [cs.CV]
(or
arXiv:2602.09477v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.09477
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
77. 【2602.09476】FD-DB: Frequency-Decoupled Dual-Branch Network for Unpaired Synthetic-to-Real Domain Translation
链接:https://arxiv.org/abs/2602.09476
作者:Chuanhai Zang,Jiabao Hu,XW Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthetic data provide, severe domain shift, data provide low-cost, accurately annotated samples, geometry-sensitive vision tasks
备注: 26 pages, 13 figures, 2 tables. Code available at [this https URL](https://github.com/tryzang/FD-DB)
点击查看摘要
Abstract:Synthetic data provide low-cost, accurately annotated samples for geometry-sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift and degrade downstream performance. Unpaired synthetic-to-real translation can reduce this gap without paired supervision, yet existing methods often face a trade-off between photorealism and structural stability: unconstrained generation may introduce deformation or spurious textures, while overly rigid constraints limit adaptation to real-domain statistics. We propose FD-DB, a frequency-decoupled dual-branch model that separates appearance transfer into low-frequency interpretable editing and high-frequency residual compensation. The interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, and grain) to build a stable low-frequency appearance base with strong content preservation. The free branch complements fine details through residual generation, and a gated fusion mechanism combines the two branches under explicit frequency constraints to limit low-frequency drift. We further adopt a two-stage training schedule that first stabilizes the editing branch and then releases the residual branch to improve optimization stability. Experiments on the YCB-V dataset show that FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.
78. 【2602.09475】ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs
链接:https://arxiv.org/abs/2602.09475
作者:James Burgess,Rameen Abdal,Dan Stoddart,Sergey Tulyakov,Serena Yeung-Levy,Kuan-Chieh Jackson Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:produce strikingly realistic, Modern image generators, generators produce strikingly, strikingly realistic images, Modern image
备注: [this https URL](https://jmhb0.github.io/ArtifactLens/)
点击查看摘要
Abstract:Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.
79. 【2602.09472】LLM-Grounded Dynamic Task Planning with Hierarchical Temporal Logic for Human-Aware Multi-Robot Collaboration
链接:https://arxiv.org/abs/2602.09472
作者:Shuyuan Hu,Tao Lin,Kai Ye,Yang Yang,Tianwei Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Language Models, lack kinematic feasibility, Large Language, open-world multi-robot tasks
备注:
点击查看摘要
Abstract:While Large Language Models (LLM) enable non-experts to specify open-world multi-robot tasks, the generated plans often lack kinematic feasibility and are not efficient, especially in long-horizon scenarios. Formal methods like Linear Temporal Logic (LTL) offer correctness and optimal guarantees, but are typically confined to static, offline settings and struggle with computational scalability. To bridge this gap, we propose a neuro-symbolic framework that grounds LLM reasoning into hierarchical LTL specifications and solves the corresponding Simultaneous Task Allocation and Planning (STAP) problem. Unlike static approaches, our system resolves stochastic environmental changes, such as moving users or updated instructions via a receding horizon planning (RHP) loop with real-time perception, which dynamically refines plans through a hierarchical state space. Extensive real-world experiments demonstrate that our approach significantly outperforms baseline methods in success rate and interaction fluency while minimizing planning latency.
80. 【2602.09449】Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing
链接:https://arxiv.org/abs/2602.09449
作者:Yan Luo,Henry Huang,Todd Y. Zhou,Mengyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ordinary differential equations, deterministic ordinary differential, Recent advances, reformulated diffusion models, training-free flow matching
备注:
点击查看摘要
Abstract:Recent advances have reformulated diffusion models as deterministic ordinary differential equations (ODEs) through the framework of flow matching, providing a unified formulation for the noise-to-data generative process. Various training-free flow matching approaches have been developed to improve image generation through flow velocity field adjustment, eliminating the need for costly retraining. However, Modifying the velocity field $v$ introduces errors that propagate through the full generation path, whereas adjustments to the latent trajectory $z$ are naturally corrected by the pretrained velocity network, reducing error accumulation. In this paper, we propose two complementary training-free latent-trajectory adjustment approaches based on future and past velocity $v$ and latent trajectory $z$ information that refine the generative path directly in latent space. We propose two training-free trajectory smoothing schemes: \emph{Look-Ahead}, which averages the current and next-step latents using a curvature-gated weight, and \emph{Look-Back}, which smoothes latents using an exponential moving average with decay. We demonstrate through extensive experiments and comprehensive evaluation metrics that the proposed training-free trajectory smoothing models substantially outperform various state-of-the-art models across multiple datasets including COCO17, CUB-200, and Flickr30K.
81. 【2602.09446】A Scoping Review of Deep Learning for Urban Visual Pollution and Proposal of a Real-Time Monitoring Framework with a Visual Pollution Index
链接:https://arxiv.org/abs/2602.09446
作者:Mohammad Masudur Rahman,Md. Rashedur Rahman,Ashraful Islam,Saadia B Alam,M Ashraful Amin
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Visual Pollution, application remains fragmented, Urban Visual Pollution, visual pollution management, visual pollution index
备注:
点击查看摘要
Abstract:Urban Visual Pollution (UVP) has emerged as a critical concern, yet research on automatic detection and application remains fragmented. This scoping review maps the existing deep learning-based approaches for detecting, classifying, and designing a comprehensive application framework for visual pollution management. Following the PRISMA-ScR guidelines, seven academic databases (Scopus, Web of Science, IEEE Xplore, ACM DL, ScienceDirect, SpringerNatureLink, and Wiley) were systematically searched and reviewed, and 26 articles were found. Most research focuses on specific pollutant categories and employs variations of YOLO, Faster R-CNN, and EfficientDet architectures. Although several datasets exist, they are limited to specific areas and lack standardized taxonomies. Few studies integrate detection into real-time application systems, yet they tend to be geographically skewed. We proposed a framework for monitoring visual pollution that integrates a visual pollution index to assess the severity of visual pollution for a certain area. This review highlights the need for a unified UVP management system that incorporates pollutant taxonomy, a cross-city benchmark dataset, a generalized deep learning model, and an assessment index that supports sustainable urban aesthetics and enhances the well-being of urban dwellers.
82. 【2602.09439】Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning
链接:https://arxiv.org/abs/2602.09439
作者:Xu Ma,Yitian Zhang,Qihua Dong,Yun Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:open datasets remain, remain a major, major bottleneck, open, datasets remain
备注: Dataset: [this https URL](https://huggingface.co/datasets/ma-xu/fine-t2i)
点击查看摘要
Abstract:High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.
83. 【2602.09432】SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL
链接:https://arxiv.org/abs/2602.09432
作者:Yang Zhao,Shizhao Sun,Meisheng Zhang,Yingdong Shi,Xubo Yang,Jiang Bian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scene synthesis methods, Current one-pass, scene synthesis, deliberative reasoning, synthesis methods
备注:
点击查看摘要
Abstract:Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.
84. 【2602.09431】Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models
链接:https://arxiv.org/abs/2602.09431
作者:Xinwei Zhang,Li Bai,Tianwei Zhang,Youqian Zhang,Qingqing Ye,Yingnan Zhao,Ruochen Du,Haibo Hu
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large vision-language models, achieved impressive success, Large vision-language, significant adversarial threats, achieved impressive
备注: Under review; 21 pages
点击查看摘要
Abstract:Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks, but their reliance on visual inputs exposes them to significant adversarial threats. Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM, offering a computationally efficient alternative to end-to-end optimization. However, their transferability across different LVLM architectures in realistic black-box scenarios remains poorly understood. To address this gap, we present the first systematic study towards encoder-based adversarial transferability in LVLMs. Our contributions are threefold. First, through large-scale benchmarking over eight diverse LVLMs, we reveal that existing attacks exhibit severely limited transferability. Second, we perform in-depth analysis, disclosing two root causes that hinder the transferability: (1) inconsistent visual grounding across models, where different models focus their attention on distinct regions; (2) redundant semantic alignment within models, where a single object is dispersed across multiple overlapping token representations. Third, we propose Semantic-Guided Multimodal Attack (SGMA), a novel framework to enhance the transferability. Inspired by the discovered causes in our analysis, SGMA directs perturbations toward semantically critical regions and disrupts cross-modal grounding at both global and local levels. Extensive experiments across different victim models and tasks show that SGMA achieves higher transferability than existing attacks. These results expose critical security risks in LVLM deployment and underscore the urgent need for robust multimodal defenses.
85. 【2602.09425】Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification
链接:https://arxiv.org/abs/2602.09425
作者:Yiqiao Li,Bo Shang,Jie Wei
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:intelligent transportation systems, face scalability challenges, labor-intensive manual annotation, scalability challenges due, Fine-grained truck classification
备注: 12 pages, 10 figures, 4 tables
点击查看摘要
Abstract:Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a "Semantic Anchor" effect: text-based guidance regularizes performance in ultra-low-shot regimes $k 4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.
86. 【2602.09415】Stability and Concentration in Nonlinear Inverse Problems with Block-Structured Parameters: Lipschitz Geometry, Identifiability, and an Application to Gaussian Splatting
链接:https://arxiv.org/abs/2602.09415
作者:Joe-Mei Feng,Hsin-Hsiung Kao
类目:Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
关键词:develop an operator-theoretic, operator-theoretic framework, nonlinear inverse problems, global Lipschitz bounds, statistical concentration
备注:
点击查看摘要
Abstract:We develop an operator-theoretic framework for stability and statistical concentration in nonlinear inverse problems with block-structured parameters. Under a unified set of assumptions combining blockwise Lipschitz geometry, local identifiability, and sub-Gaussian noise, we establish deterministic stability inequalities, global Lipschitz bounds for least-squares misfit functionals, and nonasymptotic concentration estimates. These results yield high-probability parameter error bounds that are intrinsic to the forward operator and independent of any specific reconstruction algorithm. As a concrete instantiation, we verify that the Gaussian Splatting rendering operator satisfies the proposed assumptions and derive explicit constants governing its Lipschitz continuity and resolution-dependent observability. This leads to a fundamental stability--resolution tradeoff, showing that estimation error is inherently constrained by the ratio between image resolution and model complexity. Overall, the analysis characterizes operator-level limits for a broad class of high-dimensional nonlinear inverse problems arising in modern imaging and differentiable rendering.
87. 【2602.09413】LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging
链接:https://arxiv.org/abs/2602.09413
作者:Xinyu Wang,Ke Deng,Fei Dou,Jinbo Bi,Jin Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:combine multiple fine-tuned, multiple fine-tuned models, single multi-task model, training data, aims to combine
备注: 14 pages, 9 figures, 6 tables
点击查看摘要
Abstract:Model merging aims to combine multiple fine-tuned models into a single multi-task model without access to training data. Existing task-vector merging methods such as TIES, TSV-M, and Iso-C/CTS differ in their aggregation rules but treat all layers nearly uniformly. This assumption overlooks the strong layer-wise heterogeneity in large vision transformers, where shallow layers are sensitive to interference while deeper layers encode stable task-specific features. We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer that plugs into any task-vector merger and assigns a per-layer scale to each task vector before aggregation, and show it consistently boosts diverse merging rules. LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule, requiring no retraining or modification to existing mergers. To our knowledge, this is the first work to perform layer-aware scaling for task-vector merging. LARV computes simple data-free layer proxies and turns them into scales through a lightweight rule; we study several instantiations within one framework (e.g., tiered two/three-level scaling with fixed values, or continuous mappings) and show that tiered choices offer the best robustness, while continuous mappings remain an ablation. LARV is orthogonal to the base merger and adds negligible cost. On FusionBench with Vision Transformers, LARV consistently improves all task-vector baselines across 8/14/20-task settings; for example, Iso-C + LARV reaches 85.9% on ViT-B/32, 89.2% on ViT-B/16, and 92.6% on ViT-L/14. Layerwise analysis and corruption tests further indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features, turning model merging into a robust, layer-aware procedure rather than a uniform one.
88. 【2602.09411】K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge
链接:https://arxiv.org/abs/2602.09411
作者:Zhikai Li,Jiatong Li,Xuewen Liu,Wangbo Zhao,Pan Du,Kaicheng Zhou,Qingyi Gu,Yang You,Zhen Dong,Kurt Keutzer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generative models raises, visual generative models, rapid development, development of visual, visual generative
备注: ICLR 2026. Code is available at: [this https URL](https://github.com/zkkli/K-Sort-Eval)
点击查看摘要
Abstract:The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods. While the crowdsourced Arena platforms offer human preference assessments by collecting human votes, they are costly and time-consuming, inherently limiting their scalability. Leveraging vision-language model (VLMs) as substitutes for manual judgments presents a promising solution. However, the inherent hallucinations and biases of VLMs hinder alignment with human preferences, thus compromising evaluation reliability. Additionally, the static evaluation approach lead to low efficiency. In this paper, we propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching. Specifically, we curate a high-quality dataset from thousands of human votes in K-Sort Arena, with each instance containing the outputs and rankings of K models. When evaluating a new model, it undergoes (K+1)-wise free-for-all comparisons with existing models, and the VLM provide the rankings. To enhance alignment and reliability, we propose a posterior correction method, which adaptively corrects the posterior probability in Bayesian updating based on the consistency between the VLM prediction and human supervision. Moreover, we propose a dynamic matching strategy, which balances uncertainty and diversity to maximize the expected benefit of each comparison, thus ensuring more efficient evaluation. Extensive experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs, demonstrating both its efficiency and reliability.
89. 【2602.09407】Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D
链接:https://arxiv.org/abs/2602.09407
作者:Yan Luo,Advaith Ravishankar,Serena Liu,Yutong Yang,Mengyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long wait times, volumetric imaging remains, imaging remains costly, understanding of anatomy, treatment planning
备注:
点击查看摘要
Abstract:A 3D understanding of anatomy is central to diagnosis and treatment planning, yet volumetric imaging remains costly with long wait times. Image-to-3D foundations models can solve this issue by reconstructing 3D data from 2D modalites. Current foundation models are trained on natural image distributions to reconstruct naturalistic objects from a single image by leveraging geometric priors across pixels. However, it is unclear whether these learned geometric priors transfer to medical data. In this study, we present a controlled zero-shot benchmark of single slice medical image-to-3D reconstruction across five state-of-the-art image-to-3D models: SAM3D, Hunyuan3D-2.1, Direct3D, Hi3DGen, and TripoSG. These are evaluated across six medical datasets spanning anatomical and pathological structures and two natrual datasets, using voxel based metrics and point cloud distance metrics. Across medical datasets, voxel based overlap remains moderate for all models, consistent with a depth reconstruction failure mode when inferring volume from a single slice. In contrast, global distance metrics show more separation between methods: SAM3D achieves the strongest overall topological similarity to ground truth medical 3D data, while alternative models are more prone to over-simplication of reconstruction. Our results quantify the limits of single-slice medical reconstruction and highlight depth ambiguity caused by the planar nature of 2D medical data, motivating multi-view image-to-3D reconstruction to enable reliable medical 3D inference.
90. 【2602.09378】Fully Differentiable Bidirectional Dual-Task Synergistic Learning for Semi-Supervised 3D Medical Image Segmentation
链接:https://arxiv.org/abs/2602.09378
作者:Jun Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large pixel-wise labeled, leveraging unlabeled data, Semi-supervised learning relaxes, large pixel-wise, leveraging unlabeled
备注: Accepted by ESWA 2026
点击查看摘要
Abstract:Semi-supervised learning relaxes the need of large pixel-wise labeled datasets for image segmentation by leveraging unlabeled data. The scarcity of high-quality labeled data remains a major challenge in medical image analysis due to the high annotation costs and the need for specialized clinical expertise. Semi-supervised learning has demonstrated significant potential in addressing this bottleneck, with pseudo-labeling and consistency regularization emerging as two predominant paradigms. Dual-task collaborative learning, an emerging consistency-aware paradigm, seeks to derive supplementary supervision by establishing prediction consistency between related tasks. However, current methodologies are limited to unidirectional interaction mechanisms (typically regression-to-segmentation), as segmentation results can only be transformed into regression outputs in an offline manner, thereby failing to fully exploit the potential benefits of online bidirectional cross-task collaboration. Thus, we propose a fully Differentiable Bidirectional Synergistic Learning (DBiSL) framework, which seamlessly integrates and enhances four critical SSL components: supervised learning, consistency regularization, pseudo-supervised learning, and uncertainty estimation. Experiments on two benchmark datasets demonstrate our method's state-of-the-art performance. Beyond technical contributions, this work provides new insights into unified SSL framework design and establishes a new architectural foundation for dual-task-driven SSL, while offering a generic multitask learning framework applicable to broader computer vision applications. The code will be released on github upon acceptance.
91. 【2602.09355】Impact of domain adaptation in deep learning for medical image classifications
链接:https://arxiv.org/abs/2602.09355
作者:Yihang Wu,Ahmad Chaddad
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:quickly expanding area, Domain adaptation, quickly expanding, expanding area, area in machine
备注: Accepted in IEEE SMC 2025
点击查看摘要
Abstract:Domain adaptation (DA) is a quickly expanding area in machine learning that involves adjusting a model trained in one domain to perform well in another domain. While there have been notable progressions, the fundamental concept of numerous DA methodologies has persisted: aligning the data from various domains into a shared feature space. In this space, knowledge acquired from labeled source data can improve the model training on target data that lacks sufficient labels. In this study, we demonstrate the use of 10 deep learning models to simulate common DA techniques and explore their application in four medical image datasets. We have considered various situations such as multi-modality, noisy data, federated learning (FL), interpretability analysis, and classifier calibration. The experimental results indicate that using DA with ResNet34 in a brain tumor (BT) data set results in an enhancement of 4.7\% in model performance. Similarly, the use of DA can reduce the impact of Gaussian noise, as it provides $\sim 3\%$ accuracy increase using ResNet34 on a BT dataset. Furthermore, simply introducing DA into FL framework shows limited potential (e.g., $\sim 0.3\%$ increase in performance) for skin cancer classification. In addition, the DA method can improve the interpretability of the models using the gradcam++ technique, which offers clinical values. Calibration analysis also demonstrates that using DA provides a lower expected calibration error (ECE) value $\sim 2\%$ compared to CNN alone on a multi-modality dataset.
92. 【2602.09337】Kyrtos: A methodology for automatic deep analysis of graphic charts with curves in technical documents
链接:https://arxiv.org/abs/2602.09337
作者:Michail S. Alexiou,Nikolaos G. Bourbakis
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:great potential due, valuable knowledge contained, Deep Understanding, Technical Documents, Understanding of Technical
备注:
点击查看摘要
Abstract:Deep Understanding of Technical Documents (DUTD) has become a very attractive field with great potential due to large amounts of accumulated documents and the valuable knowledge contained in them. In addition, the holistic understanding of technical documents depends on the accurate analysis of its particular modalities, such as graphics, tables, diagrams, text, etc. and their associations. In this paper, we introduce the Kyrtos methodology for the automatic recognition and analysis of charts with curves in graphics images of technical documents. The recognition processing part adopts a clustering based approach to recognize middle-points that delimit the line-segments that construct the illustrated curves. The analysis processing part parses the extracted line-segments of curves to capture behavioral features such as direction, trend and etc. These associations assist the conversion of recognized segments' relations into attributed graphs, for the preservation of the curves' structural characteristics. The graph relations are also are expressed into natural language (NL) text sentences, enriching the document's text and facilitating their conversion into Stochastic Petri-net (SPN) graphs, which depict the internal functionality represented in the chart image. Extensive evaluation results demonstrate the accuracy of Kyrtos' recognition and analysis methods by measuring the structural similarity between input chart curves and the approximations generated by Kyrtos for charts with multiple functions.
93. 【2602.09324】Deep Modeling and Interpretation for Bladder Cancer Classification
链接:https://arxiv.org/abs/2602.09324
作者:Ahmad Chaddad,Yihang Wu,Xianrui Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:convolutional neural network, demonstrated remarkable performance, Deep models based, bladder cancer, neural network
备注: Accepted in IEEE SMC 2025
点击查看摘要
Abstract:Deep models based on vision transformer (ViT) and convolutional neural network (CNN) have demonstrated remarkable performance on natural datasets. However, these models may not be similar in medical imaging, where abnormal regions cover only a small portion of the image. This challenge motivates this study to investigate the latest deep models for bladder cancer classification tasks. We propose the following to evaluate these deep models: 1) standard classification using 13 models (four CNNs and eight transormer-based models), 2) calibration analysis to examine if these models are well calibrated for bladder cancer classification, and 3) we use GradCAM++ to evaluate the interpretability of these models for clinical diagnosis. We simulate $\sim 300$ experiments on a publicly multicenter bladder cancer dataset, and the experimental results demonstrate that the ConvNext series indicate limited generalization ability to classify bladder cancer images (e.g., $\sim 60\%$ accuracy). In addition, ViTs show better calibration effects compared to ConvNext and swin transformer series. We also involve test time augmentation to improve the models interpretability. Finally, no model provides a one-size-fits-all solution for a feasible interpretable model. ConvNext series are suitable for in-distribution samples, while ViT and its variants are suitable for interpreting out-of-distribution samples.
94. 【2602.09318】GAFR-Net: A Graph Attention and Fuzzy-Rule Network for Interpretable Breast Cancer Image Classification
链接:https://arxiv.org/abs/2602.09318
作者:Lin-Guo Gao,Suxing Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:conventional deep learning, early oncological diagnosis, deep learning architectures, encounter performance degradation, breast cancer histopathology
备注:
点击查看摘要
Abstract:Accurate classification of breast cancer histopathology images is pivotal for early oncological diagnosis and therapeutic this http URL, conventional deep learning architectures often encounter performance degradation under limited annotations and suffer from a "blackbox" nature, hindering their clinical integration. To mitigate these limitations, we propose GAFRNet, a robust and interpretable Graph Attention and FuzzyRule Network specifically engineered for histopathology image classification with scarce supervision. GAFRNet constructs a similarity-driven graph representation to model intersample relationships and employs a multihead graph attention mechanism to capture complex relational features across heterogeneous tissue this http URL, a differentiable fuzzy-rule module encodes intrinsic topological descriptorsincluding node degree, clustering coefficient, and label consistencyinto explicit, human-understandable diagnostic logic. This design establishes transparent "IF-THEN" mappings that mimic the heuristic deduction process of medical experts, providing clear reasoning behind each prediction without relying on post-hoc attribution methods. Extensive evaluations on three benchmark datasets (BreakHis, Mini-DDSM, and ICIAR2018) demonstrate that GAFR-Net consistently outperforms various state-of-the-art methods across multiple magnifications and classification tasks. These results validate the superior generalization and practical utility of GAFR-Net as a reliable decision-support tool for weakly supervised medical image analysis.
95. 【2602.09315】A Deep Multi-Modal Method for Patient Wound Healing Assessment
链接:https://arxiv.org/abs/2602.09315
作者:Subba Reddy Oota,Vijay Rowtula,Shahid Mohammed,Jeffrey Galitz,Minghsun Liu,Manish Gupta
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:wound care costs, high wound care, care costs, wound, Hospitalization
备注: 4 pages, 2 figures
点击查看摘要
Abstract:Hospitalization of patients is one of the major factors for high wound care costs. Most patients do not acquire a wound which needs immediate hospitalization. However, due to factors such as delay in treatment, patient's non-compliance or existing co-morbid conditions, an injury can deteriorate and ultimately lead to patient hospitalization. In this paper, we propose a deep multi-modal method to predict the patient's risk of hospitalization. Our goal is to predict the risk confidently by collectively using the wound variables and wound images of the patient. Existing works in this domain have mainly focused on healing trajectories based on distinct wound types. We developed a transfer learning-based wound assessment solution, which can predict both wound variables from wound images and their healing trajectories, which is our primary contribution. We argue that the development of a novel model can help in early detection of the complexities in the wound, which might affect the healing process and also reduce the time spent by a clinician to diagnose the wound.
96. 【2602.09284】X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging
链接:https://arxiv.org/abs/2602.09284
作者:Pranav Kulkarni,Junfeng Guo,Heng Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:High-quality medical imaging, High-quality medical, deep learning models, training deep learning, medical imaging datasets
备注:
点击查看摘要
Abstract:High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns. Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images, as static watermark patterns generated in fixed-scale images scale poorly dynamic and high-resolution scans with limited visual diversity and subtle anatomical structures, while preserving diagnostic quality. In this paper, we propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection. Specifically, X-Mark uses a conditional U-Net to generate unique perturbations within salient regions of each sample. We design a multi-component training objective to ensure watermark efficacy, robustness against dynamic scaling processes while preserving diagnostic quality and visual-distinguishability. We incorporate Laplacian regularization into our training objective to penalize high-frequency perturbations and achieve watermark scale-invariance. Ownership verification is performed in a black-box setting to detect characteristic behaviors in suspicious models. Extensive experiments on CheXpert verify the effectiveness of X-Mark, achieving WSR of 100% and reducing probability of false positives in Ind-M scenario by 12%, while demonstrating resistance to potential adaptive attacks.
97. 【2602.09268】Rethinking Global Text Conditioning in Diffusion Transformers
链接:https://arxiv.org/abs/2602.09268
作者:Nikita Starodubcev,Daniil Pakhomov,Zongze Wu,Ilya Drobyshevskiy,Yuchen Liu,Zhonghao Wang,Yuqian Zhou,Zhe Lin,Dmitry Baranchuk
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:transformers typically incorporate, typically incorporate textual, Diffusion transformers typically, incorporate textual information, modulation-based text conditioning
备注: Accepted at ICLR26
点击查看摘要
Abstract:Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.
98. 【2602.09252】VLM-Guided Iterative Refinement for Surgical Image Segmentation with Foundation Models
链接:https://arxiv.org/abs/2602.09252
作者:Ange Lou,Yamin Li,Qi Chang,Nan Xi,Luyuan Xie,Zichao Li,Tianyu Luan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
关键词:intraoperative guidance, Surgical image segmentation, essential for robot-assisted, robot-assisted surgery, surgery and intraoperative
备注:
点击查看摘要
Abstract:Surgical image segmentation is essential for robot-assisted surgery and intraoperative guidance. However, existing methods are constrained to predefined categories, produce one-shot predictions without adaptive refinement, and lack mechanisms for clinician interaction. We propose IR-SIS, an iterative refinement system for surgical image segmentation that accepts natural language descriptions. IR-SIS leverages a fine-tuned SAM3 for initial segmentation, employs a Vision-Language Model to detect instruments and assess segmentation quality, and applies an agentic workflow that adaptively selects refinement strategies. The system supports clinician-in-the-loop interaction through natural language feedback. We also construct a multi-granularity language-annotated dataset from EndoVis2017 and EndoVis2018 benchmarks. Experiments demonstrate state-of-the-art performance on both in-domain and out-of-distribution data, with clinician interaction providing additional improvements. Our work establishes the first language-based surgical segmentation framework with adaptive self-refinement capabilities.
99. 【2602.09216】owards Human-AI Accessibility Mapping in India: VLM-Guided Annotations and POI-Centric Analysis in Chandigarh
链接:https://arxiv.org/abs/2602.09216
作者:Varchita Lalwani,Utkarsh Agarwal,Michael Saugstad,Manish Kumar,Jon E. Froehlich,Anupam Sobti
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:Google Street View, Street View, Google Street, web-based platform, city-scale by virtually
备注: Accepted at the First Workshop on AI for Urban Planning (AI4Up) at AAAI 2025
点击查看摘要
Abstract:Project Sidewalk is a web-based platform that enables crowdsourcing accessibility of sidewalks at city-scale by virtually walking through city streets using Google Street View. The tool has been used in 40 cities across the world, including the US, Mexico, Chile, and Europe. In this paper, we describe adaptation efforts to enable deployment in Chandigarh, India, including modifying annotation types, provided examples, and integrating VLM-based mission guidance, which adapts instructions based on a street scene and metadata analysis. Our evaluation with 3 annotators indicates the utility of AI-mission guidance with an average score of 4.66. Using this adapted Project Sidewalk tool, we conduct a Points of Interest (POI)-centric accessibility analysis for three sectors in Chandigarh with very different land uses, residential, commercial and institutional covering about 40 km of sidewalks. Across 40 km of roads audited in three sectors and around 230 POIs, we identified 1,644 of 2,913 locations where infrastructure improvements could enhance accessibility.
100. 【2602.09214】VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models
链接:https://arxiv.org/abs/2602.09214
作者:Chenyu Wang,Tianle Chen,H. M. Sabbir Ahmad,Kayhan Batmanghelich,Wenchao Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision-language models, behave safely, safely and reliably, vital for ensuring, ensuring that vision-language
备注:
点击查看摘要
Abstract:Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.
101. 【2602.09209】Wearable environmental sensing to forecast how legged systems will interact with upcoming terrain
链接:https://arxiv.org/abs/2602.09209
作者:Michael D. Murray,James Tung,Richard W. Nuckols
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:environment is underexplored, environmental classification, classification during gait, ability to predict, contact a changing
备注: 19 pages excluding references and comments, 5 figures, 3 tables
点击查看摘要
Abstract:Computer-vision (CV) has been used for environmental classification during gait and is often used to inform control in assistive systems; however, the ability to predict how the foot will contact a changing environment is underexplored. We evaluated the feasibility of forecasting the anterior-posterior (AP) foot center-of-pressure (COP) and time-of-impact (TOI) prior to foot-strike on a level-ground to stair-ascent transition. Eight subjects wore an RGB-D camera on their right shank and instrumented insoles while performing the task of stepping onto the stairs. We trained a CNN-RNN to forecast the COP and TOI continuously within a 250ms window prior to foot-strike, termed the forecast horizon (FH). The COP mean-absolute-error (MAE) at 150, 100, and 50ms FH was 29.42mm, 26.82, and 23.72mm respectively. The TOI MAE was 21.14, 20.08, and 17.73ms for 150, 100, and 50ms respectively. While torso velocity had no effect on the error in either task, faster toe-swing speeds prior to foot-strike were found to improve the prediction accuracy in the COP case, however, was insignificant in the TOI case. Further, more anterior foot-strikes were found to reduce COP prediction accuracy but did not affect the TOI prediction accuracy. We also found that our lightweight model was capable at running at 60 FPS on either a consumer grade laptop or an edge computing device. This study demonstrates that forecasting COP and TOI from visual data was feasible using a lightweight model, which may have important implications for anticipatory control in assistive systems.
102. 【2602.09165】All-in-One Conditioning for Text-to-Image Synthesis
链接:https://arxiv.org/abs/2602.09165
作者:Hirunima Jayasekara,Chuong Huynh,Yixuan Ren,Christabel Acquaye,Abhinav Shrivastava
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:involving multiple objects, Accurate interpretation, complex prompts involving, prompts involving multiple, multiple objects
备注:
点击查看摘要
Abstract:Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.
103. 【2602.09155】Decoding Future Risk: Deep Learning Analysis of Tubular Adenoma Whole-Slide Images
链接:https://arxiv.org/abs/2602.09155
作者:Ahmed Rahu,Brian Shula,Brandon Combs,Aqsa Sultana,Surendra P. Singh,Vijayan K. Asari,Derrick Forchetti
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:prophylactic initiatives aimed, removing precancerous polyps, remains a significant, cancer-related mortality, widespread implementation
备注: 20 pages, 5 figures
点击查看摘要
Abstract:Colorectal cancer (CRC) remains a significant cause of cancer-related mortality, despite the widespread implementation of prophylactic initiatives aimed at detecting and removing precancerous polyps. Although screening effectively reduces incidence, a notable portion of patients initially diagnosed with low-grade adenomatous polyps will still develop CRC later in life, even without the presence of known high-risk syndromes. Identifying which low-risk patients are at higher risk of progression is a critical unmet need for tailored surveillance and preventative therapeutic strategies. Traditional histological assessment of adenomas, while fundamental, may not fully capture subtle architectural or cytological features indicative of malignant potential. Advancements in digital pathology and machine learning provide an opportunity to analyze whole-slide images (WSIs) comprehensively and objectively. This study investigates whether machine learning algorithms, specifically convolutional neural networks (CNNs), can detect subtle histological features in WSIs of low-grade tubular adenomas that are predictive of a patient's long-term risk of developing colorectal cancer.
104. 【2602.09154】A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video
链接:https://arxiv.org/abs/2602.09154
作者:Andrea Filiberto Lucas,Dylan Seychell
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:growing volume, volume of video-based, video-based news content, content has heightened, extract on-screen information
备注: 7 pages, 5 figures. Accepted for publication at the 2026 IEEE Conference on Artificial Intelligence (CAI)
点击查看摘要
Abstract:The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8% mAP@0.5, demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media.
Comments:
7 pages, 5 figures. Accepted for publication at the 2026 IEEE Conference on Artificial Intelligence (CAI)
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
ACMclasses:
I.2.10; I.4.8
Cite as:
arXiv:2602.09154 [cs.CV]
(or
arXiv:2602.09154v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.09154
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
105. 【2602.09153】SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes
链接:https://arxiv.org/abs/2602.09153
作者:Nicholas Pfaff,Thomas Cohn,Sergey Zakharov,Rick Cory,Russ Tedrake
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:real indoor spaces, existing environments fail, evaluating home robots, key tool, tool for training
备注: Project page: [this https URL](https://scenesmith.github.io/)
点击查看摘要
Abstract:Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with 2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.
106. 【2602.09146】SemanticMoments: Training-Free Motion Similarity via Third Moment Features
链接:https://arxiv.org/abs/2602.09146
作者:Saar Huberman,Kfir Goldberg,Or Patashnik,Sagie Benaim,Ron Mokady
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Retrieving videos based, Retrieving videos, Retrieving, videos based, motion
备注:
点击查看摘要
Abstract:Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.
107. 【2602.09109】Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide
链接:https://arxiv.org/abs/2602.09109
作者:Hossam Amer,Rezaul Karim,Ali Pourranjbar,Weiwei Zhang,Walid Ahmed,Boxing Chen
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:rapid growth, wide range, range of methods, developed to distribute, memory across hardware
备注:
点击查看摘要
Abstract:With the rapid growth of large language models (LLMs), a wide range of methods have been developed to distribute computation and memory across hardware devices for efficient training and inference. While existing surveys provide descriptive overviews of these techniques, systematic analysis of their benefits and trade offs and how such insights can inform principled methodology for designing optimal distributed systems remain limited. This paper offers a comprehensive review of collective operations and distributed parallel strategies, complemented by mathematical formulations to deepen theoretical understanding. We further examine hybrid parallelization designs, emphasizing communication computation overlap across different stages of model deployment, including both training and inference. Recent advances in automated search for optimal hybrid parallelization strategies using cost models are also discussed. Moreover, we present case studies with mainstream architecture categories to reveal empirical insights to guide researchers and practitioners in parallelism strategy selection. Finally, we highlight open challenges and limitations of current LLM training paradigms and outline promising directions for the next generation of large scale model development.
108. 【2602.09084】Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling
链接:https://arxiv.org/abs/2602.09084
作者:Ruijie Ye,Jiayi Zhang,Zhuoxin Liu,Zihao Zhu,Siyuan Yang,Li Li,Tianfu Fu,Franck Dernoncourt,Yue Zhao,Jiacheng Zhu,Ryan Rossi,Wenhao Chai,Zhengzhong Tu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:alter object faithfulness, study instruction-based image, Agent Banana, editors often over-edit, persistent challenges
备注: Project Website: [this http URL](http://agent-banana.github.io)
点击查看摘要
Abstract:We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.
109. 【2602.09082】UI-Venus-1.5 Technical Report
链接:https://arxiv.org/abs/2602.09082
作者:Veuns-Team:Changlong Gao,Zhangxuan Gu,Yulin Liu,Xinyu Qiu,Shuheng Shen,Yue Wen,Tianyu Xia,Zhenyu Xu,Zhengwen Zeng,Beitong Zhou,Xingran Zhou,Weizhi Chen,Sunhao Dai,Jingya Dou,Yichen Gong,Yuan Guo,Zhenlin Guo,Feng Li,Qian Li,Jinzhen Lin,Yuqi Zhou,Linchao Zhu,Liang Chen,Zhenyu Guo,Changhua Meng,Weiqiang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Online Reinforcement Learning, unified GUI Agent, GUI Agent designed, GUI Agent constructed, foundational GUI semantics
备注:
点击查看摘要
Abstract:GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains this http URL this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world this http URL proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application this http URL to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: this https URL Model: this https URL
110. 【2510.21112】LiDAR-based 3D Change Detection at City Scale
链接:https://arxiv.org/abs/2510.21112
作者:Hezam Albagami,Haitian Wang,Xinyu Wang,Muhammad Ibrahim,Zainy M. Malakan,Abdullah M. Alqamdi,Mohammed H. Alghamdi,Ajmal Mian
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:city maps enable, maps enable city, enable city planning, city maps, Digital Surface Model
备注:
点击查看摘要
Abstract:High-definition 3D city maps enable city planning and change detection, which is essential for municipal compliance, map maintenance, and asset monitoring, including both built structures and urban greenery. Conventional Digital Surface Model (DSM) and image differencing are sensitive to vertical bias and viewpoint mismatch, while original point cloud or voxel models require large memory, assume perfect alignment, and degrade thin structures. We propose an uncertainty-aware, object-centric method for city-scale LiDAR-based change detection. Our method aligns data from different time periods using multi-resolution Normal Distributions Transform (NDT) and a point-to-plane Iterative Closest Point (ICP) method, normalizes elevation, and computes a per-point level of detection from registration covariance and surface roughness to calibrate change decisions. Geometry-based associations are refined by semantic and instance segmentation and optimized using class-constrained bipartite assignment with augmented dummies to handle split-merge cases. Tiled processing bounds memory and preserves narrow ground changes, while instance-level decisions integrate overlap, displacement, and volumetric differences under local detection gating. We perform experiments on a Subiaco (Western Australia) dataset captured in 2023 and again in 2025. Our method achieves 95.3% accuracy, 90.8% mF1, and 82.9% mIoU, improving over the strongest baseline, Triplet KPConv, by 0.3, 0.6, and 1.1 points, respectively. The datasets are available on IEEE DataPort (2023: this https URL and 2025: this https URL). The source code is available at this https URL.
111. 【2510.04772】Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
链接:https://arxiv.org/abs/2510.04772
作者:Max Kirchner,Hanna Hoffmann,Alexander C. Jenke,Oliver L. Saldanha,Kevin Pfeiffer,Weam Kanjo,Julia Alekseenko,Claas de Boer,Santhi Raj Kolamuri,Lorenzo Mazza,Nicolas Padoy,Sophia Bano,Annika Reinke,Lena Maier-Hein,Danail Stoyanov,Jakob N. Kather,Fiona R. Kolbinger,Sebastian Bodenstedt,Stefanie Speidel
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:art in federated, surgical video classification, Purpose, video classification, federated learning
备注: A challenge report pre-print (31 pages), including 7 tables and 8 figures
点击查看摘要
Abstract:Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.
112. 【2602.09050】SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy
链接:https://arxiv.org/abs/2602.09050
作者:Jiahao Qin
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Toggle, introduces severe spatiotemporal, Toggle Hugging Face, code, Code Toggle Papers
备注: 21 pages, 6 figures, 3 tables
点击查看摘要
Abstract:High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional scanning enables rapid functional brain imaging but introduces severe spatiotemporal misalignment from coupled scan-direction-dependent domain shift and geometric distortion. Conventional registration methods rely on brightness constancy, an assumption violated under bidirectional scanning, leading to unreliable alignment. A unified scene-appearance separation framework is proposed to jointly address domain shift and spatial misalignment. The proposed architecture separates domain-invariant scene content from domain-specific appearance characteristics, enabling cross-domain reconstruction with geometric preservation. A scene consistency loss promotes geometric correspondence in the latent space, linking domain shift correction with spatial registration within a single framework. For in vivo mouse brain vasculature imaging, the proposed method achieves normalized cross-correlation (NCC) of 0.961 and structural similarity index (SSIM) of 0.894, substantially outperforming conventional methods. Ablation studies demonstrate that domain alignment loss is critical, with its removal causing 82% NCC reduction (0.961 to 0.175), while scene consistency and cycle consistency losses provide complementary regularization for optimal performance. The method achieves 11.2 ms inference time per frame (86 fps), substantially exceeding typical OR-PAM acquisition rates and enabling real-time processing. These results suggest that the proposed framework enables robust high-speed bidirectional OR-PAM for reliable quantitative and longitudinal functional imaging. The code will be publicly available at this https URL
Comments:
21 pages, 6 figures, 3 tables
Subjects:
Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
MSC classes:
68U10, 92C55
ACMclasses:
I.4.3; I.2.6; J.3
Cite as:
arXiv:2602.09050 [eess.IV]
(or
arXiv:2602.09050v1 [eess.IV] for this version)
https://doi.org/10.48550/arXiv.2602.09050
Focus to learn more
arXiv-issued DOI via DataCite
Submission history From: Jiahao Qin [view email] [v1]
Fri, 6 Feb 2026 21:01:27 UTC (19,772 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy, by Jiahao QinView PDFHTML (experimental)TeX Source
view license
Current browse context: eess.IV
prev
|
next
new
|
recent
| 2026-02
Change to browse by:
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
113. 【2508.08232】Mamba-FCS: Joint Spatio- Frequency Feature Fusion, Change-Guided Attention, and SeK Loss for Enhanced Semantic Change Detection in Remote Sensing
链接:https://arxiv.org/abs/2508.08232
作者:Buddhi Wijenayake,Athulya Ratnayake,Praveen Sumanasekara,Roshan Godaliyadda,Parakrama Ekanayake,Vijitha Herath,Nichula Wasalathilaka
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:Convolutional Neural Networks, class-imbalanced land-cover transitions, imagery requires models, requires models balancing, balancing extensive spatial
备注: 19 pages, 13 Figures
点击查看摘要
Abstract:Semantic Change Detection (SCD) from remote sensing imagery requires models balancing extensive spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions. While Convolutional Neural Networks excel at local feature extraction but lack global context, Transformers provide global modeling at high computational costs. Recent Mamba architectures based on state-space models offer compelling solutions through linear complexity and efficient long-range modeling. In this study, we introduce Mamba-FCS, a SCD framework built upon Visual State Space Model backbone incorporating, a Joint Spatio-Frequency Fusion block incorporating log-amplitude frequency domain features to enhance edge clarity and suppress illumination artifacts, a Change-Guided Attention (CGA) module that explicitly links the naturally intertwined BCD and SCD tasks, and a Separated Kappa (SeK) loss tailored for class-imbalanced performance optimization. Extensive evaluation on SECOND and Landsat-SCD datasets shows that Mamba-FCS achieves state-of-the-art metrics, 88.62% Overall Accuracy, 65.78% F_scd, and 25.50% SeK on SECOND, 96.25% Overall Accuracy, 89.27% F_scd, and 60.26% SeK on Landsat-SCD. Ablation analyses confirm distinct contributions of each novel component, with qualitative assessments highlighting significant improvements in SCD. Our results underline the substantial potential of Mamba architectures, enhanced by proposed techniques, setting a new benchmark for effective and scalable semantic change detection in remote sensing applications. The complete source code, configuration files, and pre-trained models will be publicly available upon publication.




