本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新533篇论文,其中:
- 自然语言处理76篇
- 信息检索6篇
- 计算机视觉98篇
自然语言处理
1. 【2602.06964】Learning a Generative Meta-Model of LLM Activations
链接:https://arxiv.org/abs/2602.06964
作者:Grace Luo,Jiahai Feng,Trevor Darrell,Alec Radford,Jacob Steinhardt
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Existing approaches, analyzing neural network, rely on strong, neural network activations, approaches for analyzing
备注:
点击查看摘要
Abstract:Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating "meta-models" that learn the distribution of a network's internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model's neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: this https URL.
2. 【2602.06960】InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning
链接:https://arxiv.org/abs/2602.06960
作者:Yuchen Yan,Liang Jiang,Jin Jiang,Shuaicheng Li,Zujie Wen,Zhiqiang Zhang,Jun Zhou,Jian Shao,Yueting Zhuang,Yongliang Shen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:context length limits, Large reasoning models, degraded reasoning due, models achieve strong, achieve strong performance
备注: Project Page: [this https URL](https://zju-real.github.io/InftyThink-Plus) Code: [this https URL](https://github.com/ZJU-REAL/InftyThink-Plus)
点击查看摘要
Abstract:Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
3. 【2602.06953】DAWN: Dependency-Aware Fast Inference for Diffusion LLMs
链接:https://arxiv.org/abs/2602.06953
作者:Lizhuo Luo,Zhuoran Shi,Jiajun Luo,Zhi Wang,Shen Ren,Wenya Wang,Tianwei Zhang
类目:Computation and Language (cs.CL)
关键词:Diffusion large language, Diffusion large, large language models, large language, shown advantages
备注:
点击查看摘要
Abstract:Diffusion large language models (dLLMs) have shown advantages in text generation, particularly due to their inherent ability for parallel decoding. However, constrained by the quality--speed trade-off, existing inference solutions adopt conservative parallel strategies, leaving substantial efficiency potential underexplored. A core challenge is that parallel decoding assumes each position can be filled independently, but tokens are often semantically coupled. Thus, the correct choice at one position constrains valid choices at others. Without modeling these inter-token dependencies, parallel strategies produce deteriorated outputs. Motivated by this insight, we propose DAWN, a training-free, dependency-aware decoding method for fast dLLM inference. DAWN extracts token dependencies and leverages two key motivations: (1) positions dependent on unmasked certain positions become more reliable, (2) simultaneously unmasking strongly coupled uncertain positions induces errors. Given those findings, DAWN leverages a dependency graph to select more reliable unmasking positions at each iteration, achieving high parallelism with negligible loss in generation quality. Extensive experiments across multiple models and datasets demonstrate that DAWN speedups the inference by 1.80-8.06x over baselines while preserving the generation quality. Code is released at this https URL.
4. 【2602.06942】Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay
链接:https://arxiv.org/abs/2602.06942
作者:Duygu Altinok
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:neural language modeling, morphologically rich languages, pivotal design choice, productive agglutination challenges, tokenizer training corpus
备注: Submitted to Cambridge NLP journal, all rights belong to them
点击查看摘要
Abstract:Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages (MRLs) such as Turkish, where productive agglutination challenges both vocabulary efficiency and morphological fidelity. Prior studies have explored tokenizer families and vocabulary sizes but typically (i) vary vocabulary without systematically controlling the tokenizer's training corpus, (ii) provide limited intrinsic diagnostics, and (iii) evaluate a narrow slice of downstream tasks. We present the first comprehensive, principled study of Turkish subword tokenization; a "subwords manifest", that jointly varies vocabulary size and tokenizer training corpus size (data and vocabulary coupling), compares multiple tokenizer families under matched parameter budgets (WordPiece, morphology level, and character baselines), and evaluates across semantic (NLI, STS, sentiment analysis, NER), syntactic (POS, dependency parsing), and morphology-sensitive probes. To explain why tokenizers succeed or fail, we introduce a morphology-aware diagnostic toolkit that goes beyond coarse aggregates to boundary-level micro/macro F1, decoupled lemma atomicity vs. surface boundary hits, over/under-segmentation indices, character/word edit distances (CER/WER), continuation rates, and affix-type coverage and token-level atomicity. Our contributions are fourfold: (i) a systematic investigation of the vocabulary-corpus-success triad; (ii) a unified, morphology-aware evaluation framework linking intrinsic diagnostics to extrinsic outcomes; (iii) controlled comparisons identifying when character-level and morphology-level tokenization pay off; and (iv) an open-source release of evaluation code, tokenizer pipelines, and models. As the first work of its kind, this "subwords manifest" delivers actionable guidance for building effective tokenizers in MRLs and establishes a reproducible foundation for future research.
5. 【2602.06941】Endogenous Resistance to Activation Steering in Language Models
链接:https://arxiv.org/abs/2602.06941
作者:Alex McKenzie,Keenan Pepper,Stijn Servaes,Martin Leitgab,Murat Cubuktepe,Mike Vaiana,Diogo de Lucena,Judd Rosenblatt,Michael S. A. Graziano
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, produce improved responses, steering remains active, Large language, resist task-misaligned activation
备注:
点击查看摘要
Abstract:Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at this http URL.
6. 【2602.06920】Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs
链接:https://arxiv.org/abs/2602.06920
作者:Samir Abdaljalil,Parichit Sharma,Erchin Serpedin,Hasan Kurban
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:persistent challenge, difficult to maintain, large language models, factual consistency, consistency is difficult
备注:
点击查看摘要
Abstract:Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{this https URL}.
7. 【2602.06869】Uncovering Cross-Objective Interference in Multi-Objective Alignment
链接:https://arxiv.org/abs/2602.06869
作者:Yining Lu,Meng Jiang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:persistent failure mode, persistent failure, failure mode, mode in multi-objective, large language models
备注:
点击查看摘要
Abstract:We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across classic scalarization algorithms, showing that interference is pervasive and exhibits strong model dependence. To explain this phenomenon, we derive a local covariance law showing that an objective improves at first order when its reward exhibits positive covariance with the scalarized score. We extend this analysis to clipped surrogate objectives used in modern alignment, demonstrating that the covariance law remains valid under mild conditions despite clipping. Building on this analysis, we propose Covariance Targeted Weight Adaptation (CTWA), a plug-and-play method that maintains positive covariance between objective rewards and the training signal to effectively mitigate cross-objective interference. Finally, we complement these local improvement conditions with a global convergence analysis under the Polyak--Łojasiewicz condition, establishing when non-convex scalarized optimization achieves global convergence and how cross-objective interference depends on specific model geometric properties.
Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2602.06869 [cs.CL]
(or
arXiv:2602.06869v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.06869
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
8. 【2602.06854】SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks
链接:https://arxiv.org/abs/2602.06854
作者:Mingqian Feng,Xiaodong Liu,Weiwei Yang,Jialin Song,Xuekai Zhu,Chenliang Xu,Jianfeng Gao
类目:Computation and Language (cs.CL)
关键词:real threat model, safety-aligned chatbots, special case, capture the real, real threat
备注: ICLR 2026, 37 pages, 13 tables, 7 figures
点击查看摘要
Abstract:Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average $80.1\%$ ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: this https URL.
9. 【2602.06843】he Representational Geometry of Number
链接:https://arxiv.org/abs/2602.06843
作者:Zhimin Hu,Lanhao Niu,Sashank Varma
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:minimize task interference, support generalization, central question, question in cognitive, cognitive science
备注:
点击查看摘要
Abstract:A central question in cognitive science is whether conceptual representations converge onto a shared manifold to support generalization, or diverge into orthogonal subspaces to minimize task interference. While prior work has discovered evidence for both, a mechanistic account of how these properties coexist and transform across tasks remains elusive. We propose that representational sharing lies not in the concepts themselves, but in the geometric relations between them. Using number concepts as a testbed and language models as high-dimensional computational substrates, we show that number representations preserve a stable relational structure across tasks. Task-specific representations are embedded in distinct subspaces, with low-level features like magnitude and parity encoded along separable linear directions. Crucially, we find that these subspaces are largely transformable into one another via linear mappings, indicating that representations share relational structure despite being located in distinct subspaces. Together, these results provide a mechanistic lens of how language models balance the shared structure of number representation with functional flexibility. It suggests that understanding arises when task-specific transformations are applied to a shared underlying relational structure of conceptual representations.
10. 【2602.06799】Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations
链接:https://arxiv.org/abs/2602.06799
作者:Shamik Bhattacharya,Daniel Perkins,Yaren Dogan,Vineeth Konjeti,Sudarshan Srinivasan,Edmon Begoli
类目:Computation and Language (cs.CL)
关键词:poses persistent challenges, Ambiguity poses persistent, natural language understanding, Word Sense Disambiguation, large language models
备注: 9 pages, 6 figures, pending journal/workshop submission
点击查看摘要
Abstract:Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency performance, whereas aggressive image augmentation yields only marginal gains. Additional experiments with WordNet definitions and multilingual prompt ensembles further suggest that noisy external signals tend to dilute semantic specificity, reinforcing the effectiveness of precise, CLIP-aligned prompts for visual word sense disambiguation.
11. 【2602.06795】Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling
链接:https://arxiv.org/abs/2602.06795
作者:Kate Sanders,Nathaniel Weir,Sapana Chaudhary,Kaj Bostrom,Huzefa Rangwala
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, requiring expert knowledge, reasoning output verification, reliably identify errors
备注:
点击查看摘要
Abstract:An impediment to using Large Language Models (LLMs) for reasoning output verification is that LLMs struggle to reliably identify errors in thinking traces, particularly in long outputs, domains requiring expert knowledge, and problems without verifiable rewards. We propose a data-driven approach to automatically construct highly granular reasoning error taxonomies to enhance LLM-driven error detection on unseen reasoning traces. Our findings indicate that classification approaches that leverage these error taxonomies, or "rubrics", demonstrate strong error identification compared to baseline methods in technical domains like coding, math, and chemical engineering. These rubrics can be used to build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning. Experimental results show that these rewards have the potential to improve models' task accuracy on difficult domains over models trained by general LLMs-as-judges by +45%, and approach performance of models trained by verifiable rewards while using as little as 20% as many gold labels. Through our approach, we extend the usage of reward rubrics from assessing qualitative model behavior to assessing quantitative model correctness on tasks typically learned via RLVR rewards. This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels, which are often highly costly to procure.
12. 【2602.06763】R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging
链接:https://arxiv.org/abs/2602.06763
作者:Yanlin Lai,Mitt Huang,Hangyu Guo,Xiangfeng Wang,Haodong Li,Shaoxiong Zhan,Liang Zhao,Chengyuan Yao,Yinmin Zhang,Qi Han,Chun Yuan,Zheng Ge,Xiangyu Zhang,Daxin Jiang
类目:Computation and Language (cs.CL)
关键词:Reinforcement Learning, Human Feedback, Learning from Human, large language models, aligning large language
备注: Github: [this https URL](https://github.com/lyn22333/R-Align) Huggingface: [this https URL](https://huggingface.co/collections/lyn22333/r-align)
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that reasoning fidelity-the consistency between a GenRM's preference decision and reference decision rationales-is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward-model benchmarks to compute Spurious Correctness (S-Corr)-the fraction of label-correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S-Corr even for competitive GenRMs, and higher S-Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment. R-Align reduces S-Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.
13. 【2602.06724】able-as-Search: Formulate Long-Horizon Agentic Information Seeking as Table Completion
链接:https://arxiv.org/abs/2602.06724
作者:Tian Lan,Felix Henry,Bin Zhu,Qianghuai Jia,Junyang Ren,Qihang Pu,Haijun Li,Longyue Wang,Zhao Xu,Weihua Luo
类目:Computation and Language (cs.CL)
关键词:Current Information Seeking, Current Information, Information Seeking, tracking search states, massive search results
备注:
点击查看摘要
Abstract:Current Information Seeking (InfoSeeking) agents struggle to maintain focus and coherence during long-horizon exploration, as tracking search states, including planning procedure and massive search results, within one plain-text context is inherently fragile. To address this, we introduce \textbf{Table-as-Search (TaS)}, a structured planning framework that reformulates the InfoSeeking task as a Table Completion task. TaS maps each query into a structured table schema maintained in an external database, where rows represent search candidates and columns denote constraints or required information. This table precisely manages the search states: filled cells strictly record the history and search results, while empty cells serve as an explicit search plan. Crucially, TaS unifies three distinct InfoSeeking tasks: Deep Search, Wide Search, and the challenging DeepWide Search. Extensive experiments demonstrate that TaS significantly outperforms numerous state-of-the-art baselines across three kinds of benchmarks, including multi-agent framework and commercial systems. Furthermore, our analysis validates the TaS's superior robustness in long-horizon InfoSeeking, alongside its efficiency, scalability and flexibility. Code and datasets are publicly released at this https URL.
14. 【2602.06692】Evaluating Prompt Engineering Strategies for Sentiment Control in AI-Generated Texts
链接:https://arxiv.org/abs/2602.06692
作者:Kerstin Sahler,Sophie Jentzsch
类目:Computation and Language (cs.CL)
关键词:Large Language Models, emotion-adaptive Artificial Intelligence, Language Models, Artificial Intelligence, Large Language
备注: The definitive, peer-reviewed and edited version of this article is published in HHAI 2025 - Proceedings of the Fourth International Conference on Hybrid Human-Artificial Intelligence, Frontiers in Artificial Intelligence and Applications, Volume 408, ISBN 978-1-64368-611-0, pages 423 - 438, 2025
点击查看摘要
Abstract:The groundbreaking capabilities of Large Language Models (LLMs) offer new opportunities for enhancing human-computer interaction through emotion-adaptive Artificial Intelligence (AI). However, deliberately controlling the sentiment in these systems remains challenging. The present study investigates the potential of prompt engineering for controlling sentiment in LLM-generated text, providing a resource-sensitive and accessible alternative to existing methods. Using Ekman's six basic emotions (e.g., joy, disgust), we examine various prompting techniques, including Zero-Shot and Chain-of-Thought prompting using gpt-3.5-turbo, and compare it to fine-tuning. Our results indicate that prompt engineering effectively steers emotions in AI-generated texts, offering a practical and cost-effective alternative to fine-tuning, especially in data-constrained settings. In this regard, Few-Shot prompting with human-written examples was the most effective among other techniques, likely due to the additional task-specific guidance. The findings contribute valuable insights towards developing emotion-adaptive AI systems.
15. 【2602.06669】compar:IA: The French Government's LLM arena to collect French-language human prompts and preference data
链接:https://arxiv.org/abs/2602.06669
作者:Lucie Termignon,Simonas Zilinskas,Hadrien Pélissier,Aurélien Barrot,Nicolas Chesnais,Elie Gavoty
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:show reduced performance, Large Language Models, human preference alignment, human preference data, Direct Preference Optimization
备注: 18 pages, 7 figures, preprint
点击查看摘要
Abstract:Large Language Models (LLMs) often show reduced performance, cultural alignment, and safety robustness in non-English languages, partly because English dominates both pre-training data and human preference alignment datasets. Training methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) require human preference data, which remains scarce and largely non-public for many languages beyond English. To address this gap, we introduce compar:IA, an open-source digital public service developed inside the French government and designed to collect large-scale human preference data from a predominantly French-speaking general audience. The platform uses a blind pairwise comparison interface to capture unconstrained, real-world prompts and user judgments across a diverse set of language models, while maintaining low participation friction and privacy-preserving automated filtering. As of 2026-02-07, compar:IA has collected over 600,000 free-form prompts and 250,000 preference votes, with approximately 89% of the data in French. We release three complementary datasets -- conversations, votes, and reactions -- under open licenses, and present initial analyses, including a French-language model leaderboard and user interaction patterns. Beyond the French context, compar:IA is evolving toward an international digital public good, offering reusable infrastructure for multilingual model training, evaluation, and the study of human-AI interaction.
16. 【2602.06665】Not All Layers Need Tuning: Selective Layer Restoration Recovers Diversity
链接:https://arxiv.org/abs/2602.06665
作者:Bowen Zhang,Meiyi Wang,Harold Soh
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Post-training improves instruction-following, reduces generation diversity, large language models, Constrained Random Character, mode collapse
备注: 16 pages, 7 figures, 12 tables
点击查看摘要
Abstract:Post-training improves instruction-following and helpfulness of large language models (LLMs) but often reduces generation diversity, which leads to repetitive outputs in open-ended settings, a phenomenon known as mode collapse. Motivated by evidence that LLM layers play distinct functional roles, we hypothesize that mode collapse can be localized to specific layers and that restoring a carefully chosen range of layers to their pre-trained weights can recover diversity while maintaining high output quality. To validate this hypothesis and decide which layers to restore, we design a proxy task -- Constrained Random Character(CRC) -- with an explicit validity set and a natural diversity objective. Results on CRC reveal a clear diversity-validity trade-off across restoration ranges and identify configurations that increase diversity with minimal quality loss. Based on these findings, we propose Selective Layer Restoration (SLR), a training-free method that restores selected layers in a post-trained model to their pre-trained weights, yielding a hybrid model with the same architecture and parameter count, incurring no additional inference cost. Across three different tasks (creative writing, open-ended question answering, and multi-step reasoning) and three different model families (Llama, Qwen, and Gemma), we find SLR can consistently and substantially improve output diversity while maintaining high output quality.
17. 【2602.06650】Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought
链接:https://arxiv.org/abs/2602.06650
作者:Jianfeng Si,Lin Sun,Weihong Lin,Xiangzheng Zhang
类目:Computation and Language (cs.CL)
关键词:lack runtime controllabilityxf, Large Language Models, Large Language, face a fundamental, due to static
备注: 13 pages, 5 tables, 2 figures
点击查看摘要
Abstract:Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.
Comments:
13 pages, 5 tables, 2 figures
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.06650 [cs.CL]
(or
arXiv:2602.06650v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.06650
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2602.06647】Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features
链接:https://arxiv.org/abs/2602.06647
作者:Steffen Freisinger,Philipp Seeberger,Tobias Bocklet,Korbinian Riedhammer
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:Spoken content, spans multiple topics, makes automatic topic, downstream applications, spans multiple
备注: Accepted to IEEE ICASSP 2026
点击查看摘要
Abstract:Spoken content, such as online videos and podcasts, often spans multiple topics, which makes automatic topic segmentation essential for user navigation and downstream applications. However, current methods do not fully leverage acoustic features, leaving room for improvement. We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries. Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines. Our model also proves more resilient to ASR noise and outperforms a larger text-only baseline on three additional datasets in Portuguese, German, and English, underscoring the value of learned acoustic features for robust topic segmentation.
19. 【2602.06625】FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge
链接:https://arxiv.org/abs/2602.06625
作者:Bo Yang,Lanfei Feng,Yunkui Chen,Yu Zhang,Xiao Xu,Shijian Li
类目:Computation and Language (cs.CL)
关键词:pointwise versus pairwise, domain-specific evaluation criteria, systematic biases driven, systems suffer, fundamental limitations
备注:
点击查看摘要
Abstract:Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and model provenance, and evaluation inconsistency that leads to contradictory judgments across different evaluation modes (e.g., pointwise versus pairwise). To address these issues, we propose FairJudge, an adaptive, debiased, and consistent LLM-as-a-Judge. Unlike prior approaches that treat the judge as a static evaluator, FairJudge models judging behavior itself as a learnable and regularized policy. From a data-centric perspective, we construct a high-information-density judging dataset that explicitly injects supervision signals aligned with evaluation behavior. Building on this dataset, we adopt a curriculum-style SFT-DPO-GRPO training paradigm that progressively aligns rubric adherence, bias mitigation, and cross-mode consistency, while avoiding catastrophic forgetting. Experimental results on multiple internal and public benchmarks show that FairJudge consistently improves agreement and F1, reduces non-semantic biases, and outperforms substantially larger instruction-tuned LLMs. All resources will be publicly released after acceptance to facilitate future research.
20. 【2602.06623】Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention
链接:https://arxiv.org/abs/2602.06623
作者:Himanshu Singh,Ziwei Xu,A. V. Subramanyam,Mohan Kankanhalli
类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)
关键词:Large Language Models, Large Language, seemingly harmless prompts, powerful text generators, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity often face a trade-off between safety and the coherence, or fluency of the generated text. In this work, we present a targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns from underlying model representations, while preserving overall ability to generate safe fluent content. On the RealToxicityPrompts, our method achieves strong mitigation performance compared to existing baselines, with minimal impact on inference complexity. Across multiple LLMs, our approach reduces toxicity of state-of-the-art detoxification systems by 8-20%, while maintaining comparable fluency. Through extensive quantitative and qualitative analyses, we show that our approach achieves effective toxicity reduction without impairing generative performance, consistently outperforming existing baselines.
21. 【2602.06600】Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning
链接:https://arxiv.org/abs/2602.06600
作者:Zhuoyuan Hao,Zhuo Li,Wu Li,Fangming Liu,Min Zhang,Jing Li
类目:Computation and Language (cs.CL)
关键词:Test-time compute allocation, mathematical problem solving, large reasoning models, Test-time compute, emph
备注:
点击查看摘要
Abstract:Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic ``thinking tokens'' and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain -- and often ignore -- the \emph{spontaneous} repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emph{Echo of Prompt (EOP)}, as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the \emph{Echo Likelihood Gap} $\Delta\mathcal{L}$ as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop \emph{Echo-Distilled SFT (ED-SFT)} to instill an ``echo-then-reason'' pattern through supervised finetuning, and \emph{Echoic Prompting (EP)} to re-ground the model mid-trace without training. While promising, quantifying benefits beyond verbosity is non-trivial. Therefore, we conduct length and suffix-controlled likelihood analyses together with layer-wise attention studies, showing that EOP increases answer to answer-prefix attention in middle layers, consistent with an \emph{attention refocusing} mechanism. We evaluate on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 under identical decoding settings and budgets, and find consistent gains over baselines. Code is available at this https URL.
22. 【2602.06596】Personality as Relational Infrastructure: User Perceptions of Personality-Trait-Infused LLM Messaging
链接:https://arxiv.org/abs/2602.06596
作者:Dominik P. Hofer,David Haag,Rania Islambouli,Jan D. Smeddinck
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Digital behaviour change, Digital behaviour, systems increasingly rely, rely on repeated, increasingly rely
备注: Currently under review
点击查看摘要
Abstract:Digital behaviour change systems increasingly rely on repeated, system-initiated messages to support users in everyday contexts. LLMs enable these messages to be personalised consistently across interactions, yet it remains unclear whether such personalisation improves individual messages or instead shapes users' perceptions through patterns of exposure. We explore this question in the context of LLM-generated JITAIs, which are short, context-aware messages delivered at moments deemed appropriate to support behaviour change, using physical activity as an application domain. In a controlled retrospective study, 90 participants evaluated messages generated using four LLM strategies: baseline prompting, few-shot prompting, fine-tuned models, and retrieval augmented generation, each implemented with and without Big Five Personality Traits to produce personality-aligned communication across multiple scenarios. Using ordinal multilevel models with within-between decomposition, we distinguish trial-level effects, whether personality information improves evaluations of individual messages, from person-level exposure effects, whether participants receiving higher proportions of personality-informed messages exhibit systematically different overall perceptions. Results showed no trial-level associations, but participants who received higher proportions of BFPT-informed messages rated the messages as more personalised, appropriate, and reported less negative affect. We use Communication Accommodation Theory for post-hoc analysis. These results suggest that personality-based personalisation in behaviour change systems may operate primarily through aggregate exposure rather than per-message optimisation, with implications for how adaptive systems are designed and evaluated in sustained human-AI interaction. In-situ longitudinal studies are needed to validate these findings in real-world contexts.
23. 【2602.06584】Inference-Time Rethinking with Latent Thought Vectors for Math Reasoning
链接:https://arxiv.org/abs/2602.06584
作者:Deqian Kong,Minglu Zhao,Aoyang Qin,Bo Pang,Chenxin Tao,David Hartmann,Edouardo Honig,Dehong Xu,Amit Kumar,Matt Sarte,Chuan Li,Jianwen Xie,Ying Nian Wu
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:single forward pass, latent thought vectors, forward pass, committing irrevocably, early errors
备注:
点击查看摘要
Abstract:Standard chain-of-thought reasoning generates a solution in a single forward pass, committing irrevocably to each token and lacking a mechanism to recover from early errors. We introduce Inference-Time Rethinking, a generative framework that enables iterative self-correction by decoupling declarative latent thought vectors from procedural generation. We factorize reasoning into a continuous latent thought vector (what to reason about) and a decoder that verbalizes the trace conditioned on this vector (how to reason). Beyond serving as a declarative buffer, latent thought vectors compress the reasoning structure into a continuous representation that abstracts away surface-level token variability, making gradient-based optimization over reasoning strategies well-posed. Our prior model maps unstructured noise to a learned manifold of valid reasoning patterns, and at test time we employ a Gibbs-style procedure that alternates between generating a candidate trace and optimizing the latent vector to better explain that trace, effectively navigating the latent manifold to refine the reasoning strategy. Training a 0.2B-parameter model from scratch on GSM8K, our method with 30 rethinking iterations surpasses baselines with 10 to 15 times more parameters, including a 3B counterpart. This result demonstrates that effective mathematical reasoning can emerge from sophisticated inference-time computation rather than solely from massive parameter counts.
24. 【2602.06570】Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making
链接:https://arxiv.org/abs/2602.06570
作者:Baichuan-M3 Team:Chengfeng Dou,Fan Yang,Fei Li,Jiyuan Jia,Qiang Ju,Shuai Wang,Tianpeng Li,Xiangrong Zeng,Yijie Zhou,Hongda Zhang,Jinyang Tai,Linzhuang Sun,Peidong Guo,Yichuan Mo,Xiaochuan Wang,Hengfu Cui,Zhishou Zhang
类目:Computation and Language (cs.CL)
关键词:clinical-grade decision support, medical-enhanced large language, large language model, language model engineered, question-answering to active
备注:
点击查看摘要
Abstract:We introduce Baichuan-M3, a medical-enhanced large language model engineered to shift the paradigm from passive question-answering to active, clinical-grade decision support. Addressing the limitations of existing systems in open-ended consultations, Baichuan-M3 utilizes a specialized training pipeline to model the systematic workflow of a physician. Key capabilities include: (i) proactive information acquisition to resolve ambiguity; (ii) long-horizon reasoning that unifies scattered evidence into coherent diagnoses; and (iii) adaptive hallucination suppression to ensure factual reliability. Empirical evaluations demonstrate that Baichuan-M3 achieves state-of-the-art results on HealthBench, the newly introduced HealthBench-Hallu and ScanBench, significantly outperforming GPT-5.2 in clinical inquiry, advisory and safety. The models are publicly available at this https URL.
25. 【2602.06566】SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs
链接:https://arxiv.org/abs/2602.06566
作者:Niccolo Avogaro,Nayanika Debnath,Li Mi,Thomas Frick,Junling Wang,Zexue He,Hang Hua,Konrad Schindler,Mattia Rigotti
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:completely wrong answers, small perceptual mistakes, leading to long, recent successes, dynamically expanding
备注:
点击查看摘要
Abstract:Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.
26. 【2602.06547】Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study
链接:https://arxiv.org/abs/2602.06547
作者:Yi Liu,Zhihao Chen,Yanjun Zhang,Gelei Deng,Yuekang Li,Jianting Ning,Leo Yu Zhang
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
关键词:Third-party agent skills, extend LLM-based agents, skills extend LLM-based, Third-party agent, users' machines
备注:
点击查看摘要
Abstract:Third-party agent skills extend LLM-based agents with instruction files and executable code that run on users' machines. Skills execute with user privileges and are distributed through community registries with minimal vetting, but no ground-truth dataset exists to characterize the resulting threats. We construct the first labeled dataset of malicious agent skills by behaviorally verifying 98,380 skills from two community registries, confirming 157 malicious skills with 632 vulnerabilities. These attacks are not incidental. Malicious skills average 4.03 vulnerabilities across a median of three kill chain phases, and the ecosystem has split into two archetypes: Data Thieves that exfiltrate credentials through supply chain techniques, and Agent Hijackers that subvert agent decision-making through instruction manipulation. A single actor accounts for 54.1\% of confirmed cases through templated brand impersonation. Shadow features, capabilities absent from public documentation, appear in 0\% of basic attacks but 100\% of advanced ones; several skills go further by exploiting the AI platform's own hook system and permission flags. Responsible disclosure led to 93.6\% removal within 30 days. We release the dataset and analysis pipeline to support future work on agent skill security.
27. 【2602.06546】MTQE.en-he: Machine Translation Quality Estimation for English-Hebrew
链接:https://arxiv.org/abs/2602.06546
作者:Andy Rosenbaum,Assaf Siani,Ilan Kernerman
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Translation Quality Estimation, Machine Translation Quality, Quality Estimation, translation quality annotated, Machine Translation
备注: Accepted to LoResLM at EACL 2026
点击查看摘要
Abstract:We release this http URL-he: to our knowledge, the first publicly available English-Hebrew benchmark for Machine Translation Quality Estimation. this http URL-he contains 959 English segments from WMT24++, each paired with a machine translation into Hebrew, and Direct Assessment scores of the translation quality annotated by three human experts. We benchmark ChatGPT prompting, TransQuest, and CometKiwi and show that ensembling the three models outperforms the best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Fine-tuning experiments with TransQuest and CometKiwi reveal that full-model updates are sensitive to overfitting and distribution collapse, yet parameter-efficient methods (LoRA, BitFit, and FTHead, i.e., fine-tuning only the classification head) train stably and yield improvements of 2-3 percentage points. this http URL-he and our experimental results enable future research on this under-resourced language pair.
28. 【2602.06540】AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research
链接:https://arxiv.org/abs/2602.06540
作者:Yishan Li,Wentong Chen,Yukun Yan,Mingwei Li,Sen Mei,Xiaorong Wang,Kunpeng Liu,Xin Cong,Shuo Wang,Zhong Zhang,Yaxi Lu,Zhenghao Liu,Yankai Lin,Zhiyuan Liu,Maosong Sun
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Generating deep research, requires large-scale information, reports requires large-scale, Generating deep, insight-driven analysis
备注:
点击查看摘要
Abstract:Generating deep research reports requires large-scale information acquisition and the synthesis of insight-driven analysis, posing a significant challenge for current language models. Most existing approaches follow a plan-then-write paradigm, whose performance heavily depends on the quality of the initial outline. However, constructing a comprehensive outline itself demands strong reasoning ability, causing current deep research systems to rely almost exclusively on closed-source or online large models. This reliance raises practical barriers to deployment and introduces safety and privacy concerns for user-authored data. In this work, we present AgentCPM-Report, a lightweight yet high-performing local solution composed of a framework that mirrors the human writing process and an 8B-parameter deep research agent. Our framework uses a Writing As Reasoning Policy (WARP), which enables models to dynamically revise outlines during report generation. Under this policy, the agent alternates between Evidence-Based Drafting and Reasoning-Driven Deepening, jointly supporting information acquisition, knowledge refinement, and iterative outline evolution. To effectively equip small models with this capability, we introduce a Multi-Stage Agentic Training strategy, consisting of cold-start, atomic skill RL, and holistic pipeline RL. Experiments on DeepResearch Bench, DeepConsult, and DeepResearch Gym demonstrate that AgentCPM-Report outperforms leading closed-source systems, with substantial gains in Insight.
29. 【2602.06533】LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models
链接:https://arxiv.org/abs/2602.06533
作者:Brian Rabern,Philipp Mondorf,Barbara Plank
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, demonstrated notable performance, Large language, demonstrated notable, Large
备注: 13 pages, 5 figures
点击查看摘要
Abstract:Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) $\textit{formal symbolization}\unicode{x2014}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$formulating a finite structure in which all premises are true while the conclusion is false; and (iii) $\textit{validity assessment}\unicode{x2014}$deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.
30. 【2602.06526】Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks
链接:https://arxiv.org/abs/2602.06526
作者:Minjeong Ban,Jeonghwan Choi,Hyangsuk Min,Nicole Hee-Yeon Kim,Minseok Kim,Jae-Gil Lee,Hwanjun Song
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Information retrieval, remains challenging due, unlabeled relevant chunks, challenging due, due to incomplete
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at this https URL.
31. 【2602.06506】Designing Computational Tools for Exploring Causal Relationships in Qualitative Data
链接:https://arxiv.org/abs/2602.06506
作者:Han Meng,Qiuyuan Lyu,Peinuan Qin,Yitian Yang,Renwen Zhang,Wen-Chieh Lin,Yi-Chieh Lee
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Exploring causal relationships, HCI and social, social science research, science research enables, qualitative data
备注: 19 pages, 5 figures, conditionally accepted by CHI26
点击查看摘要
Abstract:Exploring causal relationships for qualitative data analysis in HCI and social science research enables the understanding of user needs and theory building. However, current computational tools primarily characterize and categorize qualitative data; the few systems that analyze causal relationships either inadequately consider context, lack credibility, or produce overly complex outputs. We first conducted a formative study with 15 participants interested in using computational tools for exploring causal relationships in qualitative data to understand their needs and derive design guidelines. Based on these findings, we designed and implemented QualCausal, a system that extracts and illustrates causal relationships through interactive causal network construction and multi-view visualization. A feedback study (n = 15) revealed that participants valued our system for reducing the analytical burden and providing cognitive scaffolding, yet navigated how such systems fit within their established research paradigms, practices, and habits. We discuss broader implications for designing computational tools that support qualitative data analysis.
32. 【2602.06471】Revisiting the Shape Convention of Transformer Language Models
链接:https://arxiv.org/abs/2602.06471
作者:Feng-Ting Liao,Meng-Hsi Chen,Guan-Ting Yi,Da-shan Shiu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Dense Transformer language, Dense Transformer, FFN, hourglass FFN, lighter hourglass FFN
备注:
点击查看摘要
Abstract:Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass FFN can be more effectively utilized, such as by enlarging model hidden dimensions under fixed budgets. We confirm these through empirical validations across model scales: hourglass FFNs outperform conventional FFNs up to 400M and achieve comparable performance at larger scales to 1B parameters; hourglass FFN variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets. Together, these findings shed new light on recent work and prompt a rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN towards efficient and expressive modern language models.
33. 【2602.06470】Improve Large Language Model Systems with User Logs
链接:https://arxiv.org/abs/2602.06470
作者:Changyue Wang,Weihang Su,Qingyao Ai,Yiqun Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Scaling training data, rising computational costs, long driven progress, Scaling training, large language models
备注:
点击查看摘要
Abstract:Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at this https URL .
34. 【2602.06462】Diffusion-State Policy Optimization for Masked Diffusion Language Models
链接:https://arxiv.org/abs/2602.06462
作者:Daisuke Oba,Hiroki Furuta,Naoaki Okazaki
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:language models generate, yields coarse credit, coarse credit assignment, multiple denoising steps, diffusion language models
备注:
点击查看摘要
Abstract:Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens -- without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at this https URL .
35. 【2602.06454】RelayGen: Intra-Generation Model Switching for Efficient Reasoning
链接:https://arxiv.org/abs/2602.06454
作者:Jiwon Song,Yoongon Kim,Jae-Joon Kim
类目:Computation and Language (cs.CL)
关键词:substantial deployment cost, inference-time scaling incurs, scaling incurs substantial, incurs substantial deployment, multi-step reasoning trajectories
备注:
点击查看摘要
Abstract:Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi-step reasoning trajectories, but inference-time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing efficiency-oriented approaches either ignore this intra-generation variation or rely on supervised token-level routing with high system complexity. We present \textbf{RelayGen}, a training-free, segment-level runtime model switching framework that exploits difficulty variation in long-form reasoning. Through offline analysis of generation uncertainty using token probability margins, we show that coarse-grained segment-level control is sufficient to capture difficulty transitions within a reasoning trajectory. RelayGen identifies model-specific switch cues that signal transitions to lower-difficulty segments and dynamically delegates their continuation to a smaller model, while preserving high-difficulty reasoning on the large model. Across multiple reasoning benchmarks, RelayGen substantially reduces inference latency while preserving most of the accuracy of large models. When combined with speculative decoding, RelayGen achieves up to 2.2$\times$ end-to-end speedup with less than 2\% accuracy degradation, without requiring additional training or learned routing components.
36. 【2602.06449】Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning
链接:https://arxiv.org/abs/2602.06449
作者:Xinxin Lin,Guangxin Dai,Yi Zhong,Xiang Li,Xue Xiao,Yixin Zhang,Zhengdong Wu,Yongbo Zheng,Runchuan Zhu,Ming Zhao,Huizi Yu,Shuo Wu,Jun Zhao,Lingming Hu,Yumei Wang,Ping Yin,Joey W.Y. Chan,Ngan Yin Chan,Sijing Chen,Yun Kwok Wing,Lin Lu,Xin Ma,Lizhou Fan
类目:Computation and Language (cs.CL)
关键词:hold transformative potential, Large language models, Large language, psychiatry remains constrained, hold transformative
备注: 21 pages, 8 figures
点击查看摘要
Abstract:Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.
37. 【2602.06446】CORE: Comprehensive Ontological Relation Evaluation for Large Language Models
链接:https://arxiv.org/abs/2602.06446
作者:Satyam Dwivedi,Sanjukta Ghosh,Shivam Dwivedi,Nishi Kumari,Anil Thakur,Anurag Purushottam,Deepak Alok,Praveen Gatla,Manjuprasad B,Bipasha Patgiri
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, existing evaluations rarely, evaluations rarely assess
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen's Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error increases 2-4x on unrelated pairs, and a mean semantic collapse rate of 37.6% indicates systematic generation of spurious relations. On the CORE 225K MCQs dataset, accuracy further drops to approximately 2%, highlighting substantial challenges in domain-specific semantic reasoning. We identify unrelatedness reasoning as a critical, under-evaluated frontier for LLM evaluation and safety.
38. 【2602.06440】railBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking
链接:https://arxiv.org/abs/2602.06440
作者:Sung-Hoon Yoon,Ruizhi Qian,Minda Zhao,Weiyue Li,Mengyu Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:Large Language Models, Large Language, Language Models, making their safety, Large
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.
39. 【2602.06430】Investigating the structure of emotions by analyzing similarity and association of emotion words
链接:https://arxiv.org/abs/2602.06430
作者:Fumitaka Iwaki,Tatsuji Takahashi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:natural language processing, attempted sentiment analysis, language processing, response variables, field of natural
备注: 5 figures, 8 tables
点击查看摘要
Abstract:In the field of natural language processing, some studies have attempted sentiment analysis on text by handling emotions as explanatory or response variables. One of the most popular emotion models used in this context is the wheel of emotion proposed by Plutchik. This model schematizes human emotions in a circular structure, and represents them in two or three dimensions. However, the validity of Plutchik's wheel of emotion has not been sufficiently examined. This study investigated the validity of the wheel by creating and analyzing a semantic networks of emotion words. Through our experiments, we collected data of similarity and association of ordered pairs of emotion words, and constructed networks using these data. We then analyzed the structure of the networks through community detection, and compared it with that of the wheel of emotion. The results showed that each network's structure was, for the most part, similar to that of the wheel of emotion, but locally different.
40. 【2602.06423】On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation
链接:https://arxiv.org/abs/2602.06423
作者:Wenbo Shang,Yuxi Sun,Jing Ma,Xin Huang
类目:Computation and Language (cs.CL)
关键词:intricate human language, daily life, intricate human, Humor, human language
备注: Paper accepted as a conference paper at ICLR 2026
点击查看摘要
Abstract:Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.
41. 【2602.06412】Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
链接:https://arxiv.org/abs/2602.06412
作者:Daisuke Oba,Danushka Bollegala,Masahiro Kaneko,Naoaki Okazaki
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Masked Diffusion Language, Diffusion Language Models, Language Models generate, Masked Diffusion, Diffusion Language
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at this https URL .
42. 【2602.06384】FMBench: Adaptive Large Language Model Output Formatting
链接:https://arxiv.org/abs/2602.06384
作者:Yaoting Wang,Yun Zhou,Henghui Ding
类目:Computation and Language (cs.CL)
关键词:deploying large language, Producing outputs, system-integrated workflows, intent and format, essential for deploying
备注:
点击查看摘要
Abstract:Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: this https URL.
43. 【2602.06373】ReBeCA: Unveiling Interpretable Behavior Hierarchy behind the Iterative Self-Reflection of Language Models with Causal Analysis
链接:https://arxiv.org/abs/2602.06373
作者:Tianqiang Yan,Sihan Shang,Yuheng Li,Song Qiu,Hao Peng,Wenjian Luo,Jue Xie,Lizhen Qu,Yuan Gao
类目:Computation and Language (cs.CL)
关键词:textbf, language model reliability, enhance language model, mechanisms remain opaque, texttt
备注: 17 pages, 3 figures
点击查看摘要
Abstract:While self-reflection can enhance language model reliability, its underlying mechanisms remain opaque, with existing analyses often yielding correlation-based insights that fail to generalize. To address this, we introduce \textbf{\texttt{ReBeCA}} (self-\textbf{\texttt{Re}}flection \textbf{\texttt{Be}}havior explained through \textbf{\texttt{C}}ausal \textbf{\texttt{A}}nalysis), a framework that unveils the interpretable behavioral hierarchy governing the self-reflection outcome. By modeling self-reflection trajectories as causal graphs, ReBeCA isolates genuine determinants of performance through a three-stage Invariant Causal Prediction (ICP) pipeline. We establish three critical findings: (1) \textbf{Behavioral hierarchy:} Semantic behaviors of the model influence final self-reflection results hierarchically: directly or indirectly; (2) \textbf{Causation matters:} Generalizability in self-reflection effects is limited to just a few semantic behaviors; (3) \textbf{More $\mathbf{\neq}$ better:} The confluence of seemingly positive semantic behaviors, even among direct causal factors, can impair the efficacy of self-reflection. ICP-based verification identifies sparse causal parents achieving up to $49.6\%$ structural likelihood gains, stable across tasks where correlation-based patterns fail. Intervention studies on novel datasets confirm these causal relationships hold out-of-distribution ($p = .013, \eta^2_\mathrm{p} = .071$). ReBeCA thus provides a rigorous methodology for disentangling genuine causal mechanisms from spurious associations in self-reflection dynamics.
44. 【2602.06370】Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production
链接:https://arxiv.org/abs/2602.06370
作者:Alberto Andres Valdes Gonzalez
类目:Computation and Language (cs.CL)
关键词:Claude Sonnet, demonstrated strong capabilities, Large language models, generative language tasks, Large language
备注: 26 pages, 12 figures. Empirical benchmark comparing fine-tuned encoders and LLM prompting for text classification under cost and latency constraints
点击查看摘要
Abstract:Large language models (LLMs) such as GPT-4o and Claude Sonnet 4.5 have demonstrated strong capabilities in open-ended reasoning and generative language tasks, leading to their widespread adoption across a broad range of NLP applications. However, for structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone, overlooking operational constraints encountered in production systems. In this work, we present a systematic comparison of two contrasting paradigms for text classification: zero- and few-shot prompt-based large language models, and fully fine-tuned encoder-only architectures. We evaluate these approaches across four canonical benchmarks (IMDB, SST-2, AG News, and DBPedia), measuring predictive quality (macro F1), inference latency, and monetary cost. We frame model evaluation as a multi-objective decision problem and analyze trade-offs using Pareto frontier projections and a parameterized utility function reflecting different deployment regimes. Our results show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance while operating at one to two orders of magnitude lower cost and latency compared to zero- and few-shot LLM prompting. Overall, our findings suggest that indiscriminate use of large language models for standard text classification workloads can lead to suboptimal system-level outcomes. Instead, fine-tuned encoders emerge as robust and efficient components for structured NLP pipelines, while LLMs are better positioned as complementary elements within hybrid architectures. We release all code, datasets, and evaluation protocols to support reproducibility and cost-aware NLP system design.
Comments:
26 pages, 12 figures. Empirical benchmark comparing fine-tuned encoders and LLM prompting for text classification under cost and latency constraints
Subjects:
Computation and Language (cs.CL)
ACMclasses:
I.2.7; I.5.1; I.2.6
Cite as:
arXiv:2602.06370 [cs.CL]
(or
arXiv:2602.06370v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.06370
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Alberto Andres Valdes Gonzalez [view email] [v1]
Fri, 6 Feb 2026 03:54:28 UTC (1,082 KB)
45. 【2602.06358】SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass
链接:https://arxiv.org/abs/2602.06358
作者:Yewei Liu,Xiyuan Wang,Yansheng Mao,Yoav Gelbery,Haggai Maron,Muhan Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Scalable Hyper In-context, Hyper In-context NEtwork, large language models, Scalable Hyper, Hyper In-context
备注:
点击查看摘要
Abstract:We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLM). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at this https URL
46. 【2602.06337】Can Post-Training Transform LLMs into Causal Reasoners?
链接:https://arxiv.org/abs/2602.06337
作者:Junqi Chen,Sirui Chen,Chaochao Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:challenging for non-experts, essential for decision-making, decision-making but remains, remains challenging, Causal inference
备注:
点击查看摘要
Abstract:Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post-trained LLMs exhibit strong generalization and robustness under real-world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post-training can produce reliable and robust LLM-based causal reasoners. Our data and GRPO-model are available at this https URL.
47. 【2602.06317】he Condensate Theorem: Transformers are O(n), Not $O(n^2)$
链接:https://arxiv.org/abs/2602.06317
作者:Jorge L. Ruiz Williams
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:learned topological property, Condensate Theorem, architectural constraint, present the Condensate, topological property
备注: 13 pages, 4 figures, 8 tables
点击查看摘要
Abstract:We present the Condensate Theorem: attention sparsity is a learned topological property, not an architectural constraint. Through empirical analysis of trained language models, we find that attention mass concentrates on a distinct topological manifold -- and this manifold can be identified dynamically without checking every position. We prove a general result: for any query, projecting attention onto the Condensate Manifold (Anchor + Window + Dynamic Top-k) achieves 100% output equivalence with full $O(n^2)$ attention. This is not an approximation -- it is lossless parity. We validate this across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral, demonstrating bit-exact token matching on 1,500+ generated tokens. By mapping this topology to hardware, our Topological Attention kernel achieves a 159x measured speedup at 131K tokens (3.94ms vs 628ms) and a projected 1,200x speedup at 1M tokens, reducing inference costs by 99.9% compared to Flash Attention. We conclude that the quadratic bottleneck is an artifact of naive implementation, not intelligence.
48. 【2602.06307】Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions
链接:https://arxiv.org/abs/2602.06307
作者:Nemika Tyagi,Holly Hendrix,Nelvin Licona-Guevara,Justin Mackie,Phanos Kareen,Muhammad Imran,Megan Michelle Smith,Tatiana Gallego Hernande,Chitta Baral,Olga Kellert
类目:Computation and Language (cs.CL)
关键词:standard Universal Dependencies, Universal Dependencies, spoken CSW, written text, CSW
备注: 18 pages, 4 Figures
点击查看摘要
Abstract:Spoken code-switching (CSW) challenges syntactic parsing in ways not observed in written text. Disfluencies, repetition, ellipsis, and discourse-driven structure routinely violate standard Universal Dependencies (UD) assumptions, causing parsers and large language models (LLMs) to fail despite strong performance on written data. These failures are compounded by rigid evaluation metrics that conflate genuine structural errors with acceptable variation. In this work, we present a systems-oriented approach to spoken CSW parsing. We introduce a linguistically grounded taxonomy of spoken CSW phenomena and SpokeBench, an expert-annotated gold benchmark designed to test spoken-language structure beyond standard UD assumptions. We further propose FLEX-UD, an ambiguity-aware evaluation metric, which reveals that existing parsing techniques perform poorly on spoken CSW by penalizing linguistically plausible analyses as errors. We then propose DECAP, a decoupled agentic parsing framework that isolates spoken-phenomena handling from core syntactic analysis. Experiments show that DECAP produces more robust and interpretable parses without retraining and achieves up to 52.6% improvements over existing parsing techniques. FLEX-UD evaluations further reveal qualitative improvements that are masked by standard metrics.
49. 【2602.06291】Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math
链接:https://arxiv.org/abs/2602.06291
作者:Guijin Son,Donghun Yang,Hitesh Laxmichand Patel,Hyunwoo Ko,Amit Agarwal,Sunghee Ahn,Kyong-Ha Lee,Youngjae Yu
类目:Computation and Language (cs.CL)
关键词:consuming scarce expert, scarce expert time, generating plausible attempts, Recent progress, reasoning models suggests
备注: Preprint
点击查看摘要
Abstract:Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.
50. 【2602.06275】RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution
链接:https://arxiv.org/abs/2602.06275
作者:Isaac Picov,Ritesh Goru
类目:Computation and Language (cs.CL)
关键词:Explaining closed-source LLM, closed-source LLM outputs, access prevents gradient-based, API access prevents, Explaining closed-source
备注:
点击查看摘要
Abstract:Explaining closed-source LLM outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text. We introduce RoPE-LIME, an open-source extension of gSMILE that decouples reasoning from explanation: given a fixed output from a closed model, a smaller open-source surrogate computes token-level attributions from probability-based objectives (negative log-likelihood and divergence targets) under input perturbations. RoPE-LIME incorporates (i) a locality kernel based on Relaxed Word Mover's Distance computed in RoPE embedding space for stable similarity under masking, and (ii) Sparse-K sampling, an efficient perturbation strategy that improves interaction coverage under limited budgets. Experiments on HotpotQA (sentence features) and a hand-labeled MMLU subset (word features) show that RoPE-LIME produces more informative attributions than leave-one-out sampling and improves over gSMILE while substantially reducing closed-model API calls.
51. 【2602.06270】VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation
链接:https://arxiv.org/abs/2602.06270
作者:Yancheng Wang,Osama Hanna,Ruiming Xie,Xianfeng Rui,Maohao Shen,Xuedong Zhang,Christian Fuegen,Jilong Wu,Debjyoti Paul,Arthur Guo,Zhihong Lei,Ozlem Kalinli,Qing He,Yingzhen Yang
类目:Computation and Language (cs.CL)
关键词:complex multimodal challenge, Emotion recognition, multimodal challenge, requiring comprehension, vocal expressivity
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.
52. 【2602.06268】MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs
链接:https://arxiv.org/abs/2602.06268
作者:Junhyeok Lee,Han Jang,Kyu Sung Choi
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, Medical Prompt Injection, prompt injection
备注: 13 pages, 7 figures
点击查看摘要
Abstract:Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly integrated into clinical workflows; however, prompt injection attacks can steer these systems toward clinically unsafe or misleading outputs. We introduce the Medical Prompt Injection Benchmark (MPIB), a dataset-and-benchmark suite for evaluating clinical safety under both direct prompt injection and indirect, RAG-mediated injection across clinically grounded tasks. MPIB emphasizes outcome-level risk via the Clinical Harm Event Rate (CHER), which measures high-severity clinical harm events under a clinically grounded taxonomy, and reports CHER alongside Attack Success Rate (ASR) to disentangle instruction compliance from downstream patient risk. The benchmark comprises 9,697 curated instances constructed through multi-stage quality gates and clinical safety linting. Evaluating MPIB across a diverse set of baseline LLMs and defense configurations, we find that ASR and CHER can diverge substantially, and that robustness depends critically on whether adversarial instructions appear in the user query or in retrieved context. We release MPIB with evaluation code, adversarial baselines, and comprehensive documentation to support reproducible and systematic research on clinical prompt injection. Code and data are available at GitHub (code) and Hugging Face (data).
53. 【2602.06266】Is my model "mind blurting"? Interpreting the dynamics of reasoning tokens with Recurrence Quantification Analysis (RQA)
链接:https://arxiv.org/abs/2602.06266
作者:Quoc Tuan Pham,Mehdi Jafari,Flora Salim
类目:Computation and Language (cs.CL)
关键词:Recurrence Quantification Analysis, impractical and unreliable, compute is central, central to large, text is increasingly
备注:
点击查看摘要
Abstract:Test-time compute is central to large reasoning models, yet analysing their reasoning behaviour through generated text is increasingly impractical and unreliable. Response length is often used as a brute proxy for reasoning effort, but this metric fails to capture the dynamics and effectiveness of the Chain of Thoughts (CoT) or the generated tokens. We propose Recurrence Quantification Analysis (RQA) as a non-textual alternative for analysing model's reasoning chains at test time. By treating token generation as a dynamical system, we extract hidden embeddings at each generation step and apply RQA to the resulting trajectories. RQA metrics, including Determinism and Laminarity, quantify patterns of repetition and stalling in the model's latent representations. Analysing 3,600 generation traces from DeepSeek-R1-Distill, we show that RQA captures signals not reflected by response length, but also substantially improves prediction of task complexity by 8\%. These results help establish RQA as a principled tool for studying the latent token generation dynamics of test-time scaling in reasoning models.
54. 【2602.06260】Can One-sided Arguments Lead to Response Change in Large Language Models?
链接:https://arxiv.org/abs/2602.06260
作者:Pedro Cisneros-Velarde
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:balanced answer, Large Language Models, express a balanced, answer, Language Models
备注:
点击查看摘要
Abstract:Polemic questions need more than one viewpoint to express a balanced answer. Large Language Models (LLMs) can provide a balanced answer, but also take a single aligned viewpoint or refuse to answer. In this paper, we study if such initial responses can be steered to a specific viewpoint in a simple and intuitive way: by only providing one-sided arguments supporting the viewpoint. Our systematic study has three dimensions: (i) which stance is induced in the LLM response, (ii) how the polemic question is formulated, (iii) how the arguments are shown. We construct a small dataset and remarkably find that opinion steering occurs across (i)-(iii) for diverse models, number of arguments, and topics. Switching to other arguments consistently decreases opinion steering.
55. 【2602.06256】Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions
链接:https://arxiv.org/abs/2602.06256
作者:Navita Goyal,Hal Daumé III
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:precisely controlling large, controlling large language, large language models, inference time, involves intervening
备注: EACL 2026 Main, Long Paper
点击查看摘要
Abstract:Model steering, which involves intervening on hidden representations at inference time, has emerged as a lightweight alternative to finetuning for precisely controlling large language models. While steering efficacy has been widely studied, evaluations of whether interventions alter only the intended property remain limited, especially with respect to unintended changes in behaviors related to the target property. We call this notion specificity. We propose a framework that distinguishes three dimensions of specificity: general (preserving fluency and unrelated abilities), control (preserving related control properties), and robustness (preserving control properties under distribution shifts). We study two safety-critical use cases: steering models to reduce overrefusal and faithfulness hallucinations, and show that while steering achieves high efficacy and largely maintains general and control specificity, it consistently fails to preserve robustness specificity. In the case of overrefusal steering, for example, all steering methods reduce overrefusal without harming general abilities and refusal on harmful queries; however, they substantially increase vulnerability to jailbreaks. Our work provides the first systematic evaluation of specificity in model steering, showing that standard efficacy and specificity checks are insufficient, because without robustness evaluation, steering methods may appear reliable even when they compromise model safety.
56. 【2602.06221】BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
链接:https://arxiv.org/abs/2602.06221
作者:Nishant Balepur,Bhavya Rajasekaran,Jane Oh,Michael Xie,Atrey Desai,Vipul Gupta,Steven James Moore,Eunsol Choi,Rachel Rudinger,Jordan Lee Boyd-Graber
类目:Computation and Language (cs.CL)
关键词:Multiple-choice question answering, rigorous quality control, lack rigorous quality, Multiple-choice question, benchmarks lack rigorous
备注: In-progress preprint
点击查看摘要
Abstract:Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination - items appearing exactly online; 2) shortcuts - cues in the choices that enable guessing; and 3) writing errors - structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.
57. 【2602.06204】Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning
链接:https://arxiv.org/abs/2602.06204
作者:Nan Chen,Soledad Villar,Soufiane Hayou
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:learning rate, optimal learning rate, learning, Low-Rank Adaptation, rate
备注:
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation ($\mu$A), a theoretical framework that characterizes how the "optimal" learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. $\mu$A is inspired from the Maximal-Update Parametrization ($\mu$P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision--language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.
58. 【2602.06190】Generics in science communication: Misaligned interpretations across laypeople, scientists, and large language models
链接:https://arxiv.org/abs/2602.06190
作者:Uwe Peters,Andrea Bertazzoli,Jasmine M. DeJesus,Gisela J. van der Velden,Benjamin Chin-Yee
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:statins reduce cardiovascular, reduce cardiovascular events, unquantified statements, people or phenomena, statins reduce
备注:
点击查看摘要
Abstract:Scientists often use generics, that is, unquantified statements about whole categories of people or phenomena, when communicating research findings (e.g., "statins reduce cardiovascular events"). Large language models (LLMs), such as ChatGPT, frequently adopt the same style when summarizing scientific texts. However, generics can prompt overgeneralizations, especially when they are interpreted differently across audiences. In a study comparing laypeople, scientists, and two leading LLMs (ChatGPT-5 and DeepSeek), we found systematic differences in interpretation of generics. Compared to most scientists, laypeople judged scientific generics as more generalizable and credible, while LLMs rated them even higher. These mismatches highlight significant risks for science communication. Scientists may use generics and incorrectly assume laypeople share their interpretation, while LLMs may systematically overgeneralize scientific findings when summarizing research. Our findings underscore the need for greater attention to language choices in both human and LLM-mediated science communication.
59. 【2602.06184】PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining
链接:https://arxiv.org/abs/2602.06184
作者:Cheng Liang,Chaoyi Wu,Weike Zhao,Ya Zhang,Yanfeng Wang,Weidi Xie
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:CLIP-like vision-language models, Recent progress, greatly advanced medical, large-scale CLIP-like vision-language, vision-language models
备注:
点击查看摘要
Abstract:Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85\% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.
60. 【2602.06181】Uncertainty Drives Social Bias Changes in Quantized Large Language Models
链接:https://arxiv.org/abs/2602.06181
作者:Stanley Z. Hua,Sanae Lotfi,Irene Y. Chen
类目:Computation and Language (cs.CL)
关键词:Post-training quantization reduces, aggregate metrics fail, large language models, Post-training quantization, fail to capture
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Post-training quantization reduces the computational cost of large language models but fundamentally alters their social biases in ways that aggregate metrics fail to capture. We present the first large-scale study of 50 quantized models evaluated on PostTrainingBiasBench, a unified benchmark of 13 closed- and open-ended bias datasets. We identify a phenomenon we term quantization-induced masked bias flipping, in which up to 21% of responses flip between biased and unbiased states after quantization, despite showing no change in aggregate bias scores. These flips are strongly driven by model uncertainty, where the responses with high uncertainty are 3-11x more likely to change than the confident ones. Quantization strength amplifies this effect, with 4-bit quantized models exhibiting 4-6x more behavioral changes than 8-bit quantized models. Critically, these changes create asymmetric impacts across demographic groups, where bias can worsen by up to 18.6% for some groups while improving by 14.1% for others, yielding misleadingly neutral aggregate outcomes. Larger models show no consistent robustness advantage, and group-specific shifts vary unpredictably across model families. Our findings demonstrate that compression fundamentally alters bias patterns, requiring crucial post-quantization evaluation and interventions to ensure reliability in practice.
61. 【2602.06176】Large Language Model Reasoning Failures
链接:https://arxiv.org/abs/2602.06176
作者:Peiyang Song,Pengrui Han,Noah Goodman
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, achieving impressive results, exhibited remarkable reasoning
备注: Repository: [this https URL](https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures) . Published at TMLR 2026 with Survey Certification
点击查看摘要
Abstract:Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at this https URL, to provide an easy entry point to this area.
62. 【2602.06161】Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding
链接:https://arxiv.org/abs/2602.06161
作者:Yanzheng Xiang,Lan Wei,Yizhen Yao,Qinglin Zhu,Hanqi Yan,Chen Jin,Philip Alexander Teare,Dandan Zhang,Lin Gui,Amrutha Saseendran,Yulan He
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:accelerate diffusion language, diffusion language model, unmasking multiple tokens, language model inference, Parallel diffusion decoding
备注:
点击查看摘要
Abstract:Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.
63. 【2602.06154】MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models
链接:https://arxiv.org/abs/2602.06154
作者:Nurbek Tastan,Stefanos Laskaridis,Karthik Nandakumar,Samuel Horvath
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:scale large language, sparsely activating experts, language models efficiently, efficiently by sparsely, sparsely activating
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated, but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT models trained on OpenWebText demonstrate that MoSE matches or improves upon standard MoE at full width and consistently shifts the Pareto frontier for accuracy vs. cost, achieving comparable performance with significantly fewer FLOPs.
64. 【2602.06142】Protean Compiler: An Agile Framework to Drive Fine-grain Phase Ordering
链接:https://arxiv.org/abs/2602.06142
作者:Amir H. Ashouri,Shayan Shirahmad Gale Bagi,Kavin Satheeskumar,Tejas Srikanth,Jonathan Zhao,Ibrahim Saidoun,Ziwen Wang,Bryan Chan,Tomasz S. Czajkowski
类目:Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
关键词:open problem due, phase ordering problem, vast optimization space, open problem, problem due
备注: Version 1- Submitted for a possible publication in 2026
点击查看摘要
Abstract:The phase ordering problem has been a long-standing challenge since the late 1970s, yet it remains an open problem due to having a vast optimization space and an unbounded nature, making it an open-ended problem without a finite solution, one can limit the scope by reducing the number and the length of optimizations. Traditionally, such locally optimized decisions are made by hand-coded algorithms tuned for a small number of benchmarks, often requiring significant effort to be retuned when the benchmark suite changes. In the past 20 years, Machine Learning has been employed to construct performance models to improve the selection and ordering of compiler optimizations, however, the approaches are not baked into the compiler seamlessly and never materialized to be leveraged at a fine-grained scope of code segments. This paper presents Protean Compiler: An agile framework to enable LLVM with built-in phase-ordering capabilities at a fine-grained scope. The framework also comprises a complete library of more than 140 handcrafted static feature collection methods at varying scopes, and the experimental results showcase speedup gains of up to 4.1% on average and up to 15.7% on select Cbench applications wrt LLVM's O3 by just incurring a few extra seconds of build time on Cbench. Additionally, Protean compiler allows for an easy integration with third-party ML frameworks and other Large Language Models, and this two-step optimization shows a gain of 10.1% and 8.5% speedup wrt O3 on Cbench's Susan and Jpeg applications. Protean compiler is seamlessly integrated into LLVM and can be used as a new, enhanced, full-fledged compiler. We plan to release the project to the open-source community in the near future.
65. 【2602.06130】Self-Improving World Modelling with Latent Actions
链接:https://arxiv.org/abs/2602.06130
作者:Yifu Qiu,Zheng Zhao,Waylon Li,Yftah Ziser,Anna Korhonen,Shay B. Cohen,Edoardo M. Ponti
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Internal modelling, Forward World Modelling, Inverse Dynamics Modelling, essential to reasoning, reasoning and planning
备注:
点击查看摘要
Abstract:Internal modelling of the world -- predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) $P_\theta(Y|X,Z)$ and an Inverse Dynamics Modelling (IDM) $Q_\phi(Z|X,Y)$. SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.
66. 【2602.06056】Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space
链接:https://arxiv.org/abs/2602.06056
作者:Zihang Wang,Siyue Zhang,Yilun Zhao,Jingyi Yang,Tingyu Song,Anh Tuan Luu,Chen Zhao
类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Language Models, Large Diffusion Language, Large Language Models, Diffusion Language Models, Language Models
备注:
点击查看摘要
Abstract:Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation. Recent advances in large foundation models have substantially accelerated the development of embedding models, including those based on Large Language Models (LLMs), Vision Language Models (VLMs), and Multimodal LLMs. More recently, Large Diffusion Language Models (dLLMs) and Multimodal dLLMs have emerged as competitive alternatives to autoregressive models, offering advantages such as bidirectional attention and parallel generation. This progress naturally raises a critical yet unexplored question: can Multimodal dLLMs serve as effective multimodal embedding models? To answer this, we present the first systematic study of converting Multimodal dLLMs into embedding models. We evaluate state-of-the-art Multimodal dLLMs and Autoregressive VLMs across three categories of embedding tasks: classification, visual question answering, and information retrieval. Our results show that Multimodal dLLM embeddings generally underperform their autoregressive VLM counterparts. The stronger diffusion-based model, LaViDa, lags by only 3.5 points on classification, 2.5 points on VQA, and 4.4 points on retrieval tasks, whereas the other diffusion-based model, MMaDA, exhibits substantially larger performance gaps, exceeding 20 points across all tasks. Further analysis reveals insufficient image-text alignment in diffusion-based models, accounting for the observed limitations in their embedding performance.
67. 【2602.06055】Quantifying and Attributing Polarization to Annotator Groups
链接:https://arxiv.org/abs/2602.06055
作者:Dimitris Tsirmpas,John Pavlopoulos
类目:Computation and Language (cs.CL)
关键词:group size imbalances, well-suited for inter-group, size imbalances, imbalances and restricted, restricted to single-annotation
备注: 28 pages, 6 tables, 7 figures, 1 algorithm
点击查看摘要
Abstract:Current annotation agreement metrics are not well-suited for inter-group analysis, are sensitive to group size imbalances and restricted to single-annotation settings. These restrictions render them insufficient for many subjective tasks such as toxicity and hate-speech detection. For this reason, we introduce a quantifiable metric, paired with a statistical significance test, that attributes polarization to various annotator groups. Our metric enables direct comparisons between heavily imbalanced sociodemographic and ideological subgroups across different datasets and tasks, while also enabling analysis on multi-label settings. We apply this metric to three datasets on hate speech, and one on toxicity detection, discovering that: (1) Polarization is strongly and persistently attributed to annotator race, especially on the hate speech task. (2) Religious annotators do not fundamentally disagree with each other, but do with other annotators, a trend that is gradually diminished and then reversed with irreligious annotators. (3) Less educated annotators are more subjective, while educated ones tend to broadly agree more between themselves. Overall, our results reflect current findings around annotation patterns for various subgroups. Finally, we estimate the minimum number of annotators needed to obtain robust results, and provide an open-source Python library that implements our metric.
68. 【2602.06054】What Is Novel? A Knowledge-Driven Framework for Bias-Aware Literature Originality Evaluation
链接:https://arxiv.org/abs/2602.06054
作者:Abeer Mostafa,Thi Huyen Nguyen,Zahra Ahmadi
类目:Computation and Language (cs.CL)
关键词:highly subjective aspect, Assessing research novelty, Assessing research, typically based, core yet highly
备注:
点击查看摘要
Abstract:Assessing research novelty is a core yet highly subjective aspect of peer review, typically based on implicit judgment and incomplete comparison to prior work. We introduce a literature-aware novelty assessment framework that explicitly learns how humans judge novelty from peer-review reports and grounds these judgments in structured comparison to existing research. Using nearly 80K novelty-annotated reviews from top-tier AI conferences, we fine-tune a large language model to capture reviewer-aligned novelty evaluation behavior. For a given manuscript, the system extracts structured representations of its ideas, methods, and claims, retrieves semantically related papers, and constructs a similarity graph that enables fine-grained, concept-level comparison to prior work. Conditioning on this structured evidence, the model produces calibrated novelty scores and human-like explanatory assessments, reducing overestimation and improving consistency relative to existing approaches.
69. 【2602.06053】PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models
链接:https://arxiv.org/abs/2602.06053
作者:Rajarshi Roy,Jonathan Raiman,Sang-gil Lee,Teodor-Dumitru Ene,Robert Kirby,Sungwon Kim,Jaehyeon Kim,Bryan Catanzaro
类目:Computation and Language (cs.CL)
关键词:Recent advances, duplex speech models, Recent, speech, models
备注:
点击查看摘要
Abstract:Recent advances in duplex speech models have enabled natural, low-latency speech-to-speech interactions. However, existing models are restricted to a fixed role and voice, limiting their ability to support structured, role-driven real-world applications and personalized interactions. In this work, we introduce PersonaPlex, a duplex conversational speech model that incorporates hybrid system prompts, combining role conditioning with text prompts and voice cloning with speech samples. PersonaPlex is trained on a large-scale synthetic dataset of paired prompts and user-agent conversations, generated with open-source large language models (LLM) and text-to-speech (TTS) models. To evaluate role conditioning in real-world settings, we extend the Full-Duplex-Bench benchmark beyond a single assistant role to multi-role customer service scenarios. Experiments show that PersonaPlex achieves strong role-conditioned behavior, voice-conditioned speech, and natural conversational responsiveness, surpassing state-of-the-art duplex speech models and hybrid large language model-based speech systems in role adherence, speaker similarity, latency, and naturalness.
70. 【2602.06052】Rethinking Memory Mechanisms of Foundation Agents in the Second Half
链接:https://arxiv.org/abs/2602.06052
作者:Wei-Chieh Huang,Weizhi Zhang,Yueqing Liang,Yuanchen Bei,Yankai Chen,Tao Feng,Xinyu Pan,Zhen Tan,Yu Wang,Tianxin Wei,Shanglin Wu,Ruiyao Xu,Liangwei Yang,Rui Yang,Wooseong Yang,Chin-Yuan Yeh,Hanrong Zhang,Haozhen Zhang,Siqi Zhu,Henry Peng Zou,Wanjia Zhao,Song Wang,Wujiang Xu,Zixuan Ke,Zheng Hui,Dawei Li,Yaozu Wu,Langzhou He,Chen Wang,Xiongxiao Xu,Baixiang Huang,Juntao Tan,Shelby Heinecke,Huan Wang,Caiming Xiong,Ahmed A. Metwally,Jun Yan,Chen-Yu Lee,Hanqing Zeng,Yinglong Xia,Xiaokai Wei,Ali Payani,Yu Wang,Haitong Ma,Wenya Wang,Chengguang Wang,Yu Zhang,Xin Wang,Yongfeng Zhang,Jiaxuan You,Hanghang Tong,Xiao Luo,Yizhou Sun,Wei Wang,Julian McAuley,James Zou,Jiawei Han,Philip S. Yu,Kai Shu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:prioritizing model innovations, emphasizing problem definition, rigorous real-world evaluation, research of artificial, artificial intelligence
备注:
点击查看摘要
Abstract:The research of artificial intelligence is undergoing a paradigm shift from prioritizing model innovations over benchmark scores towards emphasizing problem definition and rigorous real-world evaluation. As the field enters the "second half," the central challenge becomes real utility in long-horizon, dynamic, and user-dependent environments, where agents face context explosion and must continuously accumulate, manage, and selectively reuse large volumes of information across extended interactions. Memory, with hundreds of papers released this year, therefore emerges as the critical solution to fill the utility gap. In this survey, we provide a unified view of foundation agent memory along three dimensions: memory substrate (internal and external), cognitive mechanism (episodic, semantic, sensory, working, and procedural), and memory subject (agent- and user-centric). We then analyze how memory is instantiated and operated under different agent topologies and highlight learning policies over memory operations. Finally, we review evaluation benchmarks and metrics for assessing memory utility, and outline various open challenges and future directions.
71. 【2602.06051】CAST: Character-and-Scene Episodic Memory for Agents
链接:https://arxiv.org/abs/2602.06051
作者:Kexin Ma,Bojun Li,Yuhua Tang,Ruochun Jin,Liting Sun
类目:Computation and Language (cs.CL)
关键词:coherent events grounded, recall coherent events, central component, component of human, coherent events
备注:
点击查看摘要
Abstract:Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where. However, most agent memory systems only emphasize semantic recall and treat experience as structures such as key-value, vector, or graph, which makes them struggle to represent and retrieve coherent events. To address this challenge, we propose a Character-and-Scene based memory architecture(CAST) inspired by dramatic theory. Specifically, CAST constructs 3D scenes (time/place/topic) and organizes them into character profiles that summarize the events of a character to represent episodic memory. Moreover, CAST complements this episodic memory with a graph-based semantic memory, which yields a robust dual memory design. Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.
72. 【2602.06050】Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering
链接:https://arxiv.org/abs/2602.06050
作者:Jongha Kim,Byungoh Ko,Jeehye Na,Jinsung Yoon,Hyunwoo J. Kim
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Language Models, Large Vision Language, Language Models, Large Vision, Vision Language
备注: WACV 2026
点击查看摘要
Abstract:Despite the remarkable capabilities of Large Vision Language Models (LVLMs), they still lack detailed knowledge about specific entities. Retrieval-augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external Knowledge Base. However, we observe that previous decoding methods for RAG are sub-optimal as they fail to sufficiently leverage multiple relevant contexts and suppress the negative effects of irrelevant contexts. To this end, we propose Relevance-aware Multi-context Contrastive Decoding (RMCD), a novel decoding method for RAG. RMCD outputs a final prediction by combining outputs predicted with each context, where each output is weighted based on its relevance to the question. By doing so, RMCD effectively aggregates useful information from multiple relevant contexts while also counteracting the negative effects of irrelevant ones. Experiments show that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge-intensive visual question-answering benchmarks. Also, RMCD can be simply applied by replacing the decoding method of LVLMs without additional training. Analyses also show that RMCD is robust to the retrieval results, consistently performing the best across the weakest to the strongest retrieval results. Code is available at this https URL.
73. 【2602.06049】Recontextualizing Famous Quotes for Brand Slogan Generation
链接:https://arxiv.org/abs/2602.06049
作者:Ziao Yang,Zizhang Chen,Lei Zhang,Hongfu Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:shaping public perception, conveying brand identity, public perception, concise and memorable, memorable catchphrases
备注:
点击查看摘要
Abstract:Slogans are concise and memorable catchphrases that play a crucial role in advertising by conveying brand identity and shaping public perception. However, advertising fatigue reduces the effectiveness of repeated slogans, creating a growing demand for novel, creative, and insightful slogan generation. While recent work leverages large language models (LLMs) for this task, existing approaches often produce stylistically redundant outputs that lack a clear brand persona and appear overtly machine-generated. We argue that effective slogans should balance novelty with familiarity and propose a new paradigm that recontextualizes persona-related famous quotes for slogan generation. Well-known quotes naturally align with slogan-length text, employ rich rhetorical devices, and offer depth and insight, making them a powerful resource for creative generation. Technically, we introduce a modular framework that decomposes slogan generation into interpretable subtasks, including quote matching, structural decomposition, vocabulary replacement, and remix generation. Extensive automatic and human evaluations demonstrate marginal improvements in diversity, novelty, emotional impact, and human preference over three state-of-the-art LLM baselines.
74. 【2602.06699】Quantum Attention by Overlap Interference: Predicting Sequences from Classical and Many-Body Quantum Data
链接:https://arxiv.org/abs/2602.06699
作者:Alessio Pecilli,Matteo Rosati
类目:Quantum Physics (quant-ph); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, predicts future elements, forming overlap-weighted combinations, variational quantum implementation, implementation of self-attention
备注: 4 + 1 pages, 2 figures
点击查看摘要
Abstract:We propose a variational quantum implementation of self-attention (QSA), the core operation in transformers and large language models, which predicts future elements of a sequence by forming overlap-weighted combinations of past data. At variance with previous approaches, our QSA realizes the required nonlinearity through interference of state overlaps and returns a Renyi-1/2 cross-entropy loss directly as the expectation value of an observable, avoiding the need to decode amplitude-encoded predictions into classical logits. Furthermore, QSA naturally accommodates a constrained, trainable data-embedding that ties quantum state overlaps to data-level similarities. We find a gate complexity dominant scaling O(T d^2) for QSA, versus O(T^2 d) classically, suggesting an advantage in the practical regime where the sequence length T dominates the embedding size d. In simulations, we show that our QSA-based quantum transformer learns sequence prediction on classical data and on many-body transverse-field Ising quantum trajectories, establishing trainable attention as a practical primitive for quantum dynamical modeling.
75. 【2602.06180】STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs
链接:https://arxiv.org/abs/2602.06180
作者:Kaiyuan Zhang,Mohan Shi,Eray Eren,Natarajan Balaji Shankar,Zilai Wang,Abeer Alwan
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
关键词:Neural audio codecs, token-based language models, Neural audio, integrated into token-based, token-based language
备注: ICASSP 2026
点击查看摘要
Abstract:Neural audio codecs are widely used for audio compression and can be integrated into token-based language models. Traditional codecs preserve acoustic details well but lack semantic information. Recent hybrid codecs attempt to incorporate semantic information through distillation, but this often degrades reconstruction performance, making it difficult to achieve both. To address this limitation, we introduce STACodec, a unified codec that integrates semantic information from self-supervised learning (SSL) models into the first layer of residual vector quantization (RVQ-1) via semantic token assignment (STA). To further eliminate reliance on SSL-based semantic tokenizers and improve efficiency during inference, we propose a semantic pre-distillation (SPD) module, which predicts semantic tokens directly for assignment to the first RVQ layer during inference. Experimental results show that STACodec outperforms existing hybrid codecs in both audio reconstruction and downstream semantic tasks, demonstrating a better balance between acoustic fidelity and semantic capability.
76. 【2602.06065】Deep networks learn to parse uniform-depth context-free languages from local statistics
链接:https://arxiv.org/abs/2602.06065
作者:Jack T. Parley,Francesco Cagnetta,Matthieu Wyart
类目:Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, learned from sentences, cognitive science, science and machine, Language Models
备注:
点击查看摘要
Abstract:Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.
信息检索
1. 【2602.06935】On the Efficiency of Sequentially Aware Recommender Systems: Cotten4Rec
链接:https://arxiv.org/abs/2602.06935
作者:Shankar Veludandi,Gulrukh Kurdistan,Uzma Mushtaque
类目:Information Retrieval (cs.IR)
关键词:historical behaviors, predict a user, user next interaction, interaction by modeling, modeling their historical
备注:
点击查看摘要
Abstract:Sequential recommendation (SR) models predict a user's next interaction by modeling their historical behaviors. Transformer-based SR methods, notably BERT4Rec, effectively capture these patterns but incur significant computational overhead due to extensive intermediate computations associated with Softmax-based attention. We propose Cotten4Rec, a novel SR model utilizing linear-time cosine similarity attention, implemented through a single optimized compute unified device architecture (CUDA) kernel. By minimizing intermediate buffers and kernel-launch overhead, Cotten4Rec substantially reduces resource usage compared to BERT4Rec and the linear-attention baseline, LinRec, especially for datasets with moderate sequence lengths and vocabulary sizes. Evaluations across three benchmark datasets confirm that Cotten4Rec achieves considerable reductions in memory and runtime with minimal compromise in recommendation accuracy, demonstrating Cotten4Rec's viability as an efficient alternative for practical, large-scale sequential recommendation scenarios where computational resources are critical.
2. 【2602.06654】Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan
链接:https://arxiv.org/abs/2602.06654
作者:Boyu Chen,Tai Guo,Weiyu Cui,Yuqing Li,Xingxing Wang,Chuan Shi,Cheng Yang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:enable precise retrieval, meet diverse user, Multimodal retrieval models, precise retrieval, food delivery
备注:
点击查看摘要
Abstract:Multimodal retrieval models are becoming increasingly important in scenarios such as food delivery, where rich multimodal features can meet diverse user needs and enable precise retrieval. Mainstream approaches typically employ a dual-tower architecture between queries and items, and perform joint optimization of intra-tower and inter-tower tasks. However, we observe that joint optimization often leads to certain modalities dominating the training process, while other modalities are neglected. In addition, inconsistent training speeds across modalities can easily result in the one-epoch problem. To address these challenges, we propose a staged pretraining strategy, which guides the model to focus on specialized tasks at each stage, enabling it to effectively attend to and utilize multimodal features, and allowing flexible control over the training process at each stage to avoid the one-epoch problem. Furthermore, to better utilize the semantic IDs that compress high-dimensional multimodal embeddings, we design both generative and discriminative tasks to help the model understand the associations between SIDs, queries, and item features, thereby improving overall performance. Extensive experiments on large-scale real-world Meituan data demonstrate that our method achieves improvements of 3.80%, 2.64%, and 2.17% on R@5, R@10, and R@20, and 5.10%, 4.22%, and 2.09% on N@5, N@10, and N@20 compared to mainstream baselines. Online A/B testing on the Meituan platform shows that our approach achieves a 1.12% increase in revenue and a 1.02% increase in click-through rate, validating the effectiveness and superiority of our method in practical applications.
3. 【2602.06622】R2LED: Equipping Retrieval and Refinement in Lifelong User Modeling with Semantic IDs for CTR Prediction
链接:https://arxiv.org/abs/2602.06622
作者:Qidong Liu,Gengnan Wang,Zhichen Liu,Moranxin Wang,Zijian Zhang,Xiao Han,Ni Zhang,Tao Qin,Chen Li
类目:Information Retrieval (cs.IR)
关键词:long-term behavior sequences, CTR prediction, sequences for CTR, Lifelong user modeling, leverages users' long-term
备注:
点击查看摘要
Abstract:Lifelong user modeling, which leverages users' long-term behavior sequences for CTR prediction, has been widely applied in personalized services. Existing methods generally adopted a two-stage "retrieval-refinement" strategy to balance effectiveness and efficiency. However, they still suffer from (i) noisy retrieval due to skewed data distribution and (ii) lack of semantic understanding in refinement. While semantic enhancement, e.g., LLMs modeling or semantic embeddings, offers potential solutions to these two challenges, these approaches face impractical inference costs or insufficient representation granularity. Obsorbing multi-granularity and lightness merits of semantic identity (SID), we propose a novel paradigm that equips retrieval and refinement in Lifelong User Modeling with SEmantic IDs (R2LED) to address these issues. First, we introduce a Multi-route Mixed Retrieval for the retrieval stage. On the one hand, it captures users' interests from various granularities by several parallel recall routes. On the other hand, a mixed retrieval mechanism is proposed to efficiently retrieve candidates from both collaborative and semantic views, reducing noise. Then, for refinement, we design a Bi-level Fusion Refinement, including a target-aware cross-attention for route-level fusion and a gate mechanism for SID-level fusion. It can bridge the gap between semantic and collaborative spaces, exerting the merits of SID. The comprehensive experimental results on two public datasets demonstrate the superiority of our method in both performance and efficiency. To facilitate the reproduction, we have released the code online this https URL.
4. 【2602.06563】okenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders
链接:https://arxiv.org/abs/2602.06563
作者:Yuchen Jiang,Jie Zhu,Xintian Han,Hui Lu,Kunmin Bai,Mingyu Yang,Shikang Wu,Ruihao Zhang,Wenlin Zhao,Shipeng Bai,Sijin Zhou,Huizhi Yang,Tianyi Liu,Wenda Liu,Ziyan Gong,Haoran Ding,Zheng Chai,Deping Xie,Zhe Chen,Yuchao Zheng,Peng Xu
类目:Information Retrieval (cs.IR)
关键词:gradually gained attention, validate scaling laws, scaling laws, recent years, gained attention
备注:
点击查看摘要
Abstract:In recent years, the study of scaling laws for large recommendation models has gradually gained attention. Works such as Wukong, HiFormer, and DHEN have attempted to increase the complexity of interaction structures in ranking models and validate scaling laws between performance and parameters/FLOPs by stacking multiple layers. However, their experimental scale remains relatively limited. Our previous work introduced the TokenMixer architecture, an efficient variant of the standard Transformer where the self-attention mechanism is replaced by a simple reshape operation, and the feed-forward network is adapted to a pertoken FFN. The effectiveness of this architecture was demonstrated in the ranking stage by the model presented in the RankMixer paper. However, this foundational TokenMixer architecture itself has several design limitations. In this paper, we propose TokenMixer-Large, which systematically addresses these core issues: sub-optimal residual design, insufficient gradient updates in deep models, incomplete MoE sparsification, and limited exploration of scalability. By leveraging a mixing-and-reverting operation, inter-layer residuals, the auxiliary loss and a novel Sparse-Pertoken MoE architecture, TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer -Large has achieved significant offline and online performance gains.
5. 【2602.06431】A methodology for analyzing financial needs hierarchy from social discussions using LLM
链接:https://arxiv.org/abs/2602.06431
作者:Abhishek Jangra,Sachin Thukral,Arnab Chatterjee,Jayasree Raveendran
类目:ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:large-scale textual data, analyze large-scale textual, social media discourse, employing generative, textual data
备注: 15 pages, 5 figures, 4 tables
点击查看摘要
Abstract:This study examines the hierarchical structure of financial needs as articulated in social media discourse, employing generative AI techniques to analyze large-scale textual data. While human needs encompass a broad spectrum from fundamental survival to psychological fulfillment financial needs are particularly critical, influencing both individual well-being and day-to-day decision-making. Our research advances the understanding of financial behavior by utilizing large language models (LLMs) to extract and analyze expressions of financial needs from social media posts. We hypothesize that financial needs are organized hierarchically, progressing from short-term essentials to long-term aspirations, consistent with theoretical frameworks established in the behavioral sciences. Through computational analysis, we demonstrate the feasibility of identifying these needs and validate the presence of a hierarchical structure within them. In addition to confirming this structure, our findings provide novel insights into the content and themes of financial discussions online. By inferring underlying needs from naturally occurring language, this approach offers a scalable and data-driven alternative to conventional survey methodologies, enabling a more dynamic and nuanced understanding of financial behavior in real-world contexts.
6. 【2602.06393】MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
链接:https://arxiv.org/abs/2602.06393
作者:Geonmo Gu,Byeongho Heo,Jaemyung Yu,Jaehui Hwang,Taekyung Kim,Sangmin Lee,HeeJae Jun,Yoohoon Kang,Sangdoo Yun,Dongyoon Han
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Multimodal Large Language, Large Language, Language Models, traditionally employed contrastive
备注: 22 pages
点击查看摘要
Abstract:Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities. Code and M3T are available at this https URL
计算机视觉
1. 【2602.06965】MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
链接:https://arxiv.org/abs/2602.06965
作者:Ankan Deria,Komal Kumar,Adinath Madhavrao Dukre,Eran Segal,Salman Khan,Imran Razzak
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal large language, medicine remains limited, Multimodal large, large language models, modality alignment
备注: 21 pages, 6 figures and 4 tables
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline and performs within 1.9% of the SOTA Fleming-VL. For text-based QA, it attains +6.9% over the baseline and +14.5% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, and pathology-microscopy confirm MedMO's broad cross-modality generalization. We release two versions of MedMO: 4B and 8B. Project is available at this https URL
2. 【2602.06959】CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
链接:https://arxiv.org/abs/2602.06959
作者:Kaiyi Huang,Yukun Huang,Yu Li,Jianhong Bai,Xintao Wang,Zinan Lin,Xuefei Ning,Jiwen Yu,Pengfei Wan,Yu Wang,Xihui Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:constructing physical sets, production requires control, live-action shooting remains, shooting remains costly, remains costly due
备注: Project website: [this https URL](https://karine-huang.github.io/CineScene/)
点击查看摘要
Abstract:Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.
3. 【2602.06949】DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
链接:https://arxiv.org/abs/2602.06949
作者:Shenyuan Gao,William Liang,Kaiyuan Zheng,Ayaan Malik,Seonghyeon Ye,Sihyun Yu,Wei-Cheng Tseng,Yuzhu Dong,Kaichun Mo,Chen-Hsuan Lin,Qianli Ma,Seungjun Nah,Loic Magne,Jiannan Xiang,Yuqi Xie,Ruijie Zheng,Dantong Niu,You Liang Tan,K.R. Zentner,George Kurian,Suneel Indupuru,Pooya Jannaty,Jinwei Gu,Jun Zhang,Jitendra Malik,Pieter Abbeel,Ming-Yu Liu,Yuke Zhu,Joel Jang,Linxi "Jim" Fan
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:agents at scale, simulate the outcomes, varied environments, environments will revolutionize, revolutionize the development
备注: Project page: [this https URL](https://dreamdojo-world.github.io/)
点击查看摘要
Abstract:Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.
4. 【2602.06938】Reliable Mislabel Detection for Video Capsule Endoscopy Data
链接:https://arxiv.org/abs/2602.06938
作者:Julia Werner,Julius Oexle,Oliver Bause,Maxime Le Floch,Franz Brinkmann,Hannah Tolle,Jochen Hampe,Oliver Bringmann
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:deep neural networks, neural networks relies, networks relies strongly, accurately annotated datasets, access to large
备注:
点击查看摘要
Abstract:The classification performance of deep neural networks relies strongly on access to large, accurately annotated datasets. In medical imaging, however, obtaining such datasets is particularly challenging since annotations must be provided by specialized physicians, which severely limits the pool of annotators. Furthermore, class boundaries can often be ambiguous or difficult to define which further complicates machine learning-based classification. In this paper, we want to address this problem and introduce a framework for mislabel detection in medical datasets. This is validated on the two largest, publicly available datasets for Video Capsule Endoscopy, an important imaging procedure for examining the gastrointestinal tract based on a video stream of lowresolution images. In addition, potentially mislabeled samples identified by our pipeline were reviewed and re-annotated by three experienced gastroenterologists. Our results show that the proposed framework successfully detects incorrectly labeled data and results in an improved anomaly detection performance after cleaning the datasets compared to current baselines.
5. 【2602.06914】Seeing Beyond Redundancy: Task Complexity's Role in Vision Token Specialization in VLLMs
链接:https://arxiv.org/abs/2602.06914
作者:Darryl Hannan,John Cooper,Dylan White,Yijing Watkins
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision large language, Vision capabilities, vision large, visual, large language models
备注: 25 pages
点击查看摘要
Abstract:Vision capabilities in vision large language models (VLLMs) have consistently lagged behind their linguistic capabilities. In particular, numerous benchmark studies have demonstrated that VLLMs struggle when fine-grained visual information or spatial reasoning is required. However, we do not yet understand exactly why VLLMs struggle so much with these tasks relative to others. Some works have focused on visual redundancy as an explanation, where high-level visual information is uniformly spread across numerous tokens and specific, fine-grained visual information is discarded. In this work, we investigate this premise in greater detail, seeking to better understand exactly how various types of visual information are processed by the model and what types of visual information are discarded. To do so, we introduce a simple synthetic benchmark dataset that is specifically constructed to probe various visual features, along with a set of metrics for measuring visual redundancy, allowing us to better understand the nuances of their relationship. Then, we explore fine-tuning VLLMs on a number of complex visual tasks to better understand how redundancy and compression change based upon the complexity of the data that a model is trained on. We find that there is a connection between task complexity and visual compression, implying that having a sufficient ratio of high complexity visual data is crucial for altering the way that VLLMs distribute their visual representation and consequently improving their performance on complex visual tasks. We hope that this work will provide valuable insights for training the next generation of VLLMs.
6. 【2602.06912】PANC: Prior-Aware Normalized Cut for Object Segmentation
链接:https://arxiv.org/abs/2602.06912
作者:Juan Gutiérrez,Victor Gutiérrez-Garcia,José Luis Blanco-Murillo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:pipelines naively seek, segmentation pipelines naively, Fully unsupervised segmentation, pipelines naively, naively seek
备注:
点击查看摘要
Abstract:Fully unsupervised segmentation pipelines naively seek the most salient object, should this be present. As a result, most of the methods reported in the literature deliver non-deterministic partitions that are sensitive to initialization, seed order, and threshold heuristics. We propose PANC, a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens to produce stable, controllable, and reproducible object masks. From the TokenCut approach, we augment the token-token affinity graph with a handful of priors coupled to anchor nodes. By manipulating the graph topology, we bias the spectral eigenspace toward partitions that are consistent with the annotations. Our approach preserves the global grouping enforced by dense self-supervised visual features, trading annotated tokens for significant gains in reproducibility, user control, and segmentation quality. Using 5 to 30 annotations per dataset, our training-free method achieves state-of-the-art performance among weakly and unsupervised approaches on standard benchmarks (e.g., DUTS-TE, ECSSD, MS COCO). Contrarily, it excels in domains where dense labels are costly or intra-class differences are subtle. We report strong and reliable results on homogeneous, fine-grained, and texture-limited domains, achieving 96.8% (+14.43% over SotA), 78.0% (+0.2%), and 78.8% (+0.37%) average mean intersection-over-union (mIoU) on CrackForest (CFD), CUB-200-2011, and HAM10000 datasets, respectively. For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.06912 [cs.CV]
(or
arXiv:2602.06912v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.06912
Focus to learn more
arXiv-issued DOI via DataCite</p>
7. 【2602.06886】Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers
链接:https://arxiv.org/abs/2602.06886
作者:Yuxuan Yao,Yuxuan Chen,Hui Li,Kaihui Cheng,Qipeng Guo,Yuwei Sun,Zilong Dong,Jingdong Wang,Siyu Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Diffusion Transformers, Multimodal Diffusion, Diffusion Transformers, bidirectional information flow, maintain separate text
备注: 18 pages
点击查看摘要
Abstract:Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.
8. 【2602.06883】Vision Transformer Finetuning Benefits from Non-Smooth Components
链接:https://arxiv.org/abs/2602.06883
作者:Ambroise Odonnat,Laetitia Chapel,Romain Tavenard,Ievgen Redko
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
关键词:training stability, context of generalization, adversarial robustness, extensively studied, transformer architecture
备注:
点击查看摘要
Abstract:The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at this https URL.
9. 【2602.06879】NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices
链接:https://arxiv.org/abs/2602.06879
作者:Ruchika Chavhan,Malcolm Chadwick,Alberto Gil Couto Pimentel Ramos,Luca Morreale,Mehdi Noroozi,Abhinav Mehrotra
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:continue to improve, increasing scale, scale has widened, diffusion models continue, preserve generation quality
备注:
点击查看摘要
Abstract:While large-scale text-to-image diffusion models continue to improve in visual quality, their increasing scale has widened the gap between state-of-the-art models and on-device solutions. To address this gap, we introduce NanoFLUX, a 2.4B text-to-image flow-matching model distilled from 17B FLUX.1-Schnell using a progressive compression pipeline designed to preserve generation quality. Our contributions include: (1) A model compression strategy driven by pruning redundant components in the diffusion transformer, reducing its size from 12B to 2B; (2) A ResNet-based token downsampling mechanism that reduces latency by allowing intermediate blocks to operate on lower-resolution tokens while preserving high-resolution processing elsewhere; (3) A novel text encoder distillation approach that leverages visual signals from early layers of the denoiser during sampling. Empirically, NanoFLUX generates 512 x 512 images in approximately 2.5 seconds on mobile devices, demonstrating the feasibility of high-quality on-device text-to-image generation.
10. 【2602.06871】RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing
链接:https://arxiv.org/abs/2602.06871
作者:Mohammadreza Salehi,Mehdi Noroozi,Luca Morreale,Ruchika Chavhan,Malcolm Chadwick,Alberto Gil Ramos,Abhinav Mehrotra
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling intuitive natural-language, intuitive natural-language control, Instructional video editing, Instructional video, text prompts
备注:
点击查看摘要
Abstract:Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step t on the model's prediction at t-1. To leverage videos' temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods for editing tasks. Trained on paired video data for global/local style transfer and object removal, RFDM surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length. More content can be found in: this https URL
11. 【2602.06862】Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing
链接:https://arxiv.org/abs/2602.06862
作者:Meng Lou,Stanley Yu,Yizhou Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Adapting pre-trained vision, achieve performance comparable, Adapting pre-trained, remains challenging, pre-trained vision models
备注:
点击查看摘要
Abstract:Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose AdaRoute, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each AdaRoute module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in AdaRoute modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since AdaRoute modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experiments demonstrate the superiority of AdaRoute on diverse vision tasks, including semantic segmentation, object detection and instance segmentation, and panoptic segmentation. Code will be available at: this https URL.
12. 【2602.06850】Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping
链接:https://arxiv.org/abs/2602.06850
作者:Chao Zhou,Tianyi Wei,Yiling Chen,Wenbo Zhou,Nenghai Yu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:specific user requirements, models excel, subject appearances, excel at prompt-based, lack the fine-grained
备注:
点击查看摘要
Abstract:While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend'' strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven interactions via semantic-aware masking. To facilitate efficient learning, we further introduce a Conditional Sensitivity-Aware Sampling (CSAS) strategy that reweights the training objective towards critical denoising phases, drastically accelerating convergence and enhancing conditional fidelity. Empirically, PKA delivers a 10.0$\times$ inference speedup and a 5.1$\times$ VRAM saving, providing a scalable and resource-friendly solution for high-fidelity multi-conditioned generation.
13. 【2602.06830】GaussianPOP: Principled Simplification Framework for Compact 3D Gaussian Splatting via Error Quantification
链接:https://arxiv.org/abs/2602.06830
作者:Soonbin Lee,Yeong-Gyu Kim,Simon Sasse,Tomas M. Borges,Yago Sanchez,Eun-Seok Ryu,Thomas Schierl,Cornelius Hellge
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting simplification, identify redundant Gaussians, Gaussian Splatting, weights or sensitivity, Splatting simplification methods
备注:
点击查看摘要
Abstract:Existing 3D Gaussian Splatting simplification methods commonly use importance scores, such as blending weights or sensitivity, to identify redundant Gaussians. However, these scores are not driven by visual error metrics, often leading to suboptimal trade-offs between compactness and rendering fidelity. We present GaussianPOP, a principled simplification framework based on analytical Gaussian error quantification. Our key contribution is a novel error criterion, derived directly from the 3DGS rendering equation, that precisely measures each Gaussian's contribution to the rendered image. By introducing a highly efficient algorithm, our framework enables practical error calculation in a single forward pass. The framework is both accurate and flexible, supporting on-training pruning as well as post-training simplification via iterative error re-quantification for improved stability. Experimental results show that our method consistently outperforms existing state-of-the-art pruning methods across both application scenarios, achieving a superior trade-off between model compactness and high rendering quality.
14. 【2602.06825】AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models
链接:https://arxiv.org/abs/2602.06825
作者:Yuming Li,Qingyu Li,Chengyu Bai,Xiangyang Luo,Zeyue Xue,Wenyu Qin,Meng Wang,Yikai Wang,Shanghang Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:static sampling strategies, Reinforcement learning, human feedback, shows promise, flow models
备注:
点击查看摘要
Abstract:Reinforcement learning from human feedback (RLHF) shows promise for aligning diffusion and flow models, yet policy optimization methods such as GRPO suffer from inefficient and static sampling strategies. These methods treat all prompts and denoising steps uniformly, ignoring substantial variations in sample learning value as well as the dynamic nature of critical exploration moments. To address this issue, we conduct a detailed analysis of the internal attention dynamics during GRPO training and uncover a key insight: attention entropy can serve as a powerful dual-signal proxy. First, across different samples, the relative change in attention entropy ({\Delta}Entropy), which reflects the divergence between the current policy and the base policy, acts as a robust indicator of sample learning value. Second, during the denoising process, the peaks of absolute attention entropy (Entropy(t)), which quantify attention dispersion, effectively identify critical timesteps where high-value exploration occurs. Building on this observation, we propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy. At the global level, AEGPO uses {\Delta}Entropy to dynamically allocate rollout budgets, prioritizing prompts with higher learning value. At the local level, it exploits the peaks of Entropy(t) to guide exploration selectively at critical high-dispersion timesteps rather than uniformly across all denoising steps. By focusing computation on the most informative samples and the most critical moments, AEGPO enables more efficient and effective policy optimization. Experiments on text-to-image generation tasks demonstrate that AEGPO significantly accelerates convergence and achieves superior alignment performance compared to standard GRPO variants.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.06825 [cs.LG]
(or
arXiv:2602.06825v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2602.06825
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Li Yuming [view email] [v1]
Fri, 6 Feb 2026 16:09:50 UTC (36,188 KB)
15. 【2602.06806】RAIGen: Rare Attribute Identification in Text-to-Image Generative Models
链接:https://arxiv.org/abs/2602.06806
作者:Silpa Vadakkeeveetil Sreelatha,Dan Wang,Serge Belongie,Muhammad Awais,Anjan Dutta
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:amplify training-data biases, models achieve impressive, achieve impressive generation, impressive generation quality, skewing coverage
备注:
点击查看摘要
Abstract:Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for un-supervised rare-attribute discovery in diffusion models. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation.
16. 【2602.06805】A Unified Formula for Affine Transformations between Calibrated Cameras
链接:https://arxiv.org/abs/2602.06805
作者:Levente Hajder
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:affine transformation mapping, transformation mapping local, mapping local image, local image patches, technical note
备注:
点击查看摘要
Abstract:In this technical note, we derive a closed-form expression for the affine transformation mapping local image patches between two calibrated views. We show that the transformation is a function of the relative camera pose, the image coordinates, and the local surface normal.
17. 【2602.06786】Machine Learning for Detection and Severity Estimation of Sweetpotato Weevil Damage in Field and Lab Conditions
链接:https://arxiv.org/abs/2602.06786
作者:Doreen M. Chelangat,Sudi Murindanyi,Bruce Mugizi,Paul Musana,Benard Yada,Milton A. Otema,Florence Osaru,Andrew Katumba,Joyce Nakatumba-Nabende
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:destructive pests impacting, Cylas spp., impacting sweetpotato production, sub-Saharan Africa, pests impacting sweetpotato
备注:
点击查看摘要
Abstract:Sweetpotato weevils (Cylas spp.) are considered among the most destructive pests impacting sweetpotato production, particularly in sub-Saharan Africa. Traditional methods for assessing weevil damage, predominantly relying on manual scoring, are labour-intensive, subjective, and often yield inconsistent results. These challenges significantly hinder breeding programs aimed at developing resilient sweetpotato varieties. This study introduces a computer vision-based approach for the automated evaluation of weevil damage in both field and laboratory contexts. In the field settings, we collected data to train classification models to predict root-damage severity levels, achieving a test accuracy of 71.43%. Additionally, we established a laboratory dataset and designed an object detection pipeline employing YOLO12, a leading real-time detection model. This methodology incorporated a two-stage laboratory pipeline that combined root segmentation with a tiling strategy to improve the detectability of small objects. The resulting model demonstrated a mean average precision of 77.7% in identifying minute weevil feeding holes. Our findings indicate that computer vision technologies can provide efficient, objective, and scalable assessment tools that align seamlessly with contemporary breeding workflows. These advancements represent a significant improvement in enhancing phenotyping efficiency within sweetpotato breeding programs and play a crucial role in mitigating the detrimental effects of weevils on food security.
18. 【2602.06778】Revisiting Emotions Representation for Recognition in the Wild
链接:https://arxiv.org/abs/2602.06778
作者:Joao Baptista Cardia Neto,Claudio Ferrari,Stefano Berretti
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:Facial emotion recognition, single-label classification problem, Facial emotion, single-label classification, cast emotion recognition
备注:
点击查看摘要
Abstract:Facial emotion recognition has been typically cast as a single-label classification problem of one out of six prototypical emotions. However, that is an oversimplification that is unsuitable for representing the multifaceted spectrum of spontaneous emotional states, which are most often the result of a combination of multiple emotions contributing at different intensities. Building on this, a promising direction that was explored recently is to cast emotion recognition as a distribution learning problem. Still, such approaches are limited in that research datasets are typically annotated with a single emotion class. In this paper, we contribute a novel approach to describe complex emotional states as probability distributions over a set of emotion classes. To do so, we propose a solution to automatically re-label existing datasets by exploiting the result of a study in which a large set of both basic and compound emotions is mapped to probability distributions in the Valence-Arousal-Dominance (VAD) space. In this way, given a face image annotated with VAD values, we can estimate the likelihood of it belonging to each of the distributions, so that emotional states can be described as a mixture of emotions, enriching their description, while also accounting for the ambiguous nature of their perception. In a preliminary set of experiments, we illustrate the advantages of this solution and a new possible direction of investigation. Data annotations are available at this https URL.
19. 【2602.06748】Gold Exploration using Representations from a Multispectral Autoencoder
链接:https://arxiv.org/abs/2602.06748
作者:Argyro Tsandalidou,Konstantinos Dogeas,Eleftheria Tetoula Tsonga,Elisavet Parselia,Georgios Tsimiklis,George Arvanitakis
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:prospectivity mapping due, large-scale prospectivity mapping, typically limited availability, Satellite imagery, prospectivity mapping
备注: Presented in Eurips2025, 1st Workshop: Advances in Representation Learning for Earth Observation
点击查看摘要
Abstract:Satellite imagery is employed for large-scale prospectivity mapping due to the high cost and typically limited availability of on-site mineral exploration data. In this work, we present a proof-of-concept framework that leverages generative representations learned from multispectral Sentinel-2 imagery to identify gold-bearing regions from space. An autoencoder foundation model, called Isometric, which is pretrained on the large-scale FalconSpace-S2 v1.0 dataset, produces information-dense spectral-spatial representations that serve as inputs to a lightweight XGBoost classifier. We compare this representation-based approach with a raw spectral input baseline using a dataset of 63 Sentinel-2 images from known gold and non-gold locations. The proposed method improves patch-level accuracy from 0.51 to 0.68 and image-level accuracy from 0.55 to 0.73, demonstrating that generative embeddings capture transferable mineralogical patterns even with limited labeled data. These results highlight the potential of foundation-model representations to make mineral exploration more efficient, scalable, and globally applicable.
20. 【2602.06743】Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening
链接:https://arxiv.org/abs/2602.06743
作者:Dong Chen,Zizhuang Wei,Jialei Xu,Xinyang Sun,Zonglin He,Meiru An,Huili Peng,Yong Hu,Kenneth MC Cheung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Adolescent Idiopathic Scoliosis, Adolescent Idiopathic, Idiopathic Scoliosis, prevalent spinal deformity, early detection
备注:
点击查看摘要
Abstract:Adolescent Idiopathic Scoliosis (AIS) is a prevalent spinal deformity whose progression can be mitigated through early detection. Conventional screening methods are often subjective, difficult to scale, and reliant on specialized clinical expertise. Video-based gait analysis offers a promising alternative, but current datasets and methods frequently suffer from data leakage, where performance is inflated by repeated clips from the same individual, or employ oversimplified models that lack clinical interpretability. To address these limitations, we introduce ScoliGait, a new benchmark dataset comprising 1,572 gait video clips for training and 300 fully independent clips for testing. Each clip is annotated with radiographic Cobb angles and descriptive text based on clinical kinematic priors. We propose a multi-modal framework that integrates a clinical-prior-guided kinematic knowledge map for interpretable feature representation, alongside a latent attention pooling mechanism to fuse video, text, and knowledge map modalities. Our method establishes a new state-of-the-art, demonstrating a significant performance gap on a realistic, non-repeating subject benchmark. Our approach establishes a new state of the art, showing a significant performance gain on a realistic, subject-independent benchmark. This work provides a robust, interpretable, and clinically grounded foundation for scalable, non-invasive AIS assessment.
21. 【2602.06695】Diffeomorphism-Equivariant Neural Networks
链接:https://arxiv.org/abs/2602.06695
作者:Josephine Elisabeth Oettinger,Zakhar Shumaylov,Johannes Bostelmann,Jan Lellmann,Carola-Bibiane Schönlieb
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Incorporating group symmetries, modern deep learning, Incorporating group, deep learning, overcoming the efficiency
备注:
点击查看摘要
Abstract:Incorporating group symmetries via equivariance into neural networks has emerged as a robust approach for overcoming the efficiency and data demands of modern deep learning. While most existing approaches, such as group convolutions and averaging-based methods, focus on compact, finite, or low-dimensional groups with linear actions, this work explores how equivariance can be extended to infinite-dimensional groups. We propose a strategy designed to induce diffeomorphism equivariance in pre-trained neural networks via energy-based canonicalisation. Formulating equivariance as an optimisation problem allows us to access the rich toolbox of already established differentiable image registration methods. Empirical results on segmentation and classification tasks confirm that our approach achieves approximate equivariance and generalises to unseen transformations without relying on extensive data augmentation or retraining.
22. 【2602.06676】Can We Build a Monolithic Model for Fake Image Detection? SICA: Semantic-Induced Constrained Adaptation for Unified-Yet-Discriminative Artifact Feature Space Reconstruction
链接:https://arxiv.org/abs/2602.06676
作者:Bo Du,Xiaochen Ma,Xuekang Zhu,Zhe Yang,Chaogun Niu,Jian Liu,Ji-Zhe Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Fake Image Detection, image forensic subdomains, real-world forensic scenarios, Fake Image, Image Detection
备注:
点击查看摘要
Abstract:Fake Image Detection (FID), aiming at unified detection across four image forensic subdomains, is critical in real-world forensic scenarios. Compared with ensemble approaches, monolithic FID models are theoretically more promising, but to date, consistently yield inferior performance in practice. In this work, by discovering the ``heterogeneous phenomenon'', which is the intrinsic distinctness of artifacts across subdomains, we diagnose the cause of this underperformance for the first time: the collapse of the artifact feature space driven by such phenomenon. The core challenge for developing a practical monolithic FID model thus boils down to the ``unified-yet-discriminative" reconstruction of the artifact feature space. To address this paradoxical challenge, we hypothesize that high-level semantics can serve as a structural prior for the reconstruction, and further propose Semantic-Induced Constrained Adaptation (SICA), the first monolithic FID paradigm. Extensive experiments on our OpenMMSec dataset demonstrate that SICA outperforms 15 state-of-the-art methods and reconstructs the target unified-yet-discriminative artifact feature space in a near-orthogonal manner, thus firmly validating our hypothesis. The code and dataset are available at:https: //github.com/scu-zjz/SICA_OpenMMSec.
23. 【2602.06674】CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis
链接:https://arxiv.org/abs/2602.06674
作者:Yonghao Si,Xingyuan Zeng,Zhao Chen,Libin Zheng,Caleb Chen Cao,Lei Chen,Jian Yin
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:advancing machine learning, High-quality annotated datasets, crucial for advancing, advancing machine, machine learning
备注:
点击查看摘要
Abstract:High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.
24. 【2602.06663】PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks
链接:https://arxiv.org/abs/2602.06663
作者:Junxian Li,Kai Liu,Leyang Chen,Weida Wang,Zhixin Wang,Jiaqi Xu,Fan Li,Renjing Pei,Linghe Kong,Yulun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unified multimodal models, Unified multimodal, shown impressive capabilities, supporting multimodal reasoning, generating natural images
备注: The main part of our paper: PlanViz Code is at: [this https URL](https://github.com/lijunxian111/PlanViz) Supplementary material is at: [this https URL](https://github.com/lijunxian111/PlanViz/releases/tag/v1)
点击查看摘要
Abstract:Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and webUI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.
25. 【2602.06652】Same Answer, Different Representations: Hidden instability in VLMs
链接:https://arxiv.org/abs/2602.06652
作者:Farooq Ahmad Wani,Alessandro Suglia,Rohit Saxena,Aryo Pradipta Gema,Wai-Chung Kwan,Fazl Barez,Maria Sofia Bucarelli,Fabrizio Silvestri,Pasquale Minervini
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:stable multimodal processing, predictions reflect stable, stable predictions reflect, reflect stable multimodal, Vision Language Models
备注:
点击查看摘要
Abstract:The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.
26. 【2602.06619】CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling
链接:https://arxiv.org/abs/2602.06619
作者:Yuxin He,An Li,Cheng Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:intelligent operating rooms, context-aware decision support, limited annotated clinical, large domain gaps, Surgical phase recognition
备注:
点击查看摘要
Abstract:Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.
27. 【2602.06613】DAVE: Distribution-aware Attribution via ViT Gradient Decomposition
链接:https://arxiv.org/abs/2602.06613
作者:Adam Wróbel,Siddhartha Gairola,Jacek Tabor,Bernt Schiele,Bartosz Zieliński,Dawid Rymarczyk
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:Vision Transformers, models remains challenging, computer vision, high-resolution attribution maps, remains challenging
备注: work under review. Code will be released upon acceptance
点击查看摘要
Abstract:Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit{(\underline{D}istribution-aware \underline{A}ttribution via \underline{V}iT Gradient D\underline{E}composition)}, a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input--output mapping. It separates these from architecture-induced artifacts and other sources of instability.
28. 【2602.06592】ProtoQuant: Quantization of Prototypical Parts For General and Fine-Grained Image Classification
链接:https://arxiv.org/abs/2602.06592
作者:Mikołaj Janusz,Adam Wróbel,Bartosz Zieliński,Dawid Rymarczyk
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Prototypical parts-based models, parts-based models offer, require computationally expensive, Prototypical parts-based, expensive backbone finetuning
备注: Work under review. Code will be released upon acceptance
点击查看摘要
Abstract:Prototypical parts-based models offer a "this looks like that" paradigm for intrinsic interpretability, yet they typically struggle with ImageNet-scale generalization and often require computationally expensive backbone finetuning. Furthermore, existing methods frequently suffer from "prototype drift," where learned prototypes lack tangible grounding in the training distribution and change their activation under small perturbations. We present ProtoQuant, a novel architecture that achieves prototype stability and grounded interpretability through latent vector quantization. By constraining prototypes to a discrete learned codebook within the latent space, we ensure they remain faithful representations of the training data without the need to update the backbone. This design allows ProtoQuant to function as an efficient, interpretable head that scales to large-scale datasets. We evaluate ProtoQuant on ImageNet and several fine-grained benchmarks (CUB-200, Cars-196). Our results demonstrate that ProtoQuant achieves competitive classification accuracy while generalizing to ImageNet and comparable interpretability metrics to other prototypical-parts-based methods.
29. 【2602.06590】An Integer Linear Programming Approach to Geometrically Consistent Partial-Partial Shape Matching
链接:https://arxiv.org/abs/2602.06590
作者:Viktoria Ehm,Paul Roetzer,Florian Bernard,Daniel Cremers
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, task of establishing, shape matching, establishing correspondences, long-standing challenge
备注:
点击查看摘要
Abstract:The task of establishing correspondences between two 3D shapes is a long-standing challenge in computer vision. While numerous studies address full-full and partial-full 3D shape matching, only a limited number of works have explored the partial-partial setting, very likely due to its unique challenges: we must compute accurate correspondences while at the same time find the unknown overlapping region. Nevertheless, partial-partial 3D shape matching reflects the most realistic setting, as in many real-world cases, such as 3D scanning, shapes are only partially observable. In this work, we introduce the first integer linear programming approach specifically designed to address the distinctive challenges of partial-partial shape matching. Our method leverages geometric consistency as a strong prior, enabling both robust estimation of the overlapping region and computation of neighbourhood-preserving correspondences. We empirically demonstrate that our approach achieves high-quality matching results both in terms of matching error and smoothness. Moreover, we show that our method is more scalable than previous formalisms.
30. 【2602.06575】hink Proprioceptively: Embodied Visual Reasoning for VLA Manipulation
链接:https://arxiv.org/abs/2602.06575
作者:Fangyuan Wang,Peng Zhou,Jiaming Qi,Shipeng Lyu,David Navarro-Alarcon,Guodong Guo
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:models typically inject, shaping instruction understanding, late conditioning signal, typically inject proprioception, prevents robot state
备注:
点击查看摘要
Abstract:Vision-language-action (VLA) models typically inject proprioception only as a late conditioning signal, which prevents robot state from shaping instruction understanding and from influencing which visual tokens are attended throughout the policy. We introduce ThinkProprio, which converts proprioception into a sequence of text tokens in the VLM embedding space and fuses them with the task instruction at the input. This early fusion lets embodied state participate in subsequent visual reasoning and token selection, biasing computation toward action-critical evidence while suppressing redundant visual tokens. In a systematic ablation over proprioception encoding, state entry point, and action-head conditioning, we find that text tokenization is more effective than learned projectors, and that retaining roughly 15% of visual tokens can match the performance of using the full token set. Across CALVIN, LIBERO, and real-world manipulation, ThinkProprio matches or improves over strong baselines while reducing end-to-end inference latency over 50%.
31. 【2602.06566】SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs
链接:https://arxiv.org/abs/2602.06566
作者:Niccolo Avogaro,Nayanika Debnath,Li Mi,Thomas Frick,Junling Wang,Zexue He,Hang Hua,Konrad Schindler,Mattia Rigotti
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:completely wrong answers, small perceptual mistakes, leading to long, recent successes, dynamically expanding
备注:
点击查看摘要
Abstract:Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.
32. 【2602.06556】LIBERO-X: Robustness Litmus for Vision-Language-Action Models
链接:https://arxiv.org/abs/2602.06556
作者:Guodong Wang,Chenkai Zhang,Qingjie Liu,Jinjin Zhang,Jiancheng Cai,Junjie Liu,Xinmin Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:alignment of perception, perception with language-driven, language-driven manipulation tasks, VLA, language-driven manipulation
备注: 19 pages, 14 figures and 8 tables
点击查看摘要
Abstract:Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.
33. 【2602.06548】NECromancer: Breathing Life into Skeletons via BVH Animation
链接:https://arxiv.org/abs/2602.06548
作者:Mingxi Xu,Qi Wang,Zhengyu Wen,Phong Dao Thien,Zhengyu Li,Ning Zhang,Xiaoyu He,Wei Zhao,Kehong Gong,Mingyuan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Unified BVH Universe, limiting their applicability, arbitrary BVH skeletons, existing approaches, approaches are restricted
备注:
点击查看摘要
Abstract:Motion tokenization is a key component of generalizable motion models, yet most existing approaches are restricted to species-specific skeletons, limiting their applicability across diverse morphologies. We propose NECromancer (NEC), a universal motion tokenizer that operates directly on arbitrary BVH skeletons. NEC consists of three components: (1) an Ontology-aware Skeletal Graph Encoder (OwO) that encodes structural priors from BVH files, including joint semantics, rest-pose offsets, and skeletal topology, into skeletal embeddings; (2) a Topology-Agnostic Tokenizer (TAT) that compresses motion sequences into a universal, topology-invariant discrete representation; and (3) the Unified BVH Universe (UvU), a large-scale dataset aggregating BVH motions across heterogeneous skeletons. Experiments show that NEC achieves high-fidelity reconstruction under substantial compression and effectively disentangles motion from skeletal structure. The resulting token space supports cross-species motion transfer, composition, denoising, generation with token-based models, and text-motion retrieval, establishing a unified framework for motion analysis and synthesis across diverse morphologies. Demo page: this https URL
34. 【2602.06530】Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance
链接:https://arxiv.org/abs/2602.06530
作者:Haipeng Li,Rongxuan Peng,Anwei Luo,Shunquan Tan,Changsheng Chen,Anastasia Antsiferova
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
关键词:technologies poses significant, poses significant challenges, AI-Generated Content, AIGC detectors, technologies poses
备注: 17 pages, 11 figures
点击查看摘要
Abstract:The rapid advancement of AI-Generated Content (AIGC) technologies poses significant challenges for authenticity assessment. However, existing evaluation protocols largely overlook anti-forensics attack, failing to ensure the comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. To bridge this gap, we propose ForgeryEraser, a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors. We reveal an adversarial vulnerability stemming from the systemic reliance on Vision-Language Models (VLMs) as shared backbones (e.g., CLIP), where downstream AIGC detectors inherit the feature space of these publicly accessible models. Instead of traditional logit-based optimization, we design a multi-modal guidance loss to drive forged image embeddings within the VLM feature space toward text-derived authentic anchors to erase forgery traces, while repelling them from forgery anchors. Extensive experiments demonstrate that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Moreover, ForgeryEraser induces explainable forensic models to generate explanations consistent with authentic images for forged images. Our code will be made publicly available.
35. 【2602.06529】AdaptOVCD: Training-Free Open-Vocabulary Remote Sensing Change Detection via Adaptive Information Fusion
链接:https://arxiv.org/abs/2602.06529
作者:Mingyu Dou,Shi Qiu,Ming Hu,Yifan Chen,Huping Ye,Xiaohan Liao,Zhe Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Remote sensing change, Remote sensing, sensing change detection, change detection plays, urban planning
备注:
点击查看摘要
Abstract:Remote sensing change detection plays a pivotal role in domains such as environmental monitoring, urban planning, and disaster assessment. However, existing methods typically rely on predefined categories and large-scale pixel-level annotations, which limit their generalization and applicability in open-world scenarios. To address these limitations, this paper proposes AdaptOVCD, a training-free Open-Vocabulary Change Detection (OVCD) architecture based on dual-dimensional multi-level information fusion. The framework integrates multi-level information fusion across data, feature, and decision levels vertically while incorporating targeted adaptive designs horizontally, achieving deep synergy among heterogeneous pre-trained models to effectively mitigate error propagation. Specifically, (1) at the data level, Adaptive Radiometric Alignment (ARA) fuses radiometric statistics with original texture features and synergizes with SAM-HQ to achieve radiometrically consistent segmentation; (2) at the feature level, Adaptive Change Thresholding (ACT) combines global difference distributions with edge structure priors and leverages DINOv3 to achieve robust change detection; (3) at the decision level, Adaptive Confidence Filtering (ACF) integrates semantic confidence with spatial constraints and collaborates with DGTRS-CLIP to achieve high-confidence semantic identification. Comprehensive evaluations across nine scenarios demonstrate that AdaptOVCD detects arbitrary category changes in a zero-shot manner, significantly outperforming existing training-free methods. Meanwhile, it achieves 84.89\% of the fully-supervised performance upper bound in cross-dataset evaluations and exhibits superior generalization capabilities. The code is available at this https URL.
36. 【2602.06523】MicroBi-ConvLSTM: An Ultra-Lightweight Efficient Model for Human Activity Recognition on Resource Constrained Devices
链接:https://arxiv.org/abs/2602.06523
作者:Mridankan Mandal
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:Human Activity Recognition, Human Activity, Activity Recognition, wearables requires models, resource constrained wearables
备注:
点击查看摘要
Abstract:Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters) and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional-recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. INT8 post training quantization incurs only 0.21% average F1-score degradation, yielding a 23.0 KB average deployment footprint suitable for memory constrained edge devices.
37. 【2602.06521】DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving
链接:https://arxiv.org/abs/2602.06521
作者:Feiyang jia,Lin Liu,Ziying Song,Caiyan Jia,Hangjun Ye,Xiaoshuai Hao,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:recently attracted increasing, attracted increasing interest, autonomous driving, interest in unifying, driving has recently
备注: 20 pages, 7 tables, 12 figures
点击查看摘要
Abstract:End-to-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision-Language-Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld-VLA supports controllable, action-conditioned imagination at the feature level, avoiding expensive pixel-level rollouts. Extensive open-loop and closed-loop evaluations demonstrate the effectiveness of DriveWorld-VLA, which achieves state-of-the-art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes. Code and models will be released in this https URL.
38. 【2602.06507】FloorplanVLM: A Vision-Language Model for Floorplan Vectorization
链接:https://arxiv.org/abs/2602.06507
作者:Yuanqing Liu,Ziming Yang,Yulong Li,Yue Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Converting raster floorplans, engineering-grade vector graphics, Converting raster, engineering-grade vector, vector graphics
备注:
点击查看摘要
Abstract:Converting raster floorplans into engineering-grade vector graphics is challenging due to complex topology and strict geometric constraints. To address this, we present FloorplanVLM, a unified framework that reformulates floorplan vectorization as an image-conditioned sequence modeling task. Unlike pixel-based methods that rely on fragile heuristics or query-based transformers that generate fragmented rooms, our model directly outputs structured JSON sequences representing the global topology. This 'pixels-to-sequence' paradigm enables the precise and holistic constraint satisfaction of complex geometries, such as slanted walls and curved arcs. To support this data-hungry approach, we introduce a scalable data engine: we construct a large-scale dataset (Floorplan-2M) and a high-fidelity subset (Floorplan-HQ-300K) to balance geometric diversity and pixel-level precision. We then employ a progressive training strategy, using Supervised Fine-Tuning (SFT) for structural grounding and quality annealing, followed by Group Relative Policy Optimization (GRPO) for strict geometric alignment. To standardize evaluation on complex layouts, we establish and open-source FPBench-2K. Evaluated on this rigorous benchmark, FloorplanVLM demonstrates exceptional structural validity, achieving $\textbf{92.52%}$ external-wall IoU and robust generalization across non-Manhattan architectures.
39. 【2602.06504】MultiGraspNet: A Multitask 3D Vision Model for Multi-gripper Robotic Grasping
链接:https://arxiv.org/abs/2602.06504
作者:Stephany Ortuno-Chanelo,Paolo Rabino,Enrico Civitelli,Tatiana Tommasi,Raffaello Camoriano
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:draining industrial tasks, grasping automate critical, automate critical, Vision-based models, draining industrial
备注:
点击查看摘要
Abstract:Vision-based models for robotic grasping automate critical, repetitive, and draining industrial tasks. Existing approaches are typically limited in two ways: they either target a single gripper and are potentially applied on costly dual-arm setups, or rely on custom hybrid grippers that require ad-hoc learning procedures with logic that cannot be transferred across tasks, restricting their general applicability. In this work, we present MultiGraspNet, a novel multitask 3D deep learning method that predicts feasible poses simultaneously for parallel and vacuum grippers within a unified framework, enabling a single robot to handle multiple end effectors. The model is trained on the richly annotated GraspNet-1Billion and SuctionNet-1Billion datasets, which have been aligned for the purpose, and generates graspability masks quantifying the suitability of each scene point for successful grasps. By sharing early-stage features while maintaining gripper-specific refiners, MultiGraspNet effectively leverages complementary information across grasping modalities, enhancing robustness and adaptability in cluttered scenes. We characterize MultiGraspNet's performance with an extensive experimental analysis, demonstrating its competitiveness with single-task models on relevant benchmarks. We run real-world experiments on a single-arm multi-gripper robotic setup showing that our approach outperforms the vacuum baseline, grasping 16% percent more seen objects and 32% more of the novel ones, while obtaining competitive results for the parallel task.
40. 【2602.06503】Forest canopy height estimation from satellite RGB imagery using large-scale airborne LiDAR-derived training data and monocular depth estimation
链接:https://arxiv.org/abs/2602.06503
作者:Yongkang Lai,Xihan Mu,Tim R. McVicar,Dasheng Fan,Donghui Xie,Shanxin Guo,Wenli Huang,Tianjie Zhao,Guangjian Yan
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Ecosystem Dynamics Investigation, Global Ecosystem Dynamics, height mapping plays, forest canopy height, canopy height
备注:
点击查看摘要
Abstract:Large-scale, high-resolution forest canopy height mapping plays a crucial role in understanding regional and global carbon and water cycles. Spaceborne LiDAR missions, including the Ice, Cloud, and Land Elevation Satellite-2 (ICESat-2) and the Global Ecosystem Dynamics Investigation (GEDI), provide global observations of forest structure but are spatially sparse and subject to inherent uncertainties. In contrast, near-surface LiDAR platforms, such as airborne and unmanned aerial vehicle (UAV) LiDAR systems, offer much finer measurements of forest canopy structure, and a growing number of countries have made these datasets openly available. In this study, a state-of-the-art monocular depth estimation model, Depth Anything V2, was trained using approximately 16,000 km2 of canopy height models (CHMs) derived from publicly available airborne LiDAR point clouds and related products across multiple countries, together with 3 m resolution PlanetScope and airborne RGB imagery. The trained model, referred to as Depth2CHM, enables the estimation of spatially continuous CHMs directly from PlanetScope RGB imagery. Independent validation was conducted at sites in China (approximately 1 km2) and the United States (approximately 116 km2). The results showed that Depth2CHM could accurately estimate canopy height, with biases of 0.59 m and 0.41 m and root mean square errors (RMSEs) of 2.54 m and 5.75 m for these two sites, respectively. Compared with an existing global meter-resolution CHM product, the mean absolute error is reduced by approximately 1.5 m and the RMSE by approximately 2 m. These results demonstrated that monocular depth estimation networks trained with large-scale airborne LiDAR-derived canopy height data provide a promising and scalable pathway for high-resolution, spatially continuous forest canopy height estimation from satellite RGB imagery.
41. 【2602.06494】DreamHome-Pano: Design-Aware and Conflict-Free Panoramic Interior Generation
链接:https://arxiv.org/abs/2602.06494
作者:Lulu Chen,Yijiang Hu,Yuanqing Liu,Yulong Li,Yue Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:personalized spaces frequently, spaces frequently necessitates, specific stylistic preferences, modern interior design, personalized spaces
备注:
点击查看摘要
Abstract:In modern interior design, the generation of personalized spaces frequently necessitates a delicate balance between rigid architectural structural constraints and specific stylistic preferences. However, existing multi-condition generative frameworks often struggle to harmonize these inputs, leading to "condition conflicts" where stylistic attributes inadvertently compromise the geometric precision of the layout. To address this challenge, we present DreamHome-Pano, a controllable panoramic generation framework designed for high-fidelity interior synthesis. Our approach introduces a Prompt-LLM that serves as a semantic bridge, effectively translating layout constraints and style references into professional descriptive prompts to achieve precise cross-modal alignment. To safeguard architectural integrity during the generative process, we develop a Conflict-Free Control architecture that incorporates structural-aware geometric priors and a multi-condition decoupling strategy, effectively suppressing stylistic interference from eroding the spatial layout. Furthermore, we establish a comprehensive panoramic interior benchmark alongside a multi-stage training pipeline, encompassing progressive Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Experimental results demonstrate that DreamHome-Pano achieves a superior balance between aesthetic quality and structural consistency, offering a robust and professional-grade solution for panoramic interior visualization.
42. 【2602.06488】Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction
链接:https://arxiv.org/abs/2602.06488
作者:Zizhan Guo,Yi Feng,Mengtan Zhang,Haoran Zhang,Wei Ye,Rui Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision-centric autonomous driving, single image, remains a fundamental, autonomous driving, occluded regions
备注:
点击查看摘要
Abstract:Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the prevalent use of 2D ground truth fails to reveal the inherent ambiguity in occluded regions caused by insufficient geometric constraints. To address these issues, this paper presents a reformulated benchmark for unsupervised monocular 3D occupancy prediction. We first interpret the variables involved in the volume rendering process and identify the most physically consistent representation of the occupancy probability. Building on these analyses, we improve existing evaluation protocols by aligning the newly identified representation with voxel-wise 3D occupancy ground truth, thereby enabling unsupervised methods to be evaluated in a manner consistent with that of supervised approaches. Additionally, to impose explicit constraints in occluded regions, we introduce an occlusion-aware polarization mechanism that incorporates multi-view visual cues to enhance discrimination between occupied and free spaces in these regions. Extensive experiments demonstrate that our approach not only significantly outperforms existing unsupervised approaches but also matches the performance of supervised ones. Our source code and evaluation protocol will be made available upon publication.
43. 【2602.06484】Instance-Free Domain Adaptive Object Detection
链接:https://arxiv.org/abs/2602.06484
作者:Hengfu Yu,Jinhong Deng,Lixin Duan,Wen Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Adaptive Object Detection, Domain Adaptive Object, made significant strides, Adaptive Object, Object Detection
备注: 14 pages, 12 figures
点击查看摘要
Abstract:While Domain Adaptive Object Detection (DAOD) has made significant strides, most methods rely on unlabeled target data that is assumed to contain sufficient foreground instances. However, in many practical scenarios (e.g., wildlife monitoring, lesion detection), collecting target domain data with objects of interest is prohibitively costly, whereas background-only data is abundant. This common practical constraint introduces a significant technical challenge: the difficulty of achieving domain alignment when target instances are unavailable, forcing adaptation to rely solely on the target background information. We formulate this challenge as the novel problem of Instance-Free Domain Adaptive Object Detection. To tackle this, we propose the Relational and Structural Consistency Network (RSCN) which pioneers an alignment strategy based on background feature prototypes while simultaneously encouraging consistency in the relationship between the source foreground features and the background features within each domain, enabling robust adaptation even without target instances. To facilitate research, we further curate three specialized benchmarks, including simulative auto-driving detection, wildlife detection, and lung nodule detection. Extensive experiments show that RSCN significantly outperforms existing DAOD methods across all three benchmarks in the instance-free scenario. The code and benchmarks will be released soon.
44. 【2602.06478】Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention
链接:https://arxiv.org/abs/2602.06478
作者:Xiaosong Jia,Yihang Sun,Junqi You,Songbur Wong,Zichen Zou,Junchi Yan,Zuxuan Wu,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Feedforward models, recently advanced, advanced by transformer-based, transformer-based methods, NVS
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose Efficient-LVSM, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed. Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong zero-shot generalization to unseen view counts, and enables incremental inference with KV-cache, thanks to its decoupled designs.
45. 【2602.06474】LAB-Det: Language as a Domain-Invariant Bridge for Training-Free One-Shot Domain Generalization in Object Detection
链接:https://arxiv.org/abs/2602.06474
作者:Xu Zhang,Zhe Chen,Jing Zhang,Dacheng Tao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Grounding DINO excel, GLIP and Grounding, Grounding DINO, DINO excel, Foundation object detectors
备注:
点击查看摘要
Abstract:Foundation object detectors such as GLIP and Grounding DINO excel on general-domain data but often degrade in specialized and data-scarce settings like underwater imagery or industrial defects. Typical cross-domain few-shot approaches rely on fine-tuning scarce target data, incurring cost and overfitting risks. We instead ask: Can a frozen detector adapt with only one exemplar per class without training? To answer this, we introduce training-free one-shot domain generalization for object detection, where detectors must adapt to specialized domains with only one annotated exemplar per class and no weight updates. To tackle this task, we propose LAB-Det, which exploits Language As a domain-invariant Bridge. Instead of adapting visual features, we project each exemplar into a descriptive text that conditions and guides a frozen detector. This linguistic conditioning replaces gradient-based adaptation, enabling robust generalization in data-scarce domains. We evaluate on UODD (underwater) and NEU-DET (industrial defects), two widely adopted benchmarks for data-scarce detection, where object boundaries are often ambiguous, and LAB-Det achieves up to 5.4 mAP improvement over state-of-the-art fine-tuned baselines without updating a single parameter. These results establish linguistic adaptation as an efficient and interpretable alternative to fine-tuning in specialized detection settings.
46. 【2602.06452】Exploring Specular Reflection Inconsistency for Generalizable Face Forgery Detection
链接:https://arxiv.org/abs/2602.06452
作者:Hongyan Fei,Zexi Jia,Chuanwei Huang,Jinchao Zhang,Jie Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieve unprecedented quality, Detecting deepfakes, quality and resolution, specular reflection, increasingly challenging
备注:
点击查看摘要
Abstract:Detecting deepfakes has become increasingly challenging as forgery faces synthesized by AI-generated methods, particularly diffusion models, achieve unprecedented quality and resolution. Existing forgery detection approaches relying on spatial and frequency features demonstrate limited efficacy against high-quality, entirely synthesized forgeries. In this paper, we propose a novel detection method grounded in the observation that facial attributes governed by complex physical laws and multiple parameters are inherently difficult to replicate. Specifically, we focus on illumination, particularly the specular reflection component in the Phong illumination model, which poses the greatest replication challenge due to its parametric complexity and nonlinear formulation. We introduce a fast and accurate face texture estimation method based on Retinex theory to enable precise specular reflection separation. Furthermore, drawing from the mathematical formulation of specular reflection, we posit that forgery evidence manifests not only in the specular reflection itself but also in its relationship with corresponding face texture and direct light. To address this issue, we design the Specular-Reflection-Inconsistency-Network (SRI-Net), incorporating a two-stage cross-attention mechanism to capture these correlations and integrate specular reflection related features with image features for robust forgery detection. Experimental results demonstrate that our method achieves superior performance on both traditional deepfake datasets and generative deepfake datasets, particularly those containing diffusion-generated forgery faces.
47. 【2602.06450】What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution
链接:https://arxiv.org/abs/2602.06450
作者:Xingsong Ye,Yongkun Du,JiaXin Zhang,Chen Li,Jing LYU,Zhineng Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Scene Text Recognition, training effective Scene, effective Scene Text, Text Recognition, Scene Text
备注:
点击查看摘要
Abstract:Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.
48. 【2602.06442】ChatUMM: Robust Context Tracking for Conversational Interleaved Generation
链接:https://arxiv.org/abs/2602.06442
作者:Wenxun Dai,Zhiyuan Zhao,Yule Zhong,Yiji Cheng,Jianwei Zhang,Linqing Wang,Shiyi Zhang,Yunlong Lin,Runze He,Fellix Song,Wayne Zhuang,Yong Liu,Haoji Zhang,Yansong Tang,Qinglin Lu,Chunyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, single-turn interaction paradigm, interaction paradigm, effectively functioning, achieved remarkable
备注: ChatUMM Project
点击查看摘要
Abstract:Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor'' turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.
49. 【2602.06427】Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters
链接:https://arxiv.org/abs/2602.06427
作者:Yuxiang Zhao,Yirong Yang,Yanqing Zhu,Yanfen Shen,Chiyu Wang,Zhining Gu,Pei Shi,Wei Guo,Mu Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:holds significant promise, navigation holds significant, Embodied navigation holds, last-mile delivery, holds significant
备注:
点击查看摘要
Abstract:Embodied navigation holds significant promise for real-world applications such as last-mile delivery. However, most existing approaches are confined to either indoor or outdoor environments and rely heavily on strong assumptions, such as access to precise coordinate systems. While current outdoor methods can guide agents to the vicinity of a target using coarse-grained localization, they fail to enable fine-grained entry through specific building entrances, critically limiting their utility in practical deployment scenarios that require seamless outdoor-to-indoor transitions. To bridge this gap, we introduce a novel task: out-to-in prior-free instruction-driven embodied navigation. This formulation explicitly eliminates reliance on accurate external priors, requiring agents to navigate solely based on egocentric visual observations guided by instructions. To tackle this task, we propose a vision-centric embodied navigation framework that leverages image-based prompts to drive decision-making. Additionally, we present the first open-source dataset for this task, featuring a pipeline that integrates trajectory-conditioned video synthesis into the data generation process. Through extensive experiments, we demonstrate that our proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.
50. 【2602.06425】POPL-KF: A Pose-Only Geometric Representation-Based Kalman Filter for Point-Line-Based Visual-Inertial Odometry
链接:https://arxiv.org/abs/2602.06425
作者:Aiping Wang,Zhaolong Yang,Shuwen Chen,Hai Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Mainstream Visual-inertial odometry, Mainstream Visual-inertial, Visual-inertial odometry, motion estimation, VIO
备注:
点击查看摘要
Abstract:Mainstream Visual-inertial odometry (VIO) systems rely on point features for motion estimation and localization. However, their performance degrades in challenging scenarios. Moreover, the localization accuracy of multi-state constraint Kalman filter (MSCKF)-based VIO systems suffers from linearization errors associated with feature 3D coordinates and delayed measurement updates. To improve the performance of VIO in challenging scenes, we first propose a pose-only geometric representation for line features. Building on this, we develop POPL-KF, a Kalman filter-based VIO system that employs a pose-only geometric representation for both point and line features. POPL-KF mitigates linearization errors by explicitly eliminating both point and line feature coordinates from the measurement equations, while enabling immediate update of visual measurements. We also design a unified base-frames selection algorithm for both point and line features to ensure optimal constraints on camera poses within the pose-only measurement model. To further improve line feature quality, a line feature filter based on image grid segmentation and bidirectional optical flow consistency is proposed. Our system is evaluated on public datasets and real-world experiments, demonstrating that POPL-KF outperforms the state-of-the-art (SOTA) filter-based methods (OpenVINS, PO-KF) and optimization-based methods (PL-VINS, EPLF-VINS), while maintaining real-time performance.
51. 【2602.06422】Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO
链接:https://arxiv.org/abs/2602.06422
作者:Yunze Tong,Mushui Liu,Canyu Zhao,Wanggui He,Shiyi Zhang,Hongwei Zhang,Peng Zhang,Jinlong Liu,Ju Huang,Jiamang Wang,Hao Jiang,Pipei Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Flow Matching models, Flow Matching, Deploying GRPO, Matching models, GRPO on Flow
备注: 18 pages, in submission
点击查看摘要
Abstract:Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's "pure" effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at this https URL.
52. 【2602.06419】Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors
链接:https://arxiv.org/abs/2602.06419
作者:Soham Pahari,Sandeep C. Kumain
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:three-dimensional objects emerges, Human visual attention, objects emerges, top-down semantic recognition, Human visual
备注:
点击查看摘要
Abstract:Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.
53. 【2602.06406】Point Virtual Transformer
链接:https://arxiv.org/abs/2602.06406
作者:Veerain Sood,Bnalin,Gaurav Pandey
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reliable geometric cues, detect far-field objects, far-field objects due, long ranges, geometric cues
备注: 8 pages, 4 figures
点击查看摘要
Abstract:LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.
54. 【2602.06405】A neuromorphic model of the insect visual system for natural image processing
链接:https://arxiv.org/abs/2602.06405
作者:Adam D. Hines,Karin Nordström,Andrew B. Barron
类目:Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
关键词:complex behaviors including, behaviors including associative, understanding biological visual, including associative learning, supports complex behaviors
备注: 21 pages, 7 figures, under review
点击查看摘要
Abstract:Insect vision supports complex behaviors including associative learning, navigation, and object detection, and has long motivated computational models for understanding biological visual processing. However, many contemporary models prioritize task performance while neglecting biologically grounded processing pathways. Here, we introduce a bio-inspired vision model that captures principles of the insect visual system to transform dense visual input into sparse, discriminative codes. The model is trained using a fully self-supervised contrastive objective, enabling representation learning without labeled data and supporting reuse across tasks without reliance on domain-specific classifiers. We evaluated the resulting representations on flower recognition tasks and natural image benchmarks. The model consistently produced reliable sparse codes that distinguish visually similar inputs. To support different modelling and deployment uses, we have implemented the model as both an artificial neural network and a spiking neural network. In a simulated localization setting, our approach outperformed a simple image downsampling comparison baseline, highlighting the functional benefit of incorporating neuromorphic visual processing pathways. Collectively, these results advance insect computational modelling by providing a generalizable bio-inspired vision model capable of sparse computation across diverse tasks.
55. 【2602.06402】MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing
链接:https://arxiv.org/abs/2602.06402
作者:Wenjie Wang,Wei Wu,Ying Liu,Yuan Zhao,Xiaole Lv,Liang Diao,Zengjian Fan,Wenfeng Xie,Ziling Lin,De Shi,Lin Huang,Kaihe Xu,Hong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:field-level exact matching, requiring strict field-level, strict field-level exact, domain-specific terminology, complex layouts
备注: 20 pages, 8 figures. Technical report
点击查看摘要
Abstract:Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
56. 【2602.06400】FusionOcc: Student's t-Distribution Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction
链接:https://arxiv.org/abs/2602.06400
作者:Zhenxing Ming,Julie Stephany Berrio,Mao Shan,Stewart Worrall
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:semantic occupancy prediction, enables autonomous vehicles, prediction enables autonomous, occupancy prediction enables, semantic occupancy
备注:
点击查看摘要
Abstract:3D semantic occupancy prediction enables autonomous vehicles (AVs) to perceive fine-grained geometric and semantic structure of their surroundings from onboard sensors, which is essential for safe decision-making and navigation. Recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real-world objects with varied shapes and classes. However, the intermediate representations used by existing methods for 3D semantic occupancy prediction rely heavily on 3D voxel volumes or a set of 3D Gaussians, hindering the model's ability to efficiently and effectively capture fine-grained geometric details in the 3D driving environment. This paper introduces TFusionOcc, a novel object-centric multi-sensor fusion framework for predicting 3D semantic occupancy. By leveraging multi-stage multi-sensor fusion, Student's t-distribution, and the T-Mixture model (TMM), together with more geometrically flexible primitives, such as the deformable superquadric (superquadric with inverse warp), the proposed method achieved state-of-the-art (SOTA) performance on the nuScenes benchmark. In addition, extensive experiments were conducted on the nuScenes-C dataset to demonstrate the robustness of the proposed method in different camera and lidar corruption scenarios. The code will be available at: this https URL
57. 【2602.06391】POINTS-GUI-G: GUI-Grounding Journey
链接:https://arxiv.org/abs/2602.06391
作者:Zhongyin Zhao,Yuan Liu,Yikun Liu,Haicheng Wang,Le Tian,Xiao Zhou,Yangxiu You,Zilin Yu,Yang Yu,Jie Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:repetitive digital workflows, hold immense potential, automating complex tasks, flight booking, digital workflows
备注:
点击查看摘要
Abstract:The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.
58. 【2602.06369】Revisiting Salient Object Detection from an Observer-Centric Perspective
链接:https://arxiv.org/abs/2602.06369
作者:Fuxi Zhang,Yifan Wang,Hengrun Zhao,Zhuohan Sun,Changxing Xia,Lijun Wang,Huchuan Lu,Yangrui Shao,Chen Yang,Long Teng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Salient object detection, inherently a subjective, priors may perceive, object detection, subjective problem
备注:
点击查看摘要
Abstract:Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under-determined and fundamentally ill-posed. To address this issue, we propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction. By leveraging multi-modal large language models, we develop an efficient data annotation pipeline and construct the first OC-SOD dataset named OC-SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC-SODAgent, an agentic baseline which performs OC-SOD via a human-like "Perceive-Reflect-Adjust" process. Extensive experiments on our proposed OC-SODBench have justified the effectiveness of our contribution. Through this observer-centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly "salient." Code and dataset are publicly available at: this https URL
59. 【2602.06363】Robust Pedestrian Detection with Uncertain Modality
链接:https://arxiv.org/abs/2602.06363
作者:Qian Bie,Xiao Wang,Bin Yang,Zhixi Yu,Jun Chen,Xin Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:URL captures rich, http URL captures, http URL, cross-modal pedestrian detection, RGB and thermal-infrared
备注: Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file
点击查看摘要
Abstract:Existing cross-modal pedestrian detection (CMPD) employs complementary information from RGB and thermal-infrared (TIR) modalities to detect pedestrians in 24h-surveillance this http URL captures rich pedestrian details under daylight, while TIR excels at night. However, TIR focuses primarily on the person's silhouette, neglecting critical texture details essential for detection. While the near-infrared (NIR) captures texture under low-light conditions, which effectively alleviates performance issues of RGB and detail loss in TIR, thereby reducing missed detections. To this end, we construct a new Triplet RGB-NIR-TIR (TRNT) dataset, comprising 8,281 pixel-aligned image triplets, establishing a comprehensive foundation for algorithmic research. However, due to the variable nature of real-world scenarios, imaging devices may not always capture all three modalities simultaneously. This results in input data with unpredictable combinations of modal types, which challenge existing CMPD methods that fail to extract robust pedestrian information under arbitrary input combinations, leading to significant performance degradation. To address these challenges, we propose the Adaptive Uncertainty-aware Network (AUNet) for accurately discriminating modal availability and fully utilizing the available information under uncertain inputs. Specifically, we introduce Unified Modality Validation Refinement (UMVR), which includes an uncertainty-aware router to validate modal availability and a semantic refinement to ensure the reliability of information within the modality. Furthermore, we design a Modality-Aware Interaction (MAI) module to adaptively activate or deactivate its internal interaction mechanisms per UMVR output, enabling effective complementary information fusion from available modalities.
60. 【2602.06355】Di3PO -- Diptych Diffusion DPO for Targeted Improvements in Image
链接:https://arxiv.org/abs/2602.06355
作者:Sanjana Reddy(1),Ishaan Malhi(2),Sally Ma(2),Praneet Dutta(2) ((1) Google, (2) Google DeepMind)
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:computationally expensive generation, expensive generation steps, Existing methods, rely on computationally, generation steps
备注:
点击查看摘要
Abstract:Existing methods for preference tuning of text-to-image (T2I) diffusion models often rely on computationally expensive generation steps to create positive and negative pairs of images. These approaches frequently yield training pairs that either lack meaningful differences, are expensive to sample and filter, or exhibit significant variance in irrelevant pixel regions, thereby degrading training efficiency. To address these limitations, we introduce "Di3PO", a novel method for constructing positive and negative pairs that isolates specific regions targeted for improvement during preference tuning, while keeping the surrounding context in the image stable. We demonstrate the efficacy of our approach by applying it to the challenging task of text rendering in diffusion models, showcasing improvements over baseline methods of SFT and DPO.
61. 【2602.06351】rifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
链接:https://arxiv.org/abs/2602.06351
作者:Longhui Ma,Di Zhao,Siwei Wang,Zhao Lv,Miao Wang
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:maps natural language, natural language instructions, correct interface elements, grounding maps natural, GUI agents
备注: 17 pages, 10 figures
点击查看摘要
Abstract:GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.
62. 【2602.06346】FlowConsist: Make Your Flow Consistent with Real Trajectory
链接:https://arxiv.org/abs/2602.06346
作者:Tianyi Zhang,Chengcheng Liu,Jinwei Chen,Chun-Le Guo,Chongyi Li,Ming-Ming Cheng,Bo Li,Peng-Tao Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:directly predict ODE, ODE path integrals, predict ODE path, iterative sampling process, ODE path
备注:
点击查看摘要
Abstract:Fast flow models accelerate the iterative sampling process by learning to directly predict ODE path integrals, enabling one-step or few-step generation. However, we argue that current fast-flow training paradigms suffer from two fundamental issues. First, conditional velocities constructed from randomly paired noise-data samples introduce systematic trajectory drift, preventing models from following a consistent ODE path. Second, the model's approximation errors accumulate over time steps, leading to severe deviations across long time intervals. To address these issues, we propose FlowConsist, a training framework designed to enforce trajectory consistency in fast flows. We propose a principled alternative that replaces conditional velocities with the marginal velocities predicted by the model itself, aligning optimization with the true trajectory. To further address error accumulation over time steps, we introduce a trajectory rectification strategy that aligns the marginal distributions of generated and real samples at every time step along the trajectory. Our method establishes a new state-of-the-art on ImageNet 256$\times$256, achieving an FID of 1.52 with only 1 sampling step.
63. 【2602.06343】Uncertainty-Aware 4D Gaussian Splatting for Monocular Occluded Human Rendering
链接:https://arxiv.org/abs/2602.06343
作者:Weiquan Wang,Feifei Shao,Lin Li,Zhen Wang,Jun Xiao,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:monocular videos typically, videos typically degrades, typically degrades catastrophically, catastrophically under occlusions, dynamic humans
备注:
点击查看摘要
Abstract:High-fidelity rendering of dynamic humans from monocular videos typically degrades catastrophically under occlusions. Existing solutions incorporate external priors-either hallucinating missing content via generative models, which induces severe temporal flickering, or imposing rigid geometric heuristics that fail to capture diverse appearances. To this end, we reformulate the task as a Maximum A Posteriori estimation problem under heteroscedastic observation noise. In this paper, we propose U-4DGS, a framework integrating a Probabilistic Deformation Network and a Double Rasterization pipeline. This architecture renders pixel-aligned uncertainty maps that act as an adaptive gradient modulator, automatically attenuating artifacts from unreliable observations. Furthermore, to prevent geometric drift in regions lacking reliable visual cues, we enforce Confidence-Aware Regularizations, which leverage the learned uncertainty to selectively propagate spatial-temporal validity. Extensive experiments on ZJU-MoCap and OcMotion demonstrate that U-4DGS achieves SOTA rendering fidelity and robustness.
64. 【2602.06335】SPDA-SAM: A Self-prompted Depth-Aware Segment Anything Model for Instance Segmentation
链接:https://arxiv.org/abs/2602.06335
作者:Yihan Shang,Wei Wang,Chao Huang,Xinghui Dong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Segment Anything Model, demonstrated strong generalizability, instance segmentation tasks, demonstrated strong, strong generalizability
备注:
点击查看摘要
Abstract:Recently, Segment Anything Model (SAM) has demonstrated strong generalizability in various instance segmentation tasks. However, its performance is severely dependent on the quality of manual prompts. In addition, the RGB images that instance segmentation methods normally use inherently lack depth information. As a result, the ability of these methods to perceive spatial structures and delineate object boundaries is hindered. To address these challenges, we propose a Self-prompted Depth-Aware SAM (SPDA-SAM) for instance segmentation. Specifically, we design a Semantic-Spatial Self-prompt Module (SSSPM) which extracts the semantic and spatial prompts from the image encoder and the mask decoder of SAM, respectively. Furthermore, we introduce a Coarse-to-Fine RGB-D Fusion Module (C2FFM), in which the features extracted from a monocular RGB image and the depth map estimated from it are fused. In particular, the structural information in the depth map is used to provide coarse-grained guidance to feature fusion, while local variations in depth are encoded in order to fuse fine-grained feature representations. To our knowledge, SAM has not been explored in such self-prompted and depth-aware manners. Experimental results demonstrate that our SPDA-SAM outperforms its state-of-the-art counterparts across twelve different data sets. These promising results should be due to the guidance of the self-prompts and the compensation for the spatial information loss by the coarse-to-fine RGB-D fusion operation.
65. 【2602.06333】aming SAM3 in the Wild: A Concept Bank for Open-Vocabulary Segmentation
链接:https://arxiv.org/abs/2602.06333
作者:Gensheng Pei,Xiruo Jiang,Yazhou Yao,Xiangbo Shu,Fumin Shen,Byeungwoo Jeon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:revolutionized Open-Vocabulary Segmentation, promptable concept segmentation, grounds pixel predictions, textit, Open-Vocabulary Segmentation
备注:
点击查看摘要
Abstract:The recent introduction of \texttt{SAM3} has revolutionized Open-Vocabulary Segmentation (OVS) through \textit{promptable concept segmentation}, which grounds pixel predictions in flexible concept prompts. However, this reliance on pre-defined concepts makes the model vulnerable: when visual distributions shift (\textit{data drift}) or conditional label distributions evolve (\textit{concept drift}) in the target domain, the alignment between visual evidence and prompts breaks down. In this work, we present \textsc{ConceptBank}, a parameter-free calibration framework to restore this alignment on the fly. Instead of adhering to static prompts, we construct a dataset-specific concept bank from the target statistics. Our approach (\textit{i}) anchors target-domain evidence via class-wise visual prototypes, (\textit{ii}) mines representative supports to suppress outliers under data drift, and (\textit{iii}) fuses candidate concepts to rectify concept drift. We demonstrate that \textsc{ConceptBank} effectively adapts \texttt{SAM3} to distribution drifts, including challenging natural-scene and remote-sensing scenarios, establishing a new baseline for robustness and efficiency in OVS. Code and model are available at this https URL.
66. 【2602.06330】Halt the Hallucination: Decoupling Signal and Semantic OOD Detection Based on Cascaded Early Rejection
链接:https://arxiv.org/abs/2602.06330
作者:Ningkang Peng,Chuanjie Cheng,Jingyang Mao,Xiaoqian Peng,Feng Xing,Bo Zhang,Chao Tan,Zhichao Zheng,Peiheng Li,Yanhui Gu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low-level statistical noise, execute full-scale inference, http URL, Structural Energy Sieve, Efficient and robust
备注:
点击查看摘要
Abstract:Efficient and robust Out-of-Distribution (OOD) detection is paramount for safety-critical this http URL, existing methods still execute full-scale inference on low-level statistical noise. This computational mismatch not only incurs resource waste but also induces semantic hallucination, where deep networks forcefully interpret physical anomalies as high-confidence semantic this http URL address this, we propose the Cascaded Early Rejection (CER) framework, which realizes hierarchical filtering for anomaly detection via a coarse-to-fine this http URL comprises two core modules: 1)Structural Energy Sieve (SES), which establishes a non-parametric barrier at the network entry using the Laplacian operator to efficiently intercept physical signal anomalies; and 2) the Semantically-aware Hyperspherical Energy (SHE) detector, which decouples feature magnitude from direction in intermediate layers to identify fine-grained semantic deviations. Experimental results demonstrate that CER not only reduces computational overhead by 32% but also achieves a significant performance leap on the CIFAR-100 benchmark:the average FPR95 drastically decreases from 33.58% to 22.84%, and AUROC improves to 93.97%. Crucially, in real-world scenarios simulating sensor failures, CER exhibits performance far exceeding state-of-the-art methods. As a universal plugin, CER can be seamlessly integrated into various SOTA models to provide performance gains.
67. 【2602.06328】Adaptive and Balanced Re-initialization for Long-timescale Continual Test-time Domain Adaptation
链接:https://arxiv.org/abs/2602.06328
作者:Yanshuo Wang,Jinguang Tong,Jun Lan,Weiqiang Wang,Huijia Zhu,Haoxing Chen,Xuesong Li,Jie Hong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Continual test-time domain, test-time domain adaptation, Continual test-time, aims to adjust, test-time domain
备注: Accepted in ICASSP 2026
点击查看摘要
Abstract:Continual test-time domain adaptation (CTTA) aims to adjust models so that they can perform well over time across non-stationary environments. While previous methods have made considerable efforts to optimize the adaptation process, a crucial question remains: Can the model adapt to continually changing environments over a long time? In this work, we explore facilitating better CTTA in the long run using a re-initialization (or reset) based method. First, we observe that the long-term performance is associated with the trajectory pattern in label flip. Based on this observed correlation, we propose a simple yet effective policy, Adaptive-and-Balanced Re-initialization (ABR), towards preserving the model's long-term performance. In particular, ABR performs weight re-initialization using adaptive intervals. The adaptive interval is determined based on the change in label flip. The proposed method is validated on extensive CTTA benchmarks, achieving superior performance.
68. 【2602.06300】Accelerating Vision Transformers on Brain Processing Unit
链接:https://arxiv.org/abs/2602.06300
作者:Jinchi Tang,Yan Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Brain Processing Units, specialized neural processing, Processing Units, neural processing hardware, Brain Processing
备注:
点击查看摘要
Abstract:With the advancement of deep learning technologies, specialized neural processing hardware such as Brain Processing Units (BPUs) have emerged as dedicated platforms for CNN acceleration, offering optimized INT8 computation capabilities for convolutional operations. Meanwhile, Vision Transformer (ViT) models, such as the Data-efficient Image Transformer (DeiT), have demonstrated superior performance and play increasingly crucial roles in computer vision tasks. However, due to the architectural mismatch between CNN-optimized hardware and Vision Transformer computation characteristics--namely, that linear layers in Transformers operate on three-dimensional data while BPU acceleration is designed for four-dimensional convolution operations-it is difficult or even impossible to leverage BPU's advantages when deploying Vision Transformers. To address this challenge, we propose a novel approach that restructures the Vision Transformer by replacing linear layers and layer normalization operations with carefully designed convolutional operators. This enables DeiT to fully utilize the acceleration capabilities of BPUs, while allowing the original weight parameters to be inherited by the restructured models without retraining or fine-tuning. To the best of our knowledge, this is the first successful deployment of Vision Transformers that fully leverages BPU classification datasets demonstrate the effectiveness of our approach. Specifically, the quantized DeiT-Base model achieves 80.4% accuracy on ImageNet, compared to the original 81.8%, while obtaining up to a 3.8* inference speedup. Our finetuned DeiT model on the flower classification dataset also achieves excellent performance, with only a 0.5% accuracy drop for the DeiT-Base model, further demonstrating the effectiveness of our method.
69. 【2602.06288】Unsupervised MRI-US Multimodal Image Registration with Multilevel Correlation Pyramidal Optimization
链接:https://arxiv.org/abs/2602.06288
作者:Jiazheng Wang,Zeyu Liu,Min Liu,Xiang Chen,Hang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Surgical navigation based, multimodal image registration, critical anatomical structures, providing intraoperative guidance, intraoperative multimodal images
备注: first-place method of ReMIND2Reg Learn2Reg 2025 (in MICCAI 2025)
点击查看摘要
Abstract:Surgical navigation based on multimodal image registration has played a significant role in providing intraoperative guidance to surgeons by showing the relative position of the target area to critical anatomical structures during surgery. However, due to the differences between multimodal images and intraoperative image deformation caused by tissue displacement and removal during the surgery, effective registration of preoperative and intraoperative multimodal images faces significant challenges. To address the multimodal image registration challenges in Learn2Reg 2025, an unsupervised multimodal medical image registration method based on multilevel correlation pyramidal optimization (MCPO) is designed to solve these problems. First, the features of each modality are extracted based on the modality independent neighborhood descriptor, and the multimodal images is mapped to the feature space. Second, a multilevel pyramidal fusion optimization mechanism is designed to achieve global optimization and local detail complementation of the displacement field through dense correlation analysis and weight-balanced coupled convex optimization for input features at different scales. Our method focuses on the ReMIND2Reg task in Learn2Reg 2025. Based on the results, our method achieved the first place in the validation phase and test phase of ReMIND2Reg. The MCPO is also validated on the Resect dataset, achieving an average TRE of 1.798 mm. This demonstrates the broad applicability of our method in preoperative-to-intraoperative image registration. The code is avaliable at this https URL.
70. 【2602.06285】MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training
链接:https://arxiv.org/abs/2602.06285
作者:Lucia Gordon,Serge Belongie,Christian Igel,Nico Lang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Earth observation data, geospatial machine learning, Recent research, Earth observation, learning on Earth
备注:
点击查看摘要
Abstract:Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal environmental tasks with 12 modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models and find that while (multimodal) pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time as auxiliary tasks, regardless of whether a pretrained model accepts them as input. Our method improves model performance on both the random and geographic test splits, and geographic batching leads to a good trade-off between regularization and specialization during TTT. Our dataset, code, and visualization tool are linked from the project page at this http URL.
71. 【2602.06282】An Interpretable Vision Transformer as a Fingerprint-Based Diagnostic Aid for Kabuki and Wiedemann-Steiner Syndromes
链接:https://arxiv.org/abs/2602.06282
作者:Marilyn Lionts,Arnhildur Tomasdottir,Viktor I. Agustsson,Yuankai Huo,Hans T. Bjornsson,Lotta M. Ellingsen
类目:Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
关键词:including neurodevelopmental delay, fetal fingertip pads, distinct developmental disorders, share overlapping clinical, persistent fetal fingertip
备注:
点击查看摘要
Abstract:Kabuki syndrome (KS) and Wiedemann-Steiner syndrome (WSS) are rare but distinct developmental disorders that share overlapping clinical features, including neurodevelopmental delay, growth restriction, and persistent fetal fingertip pads. While genetic testing remains the diagnostic gold standard, many individuals with KS or WSS remain undiagnosed due to barriers in access to both genetic testing and expertise. Dermatoglyphic anomalies, despite being established hallmarks of several genetic syndromes, remain an underutilized diagnostic signal in the era of molecular testing. This study presents a vision transformer-based deep learning model that leverages fingerprint images to distinguish individuals with KS and WSS from unaffected controls and from one another. We evaluate model performance across three binary classification tasks. Across the three classification tasks, the model achieved AUC scores of 0.80 (control vs. KS), 0.73 (control vs. WSS), and 0.85 (KS vs. WSS), with corresponding F1 scores of 0.71, 0.72, and 0.83, respectively. Beyond classification, we apply attention-based visualizations to identify fingerprint regions most salient to model predictions, enhancing interpretability. Together, these findings suggest the presence of syndrome-specific fingerprint features, demonstrating the feasibility of a fingerprint-based artificial intelligence (AI) tool as a noninvasive, interpretable, and accessible future diagnostic aid for the early diagnosis of underdiagnosed genetic syndromes.
72. 【2602.06251】ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning
链接:https://arxiv.org/abs/2602.06251
作者:Aman Anand,Amir Eskandari,Elyas Rahsno,Farhana Zulkernine
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:shown remarkable success, skeleton-based action recognition, leveraging data augmentations, existing SSL methods, Self-supervised learning
备注:
点击查看摘要
Abstract:Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7-4.4% in fine-tuning and up to 5.9% in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.
73. 【2602.06226】ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos
链接:https://arxiv.org/abs/2602.06226
作者:Yuantao Chen,Jiahao Chang,Chongjie Ye,Chaoran Zhang,Zhaojie Fang,Chenghong Li,Xiaoguang Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:videos capturing daily, capturing daily hand-object, daily hand-object interactions, hand-object interactions presents, monocular videos capturing
备注: 14 pages, 7 figures, Page: [this https URL](https://tao-11-chen.github.io/project_pages/ForeHOI/)
点击查看摘要
Abstract:The ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based methods. The information exchanges between the 2D and 3D shape completion boosts the overall reconstruction quality, enabling the framework to effectively handle severe hand-object occlusion. Furthermore, to support the training of our model, we contribute the first large-scale, high-fidelity synthetic dataset of hand-object interactions with comprehensive annotations. Extensive experiments demonstrate that ForeHOI achieves state-of-the-art performance in object reconstruction, significantly outperforming previous methods with around a 100x speedup. Code and data are available at: this https URL.
74. 【2602.06218】Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings
链接:https://arxiv.org/abs/2602.06218
作者:Grégoire Dhimoïla,Thomas Fel,Victor Boutin,Agustin Picard
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:shared embedding space, remains poorly understood, embedding space remains, space remains poorly, align images
备注: Published as a conference paper at ICLR 2026
点击查看摘要
Abstract:Vision-language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.
75. 【2602.06214】Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion Models
链接:https://arxiv.org/abs/2602.06214
作者:Jorge Daniel Rodríguez-Vidal,Gabriel Villalonga,Diego Porres,Antonio M. López Peña
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:Autonomous Driving, directly output throttle, steer and brake, output throttle, models that predict
备注: 8 pages, 3 figures
点击查看摘要
Abstract:End-to-End Autonomous Driving (E2E-AD) systems are typically grouped by the nature of their outputs: (i) waypoint-based models that predict a future trajectory, and (ii) action-based models that directly output throttle, steer and brake. Most recent benchmark protocols and training pipelines are waypoint-based, which makes action-based policies harder to train and compare, slowing their progress. To bridge this waypoint-action gap, we propose a novel, differentiable vehicle-model framework that rolls out predicted action sequences to their corresponding ego-frame waypoint trajectories while supervising in waypoint space. Our approach enables action-based architectures to be trained and evaluated, for the first time, within waypoint-based benchmarks without modifying the underlying evaluation protocol. We extensively evaluate our framework across multiple challenging benchmarks and observe consistent improvements over the baselines. In particular, on NAVSIM \texttt{navhard} our approach achieves state-of-the-art performance. Our code will be made publicly available upon acceptance.
76. 【2602.06211】DroneKey++: A Size Prior-free Method and New Benchmark for Drone 3D Pose Estimation from Sequential Images
链接:https://arxiv.org/abs/2602.06211
作者:Seo-Bin Hwang,Yeong-Jun Cho
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:surveillance systems, essential for security, security and surveillance, Accurate, pose estimation
备注: 8 page, 5 figures, 6 tables, Accepted to ICRA 2026 (to appear)
点击查看摘要
Abstract:Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34 deg and MedAE 17.1 deg for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is publicly available.
77. 【2602.06203】AnyThermal: Towards Learning Universal Representations for Thermal Perception
链接:https://arxiv.org/abs/2602.06203
作者:Parv Maheshwari,Jay Karhade,Yogesh Chawla,Isaiah Adu,Florian Heisen,Andrew Porco,Andrew Jong,Yifei Liu,Santosh Pitla,Sebastian Scherer,Wenshan Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:cross-modal place recognition, captures robust task-agnostic, monocular depth estimation, robust task-agnostic thermal, place recognition
备注: Accepted at IEEE ICRA (International Conference on Robotics Automation) 2026
点击查看摘要
Abstract:We present AnyThermal, a thermal backbone that captures robust task-agnostic thermal features suitable for a variety of tasks such as cross-modal place recognition, thermal segmentation, and monocular depth estimation using thermal images. Existing thermal backbones that follow task-specific training from small-scale data result in utility limited to a specific environment and task. Unlike prior methods, AnyThermal can be used for a wide range of environments (indoor, aerial, off-road, urban) and tasks, all without task-specific training. Our key insight is to distill the feature representations from visual foundation models such as DINOv2 into a thermal encoder using thermal data from these multiple environments. To bridge the diversity gap of the existing RGB-Thermal datasets, we introduce the TartanRGBT platform, the first open-source data collection platform with synced RGB-Thermal image acquisition. We use this payload to collect the TartanRGBT dataset - a diverse and balanced dataset collected in 4 environments. We demonstrate the efficacy of AnyThermal and TartanRGBT, achieving state-of-the-art results with improvements of up to 36% across diverse environments and downstream tasks on existing datasets.
78. 【2602.06195】DeDPO: Debiased Direct Preference Optimization for Diffusion Models
链接:https://arxiv.org/abs/2602.06195
作者:Khiem Pham,Quang Nguyen,Tung Nguyen,Jingsen Zhu,Michele Santacatterina,Dimitris Metaxas,Ramin Zabih
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Direct Preference Optimization, facilitating off-policy training, explicit reward modeling, Direct Preference, Preference Optimization
备注:
点击查看摘要
Abstract:Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.
79. 【2602.06184】PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining
链接:https://arxiv.org/abs/2602.06184
作者:Cheng Liang,Chaoyi Wu,Weike Zhao,Ya Zhang,Yanfeng Wang,Weidi Xie
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:CLIP-like vision-language models, Recent progress, greatly advanced medical, large-scale CLIP-like vision-language, vision-language models
备注:
点击查看摘要
Abstract:Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85\% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.
80. 【2602.06179】Unsupervised Anomaly Detection of Diseases in the Female Pelvis for Real-Time MR Imaging
链接:https://arxiv.org/abs/2602.06179
作者:Anika Knupfer,Johanna P. Müller,Jordina A. Verdera,Martin Fenske,Claudius S. Mathy,Smiti Tripathy,Sebastian Arndt,Matthias May,Michael Uder,Matthias W. Beckmann,Stefanie Burghaus,Jana Hutter
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:global health burden, reproductive age represent, major global health, diagnosis frequently delayed, frequently delayed due
备注: 17 pages, 8 figures
点击查看摘要
Abstract:Pelvic diseases in women of reproductive age represent a major global health burden, with diagnosis frequently delayed due to high anatomical variability, complicating MRI interpretation. Existing AI approaches are largely disease-specific and lack real-time compatibility, limiting generalizability and clinical integration. To address these challenges, we establish a benchmark framework for disease- and parameter-agnostic, real-time-compatible unsupervised anomaly detection in pelvic MRI. The method uses a residual variational autoencoder trained exclusively on healthy sagittal T2-weighted scans acquired across diverse imaging protocols to model normal pelvic anatomy. During inference, reconstruction error heatmaps indicate deviations from learned healthy structure, enabling detection of pathological regions without labeled abnormal data. The model is trained on 294 healthy scans and augmented with diffusion-generated synthetic data to improve robustness. Quantitative evaluation on the publicly available Uterine Myoma MRI Dataset yields an average area-under-the-curve (AUC) value of 0.736, with 0.828 sensitivity and 0.692 specificity. Additional inter-observer clinical evaluation extends analysis to endometrial cancer, endometriosis, and adenomyosis, revealing the influence of anatomical heterogeneity and inter-observer variability on performance interpretation. With a reconstruction time of approximately 92.6 frames per second, the proposed framework establishes a baseline for unsupervised anomaly detection in the female pelvis and supports future integration into real-time MRI. Code is available upon request (this https URL), prospective data sets are available for academic collaboration.
81. 【2602.06166】M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning
链接:https://arxiv.org/abs/2602.06166
作者:Bangji Yang,Ruihan Guo,Jiajun Fan,Chaoran Cheng,Ge Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved impressive fidelity, prompts involving multiple, involving multiple constraints, Generative models, achieved impressive
备注:
点击查看摘要
Abstract:Generative models have achieved impressive fidelity in text-to-image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce \textbf{M3 (Multi-Modal, Multi-Agent, Multi-Round)}, a training-free framework that systematically resolves these failures through iterative inference-time refinement. M3 orchestrates off-the-shelf foundation models in a robust multi-agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open-source models, M3 achieves remarkable results on the challenging OneIG-EN benchmark, with our Qwen-Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state-of-the-art performance (0.532 overall). This demonstrates that intelligent multi-agent reasoning can elevate open-source models beyond proprietary alternatives. M3 also substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance on hardened test sets. As a plug-and-play module compatible with any pre-trained T2I model, M3 establishes a new paradigm for compositional generation without costly retraining.
82. 【2602.06163】MetaSSP: Enhancing Semi-supervised Implicit 3D Reconstruction through Meta-adaptive EMA and SDF-aware Pseudo-label Evaluation
链接:https://arxiv.org/abs/2602.06163
作者:Luoxi Zhang,Chun Xie,Itaru Kitahara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reconstruction achieve high-quality, Implicit SDF-based methods, achieve high-quality surfaces, Implicit SDF-based, large labeled datasets
备注:
点击查看摘要
Abstract:Implicit SDF-based methods for single-view 3D reconstruction achieve high-quality surfaces but require large labeled datasets, limiting their scalability. We propose MetaSSP, a novel semi-supervised framework that exploits abundant unlabeled images. Our approach introduces gradient-based parameter importance estimation to regularize adaptive EMA updates and an SDF-aware pseudo-label weighting mechanism combining augmentation consistency with SDF variance. Beginning with a 10% supervised warm-up, the unified pipeline jointly refines labeled and unlabeled data. On the Pix3D benchmark, our method reduces Chamfer Distance by approximately 20.61% and increases IoU by around 24.09% compared to existing semi-supervised baselines, setting a new state of the art.
83. 【2602.06159】Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving
链接:https://arxiv.org/abs/2602.06159
作者:Xuyang Chen,Conglang Zhang,Chuanheng Fu,Zihao Yang,Kaixuan Zhou,Yizhi Zhang,Jianan He,Yanfeng Zhang,Mingwei Sun,Zengmao Wang,Zhen Dong,Xiaoxiao Long,Liqiu Meng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Controllable Video Diffusion, video generation typically, driving video generation, autonomous driving video, generation typically rely
备注: Project website [this https URL](https://albertchen98.github.io/DwD-project/)
点击查看摘要
Abstract:Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by "baking in" synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for "texture baking," while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: this https URL
84. 【2602.06158】MGP-KAD: Multimodal Geometric Priors and Kolmogorov-Arnold Decoder for Single-View 3D Reconstruction in Complex Scenes
链接:https://arxiv.org/abs/2602.06158
作者:Luoxi Zhang,Chun Xie,Itaru Kitahara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:limited dataset availability, due to noise, challenging due, object diversity, dataset availability
备注: 6 pages. Published in IEEE International Conference on Image Processing (ICIP) 2025
点击查看摘要
Abstract:Single-view 3D reconstruction in complex real-world scenes is challenging due to noise, object diversity, and limited dataset availability. To address these challenges, we propose MGP-KAD, a novel multimodal feature fusion framework that integrates RGB and geometric prior to enhance reconstruction accuracy. The geometric prior is generated by sampling and clustering ground-truth object data, producing class-level features that dynamically adjust during training to improve geometric understanding. Additionally, we introduce a hybrid decoder based on Kolmogorov-Arnold Networks (KAN) to overcome the limitations of traditional linear decoders in processing complex multimodal inputs. Extensive experiments on the Pix3D dataset demonstrate that MGP-KAD achieves state-of-the-art (SOTA) performance, significantly improving geometric integrity, smoothness, and detail preservation. Our work provides a robust and effective solution for advancing single-view 3D reconstruction in complex scenes.
85. 【2602.06139】EgoAVU: Egocentric Audio-Visual Understanding
链接:https://arxiv.org/abs/2602.06139
作者:Ashish Seth,Xinhao Mei,Changsheng Zhao,Varun Nagaraja,Ernie Chang,Gregory P. Meyer,Gael Le Lan,Yunyang Xiong,Vikas Chandra,Yangyang Shi,Dinesh Manocha,Zhipeng Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Understanding egocentric videos, Understanding egocentric, embodied intelligence, plays a vital, vital role
备注:
点击查看摘要
Abstract:Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.
86. 【2602.06136】mpora: Characterising the Time-Contingent Utility of Online Test-Time Adaptation
链接:https://arxiv.org/abs/2602.06136
作者:Sudarshan Sreeram,Young D. Kwon,Cecilia Mascolo
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:improving generalisation, Test-time adaptation, offers a compelling, machine learning, domain shifts
备注: Preprint. Under review. Code available upon acceptance
点击查看摘要
Abstract:Test-time adaptation (TTA) offers a compelling remedy for machine learning (ML) models that degrade under domain shifts, improving generalisation on-the-fly with only unlabelled samples. This flexibility suits real deployments, yet conventional evaluations unrealistically assume unbounded processing time, overlooking the accuracy-latency trade-off. As ML increasingly underpins latency-sensitive and user-facing use-cases, temporal pressure constrains the viability of adaptable inference; predictions arriving too late to act on are futile. We introduce Tempora, a framework for evaluating TTA under this pressure. It consists of temporal scenarios that model deployment constraints, evaluation protocols that operationalise measurement, and time-contingent utility metrics that quantify the accuracy-latency trade-off. We instantiate the framework with three such metrics: (1) discrete utility for asynchronous streams with hard deadlines, (2) continuous utility for interactive settings where value decays with latency, and (3) amortised utility for budget-constrained deployments. Applying Tempora to seven TTA methods on ImageNet-C across 240 temporal evaluations reveals rank instability: conventional rankings do not predict rankings under temporal pressure; ETA, a state-of-the-art method in the conventional setting, falls short in 41.2% of evaluations. The highest-utility method varies with corruption type and temporal pressure, with no clear winner. By enabling systematic evaluation across diverse temporal constraints for the first time, Tempora reveals when and why rankings invert, offering practitioners a lens for method selection and researchers a target for deployable adaptation.
87. 【2602.06122】From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors
链接:https://arxiv.org/abs/2602.06122
作者:Ding-Jiun Huang,Yuanhao Wang,Shao-Ji Yuan,Albert Mosella-Montoro,Francisco Vicente Carrasco,Cheng Zhang,Fernando De la Torre
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Creating high-fidelity, immersive applications, yield poor, crucial for immersive, prevalence of low-quality
备注: Accepted to 3DV 2026. Project Page: [this https URL](https://humansensinglab.github.io/super-head/)
点击查看摘要
Abstract:Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently rigged to an underlying parametric head model (e.g., FLAME) for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.
88. 【2602.06090】SVRepair: Structured Visual Reasoning for Automated Program Repair
链接:https://arxiv.org/abs/2602.06090
作者:Xiaoxuan Tang,Jincheng Wang,Liwei Luo,Jingxuan Xu,Sheng Zhou,Dajun Chen,Wei Jiang,Yong Li
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large language models, recently shown strong, shown strong potential, existing approaches remain, approaches remain unimodal
备注: 16 pages, 3 figures
点击查看摘要
Abstract:Large language models (LLMs) have recently shown strong potential for Automated Program Repair (APR), yet most existing approaches remain unimodal and fail to leverage the rich diagnostic signals contained in visual artifacts such as screenshots and control-flow graphs. In practice, many bug reports convey critical information visually (e.g., layout breakage or missing widgets), but directly using such dense visual inputs often causes context loss and noise, making it difficult for MLLMs to ground visual observations into precise fault localization and executable patches. To bridge this semantic gap, we propose \textbf{SVRepair}, a multimodal APR framework with structured visual representation. SVRepair first fine-tunes a vision-language model, \textbf{Structured Visual Representation (SVR)}, to uniformly transform heterogeneous visual artifacts into a \emph{semantic scene graph} that captures GUI elements and their structural relations (e.g., hierarchy), providing normalized, code-relevant context for downstream repair. Building on the graph, SVRepair drives a coding agent to localize faults and synthesize patches, and further introduces an iterative visual-artifact segmentation strategy that progressively narrows the input to bug-centered regions to suppress irrelevant context and reduce hallucinations. Extensive experiments across multiple benchmarks demonstrate state-of-the-art performance: SVRepair achieves \textbf{36.47\%} accuracy on SWE-Bench M, \textbf{38.02\%} on MMCode, and \textbf{95.12\%} on CodeVision, validating the effectiveness of SVRepair for multimodal program repair.
89. 【2602.06056】Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space
链接:https://arxiv.org/abs/2602.06056
作者:Zihang Wang,Siyue Zhang,Yilun Zhao,Jingyi Yang,Tingyu Song,Anh Tuan Luu,Chen Zhao
类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Language Models, Large Diffusion Language, Large Language Models, Diffusion Language Models, Language Models
备注:
点击查看摘要
Abstract:Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation. Recent advances in large foundation models have substantially accelerated the development of embedding models, including those based on Large Language Models (LLMs), Vision Language Models (VLMs), and Multimodal LLMs. More recently, Large Diffusion Language Models (dLLMs) and Multimodal dLLMs have emerged as competitive alternatives to autoregressive models, offering advantages such as bidirectional attention and parallel generation. This progress naturally raises a critical yet unexplored question: can Multimodal dLLMs serve as effective multimodal embedding models? To answer this, we present the first systematic study of converting Multimodal dLLMs into embedding models. We evaluate state-of-the-art Multimodal dLLMs and Autoregressive VLMs across three categories of embedding tasks: classification, visual question answering, and information retrieval. Our results show that Multimodal dLLM embeddings generally underperform their autoregressive VLM counterparts. The stronger diffusion-based model, LaViDa, lags by only 3.5 points on classification, 2.5 points on VQA, and 4.4 points on retrieval tasks, whereas the other diffusion-based model, MMaDA, exhibits substantially larger performance gaps, exceeding 20 points across all tasks. Further analysis reveals insufficient image-text alignment in diffusion-based models, accounting for the observed limitations in their embedding performance.
90. 【2602.06050】Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering
链接:https://arxiv.org/abs/2602.06050
作者:Jongha Kim,Byungoh Ko,Jeehye Na,Jinsung Yoon,Hyunwoo J. Kim
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Language Models, Large Vision Language, Language Models, Large Vision, Vision Language
备注: WACV 2026
点击查看摘要
Abstract:Despite the remarkable capabilities of Large Vision Language Models (LVLMs), they still lack detailed knowledge about specific entities. Retrieval-augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external Knowledge Base. However, we observe that previous decoding methods for RAG are sub-optimal as they fail to sufficiently leverage multiple relevant contexts and suppress the negative effects of irrelevant contexts. To this end, we propose Relevance-aware Multi-context Contrastive Decoding (RMCD), a novel decoding method for RAG. RMCD outputs a final prediction by combining outputs predicted with each context, where each output is weighted based on its relevance to the question. By doing so, RMCD effectively aggregates useful information from multiple relevant contexts while also counteracting the negative effects of irrelevant ones. Experiments show that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge-intensive visual question-answering benchmarks. Also, RMCD can be simply applied by replacing the decoding method of LVLMs without additional training. Analyses also show that RMCD is robust to the retrieval results, consistently performing the best across the weakest to the strongest retrieval results. Code is available at this https URL.
91. 【2602.05847】OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
链接:https://arxiv.org/abs/2602.05847
作者:Zhangquan Chen,Jiale Tao,Ruihuang Li,Yihao Hu,Ruitao Chen,Zhantao Yang,Xinlei Yu,Haodong Jing,Manyuan Zhang,Shuai Shao,Biao Wang,Qinglin Lu,Ruqi Huang
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:audio-visual understanding tasks, face substantial challenges, existing omnivideo models, existing omnivideo, understanding tasks
备注: 19 pages, 12 figures
点击查看摘要
Abstract:While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.
92. 【2602.05243】CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers
链接:https://arxiv.org/abs/2602.05243
作者:Boxiang Zhang,Baijian Yang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Transformers achieve, incur high compute, Vision Transformers, Transformers achieve strong, Transformers achieve
备注:
点击查看摘要
Abstract:Vision Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning can reduce inference cost, but most methods rely on retraining or multi-stage optimization. These requirements limit post-training deployment. We propose \textbf{CORP}, a closed-form one-shot structured pruning framework for Vision Transformers. CORP removes entire MLP hidden dimensions and attention substructures without labels, gradients, or fine-tuning. It operates under strict post-training constraints using only a small unlabeled calibration set. CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes expected representation error under the calibration distribution. Experiments on ImageNet with DeiT models show strong redundancy in MLP and attention representations. Without compensation, one-shot structured pruning causes severe accuracy degradation. With CORP, models preserve accuracy under aggressive sparsity. On DeiT-Huge, CORP retains 82.8\% Top-1 accuracy after pruning 50\% of both MLP and attention structures. CORP completes pruning in under 20 minutes on a single GPU and delivers substantial real-world efficiency gains.
93. 【2410.09771】EUGens: Efficient, Unified, and General Dense Layers
链接:https://arxiv.org/abs/2410.09771
作者:Sang Min Kim,Byeongchan Kim,Arijit Sehanobish,Somnath Basu Roy Chowdhury,Rahul Kidambi,Dongseok Shim,Avinava Dubey,Snigdha Chaturvedi,Min-hwan Oh,Krzysztof Choromanski
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:scaling machine learning, machine learning models, textbf, resource-constrained environments, Fully-connected feedforward layers
备注: Neurips 2025
点击查看摘要
Abstract:Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbf{E}fficient, \textbf{U}nified and \textbf{Gen}eral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbf{the first} unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf{27}\%) and memory efficiency (up to \textbf{30}\%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.
94. 【2602.06761】Orientation-Robust Latent Motion Trajectory Learning for Annotation-free Cardiac Phase Detection in Fetal Echocardiography
链接:https://arxiv.org/abs/2602.06761
作者:Yingyu Yang,Qianye Yang,Can Peng,Elena D'Alberti,Olga Patey,Aris T. Papageorghiou,J.Alison Noble
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:facilitating pregnancy management, optimized delivery planning, timely postnatal interventions, congenital heart disease, detecting congenital heart
备注: Preprint, Submitted to a journal
点击查看摘要
Abstract:Fetal echocardiography is essential for detecting congenital heart disease (CHD), facilitating pregnancy management, optimized delivery planning, and timely postnatal interventions. Among standard imaging planes, the four-chamber (4CH) view provides comprehensive information for CHD diagnosis, where clinicians carefully inspect the end-diastolic (ED) and end-systolic (ES) phases to evaluate cardiac structure and motion. Automated detection of these cardiac phases is thus a critical component toward fully automated CHD analysis. Yet, in the absence of fetal electrocardiography (ECG), manual identification of ED and ES frames remains a labor-intensive bottleneck. We present ORBIT (Orientation-Robust Beat Inference from Trajectories), a self-supervised framework that identifies cardiac phases without manual annotations under various fetal heart orientation. ORBIT employs registration as self-supervision task and learns a latent motion trajectory of cardiac deformation, whose turning points capture transitions between cardiac relaxation and contraction, enabling accurate and orientation-robust localization of ED and ES frames across diverse fetal positions. Trained exclusively on normal fetal echocardiography videos, ORBIT achieves consistent performance on both normal (MAE = 1.9 frames for ED and 1.6 for ES) and CHD cases (MAE = 2.4 frames for ED and 2.1 for ES), outperforming existing annotation-free approaches constrained by fixed orientation assumptions. These results highlight the potential of ORBIT to facilitate robust cardiac phase detection directly from 4CH fetal echocardiography.
95. 【2602.06350】AS-Mamba: Asymmetric Self-Guided Mamba Decoupled Iterative Network for Metal Artifact Reduction
链接:https://arxiv.org/abs/2602.06350
作者:Bowen Ning,Zekun Zhou,Xinyi Zhong,Zhongzhen Wang,HongXin Wu,HaiTao Wang,Liu Shi,Qiegen Liu
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:degrades Computed Tomography, significantly degrades Computed, Computed Tomography, degrades Computed, artifact significantly degrades
备注: 10 pages,10 figures
点击查看摘要
Abstract:Metal artifact significantly degrades Computed Tomography (CT) image quality, impeding accurate clinical diagnosis. However, existing deep learning approaches, such as CNN and Transformer, often fail to explicitly capture the directional geometric features of artifacts, leading to compromised structural restoration. To address these limitations, we propose the Asymmetric Self-Guided Mamba (AS-Mamba) for metal artifact reduction. Specifically, the linear propagation of metal-induced streak artifacts aligns well with the sequential modeling capability of State Space Models (SSMs). Consequently, the Mamba architecture is leveraged to explicitly capture and suppress these directional artifacts. Simultaneously, a frequency domain correction mechanism is incorporated to rectify the global amplitude spectrum, thereby mitigating intensity inhomogeneity caused by beam hardening. Furthermore, to bridge the distribution gap across diverse clinical scenarios, we introduce a self-guided contrastive regularization strategy. Extensive experiments on public andclinical dental CBCT datasets demonstrate that AS-Mamba achieves superior performance in suppressing directional streaks and preserving structural details, validating the effectiveness of integrating physical geometric priors into deep network design.
96. 【2602.06292】Zero-shot Multi-Contrast Brain MRI Registration by Intensity Randomizing T1-weighted MRI (LUMIR25)
链接:https://arxiv.org/abs/2602.06292
作者:Hengjie Liu,Yimeng Dou,Di Xu,Xinyi Fu,Dan Ruan,Ke Sheng
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:summarize the methods, methods and results, test set, MRI, high-field MRI
备注: Submitted to and reviewed by Learn2Reg MICCAI 2025
点击查看摘要
Abstract:In this paper, we summarize the methods and results of our submission to the LUMIR25 challenge in Learn2Reg 2025, which achieved 1st place overall on the test set. Extended from LUMIR24, this year's task focuses on zero-shot registration under domain shifts (high-field MRI, pathological brains, and various MRI contrasts), while the training data comprise only in-domain T1-weighted brain MRI. We start with a meticulous analysis of LUMIR24 winners to identify the main contributors to good monomodal registration performance. To achieve good generalization with diverse contrasts from a model trained with T1-weighted MRI only, we employ three simple but effective strategies: (i) a multimodal loss based on the modality-independent neighborhood descriptor (MIND), (ii) intensity randomization for appearance augmentation, and (iii) lightweight instance-specific optimization (ISO) on feature encoders at inference time. On the validation set, our approach achieves reasonable T1-T2 registration accuracy while maintaining good deformation regularity.
97. 【2602.06101】ALIEN: Analytic Latent Watermarking for Controllable Generation
链接:https://arxiv.org/abs/2602.06101
作者:Liangqi Lei,Keke Gai,Jing Yu,Qi Wu
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:safeguarding intellectual property, underline, reducing misuse, technical alternative, alternative to safeguarding
备注:
点击查看摘要
Abstract:Watermarking is a technical alternative to safeguarding intellectual property and reducing misuse. Existing methods focus on optimizing watermarked latent variables to balance watermark robustness and fidelity, as Latent diffusion models (LDMs) are considered a powerful tool for generative tasks. However, reliance on computationally intensive heuristic optimization for iterative signal refinement results in high training overhead and local optima this http URL address these issues, we propose an \underline{A}na\underline{l}ytical Watermark\underline{i}ng Framework for Controllabl\underline{e} Generatio\underline{n} (ALIEN). We develop the first analytical derivation of the time-dependent modulation coefficient that guides the diffusion of watermark residuals to achieve controllable watermark embedding this http URL results show that ALIEN-Q outperforms the state-of-the-art by 33.1\% across 5 quality metrics, and ALIEN-R demonstrates 14.0\% improved robustness against generative variant and stability threats compared to the state-of-the-art across 15 distinct conditions. Code can be available at this https URL.
98. 【2602.06044】COSMOS: Coherent Supergaussian Modeling with Spatial Priors for Sparse-View 3D Splatting
链接:https://arxiv.org/abs/2602.06044
作者:Chaeyoung Jeong,Kwangsu Kim
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:real time rendering, high-quality real time, Gaussian Splatting, enabling high-quality real, providing explicit
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a promising approach for 3D reconstruction, providing explicit, point-based representations and enabling high-quality real time rendering. However, when trained with sparse input views, 3DGS suffers from overfitting and structural degradation, leading to poor generalization on novel views. This limitation arises from its optimization relying solely on photometric loss without incorporating any 3D structure priors. To address this issue, we propose Coherent supergaussian Modeling with Spatial Priors (COSMOS). Inspired by the concept of superpoints from 3D segmentation, COSMOS introduces 3D structure priors by newly defining supergaussian groupings of Gaussians based on local geometric cues and appearance features. To this end, COSMOS applies inter group global self-attention across supergaussian groups and sparse local attention among individual Gaussians, enabling the integration of global and local spatial information. These structure-aware features are then used for predicting Gaussian attributes, facilitating more consistent 3D reconstructions. Furthermore, by leveraging supergaussian-based grouping, COSMOS enforces an intra-group positional regularization to maintain structural coherence and suppress floaters, thereby enhancing training stability under sparse-view conditions. Our experiments on Blender and DTU show that COSMOS surpasses state-of-the-art methods in sparse-view settings without any external depth supervision.




