本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新545篇论文,其中:

  • 自然语言处理83
  • 信息检索16
  • 计算机视觉106

自然语言处理

1. 【2604.16278】Learning to Reason with Insight for Informal Theorem Proving

链接https://arxiv.org/abs/2604.16278

作者:Yunhe Li,Hao Shi,Bowen Deng,Wei Wang,Mengzhe Ruan,Hanxu Hou,Zhongxiang Dai,Siyang Gao,Chao Wang,Shuang Qiu,Linqi Song

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models', natural language processing, automated theorem-proving approaches, theorem-proving approaches depend, informal theorem proving

备注

点击查看摘要

Abstract:Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose a novel framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. We propose $\mathtt{DeepInsightTheorem}$, a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof. To fully exploit this dataset, we design a Progressive Multi-Stage SFT strategy that mimics the human learning process, guiding the model from basic proof writing to insightful thinking. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reasoning.

2. 【2604.16275】No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

链接https://arxiv.org/abs/2604.16275

作者:Hitesh Mehta,Arjit Saxena,Garima Chhikara,Rohit Kumar

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, politeness, paper explores, response of Large

备注

点击查看摘要

Abstract:This paper explores the response of Large Language Models (LLMs) to user prompts with different degrees of politeness and impoliteness. The Politeness Theory by Brown and Levinson and the Impoliteness Framework by Culpeper form the basis of experiments conducted across three languages (English, Hindi, Spanish), five models (Gemini-Pro, GPT-4o Mini, Claude 3.7 Sonnet, DeepSeek-Chat, and Llama 3), and three interaction histories between users (raw, polite, and impolite). Our sample consists of 22,500 pairs of prompts and responses of various types, evaluated across five levels of politeness using an eight-factor assessment framework: coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, and readability. The findings show that model performance is highly influenced by tone, dialogue history, and language. While polite prompts enhance the average response quality by up to ~11% and impolite tones worsen it, these effects are neither consistent nor universal across languages and models. English is best served by courteous or direct tones, Hindi by deferential and indirect tones, and Spanish by assertive tones. Among the models, Llama is the most tone-sensitive (11.5% range), whereas GPT is more robust to adversarial tone. These results indicate that politeness is a quantifiable computational variable that affects LLM behaviour, though its impact is language- and model-dependent rather than universal. To support reproducibility and future work, we additionally release PLUM (Politeness Levels in Utterances, Multilingual), a publicly available corpus of 1,500 human-validated prompts across three languages and five politeness categories, and provide a formal supplementary analysis of six falsifiable hypotheses derived from politeness theory, empirically assessed against the dataset.

3. 【2604.16272】VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

链接https://arxiv.org/abs/2604.16272

作者:Xiangbo Gao,Sicong Jiang,Bangya Liu,Xinghao Chen,Minglai Yang,Siyuan Yang,Mingyang Wu,Jiongze Yu,Qi Zheng,Haozhi Wang,Jiayi Zhang,Jared Yang,Jie Yang,Zihan Wang,Qing Yin,Zhengzhong Tu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:meet professional requirements, AI-assisted video creation, editing, increasingly practical, professional requirements

备注

点击查看摘要

Abstract:As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.

4. 【2604.16270】From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

链接https://arxiv.org/abs/2604.16270

作者:Van-Truong Le

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Vietnam legal texts, complexity of Vietnam, Large Language Models, legal texts presents, Vietnam legal

备注: 7 pages, 2 figures. Accepted at the FISU Joint Conference on Artificial Intelligence (FJCAI 2026), Vietnam

点击查看摘要

Abstract:The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the "why" behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \textit{Incorrect Example} and \textit{Misinterpretation} as the most prevalent failures, confirming that the primary challenge for current LLMs is not summarization but controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work provides a holistic and actionable assessment of LLMs for legal applications.

5. 【2604.16262】SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

链接https://arxiv.org/abs/2604.16262

作者:Deshan Sumanathilaka,Nicholas Micallef,Julian Hough,Saman Jayasinghe

类目:Computation and Language (cs.CL)

关键词:Natural Language Understanding, improved Natural Language, substantially improved Natural, Language Understanding, Recent advances

备注: 6 pages, 5 Tables, 1 figure, Accepted to SemEval 2026

点击查看摘要

Abstract:Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored. SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions

6. 【2604.16256】Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

链接https://arxiv.org/abs/2604.16256

作者:Yige Xu,Yongjie Wang,Zizhuo Wu,Kaisong Song,Jun Lin,Zhiqi Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:recently attracted significant, attracted significant attention, significant attention due, diverse downstream tasks, vision-language models

备注

点击查看摘要

Abstract:Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at this https URL.

7. 【2604.16242】Detecting and Suppressing Reward Hacking with Gradient Fingerprints

链接https://arxiv.org/abs/2604.16242

作者:Songtao Wang,Quang Hieu Pham,Fangcong Yin,Xinpeng Wang,Jocelyn Qiaochu Chen,Greg Durrett,Xi Ye

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Reinforcement learning, reward hacking, typically optimizes, optimizes for outcome, imposing constraints

备注

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine-tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces. Our code is available at: this https URL.

8. 【2604.16241】BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

链接https://arxiv.org/abs/2604.16241

作者:Jiacheng Shen,Masato Hagiwara,Milad Alizadeh,Ellen Gilsenan-McMahon,Marius Miron,David Robinson,Emmanuel Chemla,Sara Keen,Gagan Narula,Mathieu Laurière,Matthieu Geist,Olivier Pietquin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, shown strong performance, Large language, handle specialized animal-related, models handle specialized

备注: 28 pages, 3 figures

点击查看摘要

Abstract:Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.

9. 【2604.16235】Optimizing Korean-Centric LLMs via Token Pruning

链接https://arxiv.org/abs/2604.16235

作者:Hoyeol Kim,Hyeonwoo Kim

类目:Computation and Language (cs.CL)

关键词:multilingual large language, large language models, multilingual large, target application, paper presents

备注: 5 pages

点击查看摘要

Abstract:This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning - a compression technique that eliminates tokens and embedding parameters corresponding to languages irrelevant to the target application. Focusing on Korean-centric natural language processing (NLP) tasks, we evaluate architectures including Qwen3, Gemma-3, Llama-3, and Aya across three vocabulary configurations: Original, English-Korean (EnKo), and English-Korean-Chinese (EnKoZh). Performance is assessed using established benchmarks for general aptitude, cultural literacy, instruction following, and machine translation. Our findings indicate that token pruning significantly improves generation stability by eliminating language confusion, and in the case of machine translation, frequently enhances performance on Korean-specific tasks. While instruction-following capabilities display architecture-dependent variance linked to latent cross-lingual representations, the significant reduction in vocabulary size validates token pruning as a highly effective optimization strategy for memory-constrained, domain-specific deployments, despite modest gains in inference latency.

10. 【2604.16217】Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

链接https://arxiv.org/abs/2604.16217

作者:Yanli Wang,Peng Kuang,Xiaoyu Han,Kaidi Xu,Haohan Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, deployment mismatch, output-level uncertainty signals, token probabilities

备注

点击查看摘要

Abstract:Large language models are increasingly deployed in settings where reliability matters, yet output-level uncertainty signals such as token probabilities, entropy, and self-consistency can become brittle under calibration--deployment mismatch. Conformal prediction provides finite-sample validity under exchangeability, but its practical usefulness depends on the quality of the nonconformity score. We propose a conformal framework for LLM question answering that uses internal representations rather than output-facing statistics: specifically, we introduce Layer-Wise Information (LI) scores, which measure how conditioning on the input reshapes predictive entropy across model depth, and use them as nonconformity scores within a standard split conformal pipeline. Across closed-ended and open-domain QA benchmarks, with the clearest gains under cross-domain shift, our method achieves a better validity--efficiency trade-off than strong text-level baselines while maintaining competitive in-domain reliability at the same nominal risk level. These results suggest that internal representations can provide more informative conformal scores when surface-level uncertainty is unstable under distribution shift.

11. 【2604.16171】JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

链接https://arxiv.org/abs/2604.16171

作者:Alexandra Dragomir,Ioana Pintilie,Antonio Barbalau,Marius Dragoi,Florin Brad,Cristian Daniel Paduraru,Alexandru Tifrea,Elena Burceanu,Radu Tudor Ionescu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, low-rank update matrix, continual learning

备注

点击查看摘要

Abstract:Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-rank update matrix for each task. To mitigate catastrophic forgetting, state-of-the-art approaches impose constraints on new adapters with respect to the previous ones, by targeting either subspace or coordinate-wise interference. In this paper, we propose JumpLoRA, a novel framework to adaptively induce sparsity in the Low-Rank Adaptation (LoRA) blocks through the use of JumpReLU gating. The method achieves dynamic parameter isolation, which helps prevent task interference. We demonstrate that our method is highly modular and compatible with LoRA-based CL approaches. Specifically, it significantly boosts the performance of IncLoRA and outperforms the leading state-of-the-art CL method, ELLA.

12. 【2604.16158】AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

链接https://arxiv.org/abs/2604.16158

作者:Max Henning Höth,Kristian Kersting,Björn Deiseroth,Letitia Parcalabescu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, solve complex tasks, Large language, increasingly rely, complex tasks

备注: 14 pages, 8 figures, 1 table

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model's final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.

13. 【2604.16146】On the Rejection Criterion for Proxy-based Test-time Alignment

链接https://arxiv.org/abs/2604.16146

作者:Ayoub Hammal,Pierre Zweigenbaum,Caio Corro

类目:Computation and Language (cs.CL)

关键词:Recent works proposed, proposed test-time alignment, test-time alignment methods, small aligned model, works proposed test-time

备注: ACL 2026 Main

点击查看摘要

Abstract:Recent works proposed test-time alignment methods that rely on a small aligned model as a proxy that guides the generation of a larger base (unaligned) model. The implicit reward approach skews the large model distribution, whereas the nudging approach defers the generation of the next token to the small aligned model when the large base one is unconfident about its outcome. In this work, we first show that both approaches can be reduced to sampling from similar graphical models, where they differ only in the definition of a rejection criterion (or distribution). Moreover, we argue that the confidence criterion is ill-motivated due to linguistic phenomena like ambiguous phrasing. We propose a novel rejection criterion based on a conservative confidence bet. Experimentally, our novel approach outperforms previous work on several datasets.

14. 【2604.16138】Sentiment Analysis of German Sign Language Fairy Tales

链接https://arxiv.org/abs/2604.16138

作者:Fabrizio Nunnari,Siddhant Jain,Patrick Gebhard

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:German fairy tales, German sign language, fairy tales, present a dataset, German fairy

备注

点击查看摘要

Abstract:We present a dataset and a model for sentiment analysis of German sign language (DGS) fairy tales. First, we perform sentiment analysis for three levels of valence (negative, neutral, positive) on German fairy tales text segments using four large language models (LLMs) and majority voting, reaching an inter-annotator agreement of 0.781 Krippendorff's alpha. Second, we extract face and body motion features from each corresponding DGS video segment using MediaPipe. Finally, we train an explainable model (based on XGBoost) to predict negative, neutral or positive sentiment from video features. Results show an average balanced accuracy of 0.631. A thorough analysis of the most important features reveal that, in addition to eyebrows and mouth motion on the face, also the motion of hips, elbows, and shoulders considerably contribute in the discrimination of the conveyed sentiment, indicating an equal importance of face and body for sentiment communication in sign language.

15. 【2604.16132】Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors

链接https://arxiv.org/abs/2604.16132

作者:Jessica H. Zhu,Shayla Stringfield,Vahe Zaprosyan,Michael Wagner,Michel Cukier,Joseph B. Richardson Jr

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:public health issue, pressing public health, community firearm violence, survivors' lived experiences, Firearm violence

备注: Accepted to Findings of the Association for Computational Linguistics (2026)

点击查看摘要

Abstract:Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitative research, including in-depth interviews, is a valuable tool for understanding the personal and societal consequences of community firearm violence and designing effective interventions. However, manually analyzing these narratives through thematic analysis and inductive coding is time-consuming and labor-intensive. Recent advancements in large language models (LLMs) have opened the door to automating this process, though concerns remain about whether these models can accurately and ethically capture the experiences of vulnerable populations. In this study, we assess the use of open-source LLMs to inductively code interviews with 21 Black men who have survived community firearm violence. Our results demonstrate that while some configurations of LLMs can identify important codes, overall relevance remains low and is highly sensitive to data processing. Furthermore, LLM guardrails lead to substantial narrative erasure. These findings highlight both the potential and limitations of LLM-assisted qualitative coding and underscore the ethical challenges of applying AI in research involving marginalized communities.

16. 【2604.16058】LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning

链接https://arxiv.org/abs/2604.16058

作者:Mahir Labib Dihan,Abir Muhtasim

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, made distinguishing AI-generated, proliferation of Large, code quality assurance

备注

点击查看摘要

Abstract:The rapid proliferation of Large Language Models (LLMs) in software development has made distinguishing AI-generated code from human-written code a critical challenge with implications for academic integrity, code quality assurance, and software security. We present LLMSniffer, a detection framework that fine-tunes GraphCodeBERT using a two-stage supervised contrastive learning pipeline augmented with comment removal preprocessing and an MLP classifier. Evaluated on two benchmark datasets - GPTSniffer and Whodunit - LLMSniffer achieves substantial improvements over prior baselines: accuracy increases from 70% to 78% on GPTSniffer (F1: 68% to 78%) and from 91% to 94.65% on Whodunit (F1: 91% to 94.64%). t-SNE visualizations confirm that contrastive fine-tuning yields well-separated, compact embeddings. We release our model checkpoints, datasets, codes and a live interactive demo to facilitate further research.

17. 【2604.16042】owards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

链接https://arxiv.org/abs/2604.16042

作者:Yutong Gao,Qinglin Meng,Yuan Zhou,Liangming Pan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, achieved strong performance, opaque internal mechanisms, internal mechanisms hinder

备注: Accepted to the Main Conference of ACL 2026. 14 pages, 4 figures, 1 table

点击查看摘要

Abstract:While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: this https URL.

18. 【2604.16037】Stochasticity in Tokenisation Improves Robustness

链接https://arxiv.org/abs/2604.16037

作者:Sophie Steger,Rui Li,Sofiane Ennadir,Anya Sims,Arno Solin,Franz Pernkopf,Martin Trapp

类目:Computation and Language (cs.CL)

关键词:large language models, widespread adoption, adoption of large, large language, increased concerns

备注

点击查看摘要

Abstract:The widespread adoption of large language models (LLMs) has increased concerns about their robustness. Vulnerabilities in perturbations of tokenisation of the input indicate that models trained with a deterministic canonical tokenisation can be brittle to adversarial attacks. Recent studies suggest that stochastic tokenisation can deliver internal representations that are less sensitive to perturbations. In this paper, we analyse how stochastic tokenisations affect robustness to adversarial attacks and random perturbations. We systematically study this over a range of learning regimes (pre-training, supervised fine-tuning, and in-context learning), data sets, and model architectures. We show that pre-training and fine-tuning with uniformly sampled stochastic tokenisations improve robustness to random and adversarial perturbations. Evaluating on uniformly sampled non-canonical tokenisations reduces the accuracy of a canonically trained Llama-1b model by 29.8%. We find that training with stochastic tokenisation preserves accuracy without increasing inference cost.

19. 【2604.16029】Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

链接https://arxiv.org/abs/2604.16029

作者:Jiaxi Bi,Tongxu Luo,Wenyu Du,Zhengyang Tang,Benyou Wang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Parallel reasoning enhances, enhances Large Reasoning, reasoning enhances Large, Large Reasoning Models, Parallel reasoning

备注: 9 pages, 7 figures

点击查看摘要

Abstract:Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at this https URL

20. 【2604.16027】Where does output diversity collapse in post-training?

链接https://arxiv.org/abs/2604.16027

作者:Constantinos Karouzos,Xingwei Tan,Nikolaos Aletras

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Post-trained language models, Post-trained language, language models produce, base counterparts, diversity

备注

点击查看摘要

Abstract:Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.

21. 【2604.16004】AgentV-RL: Scaling Reward Modeling with Agentic Verifier

链接https://arxiv.org/abs/2604.16004

作者:Jiazheng Zhang,Ziche Fu,Zhiheng Xi,Wenqing Jing,Mingxu Chai,Wei He,Guoqiang Zhang,Chenghao Fan,Chenxin An,Wenxiang Chen,Zhicheng Liu,Haojie Pan,Dingwei Zhu,Tao Gui,Qi Zhang,Xuanjing Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:enhance LLM reasoning, enhance LLM, test-time scaling, LLM reasoning, demonstrated to enhance

备注: ACL 2026

点击查看摘要

Abstract:Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.

22. 【2604.15998】SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification

链接https://arxiv.org/abs/2604.15998

作者:Ke Xiong,Qian Wu,Wangjie Gan,Yuke Li,Xuhong Zhang

类目:Computation and Language (cs.CL)

关键词:Hierarchical Text Classification, involves mapping texts, Text Classification, predefined tree-structured label, Few-shot Hierarchical Text

备注: 5pages,3 figures,ICASSP 2026

点击查看摘要

Abstract:Few-shot Hierarchical Text Classification (few-shot HTC) is a challenging task that involves mapping texts to a predefined tree-structured label hierarchy under data-scarce conditions. While current approaches utilize structural constraints from the label hierarchy to maintain parent-child prediction consistency, they face a critical bottleneck, the difficulty in distinguishing semantically similar sibling classes due to insufficient domain knowledge. We introduce an innovative method named Sibling Contrastive Learning with Hierarchical Knowledge-aware Prompt Tuning for few-shot HTC tasks (SCHK-HTC). Our work enhances the model's perception of subtle differences between sibling classes at deeper levels, rather than just enforcing hierarchical rules. Specifically, we propose a novel framework featuring two core components: a hierarchical knowledge extraction module and a sibling contrastive learning mechanism. This design guides model to encode discriminative features at each hierarchy level, thus improving the separability of confusable classes. Our approach achieves superior performance across three benchmark datasets, surpassing existing state-of-the-art methods in most cases. Our code is available at this https URL.

23. 【2604.15972】Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

链接https://arxiv.org/abs/2604.15972

作者:Haoyu Bian,Chaoning Zhang,Jiaquan Zhang,Xingyao Li,Yuanfang Guo,Wei Dong,Yang Yang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:multi-role collaboration, underline, LLM-driven multi-agent frameworks, frameworks address complex, address complex reasoning

备注: 13 pages, 4 figures. Submitted to CAAI Transactions on Intelligence Technology

点击查看摘要

Abstract:LLM-driven multi-agent frameworks address complex reasoning tasks through multi-role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high-capability agents or suppressing unreliable outputs to improve framework effectiveness, while systematic identification and reinforcement of performance-limiting agents receive less attention. To address this gap, we propose WORC, a \underline{w}eak-link \underline{o}ptimization framework for multi-agent \underline{r}easoning and \underline{c}ollaboration, grounded in the weak-link principle. WORC follows a two-stage workflow. In the weak agent localization stage, task features are constructed, and a meta-learning-based weight predictor trained on optimal configurations identified by swarm intelligence algorithms (SIAs) enables zero-shot mapping from these features to agent performance weights, where the agent with the lowest predicted weight is identified as the weak agent. In the weak-link optimization stage, an uncertainty-driven allocation strategy assigns additional reasoning budgets to weak agents, with lower predicted weights leading to larger repeated-sampling quotas to compensate for reliability deficiencies. Experimental results show that WORC achieves an average accuracy of 82.2\% on reasoning benchmarks while improving framework stability and cross-architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi-agent systems.

24. 【2604.15958】A Case Study on the Impact of Anonymization Along the RAG Pipeline

链接https://arxiv.org/abs/2604.15958

作者:Andreea-Elena Bodea,Stephen Meisenbacher,Florian Matthes

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:end user requesting, exposing private information, create privacy concerns, Retrieval-Augmented Generation, requesting the response

备注: 7 pages, 1 figure, 6 tables. Accepted to IWSPA 2026

点击查看摘要

Abstract:Despite the considerable promise of Retrieval-Augmented Generation (RAG), many real-world use cases may create privacy concerns, where the purported utility of RAG-enabled insights comes at the risk of exposing private information to either the LLM or the end user requesting the response. As a potential mitigation, using anonymization techniques to remove personally identifiable information (PII) and other sensitive markers in the underlying data represents a practical and sensible course of action for RAG administrators. Despite a wealth of literature on the topic, no works consider the placement of anonymization along the RAG pipeline, i.e., asking the question, where should anonymization happen? In this case study, we systematically and empirically measure the impact of anonymization at two important points along the RAG pipeline: the dataset and generated answer. We show that differences in privacy-utility trade-offs can be observed depending on where anonymization took place, demonstrating the significance of privacy risk mitigation placement in RAG.

25. 【2604.15945】RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

链接https://arxiv.org/abs/2604.15945

作者:Fabian Ridder,Laurin Lessel,Malte Schilling

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, input to Large, Large Language, external information, domain-specific knowledge

备注: accepted at IJCNN 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is widely used to augment the input to Large Language Models (LLMs) with external information, such as recent or domain-specific knowledge. Nonetheless, current models still produce closed-domain hallucinations and generate content that is unsupported by the retrieved context. Current detection approaches typically treat hallucination as a post-hoc problem, relying on black-box consistency checks or probes over frozen internal representations. In this work, we demonstrate that hallucination detection based on internal state representation can also serve as a direct training signal. We introduce RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and RAGognizer, a hallucination-aware fine-tuning approach that integrates a lightweight detection head into an LLM, allowing for the joint optimization of language modeling and hallucination detection. This joint objective forces the model to improve the separability of its internal states regarding hallucinations while simultaneously learning to generate well-formed and meaningful responses. Across multiple benchmarks, RAGognizer achieves state-of-the-art token-level hallucination detection while substantially reducing hallucination rates during generation, without degrading language quality or relevance.

26. 【2604.15937】Polarization by Default: Auditing Recommendation Bias in LLM-Based Content Curation

链接https://arxiv.org/abs/2604.15937

作者:Nicolò Pagan,Christopher Barrie,Chris Andrew Bail,Petter Törnberg

类目:ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

关键词:Large Language Models, Large Language, Language Models, remains poorly understood, tasks remains poorly

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed to curate and rank human-created content, yet the nature and structure of their biases in these tasks remains poorly understood: which biases are robust across providers and platforms, and which can be mitigated through prompt design. We present a controlled simulation study mapping content selection biases across three major LLM providers (OpenAI, Anthropic, Google) on real social media datasets from Twitter/X, Bluesky, and Reddit, using six prompting strategies (\textit{general}, \textit{popular}, \textit{engaging}, \textit{informative}, \textit{controversial}, \textit{neutral}). Through 540,000 simulated top-10 selections from pools of 100 posts across 54 experimental conditions, we find that biases differ substantially in how structural and how prompt-sensitive they are. Polarization is amplified across all configurations, toxicity handling shows a strong inversion between engagement- and information-focused prompts, and sentiment biases are predominantly negative. Provider comparisons reveal distinct trade-offs: GPT-4o Mini shows the most consistent behavior across prompts; Claude and Gemini exhibit high adaptivity in toxicity handling; Gemini shows the strongest negative sentiment preference. On Twitter/X, where author demographics can be inferred from profile bios, political leaning bias is the clearest demographic signal: left-leaning authors are systematically over-represented despite right-leaning authors forming the pool plurality in the dataset, and this pattern largely persists across prompts.

27. 【2604.15929】MUSCAT: MUltilingual, SCientific ConversATion Benchmark

链接https://arxiv.org/abs/2604.15929

作者:Supriti Sinhamahapatra,Thai-Binh Nguyen,Yiğit Oğuz,Enes Ugan,Jan Niehues,Alexander Waibel

类目:Computation and Language (cs.CL)

关键词:facilitate seamless communication, facilitate seamless, seamless communication, communication between individuals, individuals speaking

备注

点击查看摘要

Abstract:The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: Handling mixed multilingual input, specific vocabulary, and code-switching. However, there is currently no dataset benchmarking this situation. We propose a new benchmark to evaluate current Automatic Speech Recognition (ASR) systems, whether they are able to handle these challenges. The benchmark consists of bilingual discussions on scientific papers between multiple speakers, each conversing in a different language. We provide a standard evaluation framework, beyond Word Error Rate (WER) enabling consistent comparison of ASR performance across languages. Experimental results demonstrate that the proposed dataset is still an open challenge for state-of-the-art ASR systems. The dataset is available in this https URL \\ \newline \Keywords{multilingual, speech recognition, audio segmentation, speaker diarization}

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.15929 [cs.CL]

(or
arXiv:2604.15929v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.15929

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
28. 【2604.15882】JFinTEB: Japanese Financial Text Embedding Benchmark

链接https://arxiv.org/abs/2604.15882

作者:Masahiro Suzuki,Hiroki Sakaji

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Japanese financial text, evaluating Japanese financial, Japanese financial, financial text, comprehensive benchmark specifically

备注: 5 pages. Accepted at SIGIR 2026 Resource Track

点击查看摘要

Abstract:We introduce JFinTEB, the first comprehensive benchmark specifically designed for evaluating Japanese financial text embeddings. Existing embedding benchmarks provide limited coverage of language-specific and domain-specific aspects found in Japanese financial texts. Our benchmark encompasses diverse task categories including retrieval and classification tasks that reflect realistic and well-defined financial text processing scenarios. The retrieval tasks leverage instruction-following datasets and financial text generation queries, while classification tasks cover sentiment analysis, document categorization, and domain-specific classification challenges derived from economic survey data. We conduct extensive evaluations across a wide range of embedding models, including Japanese-specific models of various sizes, multilingual models, and commercial embedding services. We publicly release JFinTEB datasets and evaluation framework at this https URL to facilitate future research and provide a standardized evaluation protocol for the Japanese financial text mining community. This work addresses a critical gap in Japanese financial text processing resources and establishes a foundation for advancing domain-specific embedding research.

29. 【2604.15877】Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

链接https://arxiv.org/abs/2604.15877

作者:Xing Zhang,Guanghui Wang,Yanwei Cui,Wei Qiu,Ziyuan Li,Bing Zhu,Peiyang He

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:LLM agents scale, efficiently managing accumulated, managing accumulated experience, multi-session deployments, LLM agents

备注

点击查看摘要

Abstract:As LLM agents scale to long-horizon, multi-session deployments, efficiently managing accumulated experience becomes a critical bottleneck. Agent memory systems and agent skill discovery both address this challenge -- extracting reusable knowledge from interaction traces -- yet a citation analysis of 1,136 references across 22 primary papers reveals a cross-community citation rate below 1%. We propose the \emph{Experience Compression Spectrum}, a unifying framework that positions memory, skills, and rules as points along a single axis of increasing compression (5--20$\times$ for episodic memory, 50--500$\times$ for procedural skills, 1,000$\times$+ for declarative rules), directly reducing context consumption, retrieval latency, and compute overhead. Mapping 20+ systems onto this spectrum reveals that every system operates at a fixed, predetermined compression level -- none supports adaptive cross-level compression, a gap we term the \emph{missing diagonal}. We further show that specialization alone is insufficient -- both communities independently solve shared sub-problems without exchanging solutions -- that evaluation methods are tightly coupled to compression levels, that transferability increases with compression at the cost of specificity, and that knowledge lifecycle management remains largely neglected. We articulate open problems and design principles for scalable, full-spectrum agent learning systems.

30. 【2604.15873】How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

链接https://arxiv.org/abs/2604.15873

作者:Judith Sieker,Sina Zarrieß

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, increasingly studied, studied as repositories, linguistic knowledge

备注: Accepted at ACL 2026 (findings)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role aligns with success in the other. In this paper, we address this question for pragmatic competence by comparing LLMs' performance as pragmatic listeners, judging the appropriateness of linguistic outputs, and as pragmatic speakers, generating pragmatically appropriate language. We evaluate multiple open-weight and proprietary LLMs across three pragmatic settings. We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers. Our results suggest that pragmatic judging and pragmatic generation are only weakly aligned in current LLMs, calling for more integrated evaluation practices.

31. 【2604.15866】DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition

链接https://arxiv.org/abs/2604.15866

作者:Siun Kim,Hyung-Jin Yoon

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:advanced information extraction, named entity recognition, Large language models, Large language, few-shot named entity

备注: 9 pages, 3 figures; Accepted to the ACL 2026 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inconsistencies observed in early-stage human annotation processes that resolve disagreements through pilot annotation. Motivated by this analogy, we introduce DiZiNER (Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition), a framework that simulates the pilot annotation process, employing LLMs to act as both annotators and supervisors. Multiple heterogeneous LLMs annotate shared texts, and a supervisor model analyzes inter-model disagreements to refine task instructions. Across 18 benchmarks, DiZiNER achieves zero-shot SOTA results on 14 datasets, improving prior bests by +8.0 F1 and reducing the zero-shot to supervised gap by over +11 points. It also consistently outperforms its supervisor, GPT-5 mini, indicating that improvements stem from disagreement-guided instruction refinement rather than model capacity. Pairwise agreement between models shows a strong correlation with NER performance, further supporting this finding.

32. 【2604.15847】CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

链接https://arxiv.org/abs/2604.15847

作者:Junyi Li,Yongqiang Chen,Ningning Ding

类目:Computation and Language (cs.CL)

关键词:Large Language Models, gained increasing attention, selectively remove unwanted, remove unwanted privacy, Large Reasoning Models

备注: Accepted by ACL 2026 Main Conference

点击查看摘要

Abstract:Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.

33. 【2604.15842】Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

链接https://arxiv.org/abs/2604.15842

作者:Tanja Baeumel,Josef van Genabith,Simon Ostermann

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, tasks remain underexplored, handling reasoning-intensive tasks, reasoning-intensive tasks remain

备注: MathNLP 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task execution. Using early decoding, we trace how next-token predictions are constructed across layers. Our experiments reveal that while the models recognize arithmetic tasks early, correct result generation occurs only in the final layers. Notably, models proficient in arithmetic exhibit a clear division of labor between attention and MLP modules, where attention propagates input information and MLP modules aggregate it. This division is absent in less proficient models. Furthermore, successful models appear to process more challenging arithmetic tasks functionally, suggesting reasoning capabilities beyond factual recall.

34. 【2604.15841】Exploring the Capability Boundaries of LLMs in Mastering of Chinese Chouxiang Language

链接https://arxiv.org/abs/2604.15841

作者:Dianqing Lin,Tian Lan,Jiali Zhu,Jiang Li,Wei Chen,Xu Liu,Aruukhan,Xiangdong Su,Hongxu Hou,Guanglai Gao

类目:Computation and Language (cs.CL)

关键词:remains largely unexplored, Chinese internet context, achieved remarkable success, large language models, representative subcultural language

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:While large language models (LLMs) have achieved remarkable success in general language tasks, their performance on Chouxiang Language, a representative subcultural language in the Chinese internet context, remains largely unexplored. In this paper, we introduce Mouse, a specialized benchmark designed to evaluate the capabilities of LLMs on NLP tasks involving Chouxiang Language across six tasks. Experimental results show that, current state-of-the-art (SOTA) LLMs exhibit clear limitations on multiple tasks, while performing well on tasks that involve contextual semantic understanding. In addition, we further discuss the reasons behind the generally low performance of SOTA LLMs on Chouxiang Language, examine whether the LLM-as-a-judge approach adopted for translation tasks aligns with human judgments and values, and analyze the key factors that influence Chouxiang translation. Our study aims to promote further research in the NLP community on multicultural integration and the dynamics of evolving internet languages. Our code and data are publicly available.

35. 【2604.15840】CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

链接https://arxiv.org/abs/2604.15840

作者:Shidong Yang,Ziyu Ma,Tongwen Huang,Yiming Hu,Yong Wang,Xiangxiang Chu

类目:Computation and Language (cs.CL)

关键词:agent evolving behavior, Reinforcement learning, enables LLM agents, static data distribution, LLM agents

备注: Accepted to ACL 2026

点击查看摘要

Abstract:Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent's evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-loop, interaction-driven training. Specifically, CoEvolve extracts feedback signals such as forgetting and uncertainty from rollout trajectories to identify failure-prone interaction patterns, and utilizes them to guide LLM-based task synthesis. The synthesized tasks are validated through environment interaction and utilized to update the data distribution, enabling joint adaptation of the agent and its data. Extensive experiments on AppWorld and BFCL across Qwen2.5-7B, Qwen3-4B, and Qwen3-30B-A3B demonstrate consistent and significant improvements over strong base models, yielding absolute gains of 19.43%, 15.58%, and 18.14%, respectively.

36. 【2604.15839】Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

链接https://arxiv.org/abs/2604.15839

作者:Chengwu Liu,Yichun Yin,Ye Yuan,Jiaxuan Xie,Botao Li,Siqi Li,Jianhao Shen,Yan Xu,Lifeng Shang,Ming Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:Hard Mode, human competitors face, Hard Mode benchmarks, Hard Mode statements, Easy Mode

备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Most ATP benchmarks embed the final answer within the formal statement -- a convention we call "Easy Mode" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode -- while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.

37. 【2604.15827】UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval

链接https://arxiv.org/abs/2604.15827

作者:Tobias Schimanski,Stefanie Lewandowski,Christian Woerle,Nicola Reichenau,Yauheni Huryn,Markus Leippold

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:concerned with identifying, Conventional information retrieval, query, Conventional, information retrieval

备注

点击查看摘要

Abstract:Conventional information retrieval is concerned with identifying the relevance of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of similarity in texts, leaving unobserved whether the text is truly useful for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce UsefulBench, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with relevance. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, UsefulBench presents a dataset challenge for targeted information retrieval systems.

38. 【2604.15804】Qwen3.5-Omni Technical Report

链接https://arxiv.org/abs/2604.15804

作者:Qwen Team

类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Qwen-Omni model family, latest advancement, model family, Qwen-Omni model, audio-visual

备注

点击查看摘要

Abstract:In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.

39. 【2604.15802】CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents

链接https://arxiv.org/abs/2604.15802

作者:Hyunseok Park,Jihyeon Kim,Jongeun Kim,Dongsik Yoon

类目:Computation and Language (cs.CL)

关键词:causing unnecessary information, Retrieval-Augmented Generation, systems lose retrieval, lose retrieval accuracy, Large Language Models

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems lose retrieval accuracy when similar documents coexist in the vector database, causing unnecessary information, hallucinations, and factual errors. To alleviate this issue, we propose CHOP, a framework that iteratively evaluates chunk relevance with Large Language Models (LLMs) and progressively reconstructs documents by determining their association with specific topics or query types. CHOP integrates two key components: the CNM-Extractor, which generates compact per-chunk signatures capturing categories, key nouns, and model names, and the Continuity Decision Module, which preserves contextual coherence by deciding whether consecutive chunks belong to the same document flow. By prefixing each chunk with context-aware metadata, CHOP reduces semantic conflicts among similar documents and enhances retriever discrimination. Experiments on benchmark datasets show that CHOP alleviates retrieval confusion and provides a scalable approach for building high-quality knowledge bases, achieving a Top-1 Hit Rate of 90.77% and notable gains in ranking quality metrics.

40. 【2604.15800】From Intention to Text: AI-Supported Goal Setting in Academic Writing

链接https://arxiv.org/abs/2604.15800

作者:Yueling Fan,Richard Lee Davis,Olga Viberg

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:voice-based writing assistant, writing assistant designed, study presents WriteFlow, assistant designed, academic writing

备注: Accepted at AIED 2026

点击查看摘要

Abstract:This study presents WriteFlow, an AI voice-based writing assistant designed to support reflective academic writing through goal-oriented interaction. Academic writing involves iterative reflection and evolving goal regulation, yet prior research and a formative study with 17 participants show that writers often struggle to articulate and manage changing goals. While commonly used AI writing tools emphasize efficiency, they offer limited support for metacognition and writer agency. WriteFlow frames AI interaction as a dialogic space for ongoing goal articulation, monitoring, and negotiation grounded in writers' intentions. Findings from a Wizard-of-Oz study with 12 expert users show that WriteFlow scaffolds metacognitive regulation and reflection-in-action by supporting iterative goal refinement, maintaining goal-text alignment during drafting, and prompting evaluation of goal fulfillment. We discuss design implications for AI writing systems that prioritize reflective dialogue, flexible goal structures, and multi-perspective feedback to support intentional and agentic writing.

41. 【2604.15794】Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

链接https://arxiv.org/abs/2604.15794

作者:Chi Liu,Xin Chen,Xu Zhou,Fangbo Tu,Srinivasan Manoharan

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, achieved remarkable success, Language Models, remarkable success

备注: 14 pages, 8 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success, underpinning diverse AI applications. However, they often suffer from performance degradation due to factors such as catastrophic forgetting during Supervised Fine-Tuning (SFT), quantization, and pruning. In this work, we introduce a performance recovery framework based on Self-Distillation Fine-Tuning (SDFT) that effectively restores model capabilities. Complementing this practical contribution, we provide a rigorous theoretical explanation for the underlying recovery mechanism. We posit that an LLM's generative capability fundamentally relies on the high-dimensional manifold constructed by its hidden layers. To investigate this, we employ Centered Kernel Alignment (CKA) to quantify the alignment between student and teacher activation trajectories, leveraging its invariance to orthogonal transformations and scaling. Our experiments demonstrate a strong correlation between performance recovery and manifold alignment, substantiating the claim that self-distillation effectively aligns the student's high-dimensional manifold with the optimal structure represented by the teacher. This study bridges the gap between practical recovery frameworks and geometric representation theory, offering new insights into the internal mechanisms of self-distillation.

42. 【2604.15789】A Systematic Study of Training-Free Methods for Trustworthy Large Language Models

链接https://arxiv.org/abs/2604.15789

作者:Wai Man Si,Mingjie Li,Michael Backes,Yang Zhang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, receive increasing attention, drawn significant attention, producing unsupported claims, including generating harmful

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-free methods have recently emerged as cost-effective alternatives to post-training alignment techniques. Despite their promising results, these methods are evaluated inconsistently across the literature, cover limited dimensions of trustworthiness, and can introduce undesirable side effects, such as utility degradation and increased brittleness. To fully assess the impacts of these training-free methods, we take a step back and systematically re-evaluate the effectiveness of existing training-free methods against various trustworthy settings and their influence on utility, robustness, and computational overhead. We also categorize these methods into three levels (input, internal, and output) based on where they intervene in the model's information flow during inference. Using this taxonomy, we conduct a comprehensive analysis of various representative and effective methods from each level across different LLM families and sizes. Our analysis highlights several trade-offs and unresolved challenges in current approaches. We summarize key findings and limitations in the existing literature, and propose practical recommendations for balancing trustworthiness, utility, and robustness in LLMs without the need for additional training.

43. 【2604.15780】Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

链接https://arxiv.org/abs/2604.15780

作者:Wai Man Si,Mingjie Li,Michael Backes,Yang Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Machine learning models, Machine learning, Mistral and LLaVA, real-world applications, inherited from pre-training

备注

点击查看摘要

Abstract:Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain "unsafe tickets" responsible for harmful behaviors, and pruning reveals "safety tickets" that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.

44. 【2604.15776】PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

链接https://arxiv.org/abs/2604.15776

作者:Pritesh Jha

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Personally Identifiable Information, Identifiable Information, Personally Identifiable, natural language text, PII entity types

备注

点击查看摘要

Abstract:We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule-based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII-specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span-level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain-silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at this https URL.

45. 【2604.15774】MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

链接https://arxiv.org/abs/2604.15774

作者:Weiwei Xie,Shaoxiong Guo,Fan Zhang,Tian Xia,Xue Yang,Lizhuang Ma,Junchi Yan,Qibing Ren

类目:Computation and Language (cs.CL)

关键词:Equipping Large Language, Large Language Models, Equipping Large, Large Language, persistent memory enhances

备注

点击查看摘要

Abstract:Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.

46. 【2604.15771】Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

链接https://arxiv.org/abs/2604.15771

作者:Kai Wei,Raymond Li,Xi Zhu,Zhaoqian Xue,Jiaojiao Han,Jingcheng Niu,Fan Yang

类目:Computation and Language (cs.CL)

关键词:grounding large language, large language models, external knowledge, foundational paradigm, paradigm for grounding

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose -- leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills -- query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases -- to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query-evidence misalignment is a typed rather than monolithic phenomenon.

47. 【2604.15756】L: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models

链接https://arxiv.org/abs/2604.15756

作者:Jinlun Ye,Jiang Liao,Runhe Lai,Xinhua Lu,Jiaxin Zhuang,Zhiyong Gan,Ruixuan Wang

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:CLIP exhibit strong, Vision-language models, CLIP exhibit, OOD, external OOD labels

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams. To address this limitation, we introduce Test-time Textual Learning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels. TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise. In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches. Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state-of-the-art performance, highlighting the value of textual adaptation for robust test-time OOD detection. Our code is available at this https URL.

48. 【2604.15744】Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand

链接https://arxiv.org/abs/2604.15744

作者:Sidney Wong

类目:Computation and Language (cs.CL)

关键词:Zealand-related Reddit communities, Zealand-related Reddit, thesis investigates geographic, investigates geographic dialect, Reddit communities

备注: PhD thesis

点击查看摘要

Abstract:This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on user-informed lexical, morphosyntactic, and semantic variables. The findings show that users generally associate language with place, and place-related communities form a contiguous speech community, though alignment between geographic dialect communities and place-related communities remains complex. Advanced language modelling, including static and diachronic Word2Vec language embeddings, revealed semantic variation across place-based communities and meaningful semantic shifts within New Zealand English. The research involved the creation of a corpus containing 4.26 billion unprocessed words, which offers a valuable resource for future study. Overall, the results highlight the potential of social media as a natural laboratory for sociolinguistic inquiry.

49. 【2604.15741】Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

链接https://arxiv.org/abs/2604.15741

作者:Ponhvoan Srey,Xiaobao Wu,Cong-Duy Nguyen,Anh Tuan Luu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:promising approach, approach to detect, Uncertainty estimation, large language models, SIVR

备注: Accepted at ACL 2026 (Main Conference)

点击查看摘要

Abstract:Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens. To address these issues, we present Sequential Internal Variance Representation (SIVR), a supervised hallucination detection framework that leverages token-wise, layer-wise features derived from hidden states. SIVR adopts a more basic assumption that uncertainty manifests in the degree of dispersion or variance of internal representations across layers, rather than relying on specific assumptions, which makes the method model and task agnostic. It additionally aggregates the full sequence of per-token variance features, learning temporal patterns indicative of factual errors and thereby preventing information loss. Experimental results demonstrate SIVR consistently outperforms strong baselines. Most importantly, SIVR enjoys stronger generalisation and avoids relying on large training sets, highlighting the potential for practical deployment. Our code repository is available online at this https URL.

50. 【2604.15736】RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

链接https://arxiv.org/abs/2604.15736

作者:Yichen Xu,Yuanhang Liu,Chuhan Wang,Zihan Zhao,jinghan luo,Jianzhe Ma,Wenxuan Wang,Qin Jin

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Large Language, remains insufficiently explored, Multimodal Large

备注: Work in Progress

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) excel at generic video understanding, their ability to support specialized, rule-grounded decision-making remains insufficiently explored. In this paper, we introduce RefereeBench, the first large-scale benchmark for evaluating MLLMs as automatic sports referees. Spanning 11 sports with 925 curated videos and 6,475 QA pairs, RefereeBench evaluates five core officiating abilities: foul existence, foul and penalty classification, foul and penalty reasoning, entity perception, and temporal grounding. The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence. Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees. Further analysis shows that while models can often identify incidents and involved entities, they struggle with rule application and temporal grounding, and frequently over-call fouls on normal clips. Our benchmark highlights the need for future MLLMs that better integrate domain knowledge and multimodal understanding, advancing trustworthy AI-assisted officiating and broader multimodal decision-making.

51. 【2604.15715】GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

链接https://arxiv.org/abs/2604.15715

作者:Jize Wang,Xuanxuan Liu,Yining Li,Songyang Zhang,Yijun Wang,Zifei Shan,Xinyi Le,Cailian Chen,Xinping Guan,Dacheng Tao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:executing simple instructions, general-purpose agents requires, completing complex, development of general-purpose, requires a shift

备注

点击查看摘要

Abstract:The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at this https URL.

52. 【2604.15706】arget-Oriented Pretraining Data Selection via Neuron-Activated Graph

链接https://arxiv.org/abs/2604.15706

作者:Zijun Wang,Haoqin Tu,Weidong Zhou,Yiyang Zhou,Xiaohuan Zhou,Bingni Zhang,Weiguo Feng,Taifeng Wang,Cihang Xie,Fengze Liu

类目:Computation and Language (cs.CL)

关键词:Everyday tasks, Neuron-Activated Graph Ranking, Neuron-Activated Graph, Graph Ranking, NAG-based Ranking

备注

点击查看摘要

Abstract:Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented language model (LM) pretraining by introducing Neuron-Activated Graph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using black-box representations, our approach directly characterizes each target input by a sparse set of high-impact neurons in any off-the-shelf LLMs. Concretely, we quantify neuron impact and select the most influential neurons across layers into a compact Neuron-Activated Graph (NAG), and rank candidate data by NAG similarity to target examples. We conduct experiments across six benchmarks, where our NAG-based Ranking improves target-oriented pretraining by 4.9% on average over random sampling, and also outperforms state-of-the-art baselines by 5.3% accuracy on HellaSwag. It also remains effective under a more applicable multi-target setting, where our best setup surpasses two baselines by 1.1% and 4.1%, respectively. Furthermore, we provide a comprehensive analysis on why and how our NAG works, e.g., deactivating NAG-selected neurons (only 0.12% of all) causes a 23.5% performance collapse, and restricting NAG to the final layer incurs a 4.1% average drop, indicating that NAG captures a sparse "functional backbone" for learning target features. We release the code at this https URL.

53. 【2604.15702】he Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

链接https://arxiv.org/abs/2604.15702

作者:Jon-Paul Cacioli

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Nelson and Narens, applying human psychometric, human psychometric methodology, cross-domain behavioural assay, introduce a cross-domain

备注: 11 pages, 6 figures, 3 tables. Submitted to NeurIPS 2026 Evaluations and Datasets Track. Code, data, and Croissant metadata: [this https URL](https://github.com/synthiumjp/metacognitive-monitoring-battery)

点击查看摘要

Abstract:We introduce a cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive function, prospective regulation), each grounded in an established experimental paradigm. Tasks T1-T5 were pre-registered on OSF prior to data collection; T6 was added as an exploratory extension. After every forced-choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline. The critical metric is the withdraw delta: the difference in withdrawal rate between incorrect and correct items. Applied to 20 frontier LLMs (10,480 evaluations), the battery discriminates three profiles consistent with the Nelson-Narens architecture: blanket confidence, blanket withdrawal, and selective sensitivity. Accuracy rank and metacognitive sensitivity rank are largely inverted. Retrospective monitoring and prospective regulation appear dissociable (r = .17, 95% CI wide given n=20; exemplar-based evidence is the primary support). Scaling on metacognitive calibration is architecture-dependent: monotonically decreasing (Qwen), monotonically increasing (GPT-5.4), or flat (Gemma). Behavioural findings converge structurally with an independent Type-2 SDT approach, providing preliminary cross-method construct validity. All items, data, and code: this https URL.

54. 【2604.15701】Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

链接https://arxiv.org/abs/2604.15701

作者:Yao Chen,Jiawei Sheng,Wenyuan Zhang,Tingwen Liu

类目:Computation and Language (cs.CL)

关键词:significant computational demands, distilling reasoning abilities, significant computational, computational demands, demands of large

备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers' dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher's stepwise attention on key information to the student model. This establishes structured guidance for the student's progressive concentration on key information during reasoning. More importantly, we develop a Mixture of Layers module enabling dynamic alignment that adapts to different layers between the teacher and student. Our method achieves consistent performance improvements across multiple mathematical and commonsense reasoning datasets. To our knowledge, it is the first method to leverage stepwise attention within CoT distillation to improve small model reasoning.

55. 【2604.15687】Preference Estimation via Opponent Modeling in Multi-Agent Negotiation

链接https://arxiv.org/abs/2604.15687

作者:Yuta Konishi,Kento Yamamoto,Eisuke Sonomoto,Rikuho Takeda,Ryo Furukawa,Yusuke Muraki,Takafumi Shimizu,Kazuma Fukumura,Yuya Kanemoto,Takayuki Ito,Shiyao Ding

类目:Computation and Language (cs.CL)

关键词:multi-issue settings critically, settings critically depends, Automated negotiation, negotiation in complex, Large Language Models

备注: This paper is accepted as a Findings of ACL 2026

点击查看摘要

Abstract:Automated negotiation in complex, multi-party and multi-issue settings critically depends on accurate opponent modeling. However, conventional numerical-only approaches fail to capture the qualitative information embedded in natural language interactions, resulting in unstable and incomplete preference estimation. Although Large Language Models (LLMs) enable rich semantic understanding of utterances, it remains challenging to quantitatively incorporate such information into a consistent opponent modeling. To tackle this issue, we propose a novel preference estimation method integrating natural language information into a structured Bayesian opponent modeling framework. Our approach leverages LLMs to extract qualitative cues from utterances and converts them into probabilistic formats for dynamic belief tracking. Experimental results on a multi-party benchmark demonstrate that our framework improves the full agreement rate and preference estimation accuracy by integrating probabilistic reasoning with natural language understanding.

56. 【2604.15675】C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

链接https://arxiv.org/abs/2604.15675

作者:Pufan Zeng,Yilun Liu,Mingchen Dai,Mengyao Piao,Chunguang Zhao,Lingqi Miao,Shimin Tao,Weibin Meng,Minggui He,Chenxin Liu,Zhenzhen Qin,Li Zhang,Hongxia Ma,Boxing Chen,Daimeng Wei

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, alignment in Large, synthetic data generation

备注

点击查看摘要

Abstract:Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this "quantification gap" by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further leverage the mined knowledge to steer the synthesis of diverse instruction-tuning datasets. Extensive experiments demonstrate that this seed-centric approach significantly enhances cultural understanding and reasoning capabilities, achieving a +6.03 point improvement on CulturalBench-Hard and surpassing state-of-the-art baselines, providing a scalable, quantifiable solution for high-quality cultural data synthesis.

57. 【2604.15672】Faster LLM Inference via Sequential Monte Carlo

链接https://arxiv.org/abs/2604.15672

作者:Yahya Emara,Mauricio Barba da Costa,Chi-Chih Chang,Cameron Freer,Tim Vieira,Ryan Cotterell,Mohamed S. Abdelfattah

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:accelerates language model, accelerates language, cheap proposal model, cheap proposal, SMC-SD

备注

点击查看摘要

Abstract:Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we propose to reweight them. To this end, we introduce sequential Monte Carlo speculative decoding (SMC-SD), which replaces token-level rejection with importance-weighted resampling over a population of draft particles. SMC-SD is a principled approximate inference scheme that trades exactness for additional speed, while preserving theoretical bounds on its per-step approximation error. Because LLM inference is memory bandwidth-bound, the arithmetic needed to draft particles and to score them in parallel comes nearly for free -- SMC-SD uses idle compute to turn verification into a vectorized, fixed-size operation with no rollback. Empirically, SMC-SD achieves 2.36x speed-up over speculative decoding and a 5.2x speed-up over autoregressive decoding, while remaining within 3% of the target model's accuracy on reasoning, instruction-following, and coding benchmarks.

58. 【2604.15648】HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

链接https://arxiv.org/abs/2604.15648

作者:Yanbin Wei,Chun Kang,Siwei Li,Haoxuan Che,Yang Chen,Hua Liu,Jian Liu,Zhuang Liu,Can Ouyang,Fei Xing,Lei Sha,Rui Liu,Yu Zhang,James Kwok

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, consistently require, require new arenas

备注: Under Review; Opensource after accepted

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) consistently require new arenas to guide their expanding boundaries, yet their capabilities with hypergraphs remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social communities. Recent advancements in LVLMs have shown promise in understanding complex topologies, yet there remains a lack of a benchmark to delineate the capabilities of LVLMs with hypergraphs, leaving the boundaries of their abilities unclear. To fill this gap, in this paper, we introduce $\texttt{HyperGVL}$, the first benchmark to evaluate the proficiency of LVLMs in hypergraph understanding and reasoning. $\texttt{HyperGVL}$ provides a comprehensive assessment of 12 advanced LVLMs across 84,000 vision-language question-answering (QA) samples spanning 12 tasks, ranging from basic component counting to complex NP-hard problem reasoning. The involved hypergraphs contain multiscale synthetic structures and real-world citation and protein networks. Moreover, we examine the effects of 12 textual and visual hypergraph representations and introduce a generalizable router $\texttt{WiseHyGR}$ that improves LVLMs in hypergraph via learning adaptive representations. We believe that this work is a step forward in connecting hypergraphs with LVLMs.

59. 【2604.15647】CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics

链接https://arxiv.org/abs/2604.15647

作者:Ming-Bin Chen,Jey Han Lau,Lea Frermann

类目:Computation and Language (cs.CL)

关键词:public deliberation requires, deliberation requires evaluating, Conversational Information Gain, argument structure, public deliberation

备注: 24 pages, 5 figures

点击查看摘要

Abstract:Measuring the quality of public deliberation requires evaluating not only civility or argument structure, but also the informational progress of a conversation. We introduce a framework for Conversational Information Gain (CIG) that evaluates each utterance in terms of how it advances collective understanding of the target topic. To operationalize CIG, we model an evolving semantic memory of the discussion: the system extracts atomic claims from utterances and incrementally consolidates them into a structured memory state. Using this memory, we score each utterance along three interpretable dimensions: Novelty, Relevance, and Implication Scope. We annotate 80 segments from two moderated deliberative settings (TV debates and community discussions) with these dimensions and show that memory-derived dynamics (e.g., the number of claim updates) correlate more strongly with human-perceived CIG than traditional heuristics such as utterance length or TF--IDF. We develop effective LLM-based CIG predictors paving the way for information-focused conversation quality analysis in dialogues and deliberative success.

60. 【2604.15646】FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use

链接https://arxiv.org/abs/2604.15646

作者:Suparno Roy Chowdhury,Tejas Anvekar,Manan Roy Choudhury,Muhammad Ali Khan,Kaneez Zahra Rubab Khakwani,Mohamad Bassam Sonbol,Irbaz Bin Riaz,Vivek Gupta

类目:Computation and Language (cs.CL)

关键词:writing SQL requires, exploring oncology trial, oncology trial repositories, requires schema expertise, Clinicians exploring oncology

备注

点击查看摘要

Abstract:Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified NL2SQL exemplars via sentence embeddings, and synthesizes executable SQL conditioned on the decomposition, retrieved exemplars, and schema, with post-processing validity checks. To improve with use, FD-NL2SQL incorporates two update signals: (i) clinician edits of generated SQL are approved and added to the exemplar bank; and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retaining variants only if they return non-empty results. A second LLM generates the corresponding natural-language question and predicate decomposition for accepted variants, automatically expanding the exemplar bank without additional annotation. The demo interface exposes decomposition, retrieval, synthesis, and execution results to support interactive refinement and continuous improvement.

61. 【2604.15628】SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

链接https://arxiv.org/abs/2604.15628

作者:Keisuke Gomi,Keiji Yanai

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Cross-modal retrieval, Multimodal Large Language, dietary logging, nutritional management, Single Integrated Multimodal

备注: 20 pages, 6 figures

点击查看摘要

Abstract:Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8\% to 87.5\% and the 10k image-to-recipe R@1 from 56.5\% to 65.5\% compared to the previous best method.

62. 【2604.15621】Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking

链接https://arxiv.org/abs/2604.15621

作者:Jun Feng,Jiahui Tang,Zhicheng He,Hang Lv,Hongchao Gu,Hao Wang,Xuezhi Yang,Shuai Fang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Retrieval-Augmented Generation aims, retrieving supplementary passages, Large Language Models, Adaptive Retrieval-Augmented Generation, aims to mitigate

备注: 7pages, 2figures

点击查看摘要

Abstract:Adaptive Retrieval-Augmented Generation aims to mitigate the interference of extraneous noise by dynamically determining the necessity of retrieving supplementary passages. However, as Large Language Models evolve with increasing robustness to noise, the necessity of adaptive retrieval warrants re-evaluation. In this paper, we rethink this necessity and propose AdaRankLLM, a novel adaptive retrieval framework. To effectively verify the necessity of adaptive listwise reranking, we first develop an adaptive ranker employing a zero-shot prompt with a passage dropout mechanism, and compare its generation outcomes against static fixed-depth retrieval strategies. Furthermore, to endow smaller open-source LLMs with this precise listwise ranking and adaptive filtering capability, we introduce a two-stage progressive distillation paradigm enhanced by data sampling and augmentation techniques. Extensive experiments across three datasets and eight LLMs demonstrate that AdaRankLLM consistently achieves optimal performance in most scenarios with significantly reduced context overhead. Crucially, our analysis reveals a role shift in adaptive retrieval: it functions as a critical noise filter for weaker models to overcome their limitations, while serving as a cost-effective efficiency optimizer for stronger reasoning models.

63. 【2604.15607】Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies

链接https://arxiv.org/abs/2604.15607

作者:Myke C. Cohen,Mingqian Zheng,Neel Bhandari,Hsien-Te Kao,Xuhui Zhou,Daniel Nguyen,Laura Cassani,Maarten Sap,Svitlana Volkova

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:design characteristics, human, imperfectly cooperative scenarios, human personality traits, impact the quality

备注: Will be presented at ACL 2026 and published in the Findings of the Association for Computational Linguistics: ACL 2026

点击查看摘要

Abstract:AI design characteristics and human personality traits each impact the quality and outcomes of human-AI interactions. However, their relative and joint impacts are underexplored in imperfectly cooperative scenarios, where people and AI only have partially aligned goals and objectives. This study compares a purely simulated dataset comprising 2,000 simulations and a parallel human subjects experiment involving 290 human participants to investigate these effects across two scenario categories: (1) hiring negotiations between human job candidates and AI hiring agents; and (2) human-AI transactions wherein AI agents may conceal information to maximize internal goals. We examine user Extraversion and Agreeableness alongside AI design characteristics, including Adaptability, Expertise, and chain-of-thought Transparency. Our causal discovery analysis extends performance-focused evaluations by integrating scenario-based outcomes, communication analysis, and questionnaire measures. Results reveal divergences between purely simulated and human study datasets, and between scenario types. In simulation experiments, personality traits and AI attributes were comparatively influential. Yet, with actual human subjects, AI attributes -- particularly transparency -- were much more impactful. We discuss how these divergences vary across different interaction contexts, offering crucial insights for the future of human-centered AI agents.

64. 【2604.15602】GroupDPO: Memory efficient Group-wise Direct Preference Optimization

链接https://arxiv.org/abs/2604.15602

作者:Jixuan Leng,Si Si,Hsiang-Fu Yu,Vinod Raman,Inderjit S. Dhillon

类目:Computation and Language (cs.CL)

关键词:Large Language Models, align Large Language, Language Models, Large Language, align Large

备注

点击查看摘要

Abstract:Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work explores group-wise preference optimization, which jointly contrasts multiple responses for the same prompt, but its empirical behavior and scalability remain underexplored due to the memory overhead of group-coupled objectives. In this work, we introduce a memory-efficient group-wise preference optimization algorithm that preserves gradients while decoupling samples during backpropagation, substantially reducing peak memory usage, which enables scalable training with larger group sizes. Across both offline and online alignment settings, we show that leveraging multiple responses consistently outperforms single-pair training. Furthermore, incorporating a negative log-likelihood (NLL) term on positive responses is critical for both performance gains and training stability.

65. 【2604.15597】LLMs Corrupt Your Documents When You Delegate

链接https://arxiv.org/abs/2604.15597

作者:Philippe Laban,Tobias Schnabel,Jennifer Neville

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Large Language Models, Large Language, disrupt knowledge work, Language Models, knowledge work

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

66. 【2604.15593】DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation

链接https://arxiv.org/abs/2604.15593

作者:Chao Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, models compress heterogeneous, language models compress, Large language, compress heterogeneous knowledge

备注

点击查看摘要

Abstract:Large language models compress heterogeneous knowledge into a single parameter space, allowing facts from different domains to interfere during generation. We propose DALM, a Domain-Algebraic Language Model that replaces unconstrained token generation with structured denoising over a domain lattice. DALM follows a three-phase generation path: it first resolves domain uncertainty, then relation uncertainty, and finally concept uncertainty, so each stage operates under explicit algebraic constraints. The framework requires only three ingredients: a lattice of domains with computable meet, join, and implication; a typing function over relations that controls inheritance across domains; and a fiber partition that localizes knowledge to domain-specific subsets. Given these ingredients, DALM yields a three-phase encoder-decoder architecture in which generation is confined to a domain fiber, cross-domain contamination is structurally prevented in closed-vocabulary mode and auditably bounded in open-vocabulary mode, and a single query can produce a domain-indexed multi-perspective answer space. We instantiate the framework with the CDC knowledge representation system and outline training and evaluation on validated domain-annotated crystal libraries. DALM reframes language generation as algebraically constrained structured denoising rather than unconstrained decoding over a flat token space.

67. 【2604.15589】LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

链接https://arxiv.org/abs/2604.15589

作者:Jack Wei Lun Shi,Minghao Dang,Wawan Solihin,Justin K.W. Yeoh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:automated code compliance, training decisions affect, Existing research, large language models, research on large

备注: 8 pages, 9 figures. Accepted at ICCCBE 2026 (International Conference on Computing in Civil and Building Engineering)

点击查看摘要

Abstract:Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors of LLMs across different fine-tuning strategies such as full fine-tuning (FFT), low-rank adaptation (LoRA) and quantized LoRA fine-tuning, as well as the impact of model scales which include varying LLM parameter sizes. Our results show that FFT produces attribution patterns that are statistically different and more focused than those from parameter-efficient fine-tuning methods. Furthermore, we found that as model scale increases, LLMs develop specific interpretive strategies such as prioritizing numerical constraints and rule identifiers in the building text, albeit with performance gains in semantic similarity of the generated and reference computer-processable rules plateauing for models larger than 7B. This paper provides crucial insights into the explainability of these models, taking a step toward building more transparent LLMs for critical, regulation-based tasks in the Architecture, Engineering, and Construction industry.

68. 【2604.15588】"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

链接https://arxiv.org/abs/2604.15588

作者:Yang Wu,Jinhong Yu,Jingwei Xiong,Zhimin Tao,Xiaozhong Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:Large Language Models, Language Models, Large Language, workflows presents exciting, presents exciting opportunities

备注: ACL 2026 Main Conference

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabScience, a proactive LLM assistant designed to enhance biomedical collaboration between AI systems and human experts through timely, context-aware interventions. At the core of our method is PULI (Positive-Unlabeled Learning-to-Intervene), a novel framework trained with a reinforcement learning objective to determine when and how to intervene in streaming scientific discussions, by leveraging the team's project proposal and long- and short-term conversational memory. To support this work, we introduce BSDD (Biomedical Streaming Dialogue Dataset), a new benchmark of simulated research discussion dialogues with intervention points derived from PubMed articles. Experimental results show that PULI significantly outperforms existing baselines in both intervention precision and collaborative task utility, highlighting the potential of proactive LLMs as intelligent scientific assistants.

69. 【2604.15574】Why Fine-Tuning Encourages Hallucinations and How to Fix It

链接https://arxiv.org/abs/2604.15574

作者:Guy Kaplan,Zorik Gekhman,Zhen Zhu,Lotem Rozner,Yuval Reif,Swabha Swayamdipta,Derek Hoiem,Roy Schwartz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词:Large language models, factually incorrect statements, hallucinating factually incorrect, Large language, incorrect statements

备注

点击查看摘要

Abstract:Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.

70. 【2604.15558】Preregistered Belief Revision Contracts

链接https://arxiv.org/abs/2604.15558

作者:Saad Alqithami

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)

关键词:Deliberative multi-agent systems, Deliberative multi-agent, multi-agent systems, systems allow agents, agents to exchange

备注

点击查看摘要

Abstract:Deliberative multi-agent systems allow agents to exchange messages and revise beliefs over time. While this interaction is meant to improve performance, it can also create dangerous conformity effects: agreement, confidence, prestige, or majority size may be treated as if they were evidence, producing high-confidence convergence to false conclusions. To address this, we introduce PBRC (Preregistered Belief Revision Contracts), a protocol-level mechanism that strictly separates open communication from admissible epistemic change. A PBRC contract publicly fixes first-order evidence triggers, admissible revision operators, a priority rule, and a fallback policy. A non-fallback step is accepted only when it cites a preregistered trigger and provides a nonempty witness set of externally validated evidence tokens. This ensures that every substantive belief change is both enforceable by a router and auditable after the fact. In this paper, (a) we prove that under evidential contracts with conservative fallback, social-only rounds cannot increase confidence and cannot generate purely conformity-driven wrong-but-sure cascades. (b) We show that auditable trigger protocols admit evidential PBRC normal forms that preserve belief trajectories and canonicalized audit traces. (c) We demonstrate that sound enforcement yields epistemic accountability: any change of top hypothesis is attributable to a concrete validated witness set. For token-invariant contracts, (d) we prove that enforced trajectories depend only on token-exposure traces; under flooding dissemination, these traces are characterized exactly by truncated reachability, giving tight diameter bounds for universal evidence closure. Finally, we introduce a companion contractual dynamic doxastic logic to specify trace invariants, and provide simulations illustrating cascade suppression, auditability, and robustness-liveness trade-offs.

71. 【2604.15557】Predicting Where Steering Vectors Succeed

链接https://arxiv.org/abs/2604.15557

作者:Jayadev Billa

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Linear Accessibility Profile, running an intervention, Accessibility Profile, Linear Accessibility, Steering vectors work

备注: 19 pages, incl. 10 appendix pages, 4 figures, 20 tables

点击查看摘要

Abstract:Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention. We introduce the Linear Accessibility Profile (LAP), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness. The key measure, $A_{\mathrm{lin}}$, applies the model's unembedding matrix to intermediate hidden states, requiring no training. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak $A_{\mathrm{lin}}$ predicts steering effectiveness at $\rho = +0.86$ to $+0.91$ and layer selection at $\rho = +0.63$ to $+0.92$. A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work. An entity-steering demo confirms the prediction end-to-end: steering at the LAP-recommended layer redirects completions on Gemma-2-2B and OLMo-2-1B-Instruct, while the middle layer (the standard heuristic) has no effect on either model.

72. 【2604.15547】Consistency Analysis of Sentiment Predictions using Syntactic Semantic Context Assessment Summarization (SSAS)

链接https://arxiv.org/abs/2604.15547

作者:Sharookh Daruwalla,Nitin Mayande,Shreeya Verma Kathuria,Nitin Joglekar,Charles Weber

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, LLMs' inherent stochasticity, Language Models, Large Language, enterprise-grade analytics

备注: 27 pages, 2 figures. arXiv admin note: text overlap with [arXiv:2604.12049](https://arxiv.org/abs/2604.12049)

点击查看摘要

Abstract:The fundamental challenge of using Large Language Models (LLMs) for reliable, enterprise-grade analytics, such as sentiment prediction, is the conflict between the LLMs' inherent stochasticity (generative, non-deterministic nature) and the analytical requirement for consistency. The LLM inconsistency, coupled with the noisy nature of chaotic modern datasets, renders sentiment predictions too volatile for strategic business decisions. To resolve this, we present a Syntactic Semantic Context Assessment Summarization (SSAS) framework for establishing context. Context established by SSAS functions as a sophisticated data pre-processing framework that enforces a bounded attention mechanism on LLMs. It achieves this by applying a hierarchical classification structure (Themes, Stories, Clusters) and an iterative Summary-of-Summaries (SoS) based context computation architecture. This endows the raw text with high-signal, sentiment-dense prompts, that effectively mitigate both irrelevant data and analytical variance. We empirically evaluated the efficacy of SSAS, using Gemini 2.0 Flash Lite, against a direct-LLM approach across three industry-standard datasets - Amazon Product Reviews, Google Business Reviews, Goodreads Book Reviews - and multiple robustness scenarios. Our results show that our SSAS framework is capable of significantly improving data quality, up to 30%, through a combination of noise removal and improvement in the estimation of sentiment prediction. Ultimately, consistency in our context-estimation capabilities provides a stable and reliable evidence base for decision-making.

Comments:
27 pages, 2 figures. arXiv admin note: text overlap with arXiv:2604.12049

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.15547 [cs.CL]

(or
arXiv:2604.15547v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.15547

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
73. 【2604.15505】PolicyBank: Evolving Policy Understanding for LLM Agents

链接https://arxiv.org/abs/2604.15505

作者:Jihye Choi,Jinsung Yoon,Long T. Le,Somesh Jha,Tomas Pfister

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:LLM agents operating, authorization constraints typically, LLM agents, natural language, operating under organizational

备注

点击查看摘要

Abstract:LLM agents operating under organizational policies must comply with authorization constraints typically specified in natural language. In practice, such specifications inevitably contain ambiguities and logical or semantic gaps that cause the agent's behavior to systematically diverge from the true requirements. We ask: by letting an agent evolve its policy understanding through interaction and corrective feedback from pre-deployment testing, can it autonomously refine its interpretation to close specification gaps? We propose PolicyBank, a memory mechanism that maintains structured, tool-level policy insights and iteratively refines them -- unlike existing memory mechanisms that treat the policy as immutable ground truth, reinforcing "compliant but wrong" behaviors. We also contribute a systematic testbed by extending a popular tool-calling benchmark with controlled policy gaps that isolate alignment failures from execution failures. While existing memory mechanisms achieve near-zero success on policy-gap scenarios, PolicyBank closes up to 82% of the gap toward a human oracle.

74. 【2604.15503】Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

链接https://arxiv.org/abs/2604.15503

作者:Jingnong Qu,Ashvin Ranjan,Shane Steinert-Threlkeld

类目:Computation and Language (cs.CL)

关键词:Recent breakthroughs, raised the question, neural networks, networks have raised, called Brain Score

备注

点击查看摘要

Abstract:Recent breakthroughs in language models (LMs) using neural networks have raised the question: how similar are these models' processing to human language processing? Results using a framework called Brain Score (BS) -- predicting fMRI activations during reading from LM activations -- have been used to argue for a high degree of similarity. To understand this similarity, we conduct experiments by training LMs on various types of input data and evaluate them on BS. We find that models trained on various natural languages from many different language families have very similar BS performance. LMs trained on other structured data -- the human genome, Python, and pure hierarchical structure (nested parentheses) -- also perform reasonably well and close to natural languages in some cases. These findings suggest that BS can highlight language models' ability to extract common structure across natural languages, but that the metric may not be sensitive enough to allow us to infer human-like processing from a high BS score alone.

75. 【2604.15490】hink Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch

链接https://arxiv.org/abs/2604.15490

作者:Eleanor M. Lin,David Jurgens

类目:Computation and Language (cs.CL)

关键词:increasingly complex mathematical, solve increasingly complex, Recent developments, models, reasoning

备注

点击查看摘要

Abstract:Recent developments in reasoning capabilities have enabled large language models to solve increasingly complex mathematical, symbolic, and logical tasks. Interestingly, while reasoning models are often trained to generate monolingual text, these models have also been observed to code-switch (i.e., mix languages). Prior works have either viewed code-switching as an undesirable error, attempted to control code-switching through modifications to input prompts or the output decoding process, or focus on narrow subsets of languages, domains, tasks, and models. We address these gaps by introducing the first linguistically and behaviorally motivated fine-tuning framework for identifying beneficial code-switched reasoning behaviors in large language models and teaching these models to code-switch more effectively for reasoning. First, we create and systematically analyze a dataset of reasoning traces from diverse models, languages, tasks, and domains to understand the types of code-switching behaviors found in existing reasoning models. Then, we develop fine-tuning interventions that teach reasoning models to code-switch based on our observations of helpful behaviors in existing models. We find that our framework can significantly increase beneficial code-switched reasoning behaviors in a data-efficient manner. Interestingly, we also find that code-switching behaviors in reasoning models can be modified by fine-tuning for tasks that do not directly demonstrate code-switching in reasoning (e.g., machine translation). Our work suggests that data-efficient interventions can instill helpful forms of code-switching behavior in reasoning models.

76. 【2604.15488】FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

链接https://arxiv.org/abs/2604.15488

作者:Zixuan Weng,Jinghuai Zhang,Kunlin Cai,Ying Li,Peiran Wang,Yuan Tian

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, exhibit undesirable behaviors, violations and hallucinations, exhibit undesirable

备注: Accepted by ACL 2026 (Main)

点击查看摘要

Abstract:Large language models (LLMs) often exhibit undesirable behaviors, such as safety violations and hallucinations. Although inference-time steering offers a cost-effective way to adjust model behavior without updating its parameters, existing methods often fail to be simultaneously effective, utility-preserving, and training-efficient due to their rigid, one-size-fits-all designs and limited adaptability. In this work, we present FineSteer, a novel steering framework that decomposes inference-time steering into two complementary stages: conditional steering and fine-grained vector synthesis, allowing fine-grained control over when and how to steer internal representations. In the first stage, we introduce a Subspace-guided Conditional Steering (SCS) mechanism that preserves model utility by avoiding unnecessary steering. In the second stage, we propose a Mixture-of-Steering-Experts (MoSE) mechanism that captures the multimodal nature of desired steering behaviors and generates query-specific steering vectors for improved effectiveness. Through tailored designs in both SCS and MoSE, FineSteer maintains robust performance on general queries while adaptively optimizing steering vectors for targeted inputs in a training-efficient manner. Extensive experiments on safety and truthfulness benchmarks show that FineSteer outperforms state-of-the-art methods in overall performance, achieving stronger steering performance with minimal utility loss. Code is available at this https URL

77. 【2604.15461】Evaluating LLM Simulators as Differentially Private Data Generators

链接https://arxiv.org/abs/2604.15461

作者:Nassima M. Bouzid,Dehao Yuan,Nam H. Nguyen,Mayana Pereira

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:traditional differentially private, high-dimensional user profiles, generating complex synthetic, complex synthetic data, differentially private

备注: Submitted to ICLR 2026. 6 pages + appendix

点击查看摘要

Abstract:LLM-based simulators offer a promising path for generating complex synthetic data where traditional differentially private (DP) methods struggle with high-dimensional user profiles. But can LLMs faithfully reproduce statistical distributions from DP-protected inputs? We evaluate this using PersonaLedger, an agentic financial simulator, seeded with DP synthetic personas derived from real user statistics. We find that PersonaLedger achieves promising fraud detection utility (AUC 0.70 at epsilon=1) but exhibits significant distribution drift due to systematic LLM biases--learned priors overriding input statistics for temporal and demographic features. These failure modes must be addressed before LLM-based methods can handle the richer user representations where they might otherwise excel.

78. 【2604.15400】Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

链接https://arxiv.org/abs/2604.15400

作者:G. Aytug Akarlar

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:autoregressive language models, present causal evidence, autoregressive language, language models, governed by asymmetric

备注: 21 pages, 12 figures, 8 tables. Code and data: [this https URL](https://github.com/akarlaraytu/trajectory-commitment)

点击查看摘要

Abstract:We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random-patch control. Window patching shows correction requires sustained multi-step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step-0 residual states predict per-prompt hallucination rate at Pearson r = 0.776 at layer 15 (p 0.001 against a 1000-permutation null); unsupervised clustering identifies five regime-like groups (eta^2 = 0.55) whose saddle-adjacent cluster concentrates 12 of the 13 bifurcating false-premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.

79. 【2604.15371】Applied Explainability for Large Language Models: A Comparative Study

链接https://arxiv.org/abs/2604.15371

作者:Venkata Abhinandan Kancharla

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:achieve strong performance, decision processes remain, processes remain difficult, Large language models, language processing tasks

备注: 14 pages, 3 figures, comparative study of explainability methods for transformer-based NLP models; also available on Zenodo

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques: Integrated Gradients, Attention Rollout, and SHAP, on a fine-tuned DistilBERT model for SST-2 sentiment classification. Rather than proposing new methods, the focus is on evaluating the practical behavior of existing approaches under a consistent and reproducible setup. The results show that gradient-based attribution provides more stable and intuitive explanations, while attention-based methods are computationally efficient but less aligned with prediction-relevant features. Model-agnostic approaches offer flexibility but introduce higher computational cost and variability. This work highlights key trade-offs between explainability methods and emphasizes their role as diagnostic tools rather than definitive explanations. The findings provide practical insights for researchers and engineers working with transformer-based NLP systems. This is a preprint and has not undergone peer review.

Comments:
14 pages, 3 figures, comparative study of explainability methods for transformer-based NLP models; also available on Zenodo

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

ACMclasses:
I.2.7; I.5.1

Cite as:
arXiv:2604.15371 [cs.CL]

(or
arXiv:2604.15371v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.15371

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Venkata Abhinandan Kancharla [view email] [v1]
Wed, 15 Apr 2026 13:07:29 UTC (353 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Applied Explainability for Large Language Models: A Comparative Study, by Venkata Abhinandan KancharlaView PDF

view license

Current browse context:
cs.CL

prev

|
next

new
|
recent
| 2026-04

Change to browse by:

cs
cs.AI
cs.LG

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

80. 【2604.15351】Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures

链接https://arxiv.org/abs/2604.15351

作者:Abdulmalek Saket

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Low-Rank Adaptation, standard practice applies, applies LoRA adapters, large language models, practice applies LoRA

备注: 11 pages, 5 figures, 2 frozen evidence campaigns, 81 experiment rows across 14 successful models and 8 architecture families, plus one documented failed Pythia/GPT-NeoX attempt

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become the dominant parameter-efficient fine-tuning method for large language models, yet standard practice applies LoRA adapters uniformly to all transformer layers regardless of their relevance to the downstream task. We introduce Aletheia, a gradient-guided layer selection method that identifies the most task-relevant layers via a lightweight gradient probe and applies LoRA adapters only to those layers with asymmetric rank allocation. Across 81 experiment rows covering 14 successful models from 8 architecture families (0.5B-72B parameters, including dense and Mixture-of-Experts architectures), with one additional documented failed Pythia/GPT-NeoX attempt in Campaign 2, Aletheia achieves a 15-28% training speedup (mean 23.1%, p 0.001) with bounded extra forgetting and broadly matched downstream behavior on the evaluated MMLU, GSM8K, and HumanEval benchmark pack. Across the tested families and scales, Campaign 1 shows a 100% per-model speed win rate and Campaign 2 shows broadly preserved downstream behavior within a bounded-degradation framing. Together these results support a practical model-economics claim: intelligent layer selection can make LoRA fine-tuning materially more efficient without introducing major downstream damage on the evaluated set.

81. 【2604.15329】Evaluating LLMs as Human Surrogates in Controlled Experiments

链接https://arxiv.org/abs/2604.15329

作者:Adnan Hoq,Tim Weninger

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, LLM-generated data support, remains unclear, experimental inferences

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate human responses in behavioral research, yet it remains unclear when LLM-generated data support the same experimental inferences as human data. We evaluate this by directly comparing off-the-shelf LLM-generated responses with human responses from a canonical survey experiment on accuracy perception. Each human observation is converted into a structured prompt, and models generate a single 0--10 outcome variable without task-specific training; identical statistical analyses are applied to human and synthetic responses. We find that LLMs reproduce several directional effects observed in humans, but effect magnitudes and moderation patterns vary across models. Off-the-shelf LLMs therefore capture aggregate belief-updating patterns under controlled conditions but do not consistently match human-scale effects, clarifying when LLM-generated data can function as behavioral surrogates.

82. 【2604.15322】Acoustic and Facial Markers of Perceived Conversational Success in Spontaneous Speech

链接https://arxiv.org/abs/2604.15322

作者:Thanushi Withanage,Elizabeth Redcay,Carol Espy-Wilson

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Individuals often align, engagement and rapport, align their speaking, speaking patterns, phenomenon linked

备注: Accepted for presentation at ICASSP 2026

点击查看摘要

Abstract:Individuals often align their speaking patterns with their interlocutors, a phenomenon linked to engagement and rapport. While well documented in task-oriented dialogues, less is known about entrainment in naturalistic, non-task and virtual settings. In this study, we analyze a large corpus of spontaneous dyadic Zoom conversations to examine how conversational dynamics relate to perceived interaction quality. We extract multimodal features encompassing turn-taking, pauses, facial movements, and acoustic measures such as pitch and intensity. Perceived conversational success was quantified via factor analysis of post-conversation ratings. Results demonstrate that entrainment reliably detected in spontaneous speech and correlates with higher perceived success. These findings identify key interactional markers of conversational quality and highlight opportunities for targeted interventions to foster more effective and engaging communication.

83. 【2308.10562】Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

链接https://arxiv.org/abs/2308.10562

作者:Delfina Sol Martinez Pandiani,Valentina Presutti

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:tasks remains unclear, high-level visual sensemaking, high-level visual, Computer Vision, image classification

备注: Preprint

点击查看摘要

Abstract:The field of Computer Vision (CV) is increasingly shifting towards ``high-level'' visual sensemaking tasks, yet the exact nature of these tasks remains unclear and tacit. This survey paper addresses this ambiguity by systematically reviewing research on high-level visual understanding, focusing particularly on Abstract Concepts (ACs) in automatic image classification. Our survey contributes in three main ways: Firstly, it clarifies the tacit understanding of high-level semantics in CV through a multidisciplinary analysis, and categorization into distinct clusters, including commonsense, emotional, aesthetic, and inductive interpretative semantics. Secondly, it identifies and categorizes computer vision tasks associated with high-level visual sensemaking, offering insights into the diverse research areas within this domain. Lastly, it examines how abstract concepts such as values and ideologies are handled in CV, revealing challenges and opportunities in AC-based image classification. Notably, our survey of AC image classification tasks highlights persistent challenges, such as the limited efficacy of massive datasets and the importance of integrating supplementary information and mid-level features. We emphasize the growing relevance of hybrid AI systems in addressing the multifaceted nature of AC image classification tasks. Overall, this survey enhances our understanding of high-level visual reasoning in CV and lays the groundwork for future research endeavors.

信息检索

1. 【2604.16121】Beyond One-Size-Fits-All: Adaptive Test-Time Augmentation for Sequential Recommendation

链接https://arxiv.org/abs/2604.16121

作者:Xibo Li,Liang Zhang

类目:Information Retrieval (cs.IR)

关键词:costly model retraining, mitigating data sparsity, requiring costly model, improving inference accuracy, model retraining

备注: 10 pages. arXiv admin note: text overlap with [arXiv:2504.04843](https://arxiv.org/abs/2504.04843) by other authors

点击查看摘要

Abstract:Test-time augmentation (TTA) has become a promising approach for mitigating data sparsity in sequential recommendation by improving inference accuracy without requiring costly model retraining. However, existing TTA methods typically rely on uniform, user-agnostic augmentation strategies. We show that this "one-size-fits-all" design is inherently suboptimal, as it neglects substantial behavioral heterogeneity across users, and empirically demonstrate that the optimal augmentation operators vary significantly across user sequences with different characteristics for the first time. To address this limitation, we propose AdaTTA, a plug-and-play reinforcement learning-based adaptive inference framework that learns to select sequence-specific augmentation operators on a per-sequence basis. We formulate augmentation selection as a Markov Decision Process and introduce an Actor-Critic policy network with hybrid state representations and a joint macro-rank reward design to dynamically determine the optimal operator for each input user sequence. Extensive experiments on four real-world datasets and two recommendation backbones demonstrate that AdaTTA consistently outperforms the best fixed-strategy baselines, achieving up to 26.31% relative improvement on the Home dataset while incurring only moderate computational overhead

2. 【2604.15882】JFinTEB: Japanese Financial Text Embedding Benchmark

链接https://arxiv.org/abs/2604.15882

作者:Masahiro Suzuki,Hiroki Sakaji

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Japanese financial text, evaluating Japanese financial, Japanese financial, financial text, comprehensive benchmark specifically

备注: 5 pages. Accepted at SIGIR 2026 Resource Track

点击查看摘要

Abstract:We introduce JFinTEB, the first comprehensive benchmark specifically designed for evaluating Japanese financial text embeddings. Existing embedding benchmarks provide limited coverage of language-specific and domain-specific aspects found in Japanese financial texts. Our benchmark encompasses diverse task categories including retrieval and classification tasks that reflect realistic and well-defined financial text processing scenarios. The retrieval tasks leverage instruction-following datasets and financial text generation queries, while classification tasks cover sentiment analysis, document categorization, and domain-specific classification challenges derived from economic survey data. We conduct extensive evaluations across a wide range of embedding models, including Japanese-specific models of various sizes, multilingual models, and commercial embedding services. We publicly release JFinTEB datasets and evaluation framework at this https URL to facilitate future research and provide a standardized evaluation protocol for the Japanese financial text mining community. This work addresses a critical gap in Japanese financial text processing resources and establishes a foundation for advancing domain-specific embedding research.

3. 【2604.15827】UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval

链接https://arxiv.org/abs/2604.15827

作者:Tobias Schimanski,Stefanie Lewandowski,Christian Woerle,Nicola Reichenau,Yauheni Huryn,Markus Leippold

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:concerned with identifying, Conventional information retrieval, query, Conventional, information retrieval

备注

点击查看摘要

Abstract:Conventional information retrieval is concerned with identifying the relevance of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of similarity in texts, leaving unobserved whether the text is truly useful for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce UsefulBench, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with relevance. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, UsefulBench presents a dataset challenge for targeted information retrieval systems.

4. 【2604.15788】Scattered Hypothesis Generation for Open-Ended Event Forecasting

链接https://arxiv.org/abs/2604.15788

作者:He Chang,Zhulin Tao,Lifang Yang,Xianglin Huang,Yunshan Ma

类目:Information Retrieval (cs.IR)

关键词:current LLM-based methods, LLM-based methods predominantly, methods predominantly target, open-ended event forecasting, risk management

备注

点击查看摘要

Abstract:Despite the importance of open-ended event forecasting for risk management, current LLM-based methods predominantly target only the most probable outcomes, neglecting the intrinsic uncertainty of real-world events. To bridge this gap, we advance open-ended event forecasting from pinpoint forecasting to scatter forecasting by introducing the proxy task of hypothesis generation. This paradigm aims to generate an inclusive and diverse set of hypotheses that broadly cover the space of plausible future events. To this end, we propose SCATTER, a reinforcement learning framework that jointly optimizes inclusiveness and diversity of the hypothesis. Specifically, we design a novel hybrid reward that consists of three components: 1) a validity reward that measures semantic alignment with observed events, 2) an intra-group diversity reward to encourage variation within sampled responses, and 3) an inter-group diversity reward to promote exploration across distinct modes. By integrating the validity-gated score into the overall objective, we confine the exploration of wildly diversified outcomes to contextually plausible futures, preventing the mode collapse issue. Experiments on two real-world benchmark datasets, i.e., OpenForecast and OpenEP, demonstrate that SCATTER significantly outperforms strong baselines. Our code is available at this https URL.

5. 【2604.15739】On the Equivalence Between Auto-Regressive Next Token Prediction and Full-Item-Vocabulary Maximum Likelihood Estimation in Generative Recommendation--A Short Note

链接https://arxiv.org/abs/2604.15739

作者:Yusheng Huang,Shuang Yang,Zhaojie Liu,Han Li

类目:Information Retrieval (cs.IR)

关键词:Generative recommendation, auto-regressive next-token prediction, next-token prediction, industrial sequential recommendation, Generative

备注: Work in progress

点击查看摘要

Abstract:Generative recommendation (GR) has emerged as a widely adopted paradigm in industrial sequential recommendation. Current GR systems follow a similar pipeline: tokenization for item indexing, next-token prediction as the training objective and auto-regressive decoding for next-item generation. However, existing GR research mainly focuses on architecture design and empirical performance optimization, with few rigorous theoretical explanations for the working mechanism of auto-regressive next-token prediction in recommendation scenarios. In this work, we formally prove that \textbf{the k-token auto-regressive next-token prediction (AR-NTP) paradigm is strictly mathematically equivalent to full-item-vocabulary maximum likelihood estimation (FV-MLE)}, under the core premise of a bijective mapping between items and their corresponding k-token sequences. We further show that this equivalence holds for both cascaded and parallel tokenizations, the two most widely used schemes in industrial GR systems. Our result provides the first formal theoretical foundation for the dominant industrial GR paradigm, and offers principled guidance for future GR system optimization.

Comments:
Work in progress

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2604.15739 [cs.IR]

(or
arXiv:2604.15739v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.15739

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2604.15704】Intent Propagation Contrastive Collaborative Filtering

链接https://arxiv.org/abs/2604.15704

作者:Haojie Li,Junwei Du,Guanfeng Liu,Feng Jiang,Yan Wang,Xiaofang Zhou

类目:Information Retrieval (cs.IR)

关键词:collaborative filtering uncover, filtering uncover interaction, Disentanglement, collaborative filtering, uncover interaction intents

备注: 15 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Disentanglement techniques used in collaborative filtering uncover interaction intents between nodes, improving the interpretability of node representations and enhancing recommendation performance. However, existing disentanglement methods still face two problems. First, they focus on local structural features derived from direct node interactions and overlook the comprehensive graph structure, which limits disentanglement accuracy. Second, the disentanglement process depends on backpropagation signals derived from recommendation tasks and lacks direct supervision, which may lead to biases and overfitting. To address these issues, we propose the Intent Propagation Contrastive Collaborative Filtering (IPCCF) algorithm. Specifically, we design a double helix message propagation framework to more effectively extract the deep semantic information of nodes, thereby improving the model's understanding of interactions between nodes. We also develop an intent message propagation method that incorporates graph structure information into the disentanglement process, thereby expanding the consideration scope of disentanglement. In addition, contrastive learning techniques are employed to align node representations derived from structure and intents, providing direct supervision for the disentanglement process, mitigating biases, and enhancing the model's robustness to overfitting. Experiments on three real data graphs illustrate the superiority of the proposed approach.

7. 【2604.15650】Sample Is Feature: Beyond Item-Level, Toward Sample-Level Tokens for Unified Large Recommender Models

链接https://arxiv.org/abs/2604.15650

作者:Shuli Wang,Junwei Yin,Changhao Li,Senjie Kou,Chi Wang,Yinqiu Huang,Yinhua Zhu,Haitao Wang,Xingxing Wang

类目:Information Retrieval (cs.IR)

关键词:single Transformer backbone, Transformer backbone, single Transformer, sample information scaling, longer behavior sequences

备注

点击查看摘要

Abstract:Scaling industrial recommender models has followed two parallel paradigms: \textbf{sample information scaling} -- enriching the information content of each training sample through deeper and longer behavior sequences -- and \textbf{model capacity scaling} -- unifying sequence modeling and feature interaction within a single Transformer backbone. However, these two paradigms still face two structural limitations. Firstly, sample information scaling methods encode only a subset of each historical interaction into the sequence token, leaving the majority of the original sample context unexploited and precluding the modeling of sample-level, time-varying features. Secondly, model capacity scaling methods are inherently constrained by the structural heterogeneity between sequential and non-sequential features, preventing the model from fully realizing its representational capacity. To address these issues, we propose \textbf{SIF} (\emph{Sample Is Feature}), which encodes each historical Raw Sample directly into the sequence token -- maximally preserving sample information while simultaneously resolving the heterogeneity between sequential and non-sequential features. SIF consists of two key components. The \textbf{Sample Tokenizer} quantizes each historical Raw Sample into a Token Sample via hierarchical group-adaptive quantization (HGAQ), enabling full sample-level context to be incorporated into the sequence efficiently. The \textbf{SIF-Mixer} then performs deep feature interaction over the homogeneous sample representations via token-level and sample-level mixing, fully unleashing the model's representational capacity. Extensive experiments on a large-scale industrial dataset validate SIF's effectiveness, and we have successfully deployed SIF on the Meituan food delivery platform.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2604.15650 [cs.IR]

(or
arXiv:2604.15650v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.15650

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
8. 【2604.15628】SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

链接https://arxiv.org/abs/2604.15628

作者:Keisuke Gomi,Keiji Yanai

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Cross-modal retrieval, Multimodal Large Language, dietary logging, nutritional management, Single Integrated Multimodal

备注: 20 pages, 6 figures

点击查看摘要

Abstract:Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8\% to 87.5\% and the 10k image-to-recipe R@1 from 56.5\% to 65.5\% compared to the previous best method.

9. 【2604.15621】Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking

链接https://arxiv.org/abs/2604.15621

作者:Jun Feng,Jiahui Tang,Zhicheng He,Hang Lv,Hongchao Gu,Hao Wang,Xuezhi Yang,Shuai Fang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Retrieval-Augmented Generation aims, retrieving supplementary passages, Large Language Models, Adaptive Retrieval-Augmented Generation, aims to mitigate

备注: 7pages, 2figures

点击查看摘要

Abstract:Adaptive Retrieval-Augmented Generation aims to mitigate the interference of extraneous noise by dynamically determining the necessity of retrieving supplementary passages. However, as Large Language Models evolve with increasing robustness to noise, the necessity of adaptive retrieval warrants re-evaluation. In this paper, we rethink this necessity and propose AdaRankLLM, a novel adaptive retrieval framework. To effectively verify the necessity of adaptive listwise reranking, we first develop an adaptive ranker employing a zero-shot prompt with a passage dropout mechanism, and compare its generation outcomes against static fixed-depth retrieval strategies. Furthermore, to endow smaller open-source LLMs with this precise listwise ranking and adaptive filtering capability, we introduce a two-stage progressive distillation paradigm enhanced by data sampling and augmentation techniques. Extensive experiments across three datasets and eight LLMs demonstrate that AdaRankLLM consistently achieves optimal performance in most scenarios with significantly reduced context overhead. Crucially, our analysis reveals a role shift in adaptive retrieval: it functions as a critical noise filter for weaker models to overcome their limitations, while serving as a cost-effective efficiency optimizer for stronger reasoning models.

10. 【2604.15591】BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels

链接https://arxiv.org/abs/2604.15591

作者:Mengfei Lan,Lecheng Zheng,Halil Kilicoglu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Effective biomedical information, requires modeling domain, modeling domain semantics, information retrieval requires, retrieval requires modeling

备注: Accepted by ACL 2026 Main Conference

点击查看摘要

Abstract:Effective biomedical information retrieval requires modeling domain semantics and hierarchical relationships among biomedical texts. Existing biomedical generative retrievers build on coarse binary relevance signals, limiting their ability to capture semantic overlap. We propose BioHiCL (Biomedical Retrieval with Hierarchical Multi-Label Contrastive Learning), which leverages hierarchical MeSH annotations to provide structured supervision for multi-label contrastive learning. Our models, BioHiCL-Base (0.1B) and BioHiCL-Large (0.3B), achieve promising performance on biomedical retrieval, sentence similarity, and question answering tasks, while remaining computationally efficient for deployment.

11. 【2604.15581】Learning Behaviorally Grounded Item Embeddings via Personalized Temporal Contexts

链接https://arxiv.org/abs/2604.15581

作者:Rafael T. Sereicikas,Pedro R. Pires,Gregorio F. Azevedo,Tiago A. Almeida

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Effective user modeling, long-term preference evolution, modeling requires distinguishing, user modeling requires, Effective user

备注: Accepted to be published in UMAP'26, 9 pages, 7 figures

点击查看摘要

Abstract:Effective user modeling requires distinguishing between short-term and long-term preference evolution. While item embeddings have become a key component of recommender systems, standard approaches like Item2Vec treat user histories as unordered sets (bag-of-items), implicitly assuming that interactions separated by minutes are as semantically related as those separated by months. This simplification flattens the rich temporal structure of user behavior, obscuring the distinction between coherent consumption sessions and gradual interest drifts. In this work, we introduce TAI2Vec (Time-Aware Item-to-Vector), a family of lightweight embedding models that integrates temporal proximity directly into the representation learning process. Unlike approaches that apply global time constraints, TAI2Vec is user-adaptive, tailoring its temporal definitions to individual interaction paces. We propose two complementary strategies: TAI2Vec-Disc, which utilizes personalized anomaly detection to dynamically segment interactions into semantic sessions, and TAI2Vec-Cont, which employs continuous, user-specific decay functions to weigh item relationships based on their relative temporal distance. Experimental results across eight diverse datasets demonstrate that TAI2Vec consistently produces more accurate and behaviorally grounded representations than static baselines, achieving competitive or superior performance in over 80% of the datasets, with improvements of up to 135%. The source code is publicly available at this https URL.

12. 【2604.15573】Collaborative Filtering Through Weighted Similarities of User and Item Embeddings

链接https://arxiv.org/abs/2604.15573

作者:Pedro R. Pires,Rafael T. Sereicikas,Gregorio F. Azevedo,Tiago A. Almeida

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:dominated recommender systems, recent years, neural networks, recommender systems, dominated recommender

备注: Published in SAC'25, 8 pages, 4 figures

点击查看摘要

Abstract:In recent years, neural networks and other complex models have dominated recommender systems, often setting new benchmarks for state-of-the-art performance. Yet, despite these advancements, award-winning research has demonstrated that traditional matrix factorization methods can remain competitive, offering simplicity and reduced computational overhead. Hybrid models, which combine matrix factorization with newer techniques, are increasingly employed to harness the strengths of multiple approaches. This paper proposes a novel ensemble method that unifies user-item and item-item recommendations through a weighted similarity framework to deliver top-N recommendations. Our approach is distinctive in its use of shared user and item embeddings for both recommendation strategies, simplifying the architecture and enhancing computational efficiency. Extensive experiments across multiple datasets show that our method achieves competitive performance and is robust in varying scenarios that favor either user-item or item-item recommendations. Additionally, by eliminating the need for embedding-specific fine-tuning, our model allows for the seamless reuse of hyperparameters from the base algorithm without sacrificing performance. This results in a method that is both efficient and easy to implement. Our open-source implementation is available at this https URL.

13. 【2604.15484】vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents

链接https://arxiv.org/abs/2604.15484

作者:Jayson Steffens

类目:Information Retrieval (cs.IR)

关键词:Reciprocal Rank Fusion, Rank Fusion, Reciprocal Rank, local-first document memory, document memory system

备注

点击查看摘要

Abstract:We present **vstash**, a local-first document memory system that combines vector similarity search with full-text keyword matching via Reciprocal Rank Fusion (RRF) and adaptive per-query IDF weighting. All data resides in a single SQLite file using sqlite-vec for approximate nearest neighbor search and FTS5 for keyword matching. We make four primary contributions. **(1)** Self-supervised embedding refinement via hybrid retrieval disagreement: across 753 BEIR queries on SciFact, NFCorpus, and FiQA, 74.5% produce top-10 disagreement between vector-heavy (vec=0.95, fts=0.05) and FTS-heavy (vec=0.05, fts=0.95) search (per-dataset rates 63.4% / 73.4% / 86.7%, Section 5.2), providing a free training signal without human labels. Fine-tuning BGE-small (33M params) with MultipleNegativesRankingLoss on 76K disagreement triples improves NDCG@10 on all 5 BEIR datasets (up to +19.5% on NFCorpus vs. BGE-small base RRF, Table 6). On 3 of 5 datasets, under different preprocessing, the tuned 33M-parameter pipeline matches or exceeds published ColBERTv2 results (110M params) and an untrained BGE-base (110M); on FiQA and ArguAna it underperforms ColBERTv2 (Section 5.5). **(2)** Adaptive RRF with per-query IDF weighting improves NDCG@10 on all 5 BEIR datasets versus fixed weights (up to +21.4% on ArguAna), achieving 0.7263 on SciFact with BGE-small. **(3)** A negative result on post-RRF scoring: frequency+decay, history-augmented recall, and cross-encoder reranking all failed to improve NDCG. **(4)** A production-grade substrate with integrity checking, schema versioning, ranking diagnostics, and a distance-based relevance signal validated on 50,425 relevance-judged queries across the 5 BEIR datasets. Search latency remains 20.9 ms median at 50K chunks with stable NDCG. The fine-tuned model is published as `Stffens/bge-small-rrf-v2` on HuggingFace. All code, data, and experiments are open-source.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2604.15484 [cs.IR]

(or
arXiv:2604.15484v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.15484

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2604.15366】OverCite: Add citations in LaTeX without leaving the editor

链接https://arxiv.org/abs/2604.15366

作者:Cheyanne Shariat

类目:Digital Libraries (cs.DL); Instrumentation and Methods for Astrophysics (astro-ph.IM); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:Adding citations, paper in mind, copying its BibTeX, renaming the cite, cite key

备注: 3 pages, 1 figure. OverCite is available at [this https URL](https://github.com/cheyanneshariat/OverCite)

点击查看摘要

Abstract:Adding citations while drafting in LaTeX often requires leaving the editor, searching for a paper in mind, copying its BibTeX entry into the project bibliography, renaming the cite key, and then returning to the sentence. \texttt{OverCite} is an open-source, lightweight tool that lets authors find, select, and insert citations without leaving the writing environment. In Overleaf, \texttt{OverCite} uses rough citation placeholders (e.g., $\texttt{\textbackslash citep\{Perlmutter1999\}}$) and local sentence context to query ADS/SciX-indexed literature, rank likely matches, and insert the selected reference, without leaving the editor. A companion \texttt{VS Code} extension provides the same functionality for local LaTeX projects. The ADS/SciX database includes astronomy, physics, computer science, mathematics, biology, and \emph{all} indexed arXiv e-prints, making \texttt{OverCite} useful across a broad range of scientific disciplines.

15. 【2604.15347】SocialWise: LLM-Agentic Conversation Therapy for Individuals with Autism Spectrum Disorder to Enhance Communication Skills

链接https://arxiv.org/abs/2604.15347

作者:Albert Tang

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词:Autism Spectrum Disorder, million people worldwide, Autism Spectrum, Spectrum Disorder, million people

备注

点击查看摘要

Abstract:Autism Spectrum Disorder (ASD) affects more than 75 million people worldwide. However, scalable support for practicing everyday conversation is scarce: Low-cost activities such as story reading yield limited improvement. At the same time, effective role-play therapy demands expensive, in-person sessions with specialists. SocialWise bridges this gap through a browser-based application that pairs LLM conversational agents with a therapeutic retrieval augmented generation (RAG) knowledge base. Users select a scenario (e.g., ordering food, joining a group), interact by text or voice, and receive instant, structured feedback on tone, engagement, and alternative phrasing. The SocialWise prototype, implemented with Streamlit, LangChain, and ChromaDB, runs on any computer with internet access, and demonstrates how recent advances in LLM can provide evidence-based, on-demand communication coaching for individuals with ASD.

16. 【2604.15344】o LLM, or Not to LLM: How Designers and Developers Navigate LLMs as Tools or Teammates

链接https://arxiv.org/abs/2604.15344

作者:Varad Vishwarupe,Ivan Flechais,Nigel Shadbolt,Marina Jirotka

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Large language models, language models, purely technical, Large language, rarely binary

备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into design and development workflows, yet decisions about their use are rarely binary or purely technical. We report findings from a constructivist grounded theory study based on interviews with 33 designers and developers across three large technology organisations. Rather than evaluating LLMs solely by capability, participants reasoned about the role an LLM could occupy within a workflow and how that role would interact with existing structures of responsibility and organisational accountability. When LLMs were framed as tools under clear human control, their use was typically acceptable and could be integrated within existing governance structures. When framed as teammates with shared or ambiguous agency, practitioners expressed hesitation, particularly when responsibility for outcomes could not be clearly justified. At the same time, participants also described productive teammate configurations in which LLMs supported collaborative reasoning while remaining embedded within explicit oversight structures. We identify tool and teammate framings as recurring ways in which designers and developers position LLMs relative to human work and present an analytic rubric describing how role framing shapes decision authority, accountability ownership, oversight strategies, and organisational acceptability. By foregrounding design-time reasoning, this work reframes To LLM or Not to LLM as a sociotechnical positioning problem that emerges during system design rather than during post-deployment evaluation.

计算机视觉

1. 【2604.16299】Repurposing 3D Generative Model for Autoregressive Layout Generation

链接https://arxiv.org/abs/2604.16299

作者:Haoran Feng,Yifan Niu,Zehuan Huang,Yang-Tian Sun,Chunchao Guo,Yuxin Peng,Lu Sheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:framework that repurposes, layout generation, formulating layout generation, generative models, layout generation performance

备注: [this https URL](https://fenghora.github.io/LaviGen-Page/)

点击查看摘要

Abstract:We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model that integrates scene, object, and instruction information and employs a dual-guidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19% higher physical plausibility than the state of the art and 65% faster computation. Our code is publicly available at this https URL.

2. 【2604.16298】FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

链接https://arxiv.org/abs/2604.16298

作者:Dian Shao,Zhengzheng Xu,Peiyang Wang,Like Liu,Yule Wang,Jieqi Shi,Jing Huo

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:UAV vision-language navigation, UAV vision-language, ambiguous multi-step instructions, requires an agent, navigate complex

备注: Accepted by CVPR 2026 Findings

点击查看摘要

Abstract:UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These results suggest the effectiveness of fine-grained cognitive modularization for zero-shot aerial navigation. Project page: this https URL.

3. 【2604.16284】Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan

链接https://arxiv.org/abs/2604.16284

作者:Shivarth Rai,Tejeswar Pokuri

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Atmospheric haze significantly, impeding computer vision, haze significantly degrades, computer vision applications, vision applications critical

备注: Accepted at CV4Animals Workshop, CVPR 2025

点击查看摘要

Abstract:Atmospheric haze significantly degrades wildlife imagery, impeding computer vision applications critical for conservation, such as animal detection, tracking, and behavior analysis. To address this challenge, we introduce AnimalHaze3k a synthetic dataset comprising of 3,477 hazy images generated from 1,159 clear wildlife photographs through a physics-based pipeline. Our novel IncepDehazeGan architecture combines inception blocks with residual skip connections in a GAN framework, achieving state-of-the-art performance (SSIM: 0.8914, PSNR: 20.54, and LPIPS: 0.1104), delivering 6.27% higher SSIM and 10.2% better PSNR than competing approaches. When applied to downstream detection tasks, dehazed images improved YOLOv11 detection mAP by 112% and IoU by 67%. These advances can provide ecologists with reliable tools for population monitoring and surveillance in challenging environmental conditions, demonstrating significant potential for enhancing wildlife conservation efforts through robust visual analytics.

4. 【2604.16272】VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

链接https://arxiv.org/abs/2604.16272

作者:Xiangbo Gao,Sicong Jiang,Bangya Liu,Xinghao Chen,Minglai Yang,Siyuan Yang,Mingyang Wu,Jiongze Yu,Qi Zheng,Haozhi Wang,Jiayi Zhang,Jared Yang,Jie Yang,Zihan Wang,Qing Yin,Zhengzhong Tu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:meet professional requirements, AI-assisted video creation, editing, increasingly practical, professional requirements

备注

点击查看摘要

Abstract:As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.

5. 【2604.16266】Hero-Mamba: Mamba-based Dual Domain Learning for Underwater Image Enhancement

链接https://arxiv.org/abs/2604.16266

作者:Tejeswar Pokuri,Shivarth Rai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low contrast, blurred details, scattering in water, suffer from severe, absorption and scattering

备注: Accepted at AI4ES Workshop AAAI 2026

点击查看摘要

Abstract:Underwater images often suffer from severe degradation, such as color distortion, low contrast, and blurred details, due to light absorption and scattering in water. While learning-based methods like CNNs and Transformers have shown promise, they face critical limitations: CNNs struggle to model the long-range dependencies needed for non-uniform degradation, and Transformers incur quadratic computational complexity, making them inefficient for high-resolution images. To address these challenges, we propose Hero-Mamba, a novel Mamba-based network that achieves efficient dual-domain learning for underwater image enhancement. Our approach uniquely processes information from both the spatial domain (RGB image) and the spectral domain (FFT components) in parallel. This dual-domain input allows the network to decouple degradation factors, separating color/brightness information from texture/noise. The core of our network utilizes Mamba-based SS2D blocks to capture global receptive fields and long-range dependencies with linear complexity, overcoming the limitations of both CNNs and Transformers. Furthermore, we introduce a ColorFusion block, guided by a background light prior, to restore color information with high fidelity. Extensive experiments on the LSUI and UIEB benchmark datasets demonstrate that Hero-Mamba outperforms state-of-the-art methods. Notably, our model achieves a PSNR of 25.802 and an SSIM of 0.913 on LSUI, validating its superior performance and generalization capabilities.

6. 【2604.16264】Information Router for Mitigating Modality Dominance in Vision-Language Models

链接https://arxiv.org/abs/2604.16264

作者:Seulgi Kim,Mohit Prabhushankar,Ghassan AlRegib

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:predictions rely disproportionately, Vision Language models, Vision Language, demonstrated strong performance, textsc

备注

点击查看摘要

Abstract:Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.

7. 【2604.16256】Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

链接https://arxiv.org/abs/2604.16256

作者:Yige Xu,Yongjie Wang,Zizhuo Wu,Kaisong Song,Jun Lin,Zhiqi Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:recently attracted significant, attracted significant attention, significant attention due, diverse downstream tasks, vision-language models

备注

点击查看摘要

Abstract:Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at this https URL.

8. 【2604.16248】Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization

链接https://arxiv.org/abs/2604.16248

作者:Siddhant Bharadwaj,Ashish Vashist,Fahimul Aleem,Shruti Vyas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual localization pipelines, retrieval-based place recognition, geometry-based visual localization, localization pipelines, traditionally been addressed

备注: Accepted to the CVPR EarthVision 2026 Workshop

点击查看摘要

Abstract:Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic understanding.

9. 【2604.16243】Find, Fix, Reason: Context Repair for Video Reasoning

链接https://arxiv.org/abs/2604.16243

作者:Haojian Huang,Chuanyu Qin,Yinchuan Li,Yingcong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demands careful regularization, model knowledge boundary, Reinforcement learning, advanced video reasoning, large multi-modal models

备注: 22 pages, 7 figures, 17 tables. Ongoing work

点击查看摘要

Abstract:Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model's capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on various related benchmarks show consistent accuracy gains and strong generalization. Web page and source code will be available at this https URL.

10. 【2604.16240】CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting

链接https://arxiv.org/abs/2604.16240

作者:Nishq Poorav Desai,Ali Etemad,Michael Greenspan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:global patterns encapsulated, requiring precise temporal, precise temporal prediction, effective TTC forecasting, collision prevention

备注: Accepted to ICPR 2026

点击查看摘要

Abstract:Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at this https URL.

11. 【2604.16234】A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection

链接https://arxiv.org/abs/2604.16234

作者:Van-Truong Le,Le-Khanh Nguyen,Trong-Doanh Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Academic integrity continues, Academic integrity, integrity continues, continues to face, face the persistent

备注: 7 pages, 5 figures. Accepted at the FISU Joint Conference on Artificial Intelligence (FJCAI 2026), Vietnam

点击查看摘要

Abstract:Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is inefficient, costly, and prone to errors at scale. Although some existing AI-powered monitoring systems have been deployed and trusted, many lack transparency or require multi-layered architectures to achieve the desired performance. To overcome these challenges, we propose an improvement over a simple two-stage framework for exam cheating detection that integrates object detection and behavioral analysis using well-known technologies. First, the state-of-the-art YOLOv8n model is used to localize students in exam-room images. Each detected region is cropped and preprocessed, then classified by a fine-tuned RexNet-150 model as either normal or cheating behavior. The system is trained on a dataset compiled from 10 independent sources with a total of 273,897 samples, achieving 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score - a 13\% increase over a baseline accuracy of 0.82 in video-based cheating detection. In addition, with an average inference time of 13.9 ms per sample, the proposed approach demonstrates robustness and scalability for deployment in large-scale environments. Beyond the technical contribution, the AI-assisted monitoring system also addresses ethical concerns by ensuring that final outcomes are delivered privately to individual students after the examination, for example, via personal email. This prevents public exposure or shaming and offers students an opportunity to reflect on their behavior. For further improvement, it is possible to incorporate additional factors, such as audio data and consecutive frames, to achieve greater accuracy. This study provides a foundation for developing real-time, scalable, ethical, and open-source solutions.

12. 【2604.16231】Dental Panoramic Radiograph Analysis Using YOLO26 From Tooth Detection to Disease Diagnosis

链接https://arxiv.org/abs/2604.16231

作者:Khawaja Azfar Asif,Rafaqat Alam Khan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:minimal radiation exposure, fundamental diagnostic tool, tool in dentistry, offering a comprehensive, radiation exposure

备注

点击查看摘要

Abstract:Panoramic radiography is a fundamental diagnostic tool in dentistry, offering a comprehensive view of the entire dentition with minimal radiation exposure. However, manual interpretation is time-consuming and prone to errors, especially in high-volume clinical settings. This creates a pressing need for efficient automated solutions. This study presents the first application of YOLOv26 for automated tooth detection, FDI-based numbering, and dental disease segmentation in panoramic radiographs. The DENTEX dataset was preprocessed using Roboflow for format conversion and augmentation, yielding 1,082 images for tooth enumeration and 1,040 images for disease segmentation across four pathology classes. Five YOLOv26-seg variants were trained on Google Colab using transfer learning at a resolution of 800x800. Results demonstrate that the YOLOv26m-seg model achieved the best performance for tooth enumeration, with a precision of 0.976, recall of 0.970, and box mAP50 of 0.976. It outperformed the YOLOv8x baseline by 4.9% in precision and 3.3% in mAP50, while also enabling high-quality mask-level segmentation (mask mAP50 = 0.970). For disease segmentation, the YOLOv26l-seg model attained a box mAP50 of 0.591 and a mask mAP50 of 0.547. Impacted teeth showed the highest per-class average precision (0.943), indicating that visual distinctiveness influences detection performance more than annotation quantity. Overall, these findings demonstrate that YOLOv26-based models offer a robust and accurate framework for automated dental image analysis, with strong potential to enhance diagnostic efficiency and consistency in clinical practice.

13. 【2604.16214】GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos

链接https://arxiv.org/abs/2604.16214

作者:Deepak Kumar,Abhishek Pratap Singh,Puneet Kumar,Xiaobai Li,Balasubramanian Raman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Understanding affective dynamics, Understanding affective, Group affect, Group Affect Recognition, real-world social systems

备注

点击查看摘要

Abstract:Understanding affective dynamics in real-world social systems is fundamental to modeling and analyzing human-human interactions in complex environments. Group affect emerges from intertwined human-human interactions, contextual influences, and behavioral cues, making its quantitative modeling a challenging computational social systems problem. However, computational modeling of group affect in in-the-wild scenarios remains challenging due to limited large-scale annotated datasets and the inherent complexity of multimodal social interactions shaped by contextual and behavioral variability. The lack of comprehensive datasets annotated with multimodal and contextual information further limits advances in the field. To address this, we introduce the Group Affect from ViDeos (GAViD) dataset, comprising 5091 video clips with multimodal data (video, audio and context), annotated with ternary valence and discrete emotion labels and enriched with VideoGPT-generated contextual metadata and human-annotated action cues. We also present Context-Aware Group Affect Recognition Network (CAGNet) for multimodal context-aware group affect recognition. CAGNet achieves 63.20\% test accuracy on GAViD, comparable to state-of-the-art performance. The dataset and code are available at this http URL.

14. 【2604.16207】AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection

链接https://arxiv.org/abs/2604.16207

作者:Hao Wang,Beichen Zhang,Yanpei Gong,Shaoyi Fang,Zhaobo Qi,Yuanrong Xu,Xinyan Liu,Weigang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Face Forgery Detection, Incremental Face Forgery, forgery types continue, Forgery Detection, Face Forgery

备注

点击查看摘要

Abstract:As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typically rely on data replay or coarse binary supervision, which fails to explicitly constrain the feature space, leading to severe feature drift and catastrophic forgetting. To address this, we propose AIFIND, Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection, which leverages semantic anchors to stabilize incremental learning. We design the Artifact-Driven Semantic Prior Generator to instantiate invariant semantic anchors, establishing a fixed coordinate system from low-level artifact cues. These anchors are injected into the image encoder via Artifact-Probe Attention, which explicitly constrains volatile visual features to align with stable semantic anchors. Adaptive Decision Harmonizer harmonizes the classifiers by preserving angular relationships of semantic anchors, maintaining geometric consistency across tasks. Extensive experiments on multiple incremental protocols validate the superiority of AIFIND.

15. 【2604.16201】DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs

链接https://arxiv.org/abs/2604.16201

作者:Nikhil Behari,Diego Rivero,Luke Apostolides,Suman Ghosh,Paul Pu Liang,Ramesh Raskar

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:robots typically output, mobile devices, devices and robots, robots typically, typically output

备注

点击查看摘要

Abstract:Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full time-resolved histograms containing direct and multi-bounce light returns; these multi-bounce returns encode rich non-line-of-sight (NLOS) cues that can enable perception of hidden objects in a scene. However, severe hardware limitations of consumer LiDARs make NLOS reconstruction with conventional methods difficult. In this work, we motivate a complementary direction: enabling NLOS perception with low-cost LiDARs through data-driven inference. We present DENALI, the first large-scale real-world dataset of space-time histograms from low-cost LiDARs capturing hidden objects. We capture time-resolved LiDAR histograms for 72,000 hidden-object scenes across diverse object shapes, positions, lighting conditions, and spatial resolutions. Using our dataset, we show that consumer LiDARs can enable accurate, data-driven NLOS perception. We further identify key scene and modeling factors that limit performance, as well as simulation-fidelity gaps that hinder current sim-to-real transfer, motivating future work toward scalable NLOS vision with consumer LiDARs.

16. 【2604.16200】Saturation-Aware Space-Variant Blind Image Deblurring

链接https://arxiv.org/abs/2604.16200

作者:Muhammad Z. Alam,Larry Stetsiuk,Arooba Zeshan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low light conditions, aware space variant, address challenges posed, high dynamic range, saturation aware space

备注: 12 pages, 12 Figure

点击查看摘要

Abstract:This paper presents a novel saturation aware space variant blind image deblurring framework designed to address challenges posed by saturated pixels in deblurring under high dynamic range and low light conditions. The proposed approach effectively segments the image based on blur intensity and proximity to saturation, leveraging a pre estimated Light Spread Function to mitigate stray light effects. By accurately estimating the true radiance of saturated regions using the dark channel prior, our method enhances the deblurring process without introducing artifacts like ringing. Experimental evaluations on both synthetic and real world datasets demonstrate that the framework improves deblurring outcomes across various scenarios showcasing superior performance compared to state of the art saturation-aware and general purpose methods. This adaptability highlights the framework potential integration with existing and emerging blind image deblurring techniques.

17. 【2604.16177】Winner of CVPR2026 NTIRE Challenge on Image Shadow Removal: Semantic and Geometric Guidance for Shadow Removal via Cascaded Refinement

链接https://arxiv.org/abs/2604.16177

作者:Lorenzo Beltrame,Jules Salzinger,Filip Svoboda,Jasmin Lampert,Phillipp Fanta-Jende,Radu Timofte,Marco Koerner

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:three-stage progressive shadow-removal, progressive shadow-removal pipeline, present a three-stage, three-stage progressive, progressive shadow-removal

备注: 10 pages, 4 figures, 5 tables, accepted at the CVPR 2026 Workshops (NTIRE 2026 Image Shadow Removal Challenge). Code and materials are available at [this https URL](https://github.com/AIT-Assistive-Autonomous-Systems/SGCR-SR)

点击查看摘要

Abstract:We present a three-stage progressive shadow-removal pipeline for the CVPR2026 NTIRE WSRD+ challenge. Built on OmniSR, our method treats deshadowing as iterative direct refinement, where later stages correct residual artefacts left by earlier predictions. The model combines RGB appearance with frozen DINOv2 semantic guidance and geometric cues from monocular depth and surface normals, reused across all stages. To stabilise multi-stage optimisation, we introduce a contraction-constrained objective that encourages non-increasing reconstruction error across the cascade. A staged training pipeline transfers from earlier WSRD pretraining to WSRD+ supervision and final WSRD+ 2026 adaptation with cosine-annealed checkpoint ensembling. On the official WSRD+ 2026 hidden test set, our final ensemble achieved 26.680 PSNR, 0.8740 SSIM, 0.0578 LPIPS, and 26.135 FID, ranked first overall, and won the NTIRE 2026 Image Shadow Removal Challenge. The strong performance of the proposed model is further validated on the ISTD+ and UAV-SC+ datasets.

18. 【2604.16175】MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

链接https://arxiv.org/abs/2604.16175

作者:Yi Lin,Yihao Ding,Yonghui Wu,Yifan Peng

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:radiology report generation, iterative verification found, human practice, report generation, generation often suffers

备注: Accepted by ACL 2026 main conference

点击查看摘要

Abstract:Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.

19. 【2604.16170】neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

链接https://arxiv.org/abs/2604.16170

作者:Toby Perrett,Matthew Bouchard,William McCarthy

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)

关键词:expert CAD engineers, CAD models collected, CAD, CAD engineers, CAD editing

备注: Project page: [this https URL](https://autodeskailab.github.io/neuralCAD-Edit)

点击查看摘要

Abstract:We introduce neuralCAD-Edit, the first benchmark for editing 3D CAD models collected from expert CAD engineers. Instead of text conditioning as in prior works, we collect realistic CAD editing requests by capturing videos of professional designers, interacting directly with CAD models in CAD software, while talking, pointing and drawing. We recruited ten consenting designers to contribute to this contained study. We benchmark leading foundation models against human CAD experts carrying out edits, and find a large performance gap in both automatic metrics and human evaluations. Even the best foundation model (GPT 5.2) scores 53% lower (absolute) than CAD experts in human acceptance trials, demonstrating the challenge of neuralCAD-Edit. We hope neuralCAD-Edit will provide a solid foundation against which 3D CAD editing approaches and foundation models can be developed. Code/data: this https URL

20. 【2604.16147】SWNet: A Cross-Spectral Network for Camouflaged Weed Detection

链接https://arxiv.org/abs/2604.16147

作者:Henry O. Velesaca,Luigi Miranda,Angel D. Sappa

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:dense agricultural environments, network specifically engineered, paper presents SWNet, cross-spectral network specifically, Bimodal Gated Fusion

备注

点击查看摘要

Abstract:This paper presents SWNet, a bimodal end-to-end cross-spectral network specifically engineered for the detection of camouflaged weeds in dense agricultural environments. Plant camouflage, characterized by homochromatic blending where invasive species mimic the phenotypic traits of primary crops, poses a significant challenge for traditional computer vision systems. To overcome these limitations, SWNet utilizes a Pyramid Vision Transformer v2 backbone to capture long-range dependencies and a Bimodal Gated Fusion Module to dynamically integrate Visible and Near-Infrared information. By leveraging the physiological differences in chlorophyll reflectance captured in the NIR spectrum, the proposed architecture effectively discriminates targets that are otherwise indistinguishable in the visible range. Furthermore, an Edge-Aware Refinement module is employed to produce sharper object boundaries and reduce structural ambiguity. Experimental results on the Weeds-Banana dataset indicate that SWNet outperforms ten state-of-the-art methods. The study demonstrates that the integration of cross-spectral data and boundary-guided refinement is essential for high segmentation accuracy in complex crop canopies. The code is available on GitHub: this https URL

21. 【2604.16135】Motion-Adapter: A Diffusion Model Adapter for Text-to-Motion Generation of Compound Actions

链接https://arxiv.org/abs/2604.16135

作者:Yue Jiang,Mingyu Yang,Liuyuxin Yang,Yang Xu,Bingxin Yun,Yuhe Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, diverse input modalities, generative motion synthesis, realistic human motions, input modalities

备注: 12 pages, 12 figures, Under review for publication in IEEE Transactions on Visualization and Computer Graphics

点击查看摘要

Abstract:Recent advances in generative motion synthesis have enabled the production of realistic human motions from diverse input modalities. However, synthesizing compound actions from texts, which integrate multiple concurrent actions into coherent full-body sequences, remains a major challenge. We identify two key limitations in current text-to-motion diffusion models: (i) catastrophic neglect, where earlier actions are overwritten by later ones due to improper handling of temporal information, and (ii) attention collapse, which arises from excessive feature fusion in cross-attention mechanisms. As a result, existing approaches often depend on overly detailed textual descriptions (e.g., raising right hand), explicit body-part specifications (e.g., editing the upper body), or the use of large language models (LLMs) for body-part interpretation. These strategies lead to deficient semantic representations of physical structures and kinematic mechanisms, limiting the ability to incorporate natural behaviors such as greeting while walking. To address these issues, we propose the Motion-Adapter, a plug-and-play module that guides text-to-motion diffusion models in generating compound actions by computing decoupled cross-attention maps, which serve as structural masks during the denoising process. Extensive experiments demonstrate that our method consistently produces more faithful and coherent compound motions across diverse textual prompts, surpassing state-of-the-art approaches.

22. 【2604.16115】From Articles to Canopies: Knowledge-Driven Pseudo-Labelling for Tree Species Classification using LLM Experts

链接https://arxiv.org/abs/2604.16115

作者:Michał Romaszewski,Dominik Kopeć,Michał Cholewa,Katarzyna Kołodziej,Przemysław Głomb,Jan Niedzielko,Jakub Charyton,Justyna Wylazłowska,Anna Jarocińska

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:imbalanced class labels, overlapping light signatures, class labels, overlapping light, Hyperspectral tree species

备注

点击查看摘要

Abstract:Hyperspectral tree species classification is challenging due to limited and imbalanced class labels, spectral mixing (overlapping light signatures from multiple species), and ecological heterogeneity (variability among ecological systems). Addressing these challenges requires methods that integrate biological and structural characteristics of vegetation, such as canopy architecture and interspecific interactions, rather than relying solely on spectral signatures. This paper presents a biologically informed, semi-supervised deep learning method that integrates multi-sensor Earth observation data, specifically hyperspectral imaging (HSI) and airborne laser scanning (ALS), with expert, ecological knowledge. The approach relies on biologically inspired pseudo-labelling over a precomputed canopy graph, yielding accurate classification at low training cost. In addition, ecological priors on species cohabitation are automatically derived from reliable sources using large language models (LLMs) and encoded as a cohabitation matrix with likelihoods of species occurring together. These priors are incorporated into the pseudo-labelling strategy, effectively introducing expert knowledge into the model. Experiments on a real-world forest dataset demonstrate 5.6% improvement over the best reference method. Expert evaluation of cohabitation priors reveals high accuracy with differences no larger than 15%.

23. 【2604.16114】owards In-Context Tone Style Transfer with A Large-Scale Triplet Dataset

链接https://arxiv.org/abs/2604.16114

作者:Yuhai Deng,Huimin She,Wei Shen,Meng Li,Ruoxi Wu,Lunxi Yuan,Xiang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:photo retouching aims, photo retouching, retouching aims, aims to adapt, Tone style

备注: 33 pages, 14 figures

点击查看摘要

Abstract:Tone style transfer for photo retouching aims to adapt the stylistic tone of the reference image to a given content image. However, the lack of high-quality large-scale triplet datasets with stylized ground truth forces existing methods to rely on self-supervised or proxy objectives, which limits model capability. To mitigate this gap, we design a data construction pipeline to build TST100K, a large-scale dataset of 100,000 content-reference-stylized triplets. At the core of this pipeline, we train a tone style scorer to ensure strict stylistic consistency for each triplet. In addition, existing methods typically extract content and reference features independently and then fuse them in a decoder, which may cause semantic loss and lead to inappropriate color transfer and degraded visual aesthetics. Instead, we propose ICTone, a diffusion-based framework that performs tone transfer in an in-context manner by jointly conditioning on both images, leveraging the semantic priors of generative models for semantic-aware transfer. Reward feedback learning using the tone style scorer is further incorporated to improve stylistic fidelity and visual quality. Experiments demonstrate the effectiveness of TST100K, and ICTone achieves state-of-the-art performance on both quantitative metrics and human evaluations.

24. 【2604.16108】Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation

链接https://arxiv.org/abs/2604.16108

作者:Federico Nocentini,Kwanggyoon Seo,Qingju Liu,Claudio Ferrari,Stefano Berretti,David Ferman,Hyeongwoo Kim,Pablo Garrido,Akin Caliskan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained significant attention, significant attention due, video games, Speech-Driven Facial Animation, applications in movies

备注: The project website is available at [this https URL](https://fedenoce.github.io/polyglot/)

点击查看摘要

Abstract:Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference facial sequences to capture individual speaking characteristics. Polyglot does not require predefined language or speaker labels, enabling generalization across languages and speakers through self-supervised learning. By jointly conditioning on language and style, it captures expressive traits such as rhythm, articulation, and habitual facial movements, producing temporally coherent and realistic animations. Experiments show improved performance in both monolingual and multilingual settings, providing a unified framework for modeling language and personal style in SDFA.

25. 【2604.16099】DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates

链接https://arxiv.org/abs/2604.16099

作者:Laziz Hamdi,Amine Tamasna,Thierry Paquet

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:common capture artifacts, condense key transactional, practical extraction requires, Tables condense key, merged cells

备注

点击查看摘要

Abstract:Tables condense key transactional and administrative information into compact layouts, but practical extraction requires more than text recognition: systems must also recover structure (rows, columns, merged cells, headers) and interpret roles such as line items, subtotals, and totals under common capture artifacts. Many existing resources for table structure recognition and TableVQA are built from clean digital-born sources or rendered tables, and therefore only partially reflect noisy administrative conditions. We introduce DenTab, a dataset of 2{,}000 cropped table images from dental estimates with high-quality HTML annotations, enabling evaluation of table recognition (TR) and table visual question answering (TableVQA) on the same inputs. DenTab includes 2{,}208 questions across eleven categories spanning retrieval, aggregation, and logic/consistency checks. We benchmark 16 systems, including 14 vision--language models (VLMs) and two OCR baselines. Across models, strong structure recovery does not consistently translate into reliable performance on multi-step arithmetic and consistency questions, and these reasoning failures persist even when using ground-truth HTML table inputs. To improve arithmetic reliability without training, we propose the Table Router Pipeline, which routes arithmetic questions to deterministic execution. The pipeline combines (i) a VLM that produces a baseline answer, a structured table representation, and a constrained table program with (ii) a rule-based executor that performs exact computation over the parsed table. The source code and dataset will be made publicly available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.16099 [cs.CV]

(or
arXiv:2604.16099v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.16099

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
26. 【2604.16086】Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance

链接https://arxiv.org/abs/2604.16086

作者:Hamed Ouattara,Pierre Duthon,Pascal Houssam Salmane,Frédéric Bernardin,Omar Ait Aider

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:MoCo or DINO, produce robust representations, self-supervised learning, illustrated by MoCo, dominant paradigms

备注: 20 pages, 16 figures, ICPR 2026 (28th International Conference on Pattern Recognition)

点击查看摘要

Abstract:One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing features that are insensitive to certain image transformations such as illumination, or geometric changes. This strategy is appropriate when the objective is to recognize objects independently of their appearance. However, it becomes counterproductive as soon as appearance itself constitutes the discriminative signal. In weather analysis, for example, rain streaks, snow granularity, atmospheric scattering, as well as reflections and halos, are not noise: they carry the essential information. In critical applications such as autonomous driving, ignoring these cues is risky, since grip and visibility depend directly on ground conditions and atmospheric conditions. We introduce ST-STORM, a hybrid SSL framework that treats appearance (style) as a semantic modality to be disentangled from content. Our architecture explicitly separates two latent streams, regulated by gating mechanisms. The Content branch aims at a stable semantic representation through a JEPA scheme coupled with a contrastive objective, promoting invariance to appearance variations. In parallel, the Style branch is constrained to capture appearance signatures (textures, contrasts, scattering) through feature prediction and reconstruction under an adversarial constraint. We evaluate ST-STORM on several tasks, including object classification (ImageNet-1K), fine-grained weather characterization, and melanoma detection (ISIC 2024 Challenge). The results show that the Style branch effectively isolates complex appearance phenomena (F1=97% on Multi-Weather and F1=94% on ISIC 2024 with 10% labeled data), without degrading the semantic performance (F1=80% on ImageNet-1K) of the Content branch, and improves the preservation of critical appearance

27. 【2604.16083】DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics

链接https://arxiv.org/abs/2604.16083

作者:Jieming Yu,Qiuxiao Feng,Zhuohan Wang,Xiaochen Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:realistic fake images, deep generative models, localization methods rely, existing localization methods, realistic fake

备注: Technical report

点击查看摘要

Abstract:With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1\,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the research community and a practical starting point for future image-forensic applications. Code is available at this https URL.

28. 【2604.16082】Early Detection of Acute Myeloid Leukemia (AML) Using YOLOv12 Deep Learning Model

链接https://arxiv.org/abs/2604.16082

作者:Enas E. Ahmed,Salah A. Aly,Mayar Moner

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Acute Myeloid Leukemia, Acute Myeloid, Myeloid Leukemia, challenging task due, AML cells Utilizing

备注: 6 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Acute Myeloid Leukemia (AML) is one of the most life-threatening type of blood cancers, and its accurate classification is considered and remains a challenging task due to the visual similarity between various cell types. This study addresses the classification of the multiclasses of AML cells Utilizing YOLOv12 deep learning model. We applied two segmentation approaches based on cell and nucleus features, using Hue channel and Otsu thresholding techniques to preprocess the images prior to classification. Our experiments demonstrate that YOLOv12 with Otsu thresholding on cell-based segmentation achieved the highest level of validation and test accuracy, both reaching 99.3%.

29. 【2604.16079】he Amazing Stability of Flow Matching

链接https://arxiv.org/abs/2604.16079

作者:Rania Briq,Michael Kamp,Ohad Fried,Sarel Cohen,Stefan Kesselheim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:success of deep, generating high-quality, high-quality and diverse, deep generative models, large training datasets

备注: EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)

点击查看摘要

Abstract:The success of deep generative models in generating high-quality and diverse samples is often attributed to particular architectures and large training datasets. In this paper, we investigate the impact of these factors on the quality and diversity of samples generated by \emph{flow-matching} models. Surprisingly, in our experiments on CelebA-HQ dataset, flow matching remains stable even when pruning 50\% of the dataset. That is, the quality and diversity of generated samples are preserved. Moreover, pruning impacts the latent representation only slightly, that is, samples generated by models trained on the full and pruned dataset map to visually similar outputs for a given seed. We observe similar stability when changing the architecture or training configuration, such that the latent representation is maintained under these changes as well. Our results quantify just how strong this stability can be in practice, and help explain the reliability of flow-matching models under various perturbations.

Comments:
EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.16079 [cs.CV]

(or
arXiv:2604.16079v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.16079

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
30. 【2604.16070】ableSeq: Unified Generation of Structure, Content, and Layout

链接https://arxiv.org/abs/2604.16070

作者:Laziz Hamdi,Amine Tamasna,Pascal Boisson,Thierry Paquet

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:framework for joint, joint table structure, cell localization, aligning logical structure, table structure recognition

备注

点击查看摘要

Abstract:We present TableSeq, an image-only, end-to-end framework for joint table structure recognition, content recognition, and cell localization. The model formulates these tasks as a single sequence-generation problem: one decoder produces an interleaved stream of \texttt{HTML} tags, cell text, and discretized coordinate tokens, thereby aligning logical structure, textual content, and cell geometry within a unified autoregressive sequence. This design avoids external OCR, auxiliary decoders, and complex multi-stage post-processing. TableSeq combines a lightweight high-resolution FCN-H16 encoder with a minimal structure-prior head and a single-layer transformer encoder, yielding a compact architecture that remains effective on challenging layouts. Across standard benchmarks, TableSeq achieves competitive or state-of-the-art results while preserving architectural simplicity. It reaches 95.23 TEDS / 96.83 S-TEDS on PubTabNet, 97.45 TEDS / 98.69 S-TEDS on FinTabNet, and 99.79 / 99.54 / 99.66 precision / recall / F1 on SciTSR under the CAR protocol, while remaining competitive on PubTables-1M under GriTS. Beyond TSR/TCR, the same sequence interface generalizes to index-based table querying without task-specific heads, achieving the best IRDR score and competitive ICDR/ICR performance. We also study multi-token prediction for faster blockwise decoding and show that it reduces inference latency with only limited accuracy degradation. Overall, TableSeq provides a practical and reproducible single-stream baseline for unified table recognition, and the source code will be made publicly available at this https URL.

31. 【2604.16067】AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning

链接https://arxiv.org/abs/2604.16067

作者:Guransh Singh

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Adapting pre-trained vision-language, robotic control requires, control requires injecting, requires injecting high-magnitude, flow-matching action expert

备注

点击查看摘要

Abstract:Adapting pre-trained vision-language models (VLMs) for robotic control requires injecting high-magnitude continuous gradients from a flow-matching action expert into a backbone trained exclusively with cross-entropy. This cross-modal gradient asymmetry - the spectral dimensionality mismatch between low-rank MSE regression gradients and the high-dimensional semantic manifold sculpted by CE pre-training, causes rapid, severe erosion of the VLM's visual-question-answering (VQA) capability. Industry-standard defences either sever the gradient pathway entirely via stop gradient, discarding the rich continuous supervision, or restrict parameter capacity through low-rank adapters (LoRA) that constrain the rank of updates but not their direction, and thus still overwrite the pre-trained manifold. We introduce AEGIS (Anchor-Enforced Gradient Isolation System): a buffer-free, layer-wise orthogonal gradient projection framework that enables direct continuous MSE learning while preserving the pre-trained VQA manifold - without any co-training data or replay buffer. AEGIS pre-computes a static Gaussian reference anchor from masked VQA forward passes across all transformer layers, then at each training step constructs a Wasserstein-2 transport penalty that generates an anchor restoration gradient. A sequential dual-backward decomposes the task and anchor gradients; for each transformer layer, AEGIS applies a single Gram-Schmidt orthogonal projection that bends the task gradient away from the destructive direction while preserving its constructive content. The projection sheds less than 1% of gradient energy on average, yet eliminates the cumulative activation drift that drives severe forgetting.

32. 【2604.16060】Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

链接https://arxiv.org/abs/2604.16060

作者:Sai Srinivas Kancheti,Aditya Sanjiv Kanade,Vineeth N. Balasubramanian,Tanuja Ganu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Reasoning Models, Multimodal Reasoning, based thinking, logical problem-solving, thinking have revolutionized

备注

点击查看摘要

Abstract:Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

33. 【2604.16054】Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

链接https://arxiv.org/abs/2604.16054

作者:Rohit Sinha,Aditya Kanade,Sai Srinivas Kancheti,Vineeth N Balasubramanian,Tanuja Ganu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal large language, large language models, achieved impressive progress, vision language benchmarks, Multimodal large

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

34. 【2604.16044】Elucidating the SNR-t Bias of Diffusion Probabilistic Models

链接https://arxiv.org/abs/2604.16044

作者:Meng Yu,Lei Sun,Jianhao Zeng,Xiangxiang Chu,Kun Zhan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated remarkable performance, Diffusion Probabilistic Models, Probabilistic Models, generative tasks, Diffusion Probabilistic

备注: Accepted to CVPR 2026, 19pages, with appendix

点击查看摘要

Abstract:Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signal-to-Noise Ratio-timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep. However, this correspondence is disrupted during inference, leading to error accumulation and impairing the generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and FLUX) on datasets of various resolutions with negligible computational overhead. The code is at this https URL.

35. 【2604.16034】Ranking XAI Methods for Head and Neck Cancer Outcome Prediction

链接https://arxiv.org/abs/2604.16034

作者:Baoqiang Ma,Djennifer K. Madzia-Madzou,Rosa C.J. Kraaijveld,Jin Ouyang

类目:Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)

关键词:treatment strategy selection, support personalized treatment, personalized treatment strategy, prognostic outcome prediction, neck cancer

备注: 4-page conference paper, accepted at IEEE ISBI 2026 (International Symposium on Biomedical Imaging)

点击查看摘要

Abstract:For head and neck cancer (HNC) patients, prognostic outcome prediction can support personalized treatment strategy selection. Improving prediction performance of HNC outcomes has been extensively explored by using advanced artificial intelligence (AI) techniques on PET/CT data. However, the interpretability of AI remains a critical obstacle for its clinical adoption. Unlike previous HNC studies that empirically selected explainable AI (XAI) techniques, we are the first to comprehensively evaluate and rank 13 XAI methods across 24 metrics, covering faithfulness, robustness, complexity and plausibility. Experimental results on the multi-center HECKTOR challenge dataset show large variations across evaluation aspects among different XAI methods, with Integrated Gradients (IG) and DeepLIFT (DL) consistently obtained high rankings for faithfulness, complexity and plausibility. This work highlights the importance of comprehensive XAI method evaluation and can be extended to other medical imaging tasks.

36. 【2604.16024】AstroVLM: Expert Multi-agent Collaborative Reasoning for Astronomical Imaging Quality Diagnosis

链接https://arxiv.org/abs/2604.16024

作者:Yaohui Han,Tianshuo Wang,Zixi Zhao,Zhengchun Zhu,Shuo Ren,Yiru Wang,Rongliang Fu,Tinghuan Chen,Tsung-Yi Ho

类目:Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)

关键词:strong problem-solving capabilities, shown strong problem-solving, Vision Language Models, Vision Language, problem-solving capabilities

备注

点击查看摘要

Abstract:Vision Language Models (VLMs) have been applied to several specific domains and have shown strong problem-solving capabilities. However, astronomical imaging, a quite complex problem involving multidisciplinary knowledge and several subtasks, has not been adequately studied. Due to the complexity of the astronomical imaging process, both world-class astronomical organizations, such as NASA, and expert enthusiasts devote a great deal of time and effort. This is because the processes in astronomical imaging have complex underlying correlations that significantly influence one another, making the quality diagnosis and error localization of astronomical images challenging. To address this problem, we propose AstroVLM, a collaborative multi-agent system for diagnosing the quality of astronomical images. Experiment results show that AstroVLM outperforms all baselines on real-world astronomical imaging quality diagnosis tasks, providing a reference for language models to handle complicated multi-process tasks.

37. 【2604.16011】Breakout-picker: Reducing false positives in deep learning-based borehole breakout characterization from acoustic image logs

链接https://arxiv.org/abs/2604.16011

作者:Guangyu Wang,Xiaodong Ma,Xinming Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Geophysics (physics.geo-ph)

关键词:Breakout-picker, stress-induced spalling, paired zones, false positives, false

备注

点击查看摘要

Abstract:Borehole breakouts are stress-induced spalling on the borehole wall, which are identifiable in acoustic image logs as paired zones with near-symmetry azimuths, low acoustic amplitudes, and increased borehole radius. Accurate breakout characterization is crucial for in-situ stress analysis. In recent years, deep learning has been introduced to automate the time-consuming and labor-intensive breakout picking process. However, existing approaches often suffer from misclassification of non-breakout features, leading to high false positive rates. To address this limitation, this study develops a deep learning framework, termed Breakout-picker, with a specific focus on reducing false positives in automatic breakout characterization. Breakout-picker reduces false positives through two strategies. First, the training of Breakout-picker incorporates negative samples of non-breakout features, including natural fractures, keyseats, and logging artifacts. They share similar characteristics with breakouts, such as low acoustic amplitude or locally enlarged borehole radius. These negative training samples enables Breakout-picker to better discriminate true breakouts and similar non-breakout features. Second, candidate breakouts identified by Breakout-picker are further validated by azimuthal symmetry criteria, whereby detections that do not exhibit the near-symmetry characteristics of breakout azimuth are excluded. The performance of Breakout-picker is evaluated using three acoustic image log datasets from different regions. The results demonstrate that Breakout-picker outperforms other automatic methods with higher accuracy and substantially lower false positive rates. By reducing false positives, Breakout-picker enhances the reliability of automatic breakout characterization from acoustic image logs, which in turn benefits in-situ stress analysis based on borehole breakouts.

38. 【2604.16010】IA-CLAHE: Image-Adaptive Clip Limit Estimation for CLAHE

链接https://arxiv.org/abs/2604.16010

作者:Rikuto Otsuka,Yuho Shoji,Yuka Ogino,Takahiro Toizumi,Atsushi Ito

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paper proposes image-adaptive, proposes image-adaptive contrast, image-adaptive contrast limited, contrast limited adaptive, paper proposes

备注: Accepted to NTIRE 2026 Workshop at CVPR 2026

点击查看摘要

Abstract:This paper proposes image-adaptive contrast limited adaptive histogram equalization (IA-CLAHE). Conventional CLAHE is widely used to boost the performance of various computer vision tasks and to improve visual quality for human perception in practical industrial applications. CLAHE applies contrast limited histogram equalization to each local region to enhance local contrast. However, CLAHE often leads to over-enhancement, because the contrast-limiting parameter clip limit is fixed regardless of the histogram distribution of each local region. Our IA-CLAHE addresses this limitation by adaptively estimating tile-wise clip limits from the input image. To achieve this, we train a lightweight clip limits estimator with a differentiable extension of CLAHE, enabling end-to-end optimization. Unlike prior learning-based CLAHE methods, IA-CLAHE does not require pre-searched ground-truth clip limits or task-specific datasets, because it learns to map input image histograms toward a domain-invariant uniform distribution, enabling zero-shot generalization across diverse conditions. Experimental results show that IA-CLAHE consistently improves recognition performance, while simultaneously enhancing visual quality for human perception, without requiring any task-specific training data.

39. 【2604.15990】From Vulnerable Data Subjects to Vulnerabilizing Data Practices: Navigating the Protection Paradox in AI-Based Analyses of Platformized Lives

链接https://arxiv.org/abs/2604.15990

作者:Delfina S. Martinez Pandiani,Ella Streefkerk,Laurens Naudts,Paula Helm

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:essentialized property, paper traces, traces a conceptual, conceptual shift, shift from understanding

备注: In The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25-28, 2026, Montreal, QC, Canada. ACM, New York, NY, USA, 23 pages

点击查看摘要

Abstract:This paper traces a conceptual shift from understanding vulnerability as a static, essentialized property of data subjects to examining how it is actively enacted through data practices. Unlike reflexive ethical frameworks focused on missing or counter-data, we address the condition of abundance inherent to platformized life-a context where a near inexhaustible mass of data points already exists, shifting the ethical challenge to the researcher's choices in operating upon this existing mass. We argue that the ethical integrity of data science depends not just on who is studied, but on how technical pipelines transform "vulnerable" individuals into data subjects whose vulnerability can be further precarized. We develop this argument through an AI for Social Good (AI4SG) case: a journalist's request to use computer vision to quantify child presence in monetized YouTube 'family vlogs' for regulatory advocacy. This case reveals a "protection paradox": how data-driven efforts to protect vulnerable subjects can inadvertently impose new forms of computational exposure, reductionism, and extraction. Using this request as a point of departure, we perform a methodological deconstruction of the AI pipeline to show how granular technical decisions are ethically constitutive. We contribute a reflexive ethics protocol that translates these insights into a reflexive roadmap for research ethics surrounding platformized data subjects. Organized around four critical junctures-dataset design, operationalization, inference, and dissemination-the protocol identifies technical questions and ethical tensions where well-intentioned work can slide into renewed extraction or exposure. For every decision point, the protocol offers specific prompts to navigate four cross-cutting vulnerabilizing factors: exposure, monetization, narrative fixing, and algorithmic optimization. Rather than uncritically...

40. 【2604.15979】MMGait: Towards Multi-Modal Gait Recognition

链接https://arxiv.org/abs/2604.15979

作者:Chenye Wang,Qingyuan Cai,Saihui Hou,Aoqi Li,Yongzhen Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requiring user cooperation, powerful biometric technique, user cooperation, Gait recognition, biometric technique

备注: CVPR 2026

点击查看摘要

Abstract:Gait recognition has emerged as a powerful biometric technique for identifying individuals at a distance without requiring user cooperation. Most existing methods focus primarily on RGB-derived modalities, which fall short in real-world scenarios requiring multi-modal collaboration and cross-modal retrieval. To overcome these challenges, we present MMGait, a comprehensive multi-modal gait benchmark integrating data from five heterogeneous sensors, including an RGB camera, a depth camera, an infrared camera, a LiDAR scanner, and a 4D Radar system. MMGait contains twelve modalities and 334,060 sequences from 725 subjects, enabling systematic exploration across geometric, photometric, and motion domains. Based on MMGait, we conduct extensive evaluations on single-modal, cross-modal, and multi-modal paradigms to analyze modality robustness and complementarity. Furthermore, we introduce a new task, Omni Multi-Modal Gait Recognition, which aims to unify the above three gait recognition paradigms within a single model. We also propose a simple yet powerful baseline, OmniGait, which learns a shared embedding space across diverse modalities and achieves promising recognition performance. The MMGait benchmark, codebase, and pretrained checkpoints are publicly available at this https URL.

41. 【2604.15967】woHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models

链接https://arxiv.org/abs/2604.15967

作者:Chaoshuo Zhang,Yibo Liang,Mengke Tian,Chenhao Lin,Zhengyu Zhao,Le Yang,Chong Zhang,Yang Zhang,Chao Shen

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:remarkable synthesis capabilities, content violations remains, persistent challenge, remarkable synthesis, synthesis capabilities

备注

点击查看摘要

Abstract:Despite the remarkable synthesis capabilities of text-to-image (T2I) models, safeguarding them against content violations remains a persistent challenge. Existing safety alignments primarily focus on explicit malicious concepts, often overlooking the subtle yet critical risks of compositional semantics. To address this oversight, we identify and formalize a novel vulnerability: Multi-Concept Compositional Unsafety (MCCU), where unsafe semantics stem from the implicit associations of individually benign concepts. Based on this formulation, we introduce TwoHamsters, a comprehensive benchmark comprising 17.5k prompts curated to probe MCCU vulnerabilities. Through a rigorous evaluation of 10 state-of-the-art models and 16 defense mechanisms, our analysis yields 8 pivotal insights. In particular, we demonstrate that current T2I models and defense mechanisms face severe MCCU risks: on TwoHamsters, FLUX achieves an MCCU generation success rate of 99.52%, while LLaVA-Guard only attains a recall of 41.06%, highlighting a critical limitation of the current paradigm for managing hazardous compositional generation.

42. 【2604.15948】From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance

链接https://arxiv.org/abs/2604.15948

作者:Jinhao Shen,Haoqian Du,Xulu Zhang,Xiao-Yong Wei,Qing Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Text-guided image editing, Text-guided image, multimedia content creation, modern multimedia content, content creation

备注

点击查看摘要

Abstract:Text-guided image editing, a pivotal task in modern multimedia content creation, has seen remarkable progress with training-free methods that eliminate the need for additional optimization. Despite recent progress, existing methods are typically constrained by a competitive paradigm in which the editing and reconstruction branches are independently driven by their respective objectives to maximize alignment with target and source prompts. The adversarial strategy causes semantic conflicts and unpredictable outcomes due to the lack of coordination between branches. To overcome these issues, we propose Coopetitive Training-Free Image Editing (CoEdit), a novel zero-shot framework that transforms attention control from competition to coopetitive negotiation, achieving editing harmony across spatial and temporal dimensions. Spatially, CoEdit introduces Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between branches to reformulate attention control as a harmony-maximization problem, eventually improving the localization of editable and preservable regions. Temporally, we present Entropic Latent Refinement mechanism to dynamically adjust latent representations over time, minimizing accumulated editing errors and ensuring consistent semantic transitions throughout the denoising trajectory. Additionally, we propose the Fidelity-Constrained Editing Score, a composite metric that jointly evaluates semantic editing and background fidelity. Extensive experiments on standard benchmarks demonstrate that CoEdit achieves superior performance in both editing quality and structural preservation, enhancing multimedia information utilization by enabling more effective interaction between visual and textual modalities. The code will be available at this https URL.

43. 【2604.15946】SENSE: Stereo OpEN Vocabulary SEmantic Segmentation

链接https://arxiv.org/abs/2604.15946

作者:Thomas Campagnolo(ACENTAURI),Ezio Malis(ACENTAURI),Philippe Martinet(ACENTAURI),Gaétan Bahl

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:fixed class sets, Open-vocabulary semantic segmentation, segmentation enables models, semantic segmentation enables, class sets

备注

点击查看摘要

Abstract:Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems.

44. 【2604.15941】Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction

链接https://arxiv.org/abs/2604.15941

作者:Haato Watanabe,Nobuyuki Umetani

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Recent years, view synthesis, years have witnessed, witnessed the rapid, rapid emergence

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Recent years have witnessed the rapid emergence of 3D Gaussian splatting (3DGS) as a powerful approach for 3D reconstruction and novel view synthesis. Its explicit representation with Gaussian primitives enables fast training, real-time rendering, and convenient post-processing such as editing and surface reconstruction. However, 3DGS suffers from a critical drawback: the number of primitives grows drastically for scenes with high-frequency appearance details, since each primitive can represent only a single color, requiring multiple primitives for every sharp color transition. To overcome this limitation, we propose neural Gabor splatting, which augments each Gaussian primitive with a lightweight multi-layer perceptron that models a wide range of color variations within a single primitive. To further control primitive numbers, we introduce a frequency-aware densification strategy that selects mismatch primitives for pruning and cloning based on frequency energy. Our method achieves accurate reconstruction of challenging high-frequency surfaces. We demonstrate its effectiveness through extensive experiments on both standard benchmarks, such as Mip-NeRF360 and High-Frequency datasets (e.g., checkered patterns), supported by comprehensive ablation studies.

45. 【2604.15923】Hierarchical Codec Diffusion for Video-to-Speech Generation

链接https://arxiv.org/abs/2604.15923

作者:Jiaxin Ye,Gaoxiang Cong,Chenhui Wang,Xin-Cheng Wen,Zhaoyang Li,Boyuan Cao,Hongming Shan

类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)

关键词:generation aims, auditory signals, aims to synthesize, silent video, video without auditory

备注: CVPR 2026

点击查看摘要

Abstract:Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the hierarchical structure of Residual Vector Quantization (RVQ)-based codec, we propose HiCoDiT, a novel Hierarchical Codec Diffusion Transformer that exploits the inherent hierarchy of discrete speech tokens to achieve strong audio-visual alignment. Specifically, since lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody, HiCoDiT employs low-level and high-level blocks to generate tokens at different levels. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale adaptive instance layer normalization that jointly captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that HiCoDiT outperforms baselines in fidelity and expressiveness, highlighting the potential of discrete modelling for VTS. The code and speech demo are both available at this https URL.

46. 【2604.15917】Making Image Editing Easier via Adaptive Task Reformulation with Agentic Executions

链接https://arxiv.org/abs/2604.15917

作者:Bo Zhao,Kairui Guo,Runnan Du,Haiyang Sun,Pengshan Wang,Huan Yang,Kun Gai,Yixin Cao,Wei Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Instruction guided image, seemingly simple cases, recent generative models, produce reliable results, advanced substantially

备注: 9pages

点击查看摘要

Abstract:Instruction guided image editing has advanced substantially with recent generative models, yet it still fails to produce reliable results across many seemingly simple cases. We observe that a large portion of these failures stem not from insufficient model capacity, but from poorly formulated editing tasks, such as those involving small targets, implicit spatial relations, or under-specified instructions. In this work, we frame image editing failures as a task formulation problem and propose an adaptive task reformulation framework that improves editing performance without modifying the underlying model. Our key idea is to transform the original image-instruction pair into a sequence of operations that are dynamically determined and executed by a MLLM agent through analysis, routing, reformulation, and feedback-driven refinement. Experiments on multiple benchmarks, including ImgEdit, PICA, and RePlan, across diverse editing backbones such as Qwen Image Edit and Nano Banana, show consistent improvements, with especially large gains on challenging cases. These results suggest that task reformulation is a critical but underexplored factor, and that substantial gains can be achieved by better matching editing tasks to the effective operating regime of existing models.

47. 【2604.15911】Efficient Video Diffusion Models: Advancements and Challenges

链接https://arxiv.org/abs/2604.15911

作者:Shitong Shao,Lichen Bai,Pengfei Wan,James Kwok,Zeke Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe inference costs, practical deployment remains, deployment remains constrained, high-fidelity generative video, Video diffusion models

备注

点击查看摘要

Abstract:Video diffusion models have rapidly become the dominant paradigm for high-fidelity generative video synthesis, but their practical deployment remains constrained by severe inference costs. Compared with image generation, video synthesis compounds computation across spatial-temporal token growth and iterative denoising, making attention and memory traffic major bottlenecks in real-world settings. This survey provides a systematic and deployment-oriented review of efficient video diffusion models. We propose a unified categorization that organizes existing methods into four classes of main paradigms, including step distillation, efficient attention, model compression, and cache/trajectory optimization. Building on this categorization, we respectively analyze algorithmic trends of these four paradigms and examine how different design choices target two core objectives: reducing the number of function evaluations and minimizing per-step overhead. Finally, we discuss open challenges and future directions, including quality preservation under composite acceleration, hardware-software co-design, robust real-time long-horizon generation, and open infrastructure for standardized evaluation. To the best of our knowledge, our work is the first comprehensive survey on efficient video diffusion models, offering researchers and engineers a structured overview of the field and its emerging research directions.

48. 【2604.15903】AeroDeshadow: Physics-Guided Shadow Synthesis and Penumbra-Aware Deshadowing for Aerospace Imagery

链接https://arxiv.org/abs/2604.15903

作者:Wei Lu,Zi-Yang Bo,Fei-Fei Sang,Yi Liu,Xue Yang,Si-Bao Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-resolution aerospace imagery, prevalent in high-resolution, aerospace imagery, ASI, Physics-aware Degradation Shadow

备注: 13 pages, 12 figures

点击查看摘要

Abstract:Shadows are prevalent in high-resolution aerospace imagery (ASI). They often cause spectral distortion and information loss, which degrade downstream interpretation tasks. While deep learning methods have advanced natural-image shadow removal, their direct application to ASI faces two primary challenges. First, strictly paired training data are severely lacking. Second, homogeneous shadow assumptions fail to handle the broad penumbra transition zones inherent in aerospace scenes. To address these issues, we propose AeroDeshadow, a unified two-stage framework integrating physics-guided shadow synthesis and penumbra-aware restoration. In the first stage, a Physics-aware Degradation Shadow Synthesis Network (PDSS-Net) explicitly models illumination decay and spatial attenuation. This process constructs AeroDS-Syn, a large-scale paired dataset featuring soft boundary transitions. Constrained by this physical formulation, a Penumbra-aware Cascaded DeShadowing Network (PCDS-Net) then decouples the input into umbra and penumbra components. By restoring these regions progressively, PCDS-Net alleviates boundary artifacts and over-correction. Trained solely on the synthetic AeroDS-Syn, the network generalizes to real-world ASI without requiring paired real annotations. Experimental results indicate that AeroDeshadow achieves state-of-the-art quantitative accuracy and visual fidelity across synthetic and real-world datasets. The datasets and code will be made publicly available at: this https URL.

49. 【2604.15893】PolarMAE: Efficient Fetal Ultrasound Pre-training via Semantic Screening and Polar-Guided Masking

链接https://arxiv.org/abs/2604.15893

作者:Meng Lv,Yapeng Li,Hang Su,Juhua Liu,Bo Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Intelligent fetal ultrasound, highly promising paradigm, high annotation costs, operator-induced variance make, variance make unsupervised

备注: 10 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Intelligent fetal ultrasound (US) interpretation is crucial for prenatal diagnosis, but high annotation costs and operator-induced variance make unsupervised pre-training a highly promising paradigm. However, existing pre-training methods largely ignore US-specific characteristics -- severe data redundancy, fan-shaped locality, and polar coordinate beamforming -- limiting their effectiveness in downstream tasks. To address this, we propose PolarMAE, a novel and efficient pre-training framework tailored for US images. Specifically, to mitigate continuous scanning redundancy, we introduce a Progressive Visual-Semantic Screening (PVSS) that adaptively extracts high-value samples, significantly boosting pre-training efficiency. Furthermore, we design an Acoustic-Bounded Region Constraint (ABRC) to accommodate US locality, forcing the model to focus strictly on valid acoustic regions rather than invalid dark backgrounds. Finally, leveraging the beamforming prior and local details, we propose a Polar-Texture Collaborative Masking (PTCM), enabling the model to capture underlying radial imaging patterns and critical tissue structures. Extensive experiments across diverse datasets and downstream interpretation tasks demonstrate that our method achieves state-of-the-art performance with strong pre-training scalability and efficiency.

50. 【2604.15875】CLOTH-HUGS: Cloth Aware Human Gaussian Splatting

链接https://arxiv.org/abs/2604.15875

作者:Sadia Mubashshira,Nazanin Amini,Kevin Desai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Splatting based neural, Gaussian Splatting based, photorealistic clothed human, clothed human reconstruction, explicitly disentangles body

备注

点击查看摘要

Abstract:We present Cloth-HUGS, a Gaussian Splatting based neural rendering framework for photorealistic clothed human reconstruction that explicitly disentangles body and clothing. Unlike prior methods that absorb clothing into a single body representation and struggle with loose garments and complex deformations, Cloth-HUGS represents the performer using separate Gaussian layers for body and cloth within a shared canonical space. The canonical volume jointly encodes body, cloth, and scene primitives and is deformed through SMPL-driven articulation with learned linear blend skinning weights. To improve cloth realism, we initialize cloth Gaussians from mesh topology and apply physics-inspired constraints, including simulation-consistency, ARAP regularization, and mask supervision. We further introduce a depth-aware multi-pass rendering strategy for robust body-cloth-scene compositing, enabling real-time rendering at over 60 FPS. Experiments on multiple benchmarks show that Cloth-HUGS improves perceptual quality and geometric fidelity over state-of-the-art baselines, reducing LPIPS by up to 28% while producing temporally coherent cloth dynamics.

51. 【2604.15871】UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

链接https://arxiv.org/abs/2604.15871

作者:Lifan Jiang,Tianrun Wu,Yuhang Pei,Chenyang Wang,Boxi Wu,Deng Cai

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:models remains fragmented, remains fragmented, editing models remains, video editing, editing

备注

点击查看摘要

Abstract:The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at this https URL.

52. 【2604.15862】Splats in Splats++: Robust and Generalizable 3D Gaussian Splatting Steganography

链接https://arxiv.org/abs/2604.15862

作者:Yijia Guo,Wenkai Huang,Tong Hu,Gaolei Li,Yang Li,Yuxin Hong,Liwen Hu,Xitong Ling,Jianhua Li,Shengbo Chen,Tiejun Huang,Lei Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, striking an unprecedented, computational efficiency, recently redefined, redefined the paradigm

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently redefined the paradigm of 3D reconstruction, striking an unprecedented balance between visual fidelity and computational efficiency. As its adoption proliferates, safeguarding the copyright of explicit 3DGS assets has become paramount. However, existing invisible message embedding frameworks struggle to reconcile secure and high-capacity data embedding with intrinsic asset utility, often disrupting the native rendering pipeline or exhibiting vulnerability to structural perturbations. In this work, we present \textbf{\textit{Splats in Splats++}}, a unified and pipeline-agnostic steganography framework that seamlessly embeds high-capacity 3D/4D content directly within the native 3DGS representation. Grounded in a principled analysis of the frequency distribution of Spherical Harmonics (SH), we propose an importance-graded SH coefficient encryption scheme that achieves imperceptible embedding without compromising the original expressive power. To fundamentally resolve the geometric ambiguities that lead to message leakage, we introduce a \textbf{Hash-Grid Guided Opacity Mapping} mechanism. Coupled with a novel \textbf{Gradient-Gated Opacity Consistency Loss}, our formulation enforces a stringent spatial-attribute coupling between the original and hidden scenes, effectively projecting the discrete attribute mapping into a continuous, attack-resilient latent manifold. Extensive experiments demonstrate that our method substantially outperforms existing approaches, achieving up to \textbf{6.28 db} higher message fidelity, \textbf{3$\times$} faster rendering, and exceptional robustness against aggressive 3D-targeted structural attacks (e.g., GSPure). Furthermore, our framework exhibits remarkable versatility, generalizing seamlessly to 2D image embedding, 4D dynamic scene steganography, and diverse downstream tasks.

53. 【2604.15857】AHS: Adaptive Head Synthesis via Synthetic Data Augmentations

链接https://arxiv.org/abs/2604.15857

作者:Taewoong Kang,Hyojin Jang,Sohyun Jeong,Seunggi Moon,Gihwi Kim,Hoon Jin Jung,Jaegul choo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent digital media, portrait manipulation techniques, digital media advancements, created increasing demands, sophisticated portrait manipulation

备注: CVPR 2026, Project Page : [this https URL](https://keh0t0.github.io/AHS/)

点击查看摘要

Abstract:Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one's head is seamlessly integrated with another's body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.

54. 【2604.15856】Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

链接https://arxiv.org/abs/2604.15856

作者:Irem Ulku,Erdem Akagündüz,Ömer Özgür Tanrıöver

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:challenging atmospheric conditions, data provide complementary, sensing data provide, acquisition issues, real-world deployments

备注: 15 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Multimodal remote sensing data provide complementary information for semantic segmentation, but in real-world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade-off by compromising modality-specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC-SLP, a multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC-SLP consistently outperforms state-of-the-art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at this https URL.

55. 【2604.15853】Learning to Look before Learning to Like: Incorporating Human Visual Cognition into Aesthetic Quality Assessment

链接https://arxiv.org/abs/2604.15853

作者:Liwen Yu,Chi Liu,Xiaotong Han,Congcong Zhu,Minghao Wang,Sheng Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Aesthetic Quality Assessment, Automated Aesthetic Quality, Quality Assessment, treats images primarily, static pixel vectors

备注: Accepted for Poster Presentation at CogSci 2026

点击查看摘要

Abstract:Automated Aesthetic Quality Assessment (AQA) treats images primarily as static pixel vectors, aligning predictions with human-rating scores largely through semantic perception. However, this paradigm diverges from human aesthetic cognition, which arises from dynamic visual exploration shaped by scanning paths, processing fluency, and the interplay between bottom-up salience and top-down intention. We introduce AestheticNet, a novel cognitive-inspired AQA paradigm that integrates human-like visual cognition and semantic perception with a two-pathway architecture. The visual attention pathway, implemented as a gaze-aligned visual encoder (GAVE) pre-trained offline on eye-tracking data using resource-efficient contrast gaze alignment, models attention from human vision system. This pathway augments the semantic pathway, which uses a fixed semantic encoder such as CLIP, through cross-attention fusion. Visual attention provides a cognitive prior reflecting foreground/background structure, color cascade, brightness, and lighting, all of which are determinants of aesthetic perception beyond semantics. Experiments validated by hypothesis testing show a consistent improvement over the semantic-alone baselines, and demonstrate the gaze module as a model-agnostic corrector compatible with diverse AQA backbones, supporting the necessity and modularity of human-like visual cognition for AQA. Our code is available at this https URL.

56. 【2604.15829】Beyond Text Prompts: Precise Concept Erasure through Text-Image Collaboration

链接https://arxiv.org/abs/2604.15829

作者:Jun Li,Lizhi Xiong,Ziqiang Li,Weiwei Jiang,Zhangjie Fu,Yong Li,Guo-Sen Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:large-scale training datasets, inadvertently produce unsafe, implicit biases embedded, undesirable content due, achieved impressive fidelity

备注: 25 pages, accepted by CVPR 2026

点击查看摘要

Abstract:Text-to-image generative models have achieved impressive fidelity and diversity, but can inadvertently produce unsafe or undesirable content due to implicit biases embedded in large-scale training datasets. Existing concept erasure methods, whether text-only or image-assisted, face trade-offs: textual approaches often fail to fully suppress concepts, while naive image-guided methods risk over-erasing unrelated content. We propose TICoE, a text-image Collaborative Erasing framework that achieves precise and faithful concept removal through a continuous convex concept manifold and hierarchical visual representation learning. TICoE precisely removes target concepts while preserving unrelated semantic and visual content. To objectively assess the quality of erasure, we further introduce a fidelity-oriented evaluation strategy that measures post-erasure usability. Experiments on multiple benchmarks show that TICoE surpasses prior methods in concept removal precision and content fidelity, enabling safer, more controllable text-to-image generation. Our code is available at this https URL

57. 【2604.15828】SSFT: A Lightweight Spectral-Spatial Fusion Transformer for Generic Hyperspectral Classification

链接https://arxiv.org/abs/2604.15828

作者:Alexander Musiat,Nikolas Ebert,Oliver Wasenmüller

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strong domain shifts, Hyperspectral imaging enables, rich spectral signatures, capturing rich spectral, imaging enables fine-grained

备注: This paper has been accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2026

点击查看摘要

Abstract:Hyperspectral imaging enables fine-grained recognition of materials by capturing rich spectral signatures, but learning robust classifiers is challenging due to high dimensionality, spectral redundancy, limited labeled data, and strong domain shifts. Beyond earth observation, labeled HSI data is often scarce and imbalanced, motivating compact models for generic hyperspectral classification across diverse acquisition regimes. We propose the lightweight Spectral-Spatial Fusion Transformer (SSFT), which factorizes representation learning into spectral and spatial pathways and integrates them via cross-attention to capture complementary wavelength-dependent and structural information. We evaluate our SSFT on the challenging HSI-Benchmark, a heterogeneous multi-dataset benchmark covering earth observation, fruit condition assessment, and fine-grained material recognition. SSFT achieves state-of-the-art overall performance, ranking first while using less than 2% of the parameters of the previous leading method. We further evaluate transfer to the substantially larger SpectralEarth benchmark under the official protocol, where SSFT remains competitive despite its compact size. Ablation studies show that both spectral and spatial pathways are crucial, with spatial modeling contributing most, and that SSFT remains robust without data augmentation.

58. 【2604.15823】Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

链接https://arxiv.org/abs/2604.15823

作者:Ze Dong,Hao Shi,Zejia Gao,Zhonghua Yi,Kaiwei Wang,Lin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Embodied robotic agents, Embodied robotic, introducing domain shifts, native cinematic footage, scale variation

备注: 15 pages

点击查看摘要

Abstract:Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.

59. 【2604.15814】Continual Hand-Eye Calibration for Open-world Robotic Manipulation

链接https://arxiv.org/abs/2604.15814

作者:Fazeng Li,Gan Sun,Chenxi Liu,Yao He,Wei Cong,Yang Cong

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Hand-eye calibration, continual hand-eye calibration, hand-eye calibration framework, open-world environments, replay strategy

备注

点击查看摘要

Abstract:Hand-eye calibration through visual localization is a critical capability for robotic manipulation in open-world environments. However, most deep learning-based calibration models suffer from catastrophic forgetting when adapting into unseen data amongst open-world scene changes, while simple rehearsal-based continual learning strategy cannot well mitigate this issue. To overcome this challenge, we propose a continual hand-eye calibration framework, enabling robots to adapt to sequentially encountered open-world manipulation scenes through spatially replay strategy and structure-preserving distillation. Specifically, a Spatial-Aware Replay Strategy (SARS) constructs a geometrically uniform replay buffer that ensures comprehensive coverage of each scene pose space, replacing redundant adjacent frames with maximally informative viewpoints. Meanwhile, a Structure-Preserving Dual Distillation (SPDD) is proposed to decompose localization knowledge into coarse scene layout and fine pose precision, and distills them separately to alleviate both types of forgetting during continual adaptation. As a new manipulation scene arrives, SARS provides geometrically representative replay samples from all prior scenes, and SPDD applies structured distillation on these samples to retain previously learned knowledge. After training on the new scene, SARS incorporates selected samples from the new scene into the replay buffer for future rehearsal, allowing the model to continuously accumulate multi-scene calibration capability. Experiments on multiple public datasets show significant anti scene forgetting performance, maintaining accuracy on past scenes while preserving adaptation to new scenes, confirming the effectiveness of the framework.

60. 【2604.15809】Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

链接https://arxiv.org/abs/2604.15809

作者:Chengxin Liu,Wonseok Choi,Chenshuang Zhang,Tae-Hyun Oh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision-Language Models, demonstrated strong capability, document parsing, demonstrated strong, wide range

备注: CVPR 2026. Project page: [this https URL](https://cxliu0.github.io/AIF/)

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent work shows that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets, including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines. Project page: this https URL.

61. 【2604.15808】Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

链接https://arxiv.org/abs/2604.15808

作者:Lama Moukheiber,Caleb M. Yeung,Haotian Xue,Alec Helbling,Zelin Zhao,Yongxin Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:medical VLMs produce, VLMs produce predictions, Visual Question Answering, MRI Visual Question, core capabilities

备注

点击查看摘要

Abstract:Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.

62. 【2604.15795】Fed3D: Federated 3D Object Detection

链接https://arxiv.org/abs/2604.15795

作者:Suyan Dai,Chenxi Liu,Fazeng Li,Peican Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:augmented reality scenarios, object detection, object detection scenes, robotics manipulation, autonomous driving

备注

点击查看摘要

Abstract:3D object detection models trained in one server plays an important role in autonomous driving, robotics manipulation, and augmented reality scenarios. However, most existing methods face severe privacy concern when deployed on a multi-robot perception network to explore large-scale 3D scene. Meanwhile, it is highly challenging to employ conventional federated learning methods on 3D object detection scenes, due to the 3D data heterogeneity and limited communication bandwidth. In this paper, we take the first attempt to propose a novel Federated 3D object detection framework (i.e., Fed3D), to enable distributed learning for 3D object detection with privacy preservation. Specifically, considering the irregular input 3D object in local robot and various category distribution between robots could cause local heterogeneity and global heterogeneity, respectively. We then propose a local-global class-aware loss for the 3D data heterogeneity issue, which could balance gradient back-propagation rate of different 3D categories from local and global aspects. To reduce communication cost on each round, we develop a federated 3D prompt module, which could only learn and communicate the prompts with few learnable parameters. To the end, several extensive experiments on federated 3D object detection show that our Fed3D model significantly outperforms state-of-the-art algorithms with lower communication cost when providing the limited local training data.

63. 【2604.15777】SegMix:Shuffle-based Feedback Learning for Semantic Segmentation of Pathology Images

链接https://arxiv.org/abs/2604.15777

作者:Zhiling Yan,Sicheng Chen,Tianyi Zhang,Nan Ying,Yanli Lei,Guanglei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:identifies areas affected, diagnosis and treatment, critical task, task in computational, affected by disease

备注

点击查看摘要

Abstract:Segmentation is a critical task in computational pathology, as it identifies areas affected by disease or abnormal growth and is essential for diagnosis and treatment. However, acquiring high-quality pixel-level supervised segmentation data requires significant workload demands from experienced pathologists, limiting the application of deep learning. To overcome this challenge, relaxing the label conditions to image-level classification labels allows for more data to be used and more scenarios to be enabled. One approach is to leverage Class Activation Map (CAM) to generate pseudo pixel-level annotations for semantic segmentation with only image-level labels. However, this method fails to thoroughly explore the essential characteristics of pathology images, thus identifying only small areas that are insufficient for pseudo masking. In this paper, we propose a novel shuffle-based feedback learning method inspired by curriculum learning to generate higher-quality pseudo-semantic segmentation masks. Specifically, we perform patch level shuffle of pathology images, with the model adaptively adjusting the shuffle strategy based on feedback from previous learning. Experimental results demonstrate that our proposed approach outperforms state-of-the-arts on three different datasets.

64. 【2604.15770】PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding

链接https://arxiv.org/abs/2604.15770

作者:Junjie Wen,Junlin He,Fei Ma,Jinqiang Cui

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:pixel level, scene understanding requires, spatially precise, remaining scalable, scalable when lifted

备注: Accepted by ICCA 2026

点击查看摘要

Abstract:Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emph{PLAF}, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emph{PLAF} provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at this https URL.

65. 【2604.15756】L: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models

链接https://arxiv.org/abs/2604.15756

作者:Jinlun Ye,Jiang Liao,Runhe Lai,Xinhua Lu,Jiaxin Zhuang,Zhiyong Gan,Ruixuan Wang

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:CLIP exhibit strong, Vision-language models, CLIP exhibit, OOD, external OOD labels

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams. To address this limitation, we introduce Test-time Textual Learning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels. TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise. In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches. Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state-of-the-art performance, highlighting the value of textual adaptation for robust test-time OOD detection. Our code is available at this https URL.

66. 【2604.15748】Concept-wise Attention for Fine-grained Concept Bottleneck Models

链接https://arxiv.org/abs/2604.15748

作者:Minghong Zhong,Guoshuai Zou,Kanghao Chen,Dexia Chen,Ruixuan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recently impressive performance, large pre-trained vision-language, Recently impressive, pre-trained vision-language model, Concept Bottleneck Models

备注: 10 pages, 7 figures, Accepted by CVPR 2026 Fingdings

点击查看摘要

Abstract:Recently impressive performance has been achieved in Concept Bottleneck Models (CBM) by utilizing the image-text alignment learned by a large pre-trained vision-language model (i.e. CLIP). However, there exist two key limitations in concept modeling. Existing methods often suffer from pre-training biases, manifested as granularity misalignment or reliance on structural priors. Moreover, fine-tuning with Binary Cross-Entropy (BCE) loss treats each concept independently, which ignores mutual exclusivity among concepts, leading to suboptimal alignment. To address these limitations, we propose Concept-wise Attention for Fine-grained Concept Bottleneck Models (CoAt-CBM), a novel framework that achieves adaptive fine-grained image-concept alignment and high interpretability. Specifically, CoAt-CBM employs learnable concept-wise visual queries to adaptively obtain fine-grained concept-wise visual embeddings, which are then used to produce a concept score vector. Then, a novel concept contrastive optimization guides the model to handle the relative importance of the concept scores, enabling concept predictions to faithfully reflect the image content and improved alignment. Extensive experiments demonstrate that CoAt-CBM consistently outperforms state-of-the-art methods. The codes will be available upon acceptance.

67. 【2604.15736】RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

链接https://arxiv.org/abs/2604.15736

作者:Yichen Xu,Yuanhang Liu,Chuhan Wang,Zihan Zhao,jinghan luo,Jianzhe Ma,Wenxuan Wang,Qin Jin

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Large Language, remains insufficiently explored, Multimodal Large

备注: Work in Progress

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) excel at generic video understanding, their ability to support specialized, rule-grounded decision-making remains insufficiently explored. In this paper, we introduce RefereeBench, the first large-scale benchmark for evaluating MLLMs as automatic sports referees. Spanning 11 sports with 925 curated videos and 6,475 QA pairs, RefereeBench evaluates five core officiating abilities: foul existence, foul and penalty classification, foul and penalty reasoning, entity perception, and temporal grounding. The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence. Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees. Further analysis shows that while models can often identify incidents and involved entities, they struggle with rule application and temporal grounding, and frequently over-call fouls on normal clips. Our benchmark highlights the need for future MLLMs that better integrate domain knowledge and multimodal understanding, advancing trustworthy AI-assisted officiating and broader multimodal decision-making.

68. 【2604.15735】Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval

链接https://arxiv.org/abs/2604.15735

作者:Siyuan Wang,Hanchen Gao,Guangming Zhu,Jiang Lu,Yiyue Ma,Tianci Wu,Jincai Huang,Liang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:inherent modality gaps, textual descriptions remains, critical challenge due, Based Image Retrieval, Text Based Image

备注: Image Retrieval, Hand-drawn Sketch, Multi-stage Cross-modal Feature Alignment

点击查看摘要

Abstract:Fine-grained image retrieval via hand-drawn sketches or textual descriptions remains a critical challenge due to inherent modality gaps. While hand-drawn sketches capture complex structural contours, they lack color and texture, which text effectively provides despite omitting spatial contours. Motivated by the complementary nature of these modalities, we propose the Sketch and Text Based Image Retrieval (STBIR) framework. By synergizing the rich color and texture cues from text with the structural outlines provided by sketches, STBIR achieves superior fine-grained retrieval performance. First, a curriculum learning driven robustness enhancement module is proposed to enhance the model's robustness when handling queries of varying quality. Second, we introduce a category-knowledge-based feature space optimization module, thereby significantly boosting the model's representational power. Finally, we design a multi-stage cross-modal feature alignment mechanism to effectively mitigate the challenges of cross modal feature alignment. Furthermore, we curate the fine-grained STBIR benchmark dataset to rigorously validate the efficacy of our proposed framework and to provide data support as a reference for subsequent related research. Extensive experiments demonstrate that the proposed STBIR framework significantly outperforms state of the art methods.

69. 【2604.15729】MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis

链接https://arxiv.org/abs/2604.15729

作者:Sicheng Chen,Chad Wong,Tianyi Zhang,Enhui Chai,Zeyu Liu,Fei Xia

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enabling cancer diagnosis, Multiple Instance Learning, Slide Image, WSI analysis, computational pathology

备注

点击查看摘要

Abstract:Whole Slide Image (WSI) analysis is pivotal in computational pathology, enabling cancer diagnosis by integrating morphological and architectural cues across magnifications. Multiple Instance Learning (MIL) serves as the standard framework for WSI analysis. Recently, Mamba has become a promising backbone for MIL, overtaking Transformers due to its efficiency and global context modeling capabilities originating from Natural Language Processing (NLP). However, existing Mamba-based MIL approaches face three critical challenges: (1) disruption of 2D spatial locality during 1D sequence flattening; (2) sub-optimal modeling of fine-grained local cellular structures; and (3) high memory peaks during inference on resource-constrained edge devices. Studies like MambaOut reveal that Mamba's SSM component is redundant for local feature extraction, where Gated CNNs suffice. Recognizing that WSI analysis demands both fine-grained local feature extraction akin to natural images, and global context modeling akin to NLP, we propose MambaBack, a novel hybrid architecture that harmonizes the strengths of Mamba and MambaOut. First, we propose the Hilbert sampling strategy to preserve the 2D spatial locality of tiles within 1D sequences, enhancing the model's spatial perception. Second, we design a hierarchical structure comprising a 1D Gated CNN block based on MambaOut to capture local cellular features, and a BiMamba2 block to aggregate global context, jointly enhancing multi-scale representation. Finally, we implement an asymmetric chunking design, allowing parallel processing during training and chunking-streaming accumulation during inference, minimizing peak memory usage for deployment. Experimental results on five datasets demonstrate that MambaBack outperforms seven state-of-the-art methods. Source code and datasets are publicly available.

70. 【2604.15723】Diffusion Autoencoder for Unsupervised Artifact Restoration in Handheld Fundus Images

链接https://arxiv.org/abs/2604.15723

作者:Mathumetha Palani,Kavya Puthumana,Ayantika Das,Ganapathy Krishnamurthi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:made ophthalmologic diagnosis, fundus imaging devices, screening more accessible, handheld fundus imaging, imaging devices

备注: 5 pages, 2 figures, 1 Table - ISBI IEEE 2025 CONFERENCE

点击查看摘要

Abstract:The advent of handheld fundus imaging devices has made ophthalmologic diagnosis and disease screening more accessible, efficient, and cost-effective. However, images captured from these setups often suffer from artifacts such as flash reflections, exposure variations, and motion-induced blur, which degrade image quality and hinder downstream analysis. While generative models have been effective in image restoration, most depend on paired supervision or predefined artifact structures, making them less adaptable to unstructured degradations commonly observed in handheld fundus images. To address this, we propose an unsupervised diffusion autoencoder that integrates a context encoder with the denoising process to learn semantically meaningful representations for artifact restoration. The model is trained only on high-quality table-top fundus images and infers to restore artifact-affected handheld acquisitions. We validate the restorations through quantitative and qualitative evaluations, and have shown that diagnostic accuracy increases to 81.17% on an unseen dataset and multiple artifact conditions

71. 【2604.15718】NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition

链接https://arxiv.org/abs/2604.15718

作者:Junguang Yao,Wenye Liu,Stjepan Picek,Yue Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)

关键词:Visual speaker recognition, behavior-driven biometric solution, Visual speaker, speaker recognition based, lip motion offers

备注

点击查看摘要

Abstract:Visual speaker recognition based on lip motion offers a silent, hands-free, and behavior-driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance-dependent representations, lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine-grained dynamics is challenging for conventional frame-based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event-based framework that captures fine-grained lip dynamics under a strict yet practical cross-scene protocol: training is performed under a single controlled condition, while recognition must generalize to unseen viewing and lighting conditions. NeuroLip features a 1) Temporal-aware Voxel Encoding module with adaptive event weighting, 2) Structure-aware Spatial Enhancer that amplifies discriminative behavioral patterns by suppressing noise while preserving vertically structured motion information, and 3) Polarity Consistency Regularization mechanism to retain motion-direction cues encoded in event polarities. To facilitate systematic evaluation, we introduce DVSpeaker, a comprehensive event-based lip-motion dataset comprising 50 subjects recorded under four distinct viewpoint and illumination scenarios. Extensive experiments demonstrate that NeuroLip achieves near-perfect matched-scene accuracy and robust cross-scene generalization, attaining over 71% accuracy on unseen viewpoints and nearly 76% under low-light conditions, outperforming representative existing methods by at least 8.54%. The dataset and code are publicly available at this https URL.

72. 【2604.15711】SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification

链接https://arxiv.org/abs/2604.15711

作者:Enhui Chai,Sicheng Chen,Tianyi Zhang,Xingyu Li,Tianxiang Cui

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Regions of Interest, level tasks primarily, capture aggregated patterns, tasks primarily capture, primarily capture aggregated

备注

点击查看摘要

Abstract:Pathological diagnosis is highly reliant on image analysis, where Regions of Interest (ROIs) serve as the primary basis for diagnostic evidence, while whole-slide image (WSI)-level tasks primarily capture aggregated patterns. To extract these critical morphological features, ROI-level Foundation Models (FMs) based on Vision Transformers (ViTs) and large-scale self-supervised learning (SSL) have been widely adopted. However, three core limitations remain in their application to ROI analysis: (1) cross-magnification domain shift, as fixed-scale pretraining hinders adaptation to diverse clinical settings; (2) inadequate local-global relationship modeling, wherein the ViT backbone of FMs suffers from high computational overhead and imprecise local characterization; (3) insufficient fine-grained sensitivity, as traditional self-attention mechanisms tend to overlook subtle diagnostic cues. To address these challenges, we propose SSMamba, a hybrid SSL framework that enables effective fine-grained feature learning without relying on large external datasets. This framework incorporates three domain-adaptive components: Mamba Masked Image Modeling (MAMIM) for mitigating domain shift, a Directional Multi-scale (DMS) module for balanced local-global modeling, and a Local Perception Residual (LPR) module for enhanced fine-grained sensitivity. Employing a two-stage pipeline, SSL pretraining on target ROI datasets followed by supervised fine-tuning (SFT), SSMamba outperforms 11 state-of-the-art (SOTA) pathological FMs on 10 public ROI datasets and surpasses 8 SOTA methods on 6 public WSI datasets. These results validate the superiority of task-specific architectural designs for pathological image analysis.

73. 【2604.15708】APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition

链接https://arxiv.org/abs/2604.15708

作者:Geunyoung Jung,Soohong Kim,Inseok Kong,Jiyoung Jung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural networks, advent of deep, deep neural, neural networks, networks has led

备注: Accepted by CVPR 2026 Findings

点击查看摘要

Abstract:The advent of deep neural networks has led to remarkable progress in 3D point cloud recognition, but they remain vulnerable to adversarial attacks. Although various defense methods have been studied, they suffer from a trade-off between robustness and transferability. We propose Adversarial Point Counterattack (APC) to achieve both simultaneously. APC is a lightweight input-level purification module that generates instance-specific counter-perturbations for each point, effectively neutralizing attacks. Leveraging clean-adversarial pairs, APC enforces geometric consistency in data space and semantic consistency in feature space. To improve generalizability across diverse attacks, we adopt a hybrid training strategy using adversarial point clouds from multiple attack types. Since APC operates purely on input point clouds, it directly transfers to unseen models and defends against attacks targeting them without retraining. At inference, a single APC forward pass provides purified point clouds with negligible time and parameter overhead. Extensive experiments on two 3D recognition benchmarks demonstrate that the APC achieves state-of-the-art defense performance. Furthermore, cross-model evaluations validate its superior transferability. The code is available at this https URL.

74. 【2604.15707】LP$^{2}$DH: A Locality-Preserving Pixel-Difference Hashing Framework for Dynamic Texture Recognition

链接https://arxiv.org/abs/2604.15707

作者:Ruxin Ding,Jianfeng Ren,Heng Yu,Jiawei Li,Xudong Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:extremely high dimensionality, Local Binary Pattern, Binary Pattern, high dimensionality, suffers from extremely

备注

点击查看摘要

Abstract:Spatiotemporal Local Binary Pattern (STLBP) is a widely used dynamic texture descriptor, but it suffers from extremely high dimensionality. To tackle this, STLBP features are often extracted on three orthogonal planes, which sacrifice inter-plane correlation. In this work, we propose a Locality-Preserving Pixel-Difference Hashing (LP$^{2}$DH) framework that jointly encodes pixel differences in the full spatiotemporal neighbourhood. LP$^{2}$DH transforms Pixel-Difference Vectors (PDVs) into compact binary codes with maximal discriminative power. Furthermore, we incorporate a locality-preserving embedding to maintain the PDVs' local structure before and after hashing. Then, a curvilinear search strategy is utilized to jointly optimize the hashing matrix and binary codes via gradient descent on the Stiefel manifold. After hashing, dictionary learning is applied to encode the binary vectors into codewords, and the resulting histogram is utilized as the final feature representation. The proposed LP$^{2}$DH achieves state-of-the-art performance on three major dynamic texture recognition benchmarks: 99.80% against DT-GoogleNet's 98.93% on UCLA, 98.52% against HoGF$^{3D}$'s 97.63% on DynTex++, and 96.19% compared to STS's 95.00% on YUPENN. The source code is available at: this https URL.

75. 【2604.15703】P3T: Prototypical Point-level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models

链接https://arxiv.org/abs/2604.15703

作者:Geunyoung Jung,Soohong Kim,Kyungwoo Song,Jiyoung Jung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world applications, increasingly important, wide range, range of real-world, downstream tasks

备注: Accepted by ICRA 2026

点击查看摘要

Abstract:With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P$^3$T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P$^3$T consists of two components: 1) \textit{Point Prompter}, which generates instance-aware point-level prompts for the input point cloud, and 2) \textit{Text Prompter}, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P$^3$T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at \textcolor{violet}{this https URL}.

76. 【2604.15681】Self-Supervised Angular Deblurring in Photoacoustic Reconstruction via Noisier2Inverse

链接https://arxiv.org/abs/2604.15681

作者:Markus Haltmeier,Nadja Gruber,Gyeongha Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:emerging imaging modality, Photoacoustic tomography, ultrasonic resolution, emerging imaging, imaging modality

备注

点击查看摘要

Abstract:Photoacoustic tomography (PAT) is an emerging imaging modality that combines the complementary strengths of optical contrast and ultrasonic resolution. A central task is image reconstruction, where measured acoustic signals are used to recover the initial pressure distribution. For ideal point-like or line-like detectors, several efficient and fast reconstruction algorithms exist, including Fourier methods, filtered backprojection, and time reversal. However, when applied to data acquired with finite-size detectors, these methods yield systematically blurred images. Although sharper images can be obtained by compensating for finite-detector effects, supervised learning approaches typically require ground-truth images that may not be available in practice. We propose a self-supervised reconstruction method based on Noisier2Inverse that addresses finite-size detector effects without requiring ground-truth data. Our approach operates directly on noisy measurements and learns to recover high-quality PAT images in a ground-truth-free manner. Its key components are: (i) PAT-specific modeling that recasts the problem as angular deblurring; (ii) a Noisier2Inverse formulation in the polar domain that leverages the known angular point-spread function; and (iii) a novel, statistically grounded early-stopping rule. In experiments, the proposed method consistently outperforms alternative approaches that do not use supervised data and achieves performance close to supervised benchmarks, while remaining practical for real acquisitions with finite-size detectors.

77. 【2604.15679】Hierarchical Active Inference using Successor Representations

链接https://arxiv.org/abs/2604.15679

作者:Prashant Rangarajan,Rajesh P. N. Rao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:free energy principle, Active inference, energy principle, understanding perception, inferring actions based

备注: Accepted for publication in Neural Computation (MIT Press). 82 pages, 29 figures

点击查看摘要

Abstract:Active inference, a neurally-inspired model for inferring actions based on the free energy principle (FEP), has been proposed as a unifying framework for understanding perception, action, and learning in the brain. Active inference has previously been used to model ecologically important tasks such as navigation and planning, but scaling it to solve complex large-scale problems in real-world environments has remained a challenge. Inspired by the existence of multi-scale hierarchical representations in the brain, we propose a model for planning of actions based on hierarchical active inference. Our approach combines a hierarchical model of the environment with successor representations for efficient planning. We present results demonstrating (1) how lower-level successor representations can be used to learn higher-level abstract states, (2) how planning based on active inference at the lower-level can be used to bootstrap and learn higher-level abstract actions, and (3) how these learned higher-level abstract states and actions can facilitate efficient planning. We illustrate the performance of the approach on several planning and reinforcement learning (RL) problems including a variant of the well-known four rooms task, a key-based navigation task, a partially observable planning problem, the Mountain Car problem, and PointMaze, a family of navigation tasks with continuous state and action spaces. Our results represent, to our knowledge, the first application of learned hierarchical state and action abstractions to active inference in FEP-based theories of brain function.

78. 【2604.15678】HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning

链接https://arxiv.org/abs/2604.15678

作者:Eunju Lee,MiHyeon Kim,JuneHyoung Kwon,Yoonji Lee,JiHyun Kim,Soojin Jang,YoungBin Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Pretrained Vision-Language Models, varying visual complexity, balanced data distributions, Pretrained Vision-Language, Vision-Language Models

备注: Accepted to CVPR 2026. Eunju Lee and MiHyeon Kim contributed equally as co-first authors

点击查看摘要

Abstract:Pretrained Vision-Language Models (VLMs) like CLIP show promise in continual learning, but existing Few-Shot Class-Incremental Learning (FSCIL) methods assume homogeneous domains and balanced data distributions, limiting real-world applicability where data arises from heterogeneous disciplines with imbalanced sample availability and varying visual complexity. We identify Domain Gravity, a representational asymmetry where data imbalance across heterogeneous domains causes overrepresented or low-entropy domains to disproportionately influence the embedding space, leading to prototype drift and degraded performance on underrepresented or high-entropy domains. To address this, we introduce Cross-Discipline Variable Few-Shot Class-Incremental Learning (XD-VSCIL), a benchmark capturing real-world heterogeneity and imbalance where Domain Gravity naturally intensifies. We propose Hybrid Prototype Calibration (HyCal), a training-free method combining cosine similarity and Mahalanobis distance to capture complementary geometric properties-directional alignment and covariance-aware magnitude-yielding stable prototypes under imbalanced heterogeneous conditions. Operating on frozen CLIP embeddings, HyCal achieves consistent retention-adaptation improvements while maintaining efficiency. Experiments show HyCal effectively mitigates Domain Gravity and outperforms existing methods in imbalanced cross-domain incremental learning.

79. 【2604.15670】PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

链接https://arxiv.org/abs/2604.15670

作者:Shuyan Ke,Yifan Mei,Changli Wu,Yonghan Zheng,Jiayi Ji,Liujuan Cao,Rongrong Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including oblique viewpoints, UAV Reasoning Segmentation, extreme scale variations, data poses distinct, UAV data poses

备注: Accepted to CVPR 2026 (highlight)

点击查看摘要

Abstract:Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.

80. 【2604.15665】CPU Optimization of a Monocular 3D Biomechanics Pipeline for Low-Resource Deployment

链接https://arxiv.org/abs/2604.15665

作者:Yan Zhang,Xiong Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)

关键词:video enables accessible, monocular video enables, enables accessible biomechanical, sports settings, video enables

备注

点击查看摘要

Abstract:Markerless 3D movement analysis from monocular video enables accessible biomechanical assessment in clinical and sports settings. However, most research-grade pipelines rely on GPU acceleration, limiting deployment on consumer-grade hardware and in low-resource environments. In this work, we optimize a monocular 3D biomechanics pipeline derived from the MonocularBiomechanics framework for efficient CPU-only execution. Through profiling-driven system optimization, including model initialization restructuring, elimination of disk I/O serialization, and improved CPU parallelization. Experiments on a consumer workstation (AMD Ryzen 7 9700X CPU) show a 2.47x increase in processing throughput and a 59.6\% reduction in total runtime, with initialization latency reduced by 4.6x. Despite these changes, biomechanical outputs remain highly consistent with the baseline implementation (mean joint-angle deviation 0.35$^\circ$, $r=0.998$). These results demonstrate that research-grade vision-based biomechanics pipelines can be deployed on commodity CPU hardware for scalable movement assessment.

81. 【2604.15654】From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark

链接https://arxiv.org/abs/2604.15654

作者:Chen Zhao,Yunzhe Xu,Zhizhou Chen,Enxuan Gu,Kai Zhang,Xiaoming Liu,Jian Yang,Ying Tai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high spatial resolution, poses unique challenges, unique challenges due, fine-grained structures present, restoration poses unique

备注: TPAMI

点击查看摘要

Abstract:Ultra-high-definition (UHD) image restoration poses unique challenges due to the high spatial resolution, diverse content, and fine-grained structures present in UHD images. To address these issues, we introduce a progressive spectral decomposition for the restoration process, decomposing it into three stages: zero-frequency \textbf{enhancement}, low-frequency \textbf{restoration}, and high-frequency \textbf{refinement}. Based on this formulation, we propose a novel framework, \textbf{ERR}, which integrates three cooperative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). The ZFE incorporates global priors to learn holistic mappings, the LFR reconstructs the main content by focusing on coarse-scale information, and the HFR adopts our proposed frequency-windowed Kolmogorov-Arnold Network (FW-KAN) to recover fine textures and intricate details for high-fidelity restoration. To further advance research in UHD image restoration, we also construct a large-scale, high-quality benchmark dataset, \textbf{LSUHDIR}, comprising 82{,}126 UHD images with diverse scenes and rich content. Our proposed methods demonstrate superior performance across a range of UHD image restoration tasks, and extensive ablation studies confirm the contribution and necessity of each module. Project page: this https URL.

82. 【2604.15652】owards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline

链接https://arxiv.org/abs/2604.15652

作者:Bingyu Li,Tao Huo,Haocheng Dong,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains underexplored due, remains underexplored, underexplored due, due to fragmented, reflect realistic geospatial

备注

点击查看摘要

Abstract:Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textit{OVRSISBenchV1} established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textit{OVRSISBenchV2}, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbf{OVRSIS95K}, a balanced dataset of about 95K image--mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbf{Pi-Seg}, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbf{positive-incentive noise} mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \href{this https URL}{LiBingyu01/RSKT-Seg/tree/Pi-Seg}.

83. 【2604.15651】SPLIT: Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography

链接https://arxiv.org/abs/2604.15651

作者:Markus Haltmeier,Lukas Neumann,Nadja Gruber,Gyeongha Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved impressive performance, requires paired measurements, Machine learning, training requires paired, learning has achieved

备注

点击查看摘要

Abstract:Machine learning has achieved impressive performance in tomographic reconstruction, but supervised training requires paired measurements and ground-truth images that are often unavailable. This has motivated self-supervised approaches, which have primarily addressed denoising and, more recently, linear inverse problems. We address nonlinear inverse problems and introduce SPLIT (Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography), a self-supervised machine-learning framework for reconstructing images from nonlinear, incomplete, and noisy projection data without any samples of ground-truth images. SPLIT enforces cross-partition consistency and measurement-domain fidelity while exploiting complementary information across multiple partitions. Our main theoretical result shows that, under mild conditions, the proposed self-supervised objective is equivalent to its supervised counterpart in expectation. We regularize training with an automatic stopping rule that halts optimization when a no-reference image-quality surrogate saturates. As a concrete application, we derive SPLIT variants for multispectral computed tomography. Experiments on sparse-view acquisitions demonstrate high reconstruction quality and robustness to noise, surpassing classical iterative reconstruction and recent self-supervised baselines.

84. 【2604.15648】HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

链接https://arxiv.org/abs/2604.15648

作者:Yanbin Wei,Chun Kang,Siwei Li,Haoxuan Che,Yang Chen,Hua Liu,Jian Liu,Zhuang Liu,Can Ouyang,Fei Xing,Lei Sha,Rui Liu,Yu Zhang,James Kwok

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, consistently require, require new arenas

备注: Under Review; Opensource after accepted

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) consistently require new arenas to guide their expanding boundaries, yet their capabilities with hypergraphs remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social communities. Recent advancements in LVLMs have shown promise in understanding complex topologies, yet there remains a lack of a benchmark to delineate the capabilities of LVLMs with hypergraphs, leaving the boundaries of their abilities unclear. To fill this gap, in this paper, we introduce $\texttt{HyperGVL}$, the first benchmark to evaluate the proficiency of LVLMs in hypergraph understanding and reasoning. $\texttt{HyperGVL}$ provides a comprehensive assessment of 12 advanced LVLMs across 84,000 vision-language question-answering (QA) samples spanning 12 tasks, ranging from basic component counting to complex NP-hard problem reasoning. The involved hypergraphs contain multiscale synthetic structures and real-world citation and protein networks. Moreover, we examine the effects of 12 textual and visual hypergraph representations and introduce a generalizable router $\texttt{WiseHyGR}$ that improves LVLMs in hypergraph via learning adaptive representations. We believe that this work is a step forward in connecting hypergraphs with LVLMs.

85. 【2604.15631】Causal Bootstrapped Alignment for Unsupervised Video-Based Visible-Infrared Person Re-Identification

链接https://arxiv.org/abs/2604.15631

作者:Shuang Li,Jiaxu Leng,Changjiang Kuang,Mingpi Tan,Yu Yuan,Xinbo Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:all-day surveillance, static images, critical technique, technique for all-day, information provides additional

备注: Submit to IEEE TIFS

点击查看摘要

Abstract:VVI-ReID is a critical technique for all-day surveillance, where temporal information provides additional cues beyond static images. However, existing approaches rely heavily on fully supervised learning with expensive cross-modality annotations, limiting scalability. To address this issue, we investigate Unsupervised Learning for VVI-ReID (USL-VVI-ReID), which learns identity-discriminative representations directly from unlabeled video tracklets. Directly extending image-based USL-VI-ReID methods to this setting with generic pretrained encoders leads to suboptimal performance. Such encoders suffer from weak identity discrimination and strong modality bias, resulting in severe intra-modality identity confusion and pronounced clustering granularity imbalance between visible and infrared modalities. These issues jointly degrade pseudo-label reliability and hinder effective cross-modality alignment. To address these challenges, we propose a Causal Bootstrapped Alignment (CBA) framework that explicitly exploits inherent video priors. First, we introduce Causal Intervention Warm-up (CIW), which performs sequence-level causal interventions by leveraging temporal identity consistency and cross-modality identity consistency to suppress modality- and motion-induced spurious correlations while preserving identity-relevant semantics, yielding cleaner representations for unsupervised clustering. Second, we propose Prototype-Guided Uncertainty Refinement (PGUR), which employs a coarse-to-fine alignment strategy to resolve cross-modality granularity mismatch, reorganizing under-clustered infrared representations under the guidance of reliable visible prototypes with uncertainty-aware supervision. Extensive experiments on the HITSZ-VCM and BUPTCampus benchmarks demonstrate that CBA significantly outperforms existing USL-VI-ReID methods when extended to the USL-VVI-ReID setting.

86. 【2604.15628】SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

链接https://arxiv.org/abs/2604.15628

作者:Keisuke Gomi,Keiji Yanai

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Cross-modal retrieval, Multimodal Large Language, dietary logging, nutritional management, Single Integrated Multimodal

备注: 20 pages, 6 figures

点击查看摘要

Abstract:Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8\% to 87.5\% and the 10k image-to-recipe R@1 from 56.5\% to 65.5\% compared to the previous best method.

87. 【2604.15622】AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution

链接https://arxiv.org/abs/2604.15622

作者:Yiwei Zhao,Yi Zheng,Huapeng Su,Jieyu Lin,Stefano Ambrogio,Cijo Jose,Michaël Ramamonjisoa,Patrick Labatut,Barbara De Salvo,Chiao Liu,Phillip B. Gibbons,Ziyun Li

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:versatile visual understanding, Language-aligned vision foundation, enable versatile visual, power constraints, versatile visual

备注

点击查看摘要

Abstract:Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to $7.9\%$ in acc@1 on IN1K and $5.2\%$ mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to $77.9\%$.

88. 【2604.15612】GaussianFlow SLAM: Monocular Gaussian Splatting SLAM Guided by GaussianFlow

链接https://arxiv.org/abs/2604.15612

作者:Dong-Uk Seo,Jinwoo Jeon,Eungchang Mason Lee,Hyun Myung

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:recently gained traction, photo-realistic scene modeling, compelling map representation, enabling dense, SLAM systems

备注: 8 pages, 5 figures, 7 tables, accepted to IEEE RA-L

点击查看摘要

Abstract:Gaussian splatting has recently gained traction as a compelling map representation for SLAM systems, enabling dense and photo-realistic scene modeling. However, its application to monocular SLAM remains challenging due to the lack of reliable geometric cues from monocular input. Without geometric supervision, mapping or tracking could fall in local-minima, resulting in structural degeneracies and inaccuracies. To address this challenge, we propose GaussianFlow SLAM, a monocular 3DGS-SLAM that leverages optical flow as a geometry-aware cue to guide the optimization of both the scene structure and camera poses. By encouraging the projected motion of Gaussians, termed GaussianFlow, to align with the optical flow, our method introduces consistent structural cues to regularize both map reconstruction and pose estimation. Furthermore, we introduce normalized error-based densification and pruning modules to refine inactive and unstable Gaussians, thereby contributing to improved map quality and pose accuracy. Experiments conducted on public datasets demonstrate that our method achieves superior rendering quality and tracking accuracy compared with state-of-the-art algorithms. The source code is available at: this https URL.

89. 【2604.15611】CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder

链接https://arxiv.org/abs/2604.15611

作者:Duy-Phuong Dao,Muhammad Taqiyuddin,Jahae Kim,Sang-Heon Lee,Hye-Won Jung,Jaehoo Choi,Hyung-Jeong Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:magnetic resonance imaging, resonance imaging scans, high quality brain, quality brain magnetic, brain magnetic resonance

备注: 18 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Latent diffusion models have emerged as powerful generative models in medical imaging, enabling the synthesis of high quality brain magnetic resonance imaging scans. In particular, predicting the evolution of a patients brain can aid in early intervention, prognosis, and treatment planning. In this study, we introduce CLIMB, Controllable Longitudinal brain Image generation via state space based latent diffusion model, an advanced framework for modeling temporal changes in brain structure. CLIMB is designed to model the structural evolution of the brain structure over time, utilizing a baseline MRI scan and its acquisition age as foundational inputs. Additionally, multiple conditional variables, including projected age, gender, disease status, genetic information, and brain structure volumes, are incorporated to enhance the temporal modeling of anatomical changes. Unlike existing LDM methods that rely on self attention modules, which effectively capture contextual information from input images but are computationally expensive, our approach leverages state space, a state space model architecture that substantially reduces computational overhead while preserving high-quality image synthesis. Furthermore, we introduce a Gaussian-aligned autoencoder that extracts latent representations conforming to prior distributions without the sampling noise inherent in conventional variational autoencoders. We train and evaluate our proposed model on the Alzheimers Disease Neuroimaging Initiative dataset, consisting of 6,306 MRI scans from 1,390 participants. By comparing generated images with real MRI scans, CLIMB achieves a structural similarity index of 0.9433, demonstrating notable improvements over existing methods.

90. 【2604.15609】Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models

链接https://arxiv.org/abs/2604.15609

作者:Yunbei Zhang,Shuaicheng Niu,Chengyi Cai,Feng Liu,Jihun Hamm

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:largely unexplored challenge, black-box models accessible, Efficient Test-time Adaptation, Test-Time Adaptation, remains a largely

备注: Third Workshop on Test-Time Updates (Oral)

点击查看摘要

Abstract:Test-Time Adaptation (TTA) for black-box models accessible only via APIs remains a largely unexplored challenge. Existing approaches such as post-hoc output refinement offer limited adaptive capacity, while Zeroth-Order Optimization (ZOO) enables input-space adaptation but faces high query costs and optimization challenges in the unsupervised TTA setting. We introduce BETA (Black-box Efficient Test-time Adaptation), a framework that addresses these limitations by employing a lightweight, local white-box steering model to create a tractable gradient pathway. Through a prediction harmonization technique combined with consistency regularization and prompt learning-oriented filtering, BETA enables stable adaptation with no additional API calls and negligible latency beyond standard inference. On ImageNet-C, BETA achieves a +7.1% accuracy gain on ViT-B/16 and +3.4% on CLIP, surpassing strong white-box and gray-box methods including TENT and TPT. On a commercial API, BETA achieves comparable performance to ZOO at 250x lower cost while maintaining real-time inference speed, establishing it as a practical and efficient solution for real-world black-box TTA.

91. 【2604.15556】Learning Affine-Equivariant Proximal Operators

链接https://arxiv.org/abs/2604.15556

作者:Oriel Savir,Zhenghan Fang,Jeremias Sulam

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:including solving ill-posed, ill-posed inverse problems, solving ill-posed inverse, Learned Proximal Networks, Learned Proximal

备注: 9 pages, 4 figures, Accepted at ICASSP 2026

点击查看摘要

Abstract:Proximal operators are fundamental across many applications in signal processing and machine learning, including solving ill-posed inverse problems. Recent work has introduced Learned Proximal Networks (LPNs), providing parametric functions that compute exact proximals for data-driven and potentially non-convex regularizers. However, in many settings it is important to include additional structure to these regularizers--and their corresponding proximals--such as shift and scale equivariance. In this work, we show how to obtain learned functions parametrized by neural networks that provably compute exact proximal operators while being equivariant to shifts and scaling, which we dub Affine-Equivariant Learned Proximal Networks (AE-LPNs). We demonstrate our results on synthetic, constructive examples, and then on real data via denoising in out-of-distribution settings. Our equivariant learned proximals enhance robustness to noise distributions and affine shifts far beyond training distributions, improving the practical utility of learned proximal operators

92. 【2604.15555】CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification

链接https://arxiv.org/abs/2604.15555

作者:Hexin Dong,Yi Lin,Pengyu Zhou,Fengnian Zhao,Alan Clint Legasto,Juno Cho,Dohui Kim,Justin Namuk Kim,Mingeon Kim,Sunwoo Kwak,Gabriel Moyà-Alcover,Ky Trung Nguyen,Thanh-Huy Nguyen,Ha-Hieu Pham,Huy-Hieu Pham,Huy Le Pham,Nikhileswara Rao Sulake,Aina Tur-Serrano,Ruichi Zhang,Ang Zu,Adam E. Flanders,Zhiyong Lu,Ronald M. Summers,Mingquan Lin,Hao Chen,Yuzhe Yang,George Shih,Yifan Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:NIH Chest X-ray, Chest X-ray, interpretation is hindered, distribution of pathologies, CXR-LT

备注: 25 pages, 6 figures

点击查看摘要

Abstract:Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from a single institution, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT challenge. The first event, CXR-LT 2023, established a large-scale benchmark for long-tailed multi-label CXR classification and identified key challenges in rare disease recognition. CXR-LT 2024 further expanded the label space and introduced a zero-shot task to study generalization to unseen findings. Building on the success of CXR-LT 2023 and 2024, this third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. Additionally, all development and test sets in CXR-LT 2026 are annotated by radiologists, providing a more reliable and clinically grounded evaluation than report-derived labels. The challenge defines two core tasks this year: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. This paper summarizes the overview of the CXR-LT 2026 challenge. We describe the data collection and annotation procedures, analyze solution strategies adopted by participating teams, and evaluate head-versus-tail performance, calibration, and cross-center generalization gaps. Our results show that vision-language foundation models improve both in-distribution and zero-shot performance, but detecting rare findings under multi-center shift remains challenging. Our study provides a foundation for developing and evaluating AI systems in realistic long-tailed and open-world clinical conditions.

93. 【2604.15542】UA-Net: Uncertainty-Aware Network for TRISO Image Semantic Segmentation

链接https://arxiv.org/abs/2604.15542

作者:Kyle Lucke,Zuzanna Krajewska-Travar,Shoukun Sun,Lu Cai,John D. Stempien,Min Xian

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Tristructural isotropic, fuels undergo dimensional, high-temperature neutron irradiation, coated particle fuels, undergo dimensional

备注

点击查看摘要

Abstract:Tristructural isotropic (TRISO)-coated particle fuels undergo dimensional changes and chemical reactions during high-temperature neutron irradiation. Post-irradiation materialography helps understand processes that impact fuel performance, such as coating integrity and fission product retention. Conventionally, experts manually evaluate features in thousands of cross sections of sub-mm-sized samples, which is tedious and subjective. In this work, we propose UA-Net, a deep learning framework that segments five characteristic regions of TRISO fuel micrographs and generates an uncertainty map for predictions. The model uses a multi-stage pretraining strategy, starting with general image representations learned from ImageNet, followed by fine-tuning on TRISO micrographs from various irradiation experiments and AGR-5/6/7 particle cross sections. A meta-model for uncertainty prediction is integrated to identify small defects in TRISO images. UA-Net was evaluated on a test set of 102 images, achieving mean Intersection over Union (mIoU) and mean Precision (mP) of 95.5% and 97.3%, respectively. The meta-model achieved a specificity of 91.8% and sensitivity of 93.5%, demonstrating strong performance in detecting misclassifications. The model was also applied to new TRISO images for qualitative evaluation, showing high accuracy in extracting layer regions.

94. 【2604.15521】Frequency-Aware Flow Matching for High-Quality Image Generation

链接https://arxiv.org/abs/2604.15521

作者:Sucheng Ren,Qihang Yu,Ju He,Xiaohui Shen,Alan Yuille,Liang-Chieh Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:progressively adds Gaussian, adds Gaussian noise, Flow matching, adds Gaussian, Flow matching models

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is non-uniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch's output. By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectively modeled low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity and detail sharpness. On the class-conditional ImageNet-256 generation benchmark, our method achieves state-of-the-art performance with an FID of 1.38, surpassing the prior diffusion model DiT and flow matching model SiT by 0.79 and 0.58 FID, respectively. Code is available at this https URL.

95. 【2604.15495】GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

链接https://arxiv.org/abs/2604.15495

作者:Shivendra Agrawal,Bradley Hayes

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)

关键词:densely packed environments, Navigating complex, significant spatial grounding, densely packed, retail stores

备注

点击查看摘要

Abstract:Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m top-5 mean translation error; (3) a Zone Classification module that segments the walkable floor plan into high-level semantic regions; and (4) a Visually-Grounded Instruction Generator that synthesizes optimal paths into egocentric, landmark-rich natural language routing. In multi-criteria LLM evaluations, GIST outperforms sequence-based instruction generation baselines. Finally, an in-situ formative evaluation (N=5) yields an 80% navigation success rate relying solely on verbal cues, validating the system's capacity for universal design.

96. 【2604.15494】ProtoTTA: Prototype-Guided Test-Time Adaptation

链接https://arxiv.org/abs/2604.15494

作者:Mohammad Mahdi Abootorabi,Parvin Mousavi,Purang Abolmaesumi,Evan Shelhamer

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:input-have gained significant, gained significant attention, balancing high accuracy, model input-have gained, rely on prototypes-interpretable

备注: ICLR 2026 Test-Time Updates (TTU) Workshop

点击查看摘要

Abstract:Deep networks that rely on prototypes-interpretable representations that can be related to the model input-have gained significant attention for balancing high accuracy with inherent interpretability, which makes them suitable for critical domains such as healthcare. However, these models are limited by their reliance on training data, which hampers their robustness to distribution shifts. While test-time adaptation (TTA) improves the robustness of deep networks by updating parameters and statistics, the prototypes of interpretable models have not been explored for this purpose. We introduce ProtoTTA, a general framework for prototypical models that leverages intermediate prototype signals rather than relying solely on model outputs. ProtoTTA minimizes the entropy of the prototype-similarity distribution to encourage more confident and prototype-specific activations on shifted data. To maintain stability, we employ geometric filtering to restrict updates to samples with reliable prototype activations, regularized by prototype-importance weights and model-confidence scores. Experiments across four prototypical backbones on four diverse benchmarks spanning fine-grained vision, histopathology, and NLP demonstrate that ProtoTTA improves robustness over standard output entropy minimization while restoring correct semantic focus in prototype activations. We also introduce novel interpretability metrics and a vision-language model (VLM) evaluation framework to explain TTA dynamics, confirming ProtoTTA restores human-aligned semantic focus and correlates reliably with VLM-rated reasoning quality. Code is available at: this https URL.

97. 【2604.15453】(1D) Ordered Tokens Enable Efficient Test-Time Search

链接https://arxiv.org/abs/2604.15453

作者:Zhitong Gao,Parham Rezaei,Ali Cy,Mingqiao Ye,Nataša Jovanović,Jesse Allardice,Afshin Dehghan,Amir Zamir,Roman Bachmann,Oğuzhan Fatih Kar

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:converting raw data, component of autoregressive, converting raw, units for modeling, key component

备注: Project page: [this https URL](https://soto.epfl.ch/)

点击查看摘要

Abstract:Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.

Comments:
Project page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2604.15453 [cs.CV]

(or
arXiv:2604.15453v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.15453

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
98. 【2604.15451】Weak-to-Strong Knowledge Distillation Accelerates Visual Learning

链接https://arxiv.org/abs/2604.15451

作者:Baiang Li,Wenhao Chai,Felix Heide

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large-scale visual learning, Large-scale visual, increasingly limited, Large-scale, training cost

备注: 18 pages, 7 figures

点击查看摘要

Abstract:Large-scale visual learning is increasingly limited by training cost. Existing knowledge distillation methods transfer from a stronger teacher to a weaker student for compression or final-accuracy improvement. We instead investigate distillation to accelerate the training of strong students. We propose a generalizable plug-and-play recipe that freezes a weaker teacher, applies distillation only in early training, and turns it off once the student reaches and surpasses teacher-level performance. For ImageNet and CIFAR classification, this strategy reaches target thresholds much earlier, with up to 4.8 times speedup measured by epochs. We confirm that the method generalizes to other tasks and report 1.7 times epoch speedup for object detection on the COCO dataset, and 2.5 times earlier target-FID crossing for diffusion generation on the CIFAR-10 dataset, measured in steps. These findings validate our method as a universal speedup mechanism for visual learning.

Comments:
18 pages, 7 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.15451 [cs.CV]

(or
arXiv:2604.15451v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.15451

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
99. 【2604.15377】M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention

链接https://arxiv.org/abs/2604.15377

作者:Sanjeev Panta,Rhett M Morvant,Xu Yuan,Li Chen,Nian-Feng Tzeng

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:water resource management, Accurate and timely, resource management, timely rainfall nowcasting, crucial for disaster

备注: Accepted at IEEE International Conference on Multimedia and Expo (ICME) 2026

点击查看摘要

Abstract:Accurate and timely rainfall nowcasting is crucial for disaster mitigation and water resource management. Despite recent advances in deep learning, precipitation prediction remains challenging due to limitations in effectively leveraging diverse multimedia data sources. We introduce M3R, a Meteorology-informed MultiModal attention-based architecture for direct Rainfall prediction that synergistically combines visual NEXRAD radar imagery with numerical Personal Weather Station (PWS) measurements, using a comprehensive pipeline for temporal alignment of heterogeneous meteorological data. With specialized multimodal attention mechanisms, M3R novelly leverages weather station time series as queries to selectively attend to spatial radar features, enabling focused extraction of precipitation signatures. Experimental results for three spatial areas of 100 km * 100 km centered at NEXRAD radar stations demonstrate that M3R outperforms existing approaches, achieving substantial improvements in accuracy, efficiency, and precipitation detection capabilities. Our work establishes new benchmarks for multimedia-based precipitation nowcasting and provides practical tools for operational weather prediction systems. The source code is available at this https URL

100. 【2604.15376】Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines

链接https://arxiv.org/abs/2604.15376

作者:Keon Kim,Krish Chelikavada

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multi-step zoom-in pipelines, GUI grounding, Multi-step zoom-in, zoom-in pipelines, pipelines are widely

备注

点击查看摘要

Abstract:Multi-step zoom-in pipelines are widely used for GUI grounding, yet the intermediate predictions they produce are typically discarded after coordinate remapping. We observe that these intermediate outputs contain a useful confidence signal for free: zoom consistency, the distance between a model's step-2 prediction and the crop center. Unlike log-probabilities or token-level uncertainty, zoom consistency is a geometric quantity in a shared coordinate space, making it directly comparable across architecturally different VLMs without calibration. We prove this quantity is a linear estimator of step-1 spatial error under idealized conditions (perfect step-2, target within crop) and show it correlates with prediction correctness across two VLMs (AUC = 0.60; Spearman rho = -0.14, p 10^{-6} for KV-Ground-8B; rho = -0.11, p = 0.0003 for Qwen3.5-27B). The correlation is small but consistent across models, application categories, and operating systems. As a proof-of-concept, we use zoom consistency to route between a specialist and generalist model, capturing 16.5% of the oracle headroom between them (+0.8%, McNemar p = 0.19). Code is available at this https URL.

101. 【2604.15332】Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts

链接https://arxiv.org/abs/2604.15332

作者:Xiao Lu,Hao Zhen,Jidong J. Yang

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)

关键词:manual preparation remains, preparation remains time-consuming, transportation safety analysis, human variability, essential tools

备注: 16 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Crash diagrams are essential tools in transportation safety analysis, yet their manual preparation remains time-consuming and prone to human variability. This study investigates the use of Vision-Language Models (VLMs) to automate crash diagram generation from police crash reports, focusing on multilane roundabouts as a challenging test case. A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity. Three popular models, including GPT-4o, Gemini-1.5-Flash, and Janus-4o, were tested on 79 crash reports. GPT-4o achieved the highest average performance (6.29 out of 10), followed by Gemini-1.5-Flash (5.28) and Janus-4o (3.64). The analysis revealed GPT-4o's superior spatial reasoning and alignment between extracted and visualized crash data. These results highlight both the promise and current limitations of VLMs in engineering visualization tasks. The study lays the groundwork for integrating generative AI into crash analysis workflows to improve efficiency, consistency, and interpretability.

102. 【2308.10562】Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

链接https://arxiv.org/abs/2308.10562

作者:Delfina Sol Martinez Pandiani,Valentina Presutti

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:tasks remains unclear, high-level visual sensemaking, high-level visual, Computer Vision, image classification

备注: Preprint

点击查看摘要

Abstract:The field of Computer Vision (CV) is increasingly shifting towards ``high-level'' visual sensemaking tasks, yet the exact nature of these tasks remains unclear and tacit. This survey paper addresses this ambiguity by systematically reviewing research on high-level visual understanding, focusing particularly on Abstract Concepts (ACs) in automatic image classification. Our survey contributes in three main ways: Firstly, it clarifies the tacit understanding of high-level semantics in CV through a multidisciplinary analysis, and categorization into distinct clusters, including commonsense, emotional, aesthetic, and inductive interpretative semantics. Secondly, it identifies and categorizes computer vision tasks associated with high-level visual sensemaking, offering insights into the diverse research areas within this domain. Lastly, it examines how abstract concepts such as values and ideologies are handled in CV, revealing challenges and opportunities in AC-based image classification. Notably, our survey of AC image classification tasks highlights persistent challenges, such as the limited efficacy of massive datasets and the importance of integrating supplementary information and mid-level features. We emphasize the growing relevance of hybrid AI systems in addressing the multifaceted nature of AC image classification tasks. Overall, this survey enhances our understanding of high-level visual reasoning in CV and lays the groundwork for future research endeavors.

103. 【2604.16104】Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration

链接https://arxiv.org/abs/2604.16104

作者:Baramee Sukumal,Aueaphum Aueawatthanaphisut

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:cancer-related mortality worldwide, Lung cancer remains, mortality worldwide, cancer-related mortality, Lung cancer

备注: 16 pages, 6 figures, 3 tables, 8 equations

点击查看摘要

Abstract:Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Conventional computed tomography (CT) imaging, while essential for detection and staging, has limitations in distinguishing benign from malignant lesions and providing interpretable diagnostic insights. To address this challenge, this study proposes a dual-modal artificial intelligence framework that integrates CT radiology with hematoxylin and eosin (HE) histopathology for lung cancer diagnosis and subtype classification. The system employs convolutional neural networks to extract radiologic and histopathologic features and incorporates clinical metadata to improve robustness. Predictions from both modalities are fused using a weighted decision-level integration mechanism to classify adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell lung cancer, and normal tissue. Explainable AI techniques including Grad-CAM, Grad-CAM++, Integrated Gradients, Occlusion, Saliency Maps, and SmoothGrad are applied to provide visual interpretability. Experimental results show strong performance with accuracy up to 0.87, AUROC above 0.97, and macro F1-score of 0.88. Grad-CAM++ achieved the highest faithfulness and localization accuracy, demonstrating strong correspondence with expert-annotated tumor regions. These results indicate that multimodal fusion of radiology and histopathology can improve diagnostic performance while maintaining model transparency, suggesting potential for future clinical decision support systems in precision oncology.

104. 【2604.15964】opology-Driven Fusion of nnU-Net and MedNeXt for Accurate Brain Tumor Segmentation on Sub-Saharan Africa Dataset

链接https://arxiv.org/abs/2604.15964

作者:Prabin Bohara,Pralhad Kumar Shrestha,Arpan Rai,Usha Poudel Lamgade,Confidence Raymond,Dong Zhang,Aondona Lorumbu,Craig Jones,Mahesh Shakya,Bishesh Khanal,Pratibha Kulung

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Magnetic Resonance Imaging, low-field Magnetic Resonance, Accurate automatic brain, national imaging protocols, defined national imaging

备注

点击查看摘要

Abstract:Accurate automatic brain tumor segmentation in Low and Middle-Income (LMIC) countries is challenging due to the lack of defined national imaging protocols, diverse imaging data, extensive use of low-field Magnetic Resonance Imaging (MRI) scanners and limited health-care resources. As part of the Brain Tumor Segmentation (BraTS) Africa 2025 Challenge, we applied topology refinement to the state-of-the-art segmentation models like nnU-Net, MedNeXt, and a combination of both. Since the BraTS-Africa dataset has low MRI image quality, we incorporated the BraTS 2025 challenge data of pre-treatment adult glioma (Task 1) to pre-train the segmentation model and use it to fine-tune on the BraTS-Africa dataset. We added an extra topology refinement module to address the issue of deformation in prediction that arose due to topological error. With the introduction of this module, we achieved a better Normalized Surface Distance (NSD) of 0.810, 0.829, and 0.895 on Surrounding Non-Enhancing FLAIR Hyperintensity (SNFH) , Non-Enhancing Tumor Core (NETC) and Enhancing tumor (ET).

105. 【2604.15561】CTSCAN: Evaluation Leakage in Chest CT Segmentation and a Reproducible Patient-Disjoint Benchmark

链接https://arxiv.org/abs/2604.15561

作者:Anton Ivchenko

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:partitions mix slices, test partitions mix, Reported chest, segmentation performance, strongly inflated

备注

点击查看摘要

Abstract:Reported chest CT segmentation performance can be strongly inflated when train and test partitions mix slices from the same study. We present CTSCAN, a reproducible multi-source chest CT benchmark and research stack designed to measure what survives under patient-disjoint evaluation. The current four-class artifact aggregates 89 cases from PleThora, MedSeg SIRM, and LongCIU, and we show that the original slice-PNG workflow induces near-complete case reuse across train, validation, and test. Using the playground environment, we run a multi-seed protocol sweep with the same FPN plus EfficientNet-B0 control configuration under slice-mixed and case-disjoint evaluation. Across 3 seeds and 12 epochs per seed, the slice-mixed protocol reaches 0.6665 foreground Dice and 0.5031 foreground IoU, whereas the case-disjoint protocol reaches 0.2066 Dice and 0.1181 IoU. Removing patient reuse therefore reduces foreground Dice by 0.4599 absolute (69.00% relative) and foreground IoU by 0.3850 absolute (76.52% relative). CTSCAN packages the corrected benchmark with deterministic split manifests, explicit weak-supervision controls, a scripted multi-seed protocol sweep, and reproducible figure generation, providing a reusable basis for patient-disjoint chest CT evaluation.

106. 【2604.15459】RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference

链接https://arxiv.org/abs/2604.15459

作者:Yuxin Liu,Yiqing Dong,Wenxue Yu,Zhan Wu,Rongjun Ge,Yang Chen,Yuting He

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:lacks absolutely clean, absolutely clean images, limits denoising performance, fundamentally limits denoising, noisy reference problem

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Medical image denoising (MID) lacks absolutely clean images for supervision, leading to a noisy reference problem that fundamentally limits denoising performance. Existing simulated-supervised discriminative learning (SimSDL) and simulated-supervised generative learning (SimSGL) treat noisy references as clean targets, causing suboptimal convergence or reference-biased learning, while self-supervised learning (SSL) imposes restrictive noise assumptions that are seldom satisfied in realistic MID scenarios. We propose \textbf{RelativeFlow}, a flow matching framework that learns from heterogeneous noisy references and drives inputs from arbitrary quality levels toward a unified high-quality target. RelativeFlow reformulates flow matching by decomposing the absolute noise-to-clean mapping into relative noisier-to-noisy mappings, and realizes this formulation through two key components: 1) consistent transport (CoT), a displacement map that constrains relative flows to be components of and progressively compose a unified absolute flow, and 2) simulation-based velocity field (SVF), which constructs a learnable velocity field using modality-specific degradation operators to support different medical imaging modalities. Extensive experiments on Computed Tomography (CT) and Magnetic Resonance (MR) denoising demonstrate that RelativeFlow significantly outperforms existing methods, taming MID with noisy references.