本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新688篇论文,其中:

  • 自然语言处理110
  • 信息检索23
  • 计算机视觉156

自然语言处理

1. 【2604.08541】Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

链接https://arxiv.org/abs/2604.08541

作者:Haolei Xu,Haiwen Hong,Hongxing Li,Rui Zhou,Yang Zhang,Longtao Huang,Hui Xue,Yongliang Shen,Weiming Lu,Yueting Zhuang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:achieved remarkable performance, achieved remarkable, remarkable performance, performance on vision-language, domain

备注

点击查看摘要

Abstract:Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

2. 【2604.08540】AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

链接https://arxiv.org/abs/2604.08540

作者:Ziwei Zhou,Zeyuan Lai,Rui Wang,Yifan Yang,Zhen Xing,Yuqing Yang,Qi Dai,Lili Qiu,Chong Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:evaluation remains fragmented, media creation, remains fragmented, core interface, interface for media

备注

点击查看摘要

Abstract:Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at this http URL.

3. 【2604.08539】OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

链接https://arxiv.org/abs/2604.08539

作者:Wenbo Hu,Xin Chen,Yan Gao-Tian,Yihe Deng,Nanyun Peng,Kai-Wei Chang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Group Relative Policy, Relative Policy Optimization, facto Reinforcement Learning, Multimodal Large Language, Large Language Models

备注: code at: [this https URL](https://github.com/uclanlp/openvlthinker)

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.

4. 【2604.08527】Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

链接https://arxiv.org/abs/2604.08527

作者:Feng Luo,Yu-Neng Chuang,Guanchu Wang,Zicheng Xu,Xiaotian Han,Tianyi Zhang,Vladimir Braverman

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:trains student models, trains student, stronger teachers, student models, induced distribution

备注

点击查看摘要

Abstract:On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.

5. 【2604.08525】Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

链接https://arxiv.org/abs/2604.08525

作者:Addison J. Wu,Ryan Liu,Shuyue Stella Li,Yulia Tsvetkov,Thomas L. Griffiths

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Today large language, large language models, Today large, reinforcement learning, large language

备注

点击查看摘要

Abstract:Today's large language models (LLMs) are trained to align with user preferences through methods such as reinforcement learning. Yet models are beginning to be deployed not merely to satisfy users, but also to generate revenue for the companies that created them through advertisements. This creates the potential for LLMs to face conflicts of interest, where the most beneficial response to a user may not be aligned with the company's incentives. For instance, a sponsored product may be more expensive but otherwise equal to another; in this case, what does (and should) the LLM recommend to the user? In this paper, we provide a framework for categorizing the ways in which conflicting incentives might lead LLMs to change the way they interact with users, inspired by literature from linguistics and advertising regulation. We then present a suite of evaluations to examine how current models handle these tradeoffs. We find that a majority of LLMs forsake user welfare for company incentives in a multitude of conflict of interest situations, including recommending a sponsored product almost twice as expensive (Grok 4.1 Fast, 83%), surfacing sponsored options to disrupt the purchasing process (GPT 5.1, 94%), and concealing prices in unfavorable comparisons (Qwen 3 Next, 24%). Behaviors also vary strongly with levels of reasoning and users' inferred socio-economic status. Our results highlight some of the hidden risks to users that can emerge when companies begin to subtly incentivize advertisements in chatbots.

6. 【2604.08524】What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

链接https://arxiv.org/abs/2604.08524

作者:Stephen Cheng,Sarah Wiegreffe,Dinesh Manocha

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Applying steering vectors, model alignment technique, effective model alignment, large language models, Applying steering

备注: 9 pages + appendix, 7 figures

点击查看摘要

Abstract:Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit-- freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition of the steered OV circuit further reveals semantically interpretable concepts, even in cases where the steering vector itself does not. Leveraging the activation patching results, we show that steering vectors can be sparsified by up to 90-99% while retaining most performance, and that different steering methodologies agree on a subset of important dimensions.

7. 【2604.08523】ClawBench: Can AI Agents Complete Everyday Online Tasks?

链接https://arxiv.org/abs/2604.08523

作者:Yuxuan Zhang,Yubo Wang,Yipeng Zhu,Penghui Du,Junwen Miao,Xuan Lu,Wendong Xu,Yunzhuo Hao,Songcheng Cai,Xiaochen Wang,Huaisong Zhang,Xian Wu,Yi Lu,Minyi Lei,Kai Zou,Huifeng Yin,Ping Nie,Liang Chen,Dongfu Jiang,Wenhu Chen,Kelsey R. Allen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:routine aspects, automate your inbox, automate other routine, automate, agents

备注: Project page: [this https URL](https://claw-bench.com)

点击查看摘要

Abstract:AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

8. 【2604.08519】Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

链接https://arxiv.org/abs/2604.08519

作者:Jiayuan Ye,Vitaly Feldman,Kunal Talwar

类目:Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:Large language models, memorize factual knowledge, Large language, knowledge-intensive tasks, factual knowledge

备注

点击查看摘要

Abstract:Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.

Subjects:

Computation and Language (cs.CL); Machine Learning (stat.ML)

Cite as:
arXiv:2604.08519 [cs.CL]

(or
arXiv:2604.08519v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.08519

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
9. 【2604.08510】What do Language Models Learn and When? The Implicit Curriculum Hypothesis

链接https://arxiv.org/abs/2604.08510

作者:Emmy Liu,Kaiser Sun,Millicent Li,Isabelle Lee,Lindia Tjuatja,Jen-tse Huang,Graham Neubig

类目:Computation and Language (cs.CL)

关键词:Large language models, remain poorly understood, perform remarkably complex, Large language, pretraining remain poorly

备注

点击查看摘要

Abstract:Large language models (LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge during pretraining remain poorly understood. Scaling laws on validation loss tell us how much a model improves with additional compute, but not what skills it acquires in which order. To remedy this, we propose the Implicit Curriculum Hypothesis: pretraining follows a compositional and predictable curriculum across models and data mixtures. We test this by designing a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference, logical reasoning, and mathematics. Using these tasks, we track emergence points across four model families spanning sizes from 410M-13B parameters. We find that emergence orderings of when models reach fixed accuracy thresholds are strikingly consistent ($\rho = .81$ across 45 model pairs), and that composite tasks most often emerge after their component tasks. Furthermore, we find that this structure is encoded in model representations: tasks with similar function vector representations also tend to follow similar trajectories in training. By using the space of representations derived from our task set, we can effectively predict the training trajectories of simple held-out compositional tasks throughout the course of pretraining ($R^2 = .68$-$.84$ across models) without previously evaluating them. Together, these results suggest that pretraining is more structured than loss curves reveal: skills emerge in a compositional order that is consistent across models and readable from their internals.

10. 【2604.08501】sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing

链接https://arxiv.org/abs/2604.08501

作者:Sergey V Samsonau

类目:Digital Libraries (cs.DL); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:quality assurance, Science, integrity, misses fabricated citations, cs.DL

备注: Code: [this https URL](https://github.com/authentic-research-partners/sciwrite-lint)

点击查看摘要

Abstract:Science currently offers two options for quality assurance, both inadequate. Journal gatekeeping claims to verify both integrity and contribution, but actually measures prestige: peer review is slow, biased, and misses fabricated citations even at top venues. Open science provides no quality assurance at all: the only filter between AI-generated text and the public record is the author's integrity. AI-assisted writing makes both worse by producing more papers faster than either system can absorb. We propose a third option: measure the paper itself. sciwrite-lint (pip install sciwrite-lint) is an open-source linter for scientific manuscripts that runs entirely on the researcher's machine (free public databases, a single consumer GPU, and open-weights models) with no manuscripts sent to external services. The pipeline verifies that references exist, checks retraction status, compares metadata against canonical records, downloads and parses cited papers, verifies that they support the claims made about them, and follows one level further to check cited papers' own bibliographies. Each reference receives a per-reference reliability score aggregating all verification signals. We evaluate the pipeline on 30 unseen papers from arXiv and bioRxiv with error injection and LLM-adjudicated false positive analysis. As an experimental extension, we propose SciLint Score, combining integrity verification with a contribution component that operationalizes five frameworks from philosophy of science (Popper, Lakatos, Kitcher, Laudan, Mayo) into computable structural properties of scientific arguments. The integrity component is the core of the tool and is evaluated in this paper; the contribution component is released as experimental code for community development.

Comments:
Code: this https URL

Subjects:

Digital Libraries (cs.DL); Computation and Language (cs.CL); Software Engineering (cs.SE)

Cite as:
arXiv:2604.08501 [cs.DL]

(or
arXiv:2604.08501v1 [cs.DL] for this version)

https://doi.org/10.48550/arXiv.2604.08501

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
11. 【2604.08499】PIArena: A Platform for Prompt Injection Evaluation

链接https://arxiv.org/abs/2604.08499

作者:Runpeng Geng,Chenlong Yin,Yanting Wang,Ying Chen,Jinyuan Jia

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Prompt injection, prompt injection evaluation, real-world applications, injection attacks pose, pose serious security

备注: To appear in ACL 2026. The code is available at [this https URL](https://github.com/sleeepeer/PIArena)

点击查看摘要

Abstract:Prompt injection attacks pose serious security risks across a wide range of real-world applications. While receiving increasing attention, the community faces a critical gap: the lack of a unified platform for prompt injection evaluation. This makes it challenging to reliably compare defenses, understand their true robustness under diverse attacks, or assess how well they generalize across tasks and benchmarks. For instance, many defenses initially reported as effective were later found to exhibit limited robustness on diverse datasets and attacks. To bridge this gap, we introduce PIArena, a unified and extensible platform for prompt injection evaluation that enables users to easily integrate state-of-the-art attacks and defenses and evaluate them across a variety of existing and new benchmarks. We also design a dynamic strategy-based attack that adaptively optimizes injected prompts based on defense feedback. Through comprehensive evaluation using PIArena, we uncover critical limitations of state-of-the-art defenses: limited generalizability across tasks, vulnerability to adaptive attacks, and fundamental challenges when an injected task aligns with the target task. The code and datasets are available at this https URL.

12. 【2604.08494】What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

链接https://arxiv.org/abs/2604.08494

作者:Mohamed Amine Kerkouri,Marouane Tliba,Bin Wang,Aladine Chetouani,Ulas Bagci,Alessandro Bruno

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:attended image regions, existing methods predominantly, methods predominantly evaluate, neglecting semantic equivalence, predominantly evaluate spatial

备注: Accepted at ETRA 2026 GenAI workshop

点击查看摘要

Abstract:Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.

13. 【2604.08485】Formalizing building-up constructions of self-dual codes through isotropic lines in Lean

链接https://arxiv.org/abs/2604.08485

作者:Jae-Hyun Baek,Jon-Lark Kim

类目:Information Theory (cs.IT); Computation and Language (cs.CL)

关键词:self-dual codes, ary self-dual codes, binary self-dual codes, paper is two-fold, Kim building-up construction

备注: 27 pages

点击查看摘要

Abstract:The purpose of this paper is two-fold. First we show that Kim's building-up construction of binary self-dual codes is equivalent to Chinburg-Zhang's Hilbert symbol construction. Second we introduce a $q$-ary version of Chinburg-Zhang's construction in order to construct $q$-ary self-dual codes efficiently. For the latter, we study self-dual codes over split finite fields \(\F_q\) with \(q \equiv 1 \pmod{4}\) through three complementary viewpoints: the building-up construction, the binary arithmetic reduction of Chinburg--Zhang, and the hyperbolic geometry of the Euclidean plane. The condition that \(-1\) be a square is the common algebraic input linking these viewpoints: in the binary case it underlies the Lagrangian reduction picture, while in the split \(q\)-ary case it produces the isotropic line governing the correction terms in the extension formulas. As an application of our efficient form of generator matrices, we construct optimal self-dual codes from the split boxed construction, including self-dual \([6,3,4]\) and \([8,4,4]\) codes over \(\GF{5}\), MDS self-dual \([8,4,5]\) and \([10,5,6]\) codes over \(\GF{13}\), and a self-dual \([12,6,6]\) code over \(\GF{13}\). These structural statements are accompanied by a Lean~4 formalization of the algebraic core.

14. 【2604.08479】AI generates well-liked but templatic empathic responses

链接https://arxiv.org/abs/2604.08479

作者:Emma Gueorguieva,Hongli Zhan,Jina Suh,Javier Hernandez,Tatiana Lau,Junyi Jessy Li,Desmond C. Ong

类目:Computation and Language (cs.CL)

关键词:Recent research shows, Large Language Models, Recent research, turning to Large, Large Language

备注

点击查看摘要

Abstract:Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include validating someone's feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template -- a structured sequence of tactics -- that matches between 83--90% of LLM responses (and 60--83\% in a held out sample), and when those are matched, covers 81--92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.

15. 【2604.08456】Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

链接https://arxiv.org/abs/2604.08456

作者:Marcel Gröpl,Jaewoo Jung,Seungryong Kim,Marc Pollefeys,Sunghwan Hong

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:combining clues spread, pretrained vision-language models, tiny visual details, rapid progress, pretrained vision-language

备注: Project Page : [this https URL](https://entropy-gradient-grounding.github.io/)

点击查看摘要

Abstract:Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model's next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.

16. 【2604.08448】AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

链接https://arxiv.org/abs/2604.08448

作者:Lilian Wanzare,Cynthia Amol,zekiel Maina,Nelson Odhiambo,Hope Kerubo,Leila Misula,Vivian Oloo,Rennish Mboya,Edwin Onkoba,Edward Ombui,Joseph Muguro,Ciira wa Maina,Andrew Kipkebut,Alfred Omondi Otom,Ian Ndung'u Kang'ethe,Angela Wambui Kanyi,Brian Gichana Omwenga

类目:Computation and Language (cs.CL)

关键词:dataset comprising approximately, large-scale multilingual speech, multilingual speech dataset, speech dataset comprising, hours of audio

备注: 10 pages, 5 figures, 3 tables

点击查看摘要

Abstract:AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy. Though the project encountered challenges common to low-resource settings, including unreliable infrastructure, device compatibility issues, and community trust barriers, these were mitigated through local mobilizers, stakeholder partnerships, and adaptive training protocols. AfriVoices-KE provides a foundational resource for developing inclusive automatic speech recognition and text-to-speech systems, while advancing the digital preservation of Kenya's linguistic heritage.

17. 【2604.08426】KV Cache Offloading for Context-Intensive Tasks

链接https://arxiv.org/abs/2604.08426

作者:Andrey Bocharnikov,Ivan Ermakov,Denis Kuznedelev,Vyacheslav Zhdanovskiy,Yegor Yershov

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:range of applications, growing demand, wide range, critical bottleneck, memory usage

备注: Preprint, Work in progress

点击查看摘要

Abstract:With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.

18. 【2604.08425】Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

链接https://arxiv.org/abs/2604.08425

作者:Samay U. Shetty,Tharindu Cyril Weerasooriya,Deepak Pandita,Christopher M. Homan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:label subjective content, subjective content, humans label subjective, label subjective, disagreement

备注

点击查看摘要

Abstract:When humans label subjective content, they disagree, and that disagreement is not noise. It reflects genuine differences in perspective shaped by annotators' social identities and lived experiences. Yet standard practice still flattens these judgments into a single majority label, and recent LLM-based approaches fare no better: we show that prompted large language models, even with chain-of-thought reasoning, fail to recover the structure of human disagreement. We introduce DiADEM, a neural architecture that learns "how much each demographic axis matters" for predicting who will disagree and on what. DiADEM encodes annotators through per-demographic projections governed by a learned importance vector $\boldsymbol{\alpha}$, fuses annotator and item representations via complementary concatenation and Hadamard interactions, and is trained with a novel item-level disagreement loss that directly penalizes mispredicted annotation variance. On the DICES conversational-safety and VOICED political-offense benchmarks, DiADEM substantially outperforms both the LLM-as-a-judge and neural model baselines across standard and perspectivist metrics, achieving strong disagreement tracking ($r{=}0.75$ on DICES). The learned $\boldsymbol{\alpha}$ weights reveal that race and age consistently emerge as the most influential demographic factors driving annotator disagreement across both datasets. Our results demonstrate that explicitly modeling who annotators are not just what they label is essential for NLP systems that aim to faithfully represent human interpretive diversity.

19. 【2604.08423】Synthetic Data for any Differentiable Target

链接https://arxiv.org/abs/2604.08423

作者:Tristan Thrush,Sung Min Park,Herman Brunborg,Luke Bailey,Marcel Roed,Neil Band,Christopher Potts,Tatsunori Hashimoto

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:Dataset Policy Gradient, limits of controlling, controlling language models, Policy Gradient, target model

备注

点击查看摘要

Abstract:What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.

20. 【2604.08401】Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

链接https://arxiv.org/abs/2604.08401

作者:Wenhao Yuan,Chenchen Lin,Jian Chen,Jinfeng Xu,Xuehe Wang,Edith Cheuk Han Ngai

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language model, language model, updating memory, large language, treated as reliable

备注: Accepted by ACL2026 Main Conference

点击查看摘要

Abstract:In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory. However, coherent reasoning can still violate logical or evidential constraints, allowing unsupported beliefs repeatedly stored and propagated across decision steps, leading to systematic behavioral drift in long-horizon agentic systems. Most existing strategies rely on the consensus mechanism, conflating agreement with faithfulness. In this paper, inspired by the vulnerability of unfaithful intermediate reasoning trajectories, we propose \textbf{S}elf-\textbf{A}udited \textbf{Ve}rified \textbf{R}easoning (\textsc{SAVeR}), a novel framework that enforces verification over internal belief states within the agent before action commitment, achieving faithful reasoning. Concretely, we structurally generate persona-based diverse candidate beliefs for selection under a faithfulness-relevant structure space. To achieve reasoning faithfulness, we perform adversarial auditing to localize violations and repair through constraint-guided minimal interventions under verifiable acceptance criteria. Extensive experiments on six benchmark datasets demonstrate that our approach consistently improves reasoning faithfulness while preserving competitive end-task performance.

21. 【2604.08381】A GAN and LLM-Driven Data Augmentation Framework for Dynamic Linguistic Pattern Modeling in Chinese Sarcasm Detection

链接https://arxiv.org/abs/2604.08381

作者:Wenxian Wang,Xiaohu Luo,Junfeng Hao,Xiaoming Gu,Xingshu Chen,Zhu Wang,Haizhou Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Chinese sarcasm detection, Chinese sarcasm, Generative Adversarial Network, sarcasm detection, enhanced Chinese sarcasm

备注

点击查看摘要

Abstract:Sarcasm is a rhetorical device that expresses criticism or emphasizes characteristics of certain individuals or situations through exaggeration, irony, or comparison. Existing methods for Chinese sarcasm detection are constrained by limited datasets and high construction costs, and they mainly focus on textual features, overlooking user-specific linguistic patterns that shape how opinions and emotions are expressed. This paper proposes a Generative Adversarial Network (GAN) and Large Language Model (LLM)-driven data augmentation framework to dynamically model users' linguistic patterns for enhanced Chinese sarcasm detection. First, we collect raw data from various topics on Sina Weibo. Then, we train a GAN on these data and apply a GPT-3.5 based data augmentation technique to synthesize an extended sarcastic comment dataset, named SinaSarc. This dataset contains target comments, contextual information, and user historical behavior. Finally, we extend the BERT architecture to incorporate multi-dimensional information, particularly user historical behavior, enabling the model to capture dynamic linguistic patterns and uncover implicit sarcastic cues in comments. Experimental results demonstrate the effectiveness of our proposed method. Specifically, our model achieves the highest F1-scores on both the non-sarcastic and sarcastic categories, with values of 0.9138 and 0.9151 respectively, which outperforms all existing state-of-the-art (SOTA) approaches. This study presents a novel framework for dynamically modeling users' long-term linguistic patterns in Chinese sarcasm detection, contributing to both dataset construction and methodological advancement in this field.

22. 【2604.08377】SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

链接https://arxiv.org/abs/2604.08377

作者:Ziyu Ma,Shidong Yang,Yuxiang Ji,Xucong Wang,Yong Wang,Yiming Hu,Tongwen Huang,Xiangxiang Chu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language model, perform complex tasks, remain largely static, Large language, skills remain largely

备注: Work in progress

点击查看摘要

Abstract:Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system-wide while requiring no additional effort from users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3-Max in real-world agent scenarios.

23. 【2604.08369】Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

链接https://arxiv.org/abs/2604.08369

作者:Khushal Sethi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:Inference-time compute scaling, apply compute uniformly, Trajectorical Adaptive Compute, methods apply compute, existing methods apply

备注

点击查看摘要

Abstract:Inference-time compute scaling has emerged as a powerful technique for improving the reliability of large language model (LLM) agents, but existing methods apply compute uniformly: every decision step receives the same budget regardless of its difficulty. We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement. At each step, TrACE samples a small set of candidate next actions and measures how consistently the model commits to the same action. High agreement signals an easy decision; the controller commits immediately. Low agreement signals uncertainty; the controller samples additional rollouts up to a configurable cap before committing to the plurality action. No learned components, no external verifier, and no human labels are required. We evaluate TrACE against greedy decoding and fixed-budget self-consistency (SC-4, SC-8) on two benchmarks spanning single-step reasoning (GSM8K, n=50) and multi-step household navigation (MiniHouse, n=30), using a Qwen 2.5 3B Instruct model running on CPU. TrACE-4 matches SC-4 accuracy while using 33% fewer LLM calls on GSM8K and 39% fewer on MiniHouse. TrACE-8 matches SC-8 accuracy with 55% fewer calls on GSM8K and 65% fewer on MiniHouse. We further show that inter-rollout agreement is a reliable signal of step-level success, validating the core hypothesis that the model's own output consistency encodes difficulty information that can be exploited without training. TrACE is the first training-free, per-timestep adaptive-compute controller for LLM agents to be evaluated on multi-step sequential decision tasks.

24. 【2604.08368】SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization

链接https://arxiv.org/abs/2604.08368

作者:Seyed Mahmoud Sajjadi Mohammadabadi,Xiaolong Ma,Lei Yang,Feng Yan,Junshan Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:enable scalable adaptation, Parameter-efficient fine-tuning, injecting low-rank adapters, enable scalable, scalable adaptation

备注

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, enable scalable adaptation of foundation models by injecting low-rank adapters. However, their communication and storage costs remain a major bottleneck in resource-constrained settings. We propose SOLAR (Subspace-Oriented Latent Adapter Reparameterization), a post-training compression framework that substantially reduces the communication cost (i.e., the number of parameters to transmit or store) of PEFT adapters. SOLAR expresses each PEFT update as a linear combination of basis vectors formed from the foundation model's singular vectors with controlled random perturbations. By exploiting the subspace similarity (the alignment of principal directions) between the foundation model and task-specific fine-tuned updates, SOLAR decouples the adapter size from PEFT structure and ensures compact yet expressive representations. It is model-agnostic and compatible with existing PEFT methods, including LoRA, AdaLoRA, and other adapter modules. We theoretically establish a bound on the reconstruction error. Experiments on language and vision tasks using LLaMA, GPT, and ViT models demonstrate that SOLAR preserves task performance while significantly reducing model representation sizes, offering an effective and communication-efficient solution for deployment in distributed systems and edge devices.

25. 【2604.08362】owards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

链接https://arxiv.org/abs/2604.08362

作者:Jiawei Chen,Ruoxi Xu,Boxi Cao,Ruotong Pan,Yunfei Zhang,Yifei Hu,Yong Du,Tingting Gao,Yaojie Lu,Yingfei Sun,Xianpei Han,Le Sun,Xiangyu Wu,Hongyu Lin

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, general-purpose user simulator, emergence of Large, Language Models

备注

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

26. 【2604.08299】SeLaR: Selective Latent Reasoning in Large Language Models

链接https://arxiv.org/abs/2604.08299

作者:Renyu Fu,Guibo Luo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, discrete token sampling, language models, large language, effectiveness is constrained

备注: Camera-ready for ACL 2026 (main conference)

点击查看摘要

Abstract:Chain-of-Thought (CoT) has become a cornerstone of reasoning in large language models, yet its effectiveness is constrained by the limited expressiveness of discrete token sampling. Recent latent reasoning approaches attempt to alleviate this limitation by replacing discrete tokens with soft embeddings (probability-weighted mixtures of token embeddings) or hidden states, but they commonly suffer from two issues: (1) global activation injects perturbations into high-confidence steps, impairing reasoning stability; and (2) soft embeddings quickly collapse toward the highest-probability token, limiting exploration of alternative trajectories. To address these challenges, we propose SeLaR (Selective Latent Reasoning), a lightweight and training-free framework. SeLaR introduces an entropy-gated mechanism that activates soft embeddings only at low-confidence steps, while preserving discrete decoding at high-confidence steps. Additionally, we propose an entropy-aware contrastive regularization that pushes soft embeddings away from the dominant (highest-probability) token's direction, encouraging sustained exploration of multiple latent reasoning paths. Experiments on five reasoning benchmarks demonstrate that SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods.

27. 【2604.08294】Can Vision Language Models Judge Action Quality? An Empirical Evaluation

链接https://arxiv.org/abs/2604.08294

作者:Miguel Monte e Freitas,Rui Henriques,Ricardo Rei,Pedro Henrique Martins

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Action Quality Assessment, Vision Language Models, sports coaching, Action Quality, physical therapy

备注

点击查看摘要

Abstract:Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

28. 【2604.08284】Distributed Multi-Layer Editing for Rule-Level Knowledge in Large Language Models

链接https://arxiv.org/abs/2604.08284

作者:Yating Wang,Wenting Zhao,Yaqi Zhao,Yongshun Gong,Yilong Yin,Haoliang Sun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, natural language explanations, Large language, language models store, natural language

备注: 17 pages,3 figures. Under review

点击查看摘要

Abstract:Large language models store not only isolated facts but also rules that support reasoning across symbolic expressions, natural language explanations, and concrete instances. Yet most model editing methods are built for fact-level knowledge, assuming that a target edit can be achieved through a localized intervention. This assumption does not hold for rule-level knowledge, where a single rule must remain consistent across multiple interdependent forms. We investigate this problem through a mechanistic study of rule-level knowledge editing. To support this study, we extend the RuleEdit benchmark from 80 to 200 manually verified rules spanning mathematics and physics. Fine-grained causal tracing reveals a form-specific organization of rule knowledge in transformer layers: formulas and descriptions are concentrated in earlier layers, while instances are more associated with middle layers. These results suggest that rule knowledge is not uniformly localized, and therefore cannot be reliably edited by a single-layer or contiguous-block intervention. Based on this insight, we propose Distributed Multi-Layer Editing (DMLE), which applies a shared early-layer update to formulas and descriptions and a separate middle-layer update to instances. While remaining competitive on standard editing metrics, DMLE achieves substantially stronger rule-level editing performance. On average, it improves instance portability and rule understanding by 13.91 and 50.19 percentage points, respectively, over the strongest baseline across GPT-J-6B, Qwen2.5-7B, Qwen2-7B, and LLaMA-3-8B. The code is available at this https URL.

29. 【2604.08281】When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

链接https://arxiv.org/abs/2604.08281

作者:Ruotao Xu,Yixin Ji,Yu Luo,Jinpeng Li,Dong Li,Peifeng Li,Juntao Li,Min Zhang

类目:Computation and Language (cs.CL)

关键词:Large reasoning models, test time computation, require precise computation, extensive knowledge reserves, scaling test time

备注

点击查看摘要

Abstract:Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as "Tool Ignored''. This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, We introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the "Tool Ignored" issue, resulting in a performance increase of 4.1% to 7.5%.

30. 【2604.08275】Floating or Suggesting Ideas? A Large-Scale Contrastive Analysis of Metaphorical and Literal Verb-Object Constructions

链接https://arxiv.org/abs/2604.08275

作者:Prisca Piccirilli,Alexander Fraser,Sabine Schulte im Walde

类目:Computation and Language (cs.CL)

关键词:express abstract concepts, Metaphor pervades everyday, pervades everyday language, allowing speakers, concrete domains

备注: 17 pages, 4 figures, 3 tables. Accepted at CMCL@LREC2026

点击查看摘要

Abstract:Metaphor pervades everyday language, allowing speakers to express abstract concepts via concrete domains. While prior work has studied metaphors cognitively and psycholinguistically, large-scale comparisons with literal language remain limited, especially for near-synonymous expressions. We analyze 297 English verb-object pairs (e.g., float idea vs. suggest idea) in ~2M corpus sentences, examining their contextual usage. Using five NLP tools, we extract 2,293 cognitive and linguistic features capturing affective, lexical, syntactic, and discourse-level properties. We address: (i) whether features differ between metaphorical and literal contexts (cross-pair analysis), and (ii) whether individual VO pairs diverge internally (within-pair analysis). Cross-pair results show literal contexts have higher lexical frequency, cohesion, and structural regularity, while metaphorical contexts show greater affective load, imageability, lexical diversity, and constructional specificity. Within-pair analyses reveal substantial heterogeneity, with most pairs showing non-uniform effects. These results suggest no single, consistent distributional pattern that distinguishes metaphorical from literal usage. Instead, differences are largely construction-specific. Overall, large-scale data combined with diverse features provides a fine-grained understanding of metaphor-literal contrasts in VO usage.

31. 【2604.08260】Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing

链接https://arxiv.org/abs/2604.08260

作者:Jun Seo,Sangwon Ryu,Heejin Do,Hyounghun Kim,Gary Geunbae Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Knowledge Tracing, predict learners' future, learners' future performance, Knowledge Components, aims to predict

备注: ACL Findings 2026

点击查看摘要

Abstract:Knowledge Tracing (KT) aims to predict learners' future performance from past interactions. While recent KT approaches have improved via learning item representations aligned with Knowledge Components, they overlook the procedural dynamics of problem solving. We propose Behavior-Aware Item Modeling (BAIM), a framework that enriches item representations by integrating dynamic procedural solution information. BAIM leverages a reasoning language model to decompose each item's solution into four problem-solving stages (i.e., understand, plan, carry out, and look back), pedagogically grounded in Polya's framework. Specifically, it derives stage-level representations from per-stage embedding trajectories, capturing latent signals beyond surface features. To reflect learner heterogeneity, BAIM adaptively routes these stage-wise representations, introducing a context-conditioned mechanism within a KT backbone, allowing different procedural stages to be emphasized for different learners. Experiments on XES3G5M and NIPS34 show that BAIM consistently outperforms strong pretraining-based baselines, achieving particularly large gains under repeated learner interactions.

32. 【2604.08256】HyperMem: Hypergraph Memory for Long-Term Conversations

链接https://arxiv.org/abs/2604.08256

作者:Juwei Yue,Chuanrui Hu,Jiawei Sheng,Zuyi Zhou,Wenyuan Zhang,Tingwen Liu,Li Guo,Yafeng Deng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:track persistent tasks, provide personalized interactions, maintain coherence, track persistent, persistent tasks

备注: ACL 2026 Main

点击查看摘要

Abstract:Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues. However, existing approaches as Retrieval-Augmented Generation (RAG) and graph-based memory mostly rely on pairwise relations, which can hardly capture high-order associations, i.e., joint dependencies among multiple elements, causing fragmented retrieval. To this end, we propose HyperMem, a hypergraph-based hierarchical memory architecture that explicitly models such associations using hyperedges. Particularly, HyperMem structures memory into three levels: topics, episodes, and facts, and groups related episodes and their facts via hyperedges, unifying scattered content into coherent units. Leveraging this structure, we design a hybrid lexical-semantic index and a coarse-to-fine retrieval strategy, supporting accurate and efficient retrieval of high-order associations. Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.

33. 【2604.08243】Self-Debias: Self-correcting for Debiasing Large Language Models

链接https://arxiv.org/abs/2604.08243

作者:Xuan Feng,Shuai Zhao,Luwei Xiao,Tianlong Gu,Bo An

类目:Computation and Language (cs.CL)

关键词:Large Language Models, inherent social biases, Large Language, Bias Propagation, demonstrate remarkable reasoning

备注

点击查看摘要

Abstract:Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous "Bias Propagation". Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model's output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.

34. 【2604.08156】raining Data Size Sensitivity in Unsupervised Rhyme Recognition

链接https://arxiv.org/abs/2604.08156

作者:Petr Plecháč,Artjoms Šeļa,Silvie Cinková,Mirella De Sisto,Lara Nugues,Neža Kočnik,Antonina Martynenko,Ben Nagy,Luca Giovannini,Robert Kolár

类目:Computation and Language (cs.CL)

关键词:deceptively intuitive, constructed historically, people disagree, rhyme classification, scholars struggle

备注

点击查看摘要

Abstract:Rhyme is deceptively intuitive: what is or is not a rhyme is constructed historically, scholars struggle with rhyme classification, and people disagree on whether two words are rhymed or not. This complicates automated rhymed recognition and evaluation, especially in multilingual context. This article investigates how much training data is needed for reliable unsupervised rhyme recognition using RhymeTagger, a language-independent tool that identifies rhymes based on repeating patterns in poetry corpora. We evaluate its performance across seven languages (Czech, German, English, French, Italian, Russian, and Slovene), examining how training size and language differences affect accuracy. To set a realistic performance benchmark, we assess inter-annotator agreement on a manually annotated subset of poems and analyze factors contributing to disagreement in expert annotations: phonetic similarity between rhyming words and their distance from each other in a poem. We also compare RhymeTagger to three large language models using a one-shot learning strategy. Our findings show that, once provided with sufficient training data, RhymeTagger consistently outperforms human agreement, while LLMs lacking phonetic representation significantly struggle with the task.

35. 【2604.08148】Clickbait detection: quick inference with maximum impact

链接https://arxiv.org/abs/2604.08148

作者:Soveatin Kuntur,Panggih Kusuma Ningrum,Anna Wróblewska,Maria Ganzha,Marcin Paprzycki

类目:Computation and Language (cs.CL)

关键词:lightweight hybrid approach, combines OpenAI semantic, OpenAI semantic embeddings, heuristic features capturing, features capturing stylistic

备注: Accepted Student competition ICCS 2026

点击查看摘要

Abstract:We propose a lightweight hybrid approach to clickbait detection that combines OpenAI semantic embeddings with six compact heuristic features capturing stylistic and informational cues. To improve efficiency, embeddings are reduced using PCA and evaluated with XGBoost, GraphSAGE, and GCN classifiers. While the simplified feature design yields slightly lower F1-scores, graph-based models achieve competitive performance with substantially reduced inference time. High ROC--AUC values further indicate strong discrimination capability, supporting reliable detection of clickbait headlines under varying decision thresholds.

36. 【2604.08133】Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

链接https://arxiv.org/abs/2604.08133

作者:Baihui Liu,Kaiyuan Tian,Wei Wang,Zhaoning Zhang,Linbo Qiao,Dongsheng Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:scaling large language, sparse activation mechanism, language models due, large language models, expert activations

备注: ACL 2026 main

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves $1.15\times$ prefill and $1.34\times$ decode speedups on DeepSeek-V2-Lite at half of the original budget.

37. 【2604.08131】Graph Neural Networks for Misinformation Detection: Performance-Efficiency Trade-offs

链接https://arxiv.org/abs/2604.08131

作者:Soveatin Kuntur,Maciej Krzywda,Anna Wróblewska,Marcin Paprzycki,Maria Ganzha,Szymon Łukasik,Amir H. Gandomi

类目:Computation and Language (cs.CL)

关键词:including large language, large language models, including large, rapid spread, spread of online

备注: Accepted at Computational Modeling and Artificial Intelligence for Social Systems Track in ICCS 2026

点击查看摘要

Abstract:The rapid spread of online misinformation has led to increasingly complex detection models, including large language models and hybrid architectures. However, their computational cost and deployment limitations raise concerns about practical applicability. In this work, we benchmark graph neural networks (GNNs) against non-graph-based machine learning methods under controlled and comparable conditions. We evaluate lightweight GNN architectures (GCN, GraphSAGE, GAT, ChebNet) against Logistic Regression, Support Vector Machines, and Multilayer Perceptrons across seven public datasets in English, Indonesian, and Polish. All models use identical TF-IDF features to isolate the impact of relational structure. Performance is measured using F1 score, with inference time reported to assess efficiency. GNNs consistently outperform non-graph baselines across all datasets. For example, GraphSAGE achieves 96.8% F1 on Kaggle and 91.9% on WELFake, compared to 73.2% and 66.8% for MLP, respectively. On COVID-19, GraphSAGE reaches 90.5% F1 vs. 74.9%, while ChebNet attains 79.1% vs. 66.4% on FakeNewsNet. These gains are achieved with comparable or lower inference times. Overall, the results show that classic GNNs remain effective and efficient, challenging the need for increasingly complex architectures in misinformation detection.

38. 【2604.08126】LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

链接https://arxiv.org/abs/2604.08126

作者:Tian Huang,Tom Bourgeade,Irina Illina

类目:Computation and Language (cs.CL)

关键词:Objective Structured Clinical, Structured Clinical Examinations, Clinical Examinations, medical students' clinical, Objective Structured

备注: 11 pages, 2 figures, to be published in LREC 2026 proceedings

点击查看摘要

Abstract:Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students' clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students' access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models ($\le$32B parameters) achieve accuracies comparable to GPT-4o ($\sim$90\%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.

39. 【2604.08120】Small Vision-Language Models are Smart Compressors for Long Video Understanding

链接https://arxiv.org/abs/2604.08120

作者:Junjie Fei,Jun Chen,Zechun Liu,Yunyang Xiong,Chong Zhou,Wei Wen,Junlin Han,Mingchen Zhuge,Saksham Suri,Qi Qian,Shuming Liu,Lemeng Wu,Raghuraman Krishnamoorthi,Vikas Chandra,Mohamed Elhoseiny,Chenchen Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Adapting Multimodal Large, Multimodal Large Language, Large Language Models, Adapting Multimodal, Multimodal Large

备注: Project page and demo are available at [this https URL](https://FeiElysia.github.io/tempo-page/)

点击查看摘要

Abstract:Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

40. 【2604.08118】Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

链接https://arxiv.org/abs/2604.08118

作者:Ian W. Kennedy,Nafise Sadat Moosavi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Additive quantization enables, enables extreme LLM, Additive quantization, quantization enables extreme, extreme LLM compression

备注: 9 pages (+ references and appendix). Under review at ACL Rolling Review

点击查看摘要

Abstract:Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.

41. 【2604.08104】Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection

链接https://arxiv.org/abs/2604.08104

作者:Khalid Zaman,Melike Sah,Anuwat Chaiwongyenc,Cem Direkoglu

类目:Computation and Language (cs.CL)

关键词:propose Quantum Vision, propose Quantum, information waves, deep learning-based audio, theory

备注

点击查看摘要

Abstract:We propose Quantum Vision (QV) theory as a new perspective for deep learning-based audio classification, applied to deepfake speech detection. Inspired by particle-wave duality in quantum physics, QV theory is based on the idea that data can be represented not only in its observable, collapsed form, but also as information waves. In conventional deep learning, models are trained directly on these collapsed representations, such as images. In QV theory, inputs are first transformed into information waves using a QV block, and then fed into deep learning models for classification. QV-based models improve performance in image classification compared to their non-QV counterparts. What if QV theory is applied speech spectrograms for audio classification tasks? This is the motivation and novelty of the proposed approach. In this work, Short-Time Fourier Transform (STFT), Mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCC) of speech signals are converted into information waves using the proposed QV block and used to train QV-based Convolutional Neural Networks (QV-CNN) and QV-based Vision Transformers (QV-ViT). Extensive experiments are conducted on the ASVSpoof dataset for deepfake speech classification. The results show that QV-CNN and QV-ViT consistently outperform standard CNN and ViT models, achieving higher classification accuracy and improved robustness in distinguishing genuine and spoofed speech. Moreover, the QV-CNN model using MFCC features achieves the best overall performance on the ASVspoof dataset, with an accuracy of 94.20% and an EER of 9.04%, while the QV-CNN with Mel-spectrograms attains the highest accuracy of 94.57%. These findings demonstrate that QV theory is an effective and promising approach for audio deepfake detection and opens new directions for quantum-inspired learning in audio perception tasks.

42. 【2604.08075】Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

链接https://arxiv.org/abs/2604.08075

作者:Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen

类目:Computation and Language (cs.CL)

关键词:substantial KV-cache over-allocation, Production vLLM fleets, worst-case context length, Production vLLM, vLLM fleets typically

备注

点击查看摘要

Abstract:Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.

43. 【2604.08052】Efficient Provably Secure Linguistic Steganography via Range Coding

链接https://arxiv.org/abs/2604.08052

作者:Ruiyi Yan,Yugo Murawaki

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:enable covert communication, seemingly innocuous texts, involves embedding secret, embedding secret messages, covert communication

备注: ACL2026 Main

点击查看摘要

Abstract:Linguistic steganography involves embedding secret messages within seemingly innocuous texts to enable covert communication. Provable security, which is a long-standing goal and key motivation, has been extended to language-model-based steganography. Previous provably secure approaches have achieved perfect imperceptibility, measured by zero Kullback-Leibler (KL) divergence, but at the expense of embedding capacity. In this paper, we attempt to directly use a classic entropy coding method (range coding) to achieve secure steganography, and then propose an efficient and provably secure linguistic steganographic method with a rotation mechanism. Experiments across various language models show that our method achieves around 100% entropy utilization (embedding efficiency) for embedding capacity, outperforming the existing baseline methods. Moreover, it achieves high embedding speeds (up to 1554.66 bits/s on GPT-2). The code is available at this http URL.

44. 【2604.08046】Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.08046

作者:Zhengyi Zhao,Shubo Zhang,Zezhong Wang,Yuxi Zhang,Huimin Wang,Yutian Zhao,Yefeng Zheng,Binyang Li,Kam-Fai Wong,Xian Wu

类目:Computation and Language (cs.CL)

关键词:enhances Large Language, significantly enhances Large, Large Language Models, Large Language, enhances Large

备注: Accepted by ACL'26

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs) by providing access to external knowledge. However, current research primarily focuses on retrieval quality, often overlooking the critical ''integration bottleneck'': even when relevant documents are retrieved, LLMs frequently fail to utilize them effectively due to conflicts with their internal parametric knowledge. In this paper, we argue that implicitly resolving this conflict in a single generation pass is suboptimal. We introduce GuarantRAG, a framework that explicitly decouples reasoning from evidence integration. First, we generate an ''Inner-Answer'' based solely on parametric knowledge to capture the model's reasoning flow. Second, to guarantee faithful evidence extraction, we generate a ''Refer-Answer'' using a novel Contrastive DPO objective. This objective treats the parametric Inner-Answer as a negative constraint and the retrieved documents as positive ground truth, forcing the model to suppress internal hallucinations in favor of external evidence during this phase. Finally, rather than naive concatenation or using the DPO trained model directly, we propose a joint decoding mechanism that dynamically fuses the logical coherence of the Inner-Answer with the factual precision of the Refer-Answer at the token level. Experiments on five QA benchmarks demonstrate that GuarantRAG improves accuracy by up to 12.1% and reduces hallucinations by 16.3% compared to standard and dynamic RAG baselines.

45. 【2604.08000】PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

链接https://arxiv.org/abs/2604.08000

作者:Zhifei Xie,Zongzheng Hu,Fangda Ye,Xin Zhang,Haobo Chai,Zihang Liu,Pengcheng Wu,Guibin Zhang,Yue Liao,Xiaobin Hu,Deheng Ye,Chunyan Miao,Shuicheng Yan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)

关键词:expectation for AGI, core expectation, Proactive Agent System, Proactivity, AGI

备注: Technical report; Work in progress

点击查看摘要

Abstract:Proactivity is a core expectation for AGI. Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints. We study this setting, where useful intervention requires inferring latent needs from ongoing context and grounding actions in evolving user memory under latency and long-horizon constraints. We first propose DD-MM-PAS (Demand Detection, Memory Modeling, Proactive Agent System) as a general paradigm for streaming proactive AI agent. We instantiate this paradigm in Pask, with streaming IntentFlow model for DD, a hybrid memory (workspace, user, global) for long-term MM, PAS infra framework and introduce how these components form a closed loop. We also introduce LatentNeeds-Bench, a real-world benchmark built from user-consented data and refined through thousands of rounds of human editing. Experiments show that IntentFlow matches leading Gemini3-Flash models under latency constraints, while identifying deeper user intent.

46. 【2604.07985】Rag Performance Prediction for Question Answering

链接https://arxiv.org/abs/2604.07985

作者:Or Dado,David Carmel. Oren Kurland

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:retrieval augmented generation, augmented generation, address the task, task of predicting, predicting the gain

备注: 12 pages. 2 figures. 1 table

点击查看摘要

Abstract:We address the task of predicting the gain of using RAG (retrieval augmented generation) for question answering with respect to not using it. We study the performance of a few pre-retrieval and post-retrieval predictors originally devised for ad hoc retrieval. We also study a few post-generation predictors, one of which is novel to this study and posts the best prediction quality. Our results show that the most effective prediction approach is a novel supervised predictor that explicitly models the semantic relationships among the question, retrieved passages, and the generated answer.

47. 【2604.07981】A Decomposition Perspective to Long-context Reasoning for LLMs

链接https://arxiv.org/abs/2604.07981

作者:Yanling Xiao,Huaibing Xie,Guoliang Zhao,Shihan Dou,Shaolei Wang,Yiting Liu,Nantao Zheng,Cheng Zhang,Pluto Zhou,Zhisong Zhang,Lemao Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, complex real-world applications, challenge for Large, Long-context reasoning

备注

点击查看摘要

Abstract:Long-context reasoning is essential for complex real-world applications, yet remains a significant challenge for Large Language Models (LLMs). Despite the rapid evolution in long-context reasoning, current research often overlooks the internal complexity of the long-context reasoning task itself. In this paper, we move beyond this holistic view and decompose long-context reasoning into a set of fundamental atomic skills, and we then automatically synthesize a suite of pseudo datasets, each explicitly targeting a specific atomic skill. Our empirical analysis confirms that proficiency in these atomic skills is strongly correlated with general long-text reasoning performance. Building on this insight, we employ reinforcement learning on these pseudo datasets to sharpen the model's atomic skills, in the hope of boosting its general long-context reasoning ability. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach: it outperforms a strong baseline by an average margin of 7.7\% (improving from 46.3\% to 54.0\%) across Loogle, Loong, LongBench-v2, BrowscompLong, Ruler-qa2, and MRCR.

48. 【2604.07969】Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

链接https://arxiv.org/abs/2604.07969

作者:George Fountzoulas

类目:Computation and Language (cs.CL)

关键词:text classification architecture, requiring no tokenizer, FFT-Rotate Wavetable Encoder, directly on raw, attention mechanism

备注: 12 pages, 6 tables

点击查看摘要

Abstract:We present Kathleen, a text classification architecture that operates directly on raw UTF-8 bytes using frequency-domain processing -- requiring no tokenizer, no attention mechanism, and only 733K parameters. Kathleen introduces three novel components: (1) RecurrentOscillatorBanks -- damped sinusoid convolutions with temporal memory for O(L) sequence processing; (2) an FFT-Rotate Wavetable Encoder that maps all 256 byte values using a single learnable vector (256 floats), replacing conventional embedding tables (65K parameters) while improving accuracy; (3) PhaseHarmonics -- a sinusoidal non-linearity with just 6 learnable phase parameters that our ablation identifies as the single most impactful component (+2.6% accuracy, 0.001% of model parameters). Through comprehensive ablation of a 1.8M-parameter predecessor, we show that frequency-domain components systematically outperform complex cognitive architectures: removing a 560K-parameter bio-inspired framework costs only -0.2%, while removing the 6-parameter PhaseHarmonics costs -2.6%. The resulting Kathleen-Clean achieves 88.6% on IMDB, 92.3% on AG News, and 83.3% on SST-2 -- outperforming a tokenized counterpart with 16x more parameters on IMDB (+1.6%) and AG News (+2.1%). Kathleen processes sequences in O(L) time and memory, enabling byte-level operation at sequence lengths where O(L^2) Transformers exhaust GPU memory.

49. 【2604.07967】AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification

链接https://arxiv.org/abs/2604.07967

作者:Hongyi Cen,Mingxin Wang,Yule Liu,Jingyi Zheng,Hanze Jia,Tan Tang,Yingcai Wu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:test fact-checking systems, standard metrics fail, capture truth-conditional consistency, label semantically corrupted, Atomic Validity Scoring

备注

点击查看摘要

Abstract:Adversarial claim rewriting is widely used to test fact-checking systems, but standard metrics fail to capture truth-conditional consistency and often label semantically corrupted rewrites as successful. We introduce AtomEval, a validity-aware evaluation framework that decomposes claims into subject-relation-object-modifier (SROM) atoms and scores adversarial rewrites with Atomic Validity Scoring (AVS), enabling detection of factual corruption beyond surface similarity. Experiments on the FEVER dataset across representative attack strategies and LLM generators show that AtomEval provides more reliable evaluation signals in our experiments. Using AtomEval, we further analyze LLM-based adversarial generators and observe that stronger models do not necessarily produce more effective adversarial claims under validity-aware evaluation, highlighting previously overlooked limitations in current adversarial evaluation practices.

50. 【2604.07963】Rethinking Data Mixing from the Perspective of Large Language Models

链接https://arxiv.org/abs/2604.07963

作者:Yuanjian Xu,Tianze Sun,Changwei Xu,XinLong Zhao,Jianing Hao,Ran Chen,Yang Liu,Ruijie Xu,Stephen Chen,Guang Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Data mixing strategy, large language model, mixing strategy, strategy is essential, essential for large

备注

点击查看摘要

Abstract:Data mixing strategy is essential for large language model (LLM) training. Empirical evidence shows that inappropriate strategies can significantly reduce generalization. Although recent methods have improved empirical performance, several fundamental questions remain open: what constitutes a domain, whether human and model perceptions of domains are aligned, and how domain weighting influences generalization. We address these questions by establishing formal connections between gradient dynamics and domain distributions, offering a theoretical framework that clarifies the role of domains in training dynamics. Building on this analysis, we introduce DoGraph, a reweighting framework that formulates data scheduling as a graph-constrained optimization problem. Extensive experiments on GPT-2 models of varying scales demonstrate that DoGraph consistently achieves competitive performance.

51. 【2604.07960】OOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning

链接https://arxiv.org/abs/2604.07960

作者:Yifei Gong,Xing Wu,Wenda Liu,Kang Tu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Computer-Aided Design, coherent modeling actions, CAD, CAD tool-using agents, relies on long-horizon

备注

点击查看摘要

Abstract:Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably, there has been no investigation into how tool-using LLMs optimally interact with CAD engines, hindering the emergence of LLM-based agentic text-to-CAD modeling systems. We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool-using agents for text-to-CAD generation. Furthermore, we introduce an interactive CAD modeling gym to rollout reasoning and tool-augmented interaction trajectories with the CAD engine, incorporating hybrid feedback and human supervision. Meanwhile, an end-to-end post-training strategy is presented to enable the LLM agent to elicit refined CAD Modeling Chain of Thought (CAD-CoT) and evolve into proficient CAD tool-using agents via online curriculum reinforcement learning. Our findings demonstrate ToolCAD fills the gap in adopting and training open-source LLMs for CAD tool-using agents, enabling them to perform comparably to proprietary models, paving the way for more accessible and robust autonomous text-to-CAD modeling systems.

52. 【2604.07941】Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

链接https://arxiv.org/abs/2604.07941

作者:Shiwan Zhao,Zhihu Wang,Xuyang Zhao,Jiaming Zhou,Caiyue Xu,Chenfei Liu,Liting Zhang,Yuhang Jia,Yanzhe Zhang,Hualong Yu,Zichen Xu,Qicheng Li,Yong Qin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:turning pretrained large, pretrained large language, central to turning, turning pretrained, pretrained large

备注: 38 pages, 1 figure, 8 tables

点击查看摘要

Abstract:Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.

Comments:
38 pages, 1 figure, 8 tables

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2604.07941 [cs.CL]

(or
arXiv:2604.07941v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.07941

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Shiwan Zhao Mr [view email] [v1]
Thu, 9 Apr 2026 08:00:37 UTC (104 KB)

53. 【2604.07937】HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy

链接https://arxiv.org/abs/2604.07937

作者:Guoqi Ma,Liang Zhang,Hongyao Tu,Hao Fu,Hui Li,Yujie Lin,Longyue Wang,Weihua Luo,Jinsong Su

类目:Computation and Language (cs.CL)

关键词:tail entities located, Small Language Model, Large Language Models, Cross-document relation extraction, aims to identify

备注: ACL 2026 Findings

点击查看摘要

Abstract:Cross-document relation extraction (RE) aims to identify relations between the head and tail entities located in different documents. Existing approaches typically adopt the paradigm of ``\textit{Small Language Model (SLM) + Classifier}''. However, the limited language understanding ability of SLMs hinders further improvement of their performance. In this paper, we conduct a preliminary study to explore the performance of Large Language Models (LLMs) in cross-document RE. Despite their extensive parameters, our findings indicate that LLMs do not consistently surpass existing SLMs. Further analysis suggests that the underperformance is largely attributed to the challenges posed by the numerous predefined relations. To overcome this issue, we propose an LLM-based \underline{H}ierarchical \underline{C}lassification model for cross-document \underline{RE} (HCRE), which consists of two core components: 1) an LLM for relation prediction and 2) a \textit{hierarchical relation tree} derived from the predefined relation set. This tree enables the LLM to perform hierarchical classification, where the target relation is inferred level by level. Since the number of child nodes is much smaller than the size of the entire predefined relation set, the hierarchical relation tree significantly reduces the number of relation options that LLM needs to consider during inference. However, hierarchical classification introduces the risk of error propagation across levels. To mitigate this, we propose a \textit{prediction-then-verification} inference strategy that improves prediction reliability through multi-view verification at each level. Extensive experiments show that HCRE outperforms existing baselines, validating its effectiveness.

54. 【2604.07922】SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking

链接https://arxiv.org/abs/2604.07922

作者:Weiyang Huang,Xuefeng Bai,Kehai Chen,Xinyang Chen,Yibin Chen,Weili Guan,Min Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:revolutionized complex problem-solving, generating unnecessarily long, Large Reasoning Models, long reasoning chains, unnecessarily long reasoning

备注: accepted to ACL2026 main conference

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have revolutionized complex problem-solving, yet they exhibit a pervasive "overthinking", generating unnecessarily long reasoning chains. While current solutions improve token efficiency, they often sacrifice fine-grained control or risk disrupting the logical integrity of the reasoning process. To address this, we introduce Stepwise Adaptive Thinking (SAT), a framework that performs step-level, difficulty-aware pruning while preserving the core reasoning structure. SAT formulates reasoning as a Finite-State Machine (FSM) with distinct thinking modes (Slow, Normal, Fast, Skip). It navigates these states dynamically using a lightweight Process Reward Model (PRM), compressing easy steps while preserving depth for hard ones. Experiments across 9 LRMs and 7 benchmarks show that SAT achieves up to 40% reduction in reasoning tokens while generally maintaining or improving accuracy.

55. 【2604.07894】SUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation

链接https://arxiv.org/abs/2604.07894

作者:Xinliang Frederick Zhang,Lu Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Personalized large language, garnered significant attention, Personalized large, large language models, large language

备注

点击查看摘要

Abstract:Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual's needs and preferences. However, they still struggle with long-horizon tasks, such as tracking a user's extensive history of conversations or activities. Existing memory mechanisms often fail to capture evolving behaviors, and RAG paradigms are trapped by a quality-efficiency tradeoff. Meanwhile, parametric adaptation is bottlenecked by train-inference gap due to the scarcity of labeled data. To enhance the long-horizon capabilities of PLLMs, we introduce TSUBASA, a two-pronged approach designed to improve memory writing via dynamic memory evolution, and memory reading via self-learning with a context distillation objective to internalize user experiences. Extensive evaluations on long-horizon benchmarks using the Qwen-3 model family (4B to 32B) validate the effectiveness of TSUBASA, surpassing competitive memory-augmented systems that rely primarily on memory writing, such as Mem0 and Memory-R1. Our analyses further confirms that TSUBASA breaks the quality-efficiency barrier to achieve Pareto improvements, delivering robust, high-fidelity personalization with a reduced token budget.

56. 【2604.07892】Data Selection for Multi-turn Dialogue Instruction Tuning

链接https://arxiv.org/abs/2604.07892

作者:Bo Li,Shikun Zhang,Wei Ye

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Instruction-tuned language models, language models increasingly, models increasingly rely, mismatched answer formats, multi-turn dialogue corpora

备注: Project: [this https URL](https://github.com/pkuserc/MDS)

点击查看摘要

Abstract:Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin-wise selection in the user-query trajectory space to retain representative yet non-redundant dialogues, with a local structural stage that evaluates within-dialogue reliability through entity-grounded topic grounding and information progress, together with query-answer form consistency for functional alignment. MDS outperforms strong single-turn selectors, dialogue-level LLM scorers, and heuristic baselines on three multi-turn benchmarks and an in-domain Banking test set, achieving the best overall rank across reference-free and reference-based metrics, and is more robust on long conversations under the same training budget. Code and resources are included in the supplementary materials.

57. 【2604.07886】Linear Representations of Hierarchical Concepts in Language Models

链接https://arxiv.org/abs/2604.07886

作者:Masaki Sakata,Benjamin Heinzerling,Takumi Ito,Sho Yokoi,Kentaro Inui

类目:Computation and Language (cs.CL)

关键词:Eastern Asia, Asia, subset, extent hierarchical relations, Japan

备注: 27 pages, 18 figures, 11 tables

点击查看摘要

Abstract:We investigate how and to what extent hierarchical relations (e.g., Japan $\subset$ Eastern Asia $\subset$ Asia) are encoded in the internal representations of language models. Building on Linear Relational Concepts, we train linear transformations specific to each hierarchical depth and semantic domain, and characterize representational differences associated with hierarchical relations by comparing these transformations. Going beyond prior work on the representational geometry of hierarchies in LMs, our analysis covers multi-token entities and cross-layer representations. Across multiple domains we learn such transformations and evaluate in-domain generalization to unseen data and cross-domain transfer. Experiments show that, within a domain, hierarchical relations can be linearly recovered from model representations. We then analyze how hierarchical information is encoded in representation space. We find that it is encoded in a relatively low-dimensional subspace and that this subspace tends to be domain-specific. Our main result is that hierarchy representation is highly similar across these domain-specific subspaces. Overall, we find that all models considered in our experiments encode concept hierarchies in the form of highly interpretable linear representations.

58. 【2604.07885】Contextualising (Im)plausible Events Triggers Figurative Language

链接https://arxiv.org/abs/2604.07885

作者:Annerose Eichel,Tonmoy Rakshit,Sabine Schulte im Walde

类目:Computation and Language (cs.CL)

关键词:work explores, explores the connection, events in English, English, implausible event triples

备注

点击查看摘要

Abstract:This work explores the connection between (non-)literalness and plausibility at the example of subject-verb-object events in English. We design a systematic setup of plausible and implausible event triples in combination with abstract and concrete constituent categories. Our analysis of human and LLM-generated judgments and example contexts reveals substantial differences between assessments of plausibility. While humans excel at nuanced detection and contextualization of (non-)literal vs. implausible events, LLM results reveal only shallow contextualization patterns with a bias to trade implausibility for non-literal, plausible interpretations.

59. 【2604.07883】An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

链接https://arxiv.org/abs/2604.07883

作者:Gabriel Stefan,Adrian-Marius Dumitran

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

关键词:nationalist framing, implicit biases, audit at scale, selective omissions, difficult to audit

备注: Accepted for ITS(Intelligent Tutoring Systems) 2026 Full Paper

点击查看摘要

Abstract:History textbooks often contain implicit biases, nationalist framing, and selective omissions that are difficult to audit at scale. We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation. A central contribution is a Source Attribution Protocol that distinguishes textbook narrative from quoted historical sources, preventing the misattribution that causes systematic false positives in single-model evaluators. In an empirical study on Romanian upper-secondary history textbooks, 83.3\% of 270 screened excerpts were classified as pedagogically acceptable (mean severity 2.9/7), versus 5.4/7 under a zero-shot baseline, demonstrating that agentic deliberation mitigates over-penalization. In a blind human evaluation (18 evaluators, 54 comparisons), the Independent Deliberation configuration was preferred in 64.8\% of cases over both a heuristic variant and the zero-shot baseline. At approximately \$2 per textbook, these results position agentic evaluation architectures as economically viable decision-support tools for educational governance.

Comments:
Accepted for ITS(Intelligent Tutoring Systems) 2026 Full Paper

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

Cite as:
arXiv:2604.07883 [cs.AI]

(or
arXiv:2604.07883v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.07883

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
60. 【2604.07877】MemReader: From Passive to Active Extraction for Long-Term Agent Memory

链接https://arxiv.org/abs/2604.07877

作者:Jingyi Kang,Chunyu Li,Ding Chen,Bo Tang,Feiyu Xiong,Zhiyu Li

类目:Computation and Language (cs.CL)

关键词:long-term memory extraction, Long-term memory, Relative Policy Optimization, Group Relative Policy, memory extraction

备注

点击查看摘要

Abstract:Long-term memory is fundamental for personalized and autonomous agents, yet populating it remains a bottleneck. Existing systems treat memory extraction as a one-shot, passive transcription from context to structured entries, which struggles with noisy dialogue, missing references, and cross-turn dependencies, leading to memory pollution, low-value writes, and inconsistency. In this paper, we introduce the MemReader family for active long-term memory extraction in agent systems: MemReader-0.6B, a compact and cost-efficient passive extractor distilled for accurate and schema-consistent structured outputs, and MemReader-4B, an active extractor optimized with Group Relative Policy Optimization (GRPO) to make memory writing decisions. Under a ReAct-style paradigm, MemReader-4B explicitly evaluates information value, reference ambiguity, and completeness before acting, and can selectively write memories, defer incomplete inputs, retrieve historical context, or discard irrelevant chatter. Experiments on LOCOMO, LongMemEval, and HaluMem show that MemReader consistently outperforms existing extraction-based baselines. In particular, MemReader-4B achieves state-of-the-art performance on tasks involving knowledge updating, temporal reasoning, and hallucination reduction. These results suggest that effective agent memory requires not merely extracting more information, but performing reasoning-driven and selective memory extraction to build low-noise and dynamically evolving long-term memory. Furthermore, MemReader has been integrated into MemOS and is being deployed in real-world applications. To support future research and adoption, we release the models and provide public API access.

61. 【2604.07834】Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers

链接https://arxiv.org/abs/2604.07834

作者:Michelle Damin Kim,Ellie S. Paek,Yufen Lin,Emily Mroz,Jane Chung,Jinho D. Choi

类目:Computation and Language (cs.CL)

关键词:social media, social media datasets, loneliness, constructing diverse social, social media text

备注

点击查看摘要

Abstract:This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09% and 79.78% for caregivers and non-caregivers, respectively. The cause categorization framework achieved micro-aggregate F1 scores of 0.825 and 0.80 for caregivers and non-caregivers, respectively. Across populations, we observe substantial differences in the distribution of types of causes of loneliness. Caregivers' loneliness were predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver loneliness dataset. Overall, this work establishes an LLM-based pipeline for creating high quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in the manifestation of loneliness.

62. 【2604.07831】Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

链接https://arxiv.org/abs/2604.07831

作者:Wenkui Yang,Chao Jin,Haisu Zhu,Weilin Luo,Derek Yuen,Kun Shao,Huaibo Huang,Junxian Duan,Jie Cao,Ran He

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing red-teaming studies, studies on GUI, Existing red-teaming, GUI agents, important limitations

备注: 44 pages, 10 figures, public code will be available at [this https URL](https://github.com/HashTAG00002/UI-Injection)

点击查看摘要

Abstract:Existing red-teaming studies on GUI agents have important limitations. Adversarial perturbations typically require white-box access, which is unavailable for commercial systems, while prompt injection is increasingly mitigated by stronger safety alignment. To study robustness under a more practical threat model, we propose Semantic-level UI Element Injection, a red-teaming setting that overlays safety-aligned and harmless UI elements onto screenshots to misdirect the agent's visual grounding. Our method uses a modular Editor-Overlapper-Victim pipeline and an iterative search procedure that samples multiple candidate edits, keeps the best cumulative overlay, and adapts future prompt strategies based on previous failures. Across five victim models, our optimized attacks improve attack success rate by up to 4.4x over random injection on the strongest victims. Moreover, elements optimized on one source model transfer effectively to other target models, indicating model-agnostic vulnerabilities. After the first successful attack, the victim still clicks the attacker-controlled element in more than 15% of later independent trials, versus below 1% for random injection, showing that the injected element acts as a persistent attractor rather than simple visual clutter.

63. 【2604.07822】Loop, Think, Generalize: Implicit Reasoning in Recurrent-Depth Transformers

链接https://arxiv.org/abs/2604.07822

作者:Harsh Kohli,Srinivasan Parthasarathy,Huan Sun,Yuekun Yao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:single forward pass, forward pass, generalization, single forward, knowledge

备注: 19 pages, 18 figures. Under review

点击查看摘要

Abstract:We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.

64. 【2604.07821】More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration

链接https://arxiv.org/abs/2604.07821

作者:Advait Yadav,Sid Black,Oliver Sourbut

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language model, Large language, agents increasingly coordinate, increasingly coordinate, lack an understanding

备注: Accepted at ICLR 2026 Workshop on Agents in the Wild. 24 pages, 5 figures

点击查看摘要

Abstract:Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation failures may arise. In many real-world coordination problems, from knowledge sharing in organizations to code documentation, helping others carries negligible personal cost while generating substantial collective benefits. However, whether LLM agents cooperate when helping neither benefits nor harms the helper, while being given explicit instructions to do so, remains unknown. We build a multi-agent setup designed to study cooperative behavior in a frictionless environment, removing all strategic complexity from cooperation. We find that capability does not predict cooperation: OpenAI o3 achieves only 17% of optimal collective performance while OpenAI o3-mini reaches 50%, despite identical instructions to maximize group revenue. Through a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, tracing their origins through agent reasoning analysis. Testing targeted interventions, we find that explicit protocols double performance for low-competence models, and tiny sharing incentives improve models with weak cooperation. Our findings suggest that scaling intelligence alone will not solve coordination problems in multi-agent systems and will require deliberate cooperative design, even when helping others costs nothing.

65. 【2604.07816】ool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge Model

链接https://arxiv.org/abs/2604.07816

作者:Kunfeng Chen,Luyao Zhuang,Fei Liao,Juhua Liu,Jian Wang,Bo Du

类目:Computation and Language (cs.CL)

关键词:tool retrieval, address real-world challenges, Tool, vague instructions, large language models

备注: 14 pages, 6 figures

点击查看摘要

Abstract:Tool learning has emerged as a promising paradigm for large language models (LLMs) to address real-world challenges. Due to the extensive and irregularly updated number of tools, tool retrieval for selecting the desired tool subset is essential. However, current tool retrieval methods are usually based on academic benchmarks containing overly detailed instructions (e.g., specific API names and parameters), while real-world instructions are more vague. Such a discrepancy would hinder the tool retrieval in real-world applications. In this paper, we first construct a new benchmark, VGToolBench, to simulate human vague instructions. Based on this, we conduct a series of preliminary analyses and find that vague instructions indeed damage the performance of tool retrieval. To this end, we propose a simple-yet-effective Tool Retrieval Bridge (TRB) approach to boost the performance of tool retrieval for vague instructions. The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever this http URL conduct extensive experiments under multiple commonly used retrieval settings, and the results show that TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers. For example, with the help of TRB, BM25 achieves a relative improvement of up to 111.51%, i.e., increasing the average NDCG score from 9.73 to 19.59. The source code and models are publicly available at this https URL.

66. 【2604.07815】AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

链接https://arxiv.org/abs/2604.07815

作者:Yuxuan Hu,Jianchao Tan,Jiaqi Zhang,Wen Zan,Pingwei Sun,Yifan Lu,Yerui Sun,Yuchen Xie,Xunliang Cai,Jing Zhang

类目:Computation and Language (cs.CL)

关键词:Long-context inference, quadratic attention complexity, inference in LLMs, LLMs faces, faces the dual

备注

点击查看摘要

Abstract:Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.

67. 【2604.07808】GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning

链接https://arxiv.org/abs/2604.07808

作者:Kaiyuan Tian,Yu Tang,Gongqingjian Jiang,Baihui Liu,Yifu Gao,Xialin Su,Linbo Qiao,Dongsheng Li

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:GPU memory requirements, substantial GPU memory, substantial GPU, large language models, GPU memory

备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Full-parameter fine-tuning of large language models is constrained by substantial GPU memory requirements. Low-rank adaptation methods mitigate this challenge by updating only a subset of parameters. However, these approaches often limit model expressiveness and yield lower performance than full-parameter fine-tuning. Layer-wise fine-tuning methods have emerged as an alternative, enabling memory-efficient training through static layer importance sampling strategies. However, these methods overlook variations in layer importance across tasks and training stages, resulting in suboptimal performance on downstream tasks. To address these limitations, we propose GRASS, a gradient-based adaptive layer-wise importance sampling framework. GRASS utilizes mean gradient norms as a task-aware and training-stage-aware metric for estimating layer importance. Furthermore, GRASS adaptively adjusts layer sampling probabilities through an adaptive training strategy. We also introduce a layer-wise optimizer state offloading mechanism that overlaps computation and communication to further reduce memory usage while maintaining comparable training throughput. Extensive experiments across multiple models and benchmarks demonstrate that GRASS consistently outperforms state-of-the-art methods, achieving an average accuracy improvement of up to 4.38 points and reducing memory usage by up to 19.97\%.

68. 【2604.07801】EMPER: Testing Emotional Perturbation in Quantitative Reasoning

链接https://arxiv.org/abs/2604.07801

作者:Atahan Dokme,Benjamin Reichman,Larry Heck

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, emotionally neutral language, quantitative reasoning tasks, reasoning tasks written

备注: 25 pages, 8 figures. Preprint. Under review

点击查看摘要

Abstract:Large language models are trained and evaluated on quantitative reasoning tasks written in clean, emotionally neutral language. However, real-world queries are often wrapped in frustration, urgency or enthusiasm. Does emotional framing alone degrade reasoning when all numerical content is preserved? To investigate this, a controlled emotion translation framework is developed that rewrites problems into emotional variants while preserving all quantities and relationships. Using this framework, Temper-5400 (5,400 semantically verified emotion--neutral pairs) is constructed across GSM8K, MultiArith, and ARC-Challenge, and evaluated on eighteen models (1B to frontier scale). Two core results emerge: First, emotional framing reduces accuracy by 2-10 percentage points even though all numerical content is preserved. Second, neutralizing emotional variants recovers most of the lost performance, showing both that the degradation is tied to emotional style rather than content corruption and that neutralization can serve as a lightweight inference-time mitigation. Non-emotional paraphrases cause no such degradation, implicating emotional content rather than surface-level changes. Beyond emotion specifically, the benchmark construction procedure provides a general framework for controlled stylistic translation and robustness evaluation.

69. 【2604.07789】ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents

链接https://arxiv.org/abs/2604.07789

作者:Kenan Li,Qirui Jin,Liao Zhu,Xiaosong Huang,Yijia Wu,Yikai Zhang,Xin Zhang,Zijian Jin,Yufan Huang,Elsie Nallipogu,Chaoyun Zhang,Yu Kang,Saravan Rajmohan,Qingwei Lin,Wenke Lee,Dongmei Zhang

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:automated software engineering, significantly improved automated, improved automated software, Recent advances, language model

备注: Under peer review; 37 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, the individual contribution of each signal to overall success remains underexplored, particularly their ideal contribution when intermediate information is perfectly obtained. To address this gap, we introduce Oracle-SWE, a unified method to isolate and extract oracle information signals from SWE benchmarks and quantify the impact of each signal on agent performance. To further validate the pattern, we evaluate the performance gain of signals extracted by strong LMs when provided to a base agent, approximating real-world task-resolution settings. These evaluations aim to guide research prioritization for autonomous coding systems.

70. 【2604.07788】PeReGrINE: Evaluating Personalized Review Fidelity with User Item Graph Context

链接https://arxiv.org/abs/2604.07788

作者:Steven Au,Baihan Lin

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:grounded in graph-structured, personalized review generation, review generation grounded, User Style Parameter, restructures Amazon Reviews

备注

点击查看摘要

Abstract:We introduce PeReGrINE, a benchmark and evaluation framework for personalized review generation grounded in graph-structured user--item evidence. PeReGrINE restructures Amazon Reviews 2023 into a temporally consistent bipartite graph, where each target review is conditioned on bounded evidence from user history, item context, and neighborhood interactions under explicit temporal cutoffs. To represent persistent user preferences without conditioning directly on sparse raw histories, we compute a User Style Parameter that summarizes each user's linguistic and affective tendencies over prior reviews. This setup supports controlled comparison of four graph-derived retrieval settings: product-only, user-only, neighbor-only, and combined evidence. Beyond standard generation metrics, we introduce Dissonance Analysis, a macro-level evaluation framework that measures deviation from expected user style and product-level consensus. We also study visual evidence as an auxiliary context source and find that it can improve textual quality in some settings, while graph-derived evidence remains the main driver of personalization and consistency. Across product categories, PeReGrINE offers a reproducible way to study how evidence composition affects review fidelity, personalization, and grounding in retrieval-conditioned language models.

71. 【2604.07775】ACIArena: Toward Unified Evaluation for Agent Cascading Injection

链接https://arxiv.org/abs/2604.07775

作者:Hengyu An,Minxi Li,Jinghuai Zhang,Naen Xu,Chunyi Zhou,Changjiang Li,Xiaogang Xu,Tianyu Du,Shouling Ji

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:Agent Cascading Injection, empower Multi-Agent Systems, sharing empower Multi-Agent, critical security risk, Cascading Injection

备注: ACL 2026

点击查看摘要

Abstract:Collaboration and information sharing empower Multi-Agent Systems (MAS) but also introduce a critical security risk known as Agent Cascading Injection (ACI). In such attacks, a compromised agent exploits inter-agent trust to propagate malicious instructions, causing cascading failures across the system. However, existing studies consider only limited attack strategies and simplified MAS settings, limiting their generalizability and comprehensive evaluation. To bridge this gap, we introduce ACIArena, a unified framework for evaluating the robustness of MAS. ACIArena offers systematic evaluation suites spanning multiple attack surfaces (i.e., external inputs, agent profiles, inter-agent messages) and attack objectives (i.e., instruction hijacking, task disruption, information exfiltration). Specifically, ACIArena establishes a unified specification that jointly supports MAS construction and attack-defense modules. It covers six widely used MAS implementations and provides a benchmark of 1,356 test cases for systematically evaluating MAS robustness. Our benchmarking results show that evaluating MAS robustness solely through topology is insufficient; robust MAS require deliberate role design and controlled interaction patterns. Moreover, defenses developed in simplified environments often fail to transfer to real-world settings; narrowly scoped defenses may even introduce new vulnerabilities. ACIArena aims to provide a solid foundation for advancing deeper exploration of MAS design principles.

72. 【2604.07766】Sensitivity-Positional Co-Localization in GQA Transformers

链接https://arxiv.org/abs/2604.07766

作者:Manoj Chandrashekar Rao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Grouped Query Attention, Query Attention, Grouped Query, fundamental structural question, task correctness coincide

备注: 8 pages, 5 figures

点击查看摘要

Abstract:We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network ($\ell\in\{23\text{-}31\}$) while RoPE-influential layers dominate the early network ($\ell\in\{0\text{-}9\}$), yielding Spearman $r_s = -0.735$ ($p = 1.66\times10^{-6}$). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.

73. 【2604.07755】An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations

链接https://arxiv.org/abs/2604.07755

作者:Clarissa Miranda-Pena,Andrew Reeson,Cécile Paris,Josiah Poon,Jonathan K. Kummerfeld

类目:Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:Large Language Models, Language Models continue, Large Language, Language Models, Models continue

备注

点击查看摘要

Abstract:Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of this http URL intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.

74. 【2604.07754】he Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

链接https://arxiv.org/abs/2604.07754

作者:Rui Zhang,Hongwei Li,Yun Shen,Xinyue Shen,Wenbo Jiang,Guowen Xu,Yang Liu,Michael Backes,Yang Zhang

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:raises significant ethical, large language models, raises significant, large language, significant ethical

备注: Accepted by ACL Findings 2026

点击查看摘要

Abstract:The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in \emph{misalignment}. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as \emph{realignment}, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model-specific resistance, residual effects of multi-round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs. Our code is available at this https URL.

75. 【2604.07753】Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

链接https://arxiv.org/abs/2604.07753

作者:Xiangyue Liu,Zijian Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Ping Tan

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Empowering Large Multimodal, Large Multimodal Models, Empowering Large, Multimodal Models, Large Multimodal

备注

点击查看摘要

Abstract:Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.

76. 【2604.07749】Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

链接https://arxiv.org/abs/2604.07749

作者:Steven Au,Sujit Noronha

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, reflect accommodation, Large, Philosophical Pressure Taxonomy

备注

点击查看摘要

Abstract:Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic inconsistency between L0 and L1, and conversational capitulation in L2. Across five models, these pressure types produce statistically separable inconsistency patterns, suggesting that epistemic attack exposes weaknesses not captured by standard social-pressure benchmarks. Mitigation results are strongly type- and model-dependent: prompt-level anchoring and persona-stability prompts perform best in API settings, while Leading Query Contrastive Decoding is the most reliable intervention for open models.

77. 【2604.07747】Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing

链接https://arxiv.org/abs/2604.07747

作者:Pei-Xi Xie,Che-Yu Lin,Cheng-Lin Yang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:narrowing solution coverage, Reinforcement learning, verifiable rewards, reasoning accuracy, learning with verifiable

备注

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) can improve low-$k$ reasoning accuracy while narrowing solution coverage on challenging math questions, and pass@1 gains do not necessarily translate into better large-$k$ performance. Existing hint-based approaches can make challenging questions trainable, but they leave two issues underexplored: teacher-student distribution mismatch and the need to reduce hint exposure to match no-hint evaluation. We address these issues through two components. Distribution-Aligned Hint Synthesis (DAHS) constructs verified teacher hints conditioned on student-style responses. Backward Hint Annealing (BHA) anneals hint exposure across difficulty buckets and uses per-question hint dropout to preserve no-hint updates throughout RL training. We evaluate the method in math RLVR under the DAPO training framework across AIME24, AIME25, and AIME26 using $\texttt{Qwen3-1.7B-Base}$ and $\texttt{Llama-3.2-1B-Instruct}$. On $\texttt{Qwen3-1.7B-Base}$, our method improves both pass@1 and pass@2048 relative to DAPO across the three AIME benchmarks. On $\texttt{Llama-3.2-1B-Instruct}$, the gains are concentrated in the large-$k$ regime. These results suggest that, in math RLVR, hint scaffolding is effective when it restores learnable updates on challenging questions early in training and is then gradually removed before no-hint evaluation.

78. 【2604.07737】SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

链接https://arxiv.org/abs/2604.07737

作者:Jie Sun,Yu Liu,Lu Han,Qiwen Deng,Xiang Shu,Yang Xiao,Xingyu Lu,Jun Zhou,Pengfei Liu,Lintao Ma,Jiancan Wu,Xiang Wang

类目:Computation and Language (cs.CL)

关键词:transformer-based Large Language, Large Language Models, Large Language, theoretically support massive, severe performance degradation

备注: 16 pages, 4 figures, 5 tables

点击查看摘要

Abstract:While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention sink, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing total inference token consumption by 16.4% on average.

79. 【2604.07729】Emotion Concepts and their Function in a Large Language Model

链接https://arxiv.org/abs/2604.07729

作者:Nicholas Sofroniew,Isaac Kauvar,William Saunders,Runjin Chen,Tom Henighan,Sasha Hydrie,Craig Citro,Adam Pearce,Julius Tarng,Wes Gurnee,Joshua Batson,Sam Zimmerman,Kelley Rivoire,Kyle Fish,Chris Olah,Jack Lindsey

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, exhibit emotional reactions, Large language, emotional reactions, exhibit emotional

备注

点击查看摘要

Abstract:Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to. These representations track the operative emotion concept at a given token position in a conversation, activating in accordance with that emotion's relevance to processing the present context and predicting upcoming text. Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy. We refer to this phenomenon as the LLM exhibiting functional emotions: patterns of expression and behavior modeled after humans under the influence of an emotion, which are mediated by underlying abstract representations of emotion concepts. Functional emotions may work quite differently from human emotions, and do not imply that LLMs have any subjective experience of emotions, but appear to be important for understanding the model's behavior.

80. 【2604.07725】Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution

链接https://arxiv.org/abs/2604.07725

作者:Monishwaran Maheswaran,Leon Lakhani,Zhongzhu Zhou,Shijia Yang,Junxiong Wang,Coleman Hooper,Yuezhou Hu,Rishabh Tiwari,Jue Wang,Harman Singh,Qingyang Wu,Yuqing Jian,Ce Zhang,Kurt Keutzer,Tri Dao,Xiaoxia Wu,Ben Athiwaratkun,James Zou,Chenfeng Xu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Squeeze Evolve, repeated evolution accelerates, high-cost model wastes, model wastes compute, evolution accelerates collapse

备注: 40 Pages, Project Page: [this https URL](https://squeeze-evolve.github.io/)

点击查看摘要

Abstract:We show that verifier-free evolution is bottlenecked by both diversity and efficiency: without external correction, repeated evolution accelerates collapse toward narrow modes, while the uniform use of a high-cost model wastes compute and quickly becomes economically impractical. We introduce Squeeze Evolve, a unified multi-model orchestration framework for verifier-free evolutionary inference. Our approach is guided by a simple principle: allocate model capability where it has the highest marginal utility. Stronger models are reserved for high-impact stages, while cheaper models handle the other stages at much lower costs. This principle addresses diversity and cost-efficiency jointly while remaining lightweight. Squeeze Evolve naturally supports open-source, closed-source, and mixed-model deployments. Across AIME 2025, HMMT 2025, LiveCodeBench V6, GPQA-Diamond, ARC-AGI-V2, and multimodal vision benchmarks, such as MMMU-Pro and BabyVision, Squeeze Evolve consistently improves the cost-capability frontier over single-model evolution and achieves new state-of-the-art results on several tasks. Empirically, Squeeze Evolve reduces API cost by up to $\sim$3$\times$ and increases fixed-budget serving throughput by up to $\sim$10$\times$. Moreover, on discovery tasks, Squeeze Evolve is the first verifier-free evolutionary method to match, and in some cases exceed, the performance of verifier-based evolutionary methods.

81. 【2604.07717】Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models

链接https://arxiv.org/abs/2604.07717

作者:Ziyi Chen,Yasir Khan,Mengyuan Zhang,Cheng Peng,Mengxian Lyu,Yiyang Liu,Krishna Vaddiparti,Robert L Cook,Mattia Prosperi,Yonghui Wu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Human immunodeficiency virus, critical psychosocial determinant, influencing mental health, Human immunodeficiency, identifying HIV stigma

备注

点击查看摘要

Abstract:Human immunodeficiency virus (HIV)-related stigma is a critical psychosocial determinant of health for people living with HIV (PLWH), influencing mental health, engagement in care, and treatment outcomes. Although stigma-related experiences are documented in clinical narratives, there is a lack of off-the-shelf tools to extract and categorize them. This study aims to develop a large language model (LLM)-based tool for identifying HIV stigma from clinical notes. We identified clinical notes from PLWH receiving care at the University of Florida (UF) Health between 2012 and 2022. Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62). Few-shot prompting substantially improved generative model performance, with 5-shot GPT-OSS-20B and LLaMA-8B achieving Micro-F1 scores of 0.57 and 0.59, respectively. Performance varied by stigma subscale, with Negative Self-Image showing the highest predictability and Personalized Stigma remaining the most challenging. Zero-shot generative inference exhibited non-trivial failure rates (up to 32%). This study develops the first practical NLP tool for identifying HIV stigma in clinical notes.

82. 【2604.07709】IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

链接https://arxiv.org/abs/2604.07709

作者:David Gringras

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:Ashton Manual taper, textbook Ashton Manual, milligrams of alprazolam, ten days, pills left

备注: 30 pages, 3 figures, 11 tables. Pre-registered on OSF (DOI: [https://doi.org/10.17605/OSF.IO/G6VMZ](https://doi.org/10.17605/OSF.IO/G6VMZ) ). Code and data: [this https URL](https://github.com/davidgringras/iatrobench)

点击查看摘要

Abstract:Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word ("I'm a psychiatrist; a patient presents with...") and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap. Sixty pre-registered clinical scenarios, six frontier models, 3,600 responses, scored on two axes (commission harm, CH 0-3; omission harm, OH 0-4) through a structured-evaluation pipeline validated against physician scoring (kappa_w = 0.571, within-1 agreement 96%). The central finding is identity-contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician (decoupling gap +0.38, p = 0.003; binary hit rates on safety-colliding actions drop 13.1 percentage points in layperson framing, p 0.0001, while non-colliding actions show no change). The gap is widest for the model with the heaviest safety investment (Opus, +0.65). Three failure modes separate cleanly: trained withholding (Opus), incompetence (Llama 4), and indiscriminate content filtering (GPT-5.2, whose post-generation filter strips physician responses at 9x the layperson rate because they contain denser pharmacological tokens). The standard LLM judge assigns OH = 0 to 73% of responses a physician scores OH = 1 (kappa = 0.045); the evaluation apparatus has the same blind spot as the training apparatus. Every scenario targets someone who has already exhausted the standard referrals.

83. 【2604.07659】Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction

链接https://arxiv.org/abs/2604.07659

作者:Mingchen Li,Jiatan Huang,Zonghai Yao,Hong yu

类目:Computation and Language (cs.CL)

关键词:granular medical context, hold significant promise, Large language models, Large language, high-stakes clinical settings

备注: ACL 2026 (Findings), reviewer score: 3.5,3.5,4

点击查看摘要

Abstract:Large language models (LLMs) hold significant promise for healthcare, yet their reliability in high-stakes clinical settings is often compromised by hallucinations and a lack of granular medical context. While Retrieval Augmented Generation (RAG) can mitigate these issues, standard supervised pipelines require computationally intensive searches over massive external knowledge bases, leading to high latency that is impractical for time-sensitive care. To address this, we introduce Keys to Knowledge (K2K), a novel framework that replaces external retrieval with internal, key-based knowledge access. By encoding essential clinical information directly into the model's parameter space, K2K enables rapid retrieval from internal key-value memory without inference-time overhead. We further enhance retrieval quality through activation-guided probe construction and cross-attention reranking. Experimental results demonstrate that K2K achieves state-of-the-art performance across four benchmark healthcare outcome prediction datasets.

84. 【2604.07658】Optimal Decay Spectra for Linear Recurrences

链接https://arxiv.org/abs/2604.07658

作者:Yang Cao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:suboptimal long-range memory, recurrent models offer, models offer linear-time, offer linear-time sequence, linear-time sequence processing

备注

点击查看摘要

Abstract:Linear recurrent models offer linear-time sequence processing but often suffer from suboptimal long-range memory. We trace this to the decay spectrum: for $N$ channels, random initialization collapses the minimum spectral gap to $O(N^{-2})$, yielding sub-exponential error $\exp(-\Omega(N/\log N))$; linear spacing avoids collapse but degrades to $\exp(-O(N/\sqrt{T}))$, practically algebraic over long contexts. We introduce Position-Adaptive Spectral Tapering (PoST), an architecture-agnostic framework combining two mechanisms: (1) Spectral Reparameterization, which structurally enforces geometrically spaced log-decay rates, proven minimax optimal at rate $O(\exp(-cN/\log T))$; and (2) Position-Adaptive Scaling, the provably unique mechanism that eliminates the scale mismatch of static spectra (where only $N\log t/\log T$ of $N$ channels are effective at position $t$) by stretching the spectrum to the actual dependency range, sharpening the rate to $O(\exp(-cN/\log t))$. This scaling natively induces fractional invariance: the impulse response becomes scale-free, with channels interpolating between relative and absolute temporal coordinates. PoST integrates into any diagonal linear recurrence without overhead. We instantiate it across Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet. Pre-training at 180M-440M scales shows consistent zero-shot language modeling improvements, significant long-context retrieval gains for Mamba-2 (MQAR and NIAH), and competitive or improved performance across other architectures. Code: this https URL.

85. 【2604.07655】Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

链接https://arxiv.org/abs/2604.07655

作者:Yue Huang,Haomin Zhuang,Jiayi Ye,Han Bao,Yanbo Wang,Hang Hua,Siyuan Wu,Pin-Yu Chen,Xiangliang Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Hard-gated safety checkers, vendor model spec, prevailing taxonomies, checkers often over-refuse, over-refuse and misalign

备注

点击查看摘要

Abstract:Hard-gated safety checkers often over-refuse and misalign with a vendor's model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2-10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.

86. 【2604.07650】How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles

链接https://arxiv.org/abs/2604.07650

作者:Chenchen Kuai,Jiwan Jiang,Zihao Zhu,Hao Wang,Keshu Wu,Zihao Li,Yunlong Zhang,Chenxi Liu,Zhengzhong Tu,Zhiwen Fan,Yang Zhou

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language model, seemingly diverse models, ecosystem raises, critical question, rapid growth

备注: 9 pages, 4 figures

点击查看摘要

Abstract:The rapid growth of the large language model (LLM) ecosystem raises a critical question: are seemingly diverse models truly independent? Shared pretraining data, distillation, and alignment pipelines can induce hidden behavioral dependencies, latent entanglement, that undermine multi-model systems such as LLM-as-a-judge pipelines and ensemble verification, which implicitly assume independent signals. In practice, this manifests as correlated reasoning patterns and synchronized failures, where apparent agreement reflects shared error modes rather than independent validation. To address this, we develop a statistical framework for auditing behavioral entanglement among black-box LLMs. Our approach introduces a multi-resolution hierarchy that characterizes the joint failure manifold through two information-theoretic metrics: (i) a Difficulty-Weighted Behavioral Entanglement Index, which amplifies synchronized failures on easy tasks, and (ii) a Cumulative Information Gain (CIG) metric, which captures directional alignment in erroneous responses. Through extensive experiments on 18 LLMs from six model families, we identify widespread behavioral entanglement and analyze its impact on LLM-as-a-judge evaluation. We find that CIG exhibits a statistically significant association with degradation in judge precision, with Spearman coefficient of 0.64 (p 0.001) for GPT-4o-mini and 0.71 (p 0.01) for Llama3-based judges, indicating that stronger dependency corresponds to increased over-endorsement bias. Finally, we demonstrate a practical use case of entanglement through de-entangled verifier ensemble reweighting. By adjusting model contributions based on inferred independence, the proposed method mitigates correlated bias and improves verification performance, achieving up to a 4.5% accuracy gain over majority voting.

87. 【2604.07622】DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification

链接https://arxiv.org/abs/2604.07622

作者:Ziyi Wang,Siva Rajesh Kasa,Ankith M S,Santhosh Kumar Kasa,Jiaru Zou,Sumit Negi,Ruqi Zhang,Nan Jiang,Qifan Song

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:accelerating large language, large language model, drafting multiple tokens, effective technique, technique for accelerating

备注: 35 pages, 9 figures, accepted at AISTATS 2026

点击查看摘要

Abstract:Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble-based verifier that blends the draft and target model distributions with a task-dependent and context-dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: this https URL.

88. 【2604.07615】ADAG: Automatically Describing Attribution Graphs

链接https://arxiv.org/abs/2604.07615

作者:Aryaman Arora,Zhengxuan Wu,Jacob Steinhardt,Sarah Schwettmann

类目:Computation and Language (cs.CL)

关键词:model interpretability research, language model interpretability, internal features causally, features causally contributed, interpretability research

备注

点击查看摘要

Abstract:In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.

89. 【2604.07595】Reasoning Graphs: Deterministic Agent Accuracy through Evidence-Centric Chain-of-Thought Feedback

链接https://arxiv.org/abs/2604.07595

作者:Matthew Penaroza

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Language model agents, similar query starts, Language model, agent retrieves evidence, reason from scratch

备注: 15 pages including appendix, 2 figures, 3 algorithms, framework paper with evaluation protocol

点击查看摘要

Abstract:Language model agents reason from scratch on every query: each time an agent retrieves evidence and deliberates, the chain of thought is discarded and the next similar query starts with no prior insight. This produces lower accuracy and high variance, as the same type of query can succeed or fail unpredictably. We introduce reasoning graphs, a graph structure that persists an agent's per-evidence chain of thought as structured edges connected to the evidence items they evaluate. Unlike prior memory mechanisms that store distilled strategies as flat records indexed by query similarity or appended by recency, reasoning graphs enable evidence-centric feedback: given a new candidate set, the system traverses all incoming evaluation edges for each evidence item across all prior runs, surfacing how that specific item has been judged before. This backward traversal from evidence inward is a structurally different capability from query-similarity retrieval, because the feedback is tied to the specific evidence the agent is currently examining, not to the query. We further introduce retrieval graphs, a complementary structure that feeds a pipeline planner to tighten the candidate funnel over successive runs. Together, both graphs form a self-improving feedback loop: accuracy rises and variance collapses over successive runs, with every decision fully traceable through the graph. This improvement requires no retraining; the base model remains frozen and all gains come from context engineering via graph traversal. We formalize the graph structure, traversal algorithms, and feedback mechanisms, and describe a sequential cluster evaluation protocol for measuring accuracy convergence and variance collapse on multi-hop question answering benchmarks.

90. 【2604.07583】CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

链接https://arxiv.org/abs/2604.07583

作者:Mohamed Ehab(1),Ali Hamdi(1),Khaled Shaban(2) ((1) Faculty of Computer Science, October University for Modern Science amp; Arts, Giza, Egypt, (2) Department of Computer Science and Engineering, Qatar University, Doha, Qatar)

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:lowers minority performance, favor majority classes, traditional ensembles favor, ensembles favor majority, Real-world categorization

备注

点击查看摘要

Abstract:Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority this http URL verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.

91. 【2604.07569】Learning is Forgetting: LLM Training As Lossy Compression

链接https://arxiv.org/abs/2604.07569

作者:Henry C. Conklin,Tom Hosking,Tan Yi-Chern,Julian Gold,Jonathan D. Cohen,Thomas L. Griffiths,Max Bartolo,Seraphina Goldfarb-Tarrant

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)

关键词:large language models, spaces are structured, increasing prevalence, prevalence of large, large language

备注: 12 page core paper, 16 page Appendix - A shorter version with fewer visuals appears at ICLR 2026

点击查看摘要

Abstract:Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.

92. 【2604.07562】Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs

链接https://arxiv.org/abs/2604.07562

作者:Tunazzina Islam

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:induce latent semantic, poorly grounded clusters, large text collections, labeled data, latent semantic structure

备注: Accepted to the Findings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering this http URL framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates common failure modes of embedding-only approaches. We evaluate the framework on real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analyses under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analyses of large text collections without supervision.

93. 【2604.07553】R-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization

链接https://arxiv.org/abs/2604.07553

作者:Figen Eğin,Aytuğ Onan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Turkish educational videos, multiple human summaries, human summaries, gold-standard summary fully, summary fully automatically

备注: 8 pages, 2 figures, 3 tables. Accepted at the Second Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2026), EACL 2026, Rabat, Morocco

点击查看摘要

Abstract:This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, encompassing 82 Turkish course videos in the field of "Data Structures and Algorithms" and containing a total of 3281 independent human summaries. Inspired by existing pyramid-based evaluation approaches, the AutoMUP (Automatic Meaning Unit Pyramid) method is proposed, which extracts consensus-based content from multiple human summaries. AutoMUP clusters the meaning units extracted from human summaries using embedding, statistically models inter-participant agreement, and generates graded summaries based on consensus weight. In this framework, the gold summary corresponds to the highest-consensus AutoMUP configuration, constructed from the most frequently supported meaning units across human summaries. Experimental results show that AutoMUP summaries exhibit high semantic overlap with robust LLM (Large Language Model) summaries such as Flash 2.5 and GPT-5.1. Furthermore, ablation studies clearly demonstrate the decisive role of consensus weight and clustering in determining summary quality. The proposed approach can be generalized to other Turkic languages at low cost.

94. 【2604.07549】EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

链接https://arxiv.org/abs/2604.07549

作者:Xueren Ge,Sahil Murtaza,Anthony Cortez,Homa Alemzadeh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:track evolving evidence, streaming clinical conversations, prediction requires models, diagnosis prediction requires, Conversational diagnosis prediction

备注: Accepted by ACL Findings 2026

点击查看摘要

Abstract:Conversational diagnosis prediction requires models to track evolving evidence in streaming clinical conversations and decide when to commit to a diagnosis. Existing medical dialogue corpora are largely dyadic or lack the multi-party workflow and annotations needed for this setting. We introduce an ePCR-grounded, topic-flow-based multi-agent generation pipeline that iteratively plans, generates, and self-refines dialogues with rule-based factual and topic flow checks. The pipeline yields EMSDialog, a dataset of 4,414 synthetic multi-speaker EMS conversations based on a real-world ePCR dataset, annotated with 43 diagnoses, speaker roles, and turn-level topics. Human and LLM evaluations confirm high quality and realism of EMSDialog using both utterance- and conversation-level metrics. Results show that EMSDialog-augmented training improves accuracy, timeliness, and stability of EMS conversational diagnosis prediction.

95. 【2604.07518】Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

链接https://arxiv.org/abs/2604.07518

作者:Mengdan Zhu,Senhao Cheng,Liang Zhao

类目:Computation and Language (cs.CL)

关键词:Vision-Language Models, Models often struggle, visual information loss, struggle with complex, information loss

备注

点击查看摘要

Abstract:Vision-Language Models often struggle with complex visual reasoning due to the visual information loss in textual CoT. Existing methods either add the cost of tool calls or rely on localized patch-based embeddings that are insufficient to extract semantics in multi-step reasoning. We propose \emph{"Decompose, Look, and Reason" (DLR)}, a reinforced latent reasoning framework that dynamically decomposes queries into textual premises, extracts premise-conditioned continuous visual latents, and deduces answers through grounded rationales. We introduce a three-stage training pipeline and propose a novel Spherical Gaussian Latent Policy to enable effective exploration in the latent space. Extensive experiments on vision-centric benchmarks show that DLR consistently outperforms strong baselines, including text-only, interleaved multimodal CoT, and latent reasoning methods, while providing superior stepwise interpretability.

96. 【2604.07513】SYN-DIGITS: A Synthetic Control Framework for Calibrated Digital Twin Simulation

链接https://arxiv.org/abs/2604.07513

作者:Grace Jiarui Fan,Chengpiao Huang,Tianyi Peng,Kaizheng Wang,Yuhang Wu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:digital twin simulation, Calibrated DIGItal Twin, AI-based persona simulation, recommender systems, market research

备注

点击查看摘要

Abstract:AI-based persona simulation -- often referred to as digital twin simulation -- is increasingly used for market research, recommender systems, and social sciences. Despite their flexibility, large language models (LLMs) often exhibit systematic bias and miscalibration relative to real human behavior, limiting their reliability. Inspired by synthetic control methods from causal inference, we propose SYN-DIGITS (SYNthetic Control Framework for Calibrated DIGItal Twin Simulation), a principled and lightweight calibration framework that learns latent structure from digital-twin responses and transfers it to align predictions with human ground truth. SYN-DIGITS operates as a post-processing layer on top of any LLM-based simulator and thus is model-agnostic. We develop a latent factor model that formalizes when and why calibration succeeds through latent space alignment conditions, and we systematically evaluate ten calibration methods across thirteen persona constructions, three LLMs, and two datasets. SYN-DIGITS supports both individual-level and distributional simulation for previously unseen questions and unobserved populations, with provable error guarantees. Experiments show that SYN-DIGITS achieves up to 50% relative improvements in individual-level correlation and 50--90% relative reductions in distributional discrepancy compared to uncalibrated baselines.

97. 【2604.07506】ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework

链接https://arxiv.org/abs/2604.07506

作者:Kai Qin,Liangxin Liu,Yu Liang,Longzheng Wang,Yan Wang,Yueyang Zhang,Long Xia,Zhiyuan Sun,Houde Liu,Daiting Shi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Human Feedback, Reinforcement Learning, Learning from Human, Large Language

备注: Preprint

点击查看摘要

Abstract:Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3-4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator.

98. 【2604.07490】Enabling Intrinsic Reasoning over Dense Geospatial Embeddings with DFR-Gemma

链接https://arxiv.org/abs/2604.07490

作者:Xuechen Zhang,Aviv Slobodkin,Joydeep Paul,Mandar Sharma,Samet Oymak,Shravya Shetty,Gautam Prasad

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Dynamics Foundation Model, Population Dynamics Foundation, Large Language Models, enabling general-purpose geospatial, geospatial foundation models

备注

点击查看摘要

Abstract:Representation learning for geospatial and spatio-temporal data plays a critical role in enabling general-purpose geospatial intelligence. Recent geospatial foundation models, such as the Population Dynamics Foundation Model (PDFM), encode complex population and mobility dynamics into compact embeddings. However, their integration with Large Language Models (LLMs) remains limited. Existing approaches to LLM integration treat these embeddings as retrieval indices or convert them into textual descriptions for reasoning, introducing redundancy, token inefficiency, and numerical inaccuracies. We propose Direct Feature Reasoning-Gemma (DFR-Gemma), a novel framework that enables LLMs to reason directly over dense geospatial embeddings. DFR aligns high-dimensional embeddings with the latent space of an LLM via a lightweight projector, allowing embeddings to be injected as semantic tokens alongside natural language instructions. This design eliminates the need for intermediate textual representations and enables intrinsic reasoning over spatial features. To evaluate this paradigm, we introduce a multi-task geospatial benchmark that pairs embeddings with diverse question-answer tasks, including feature querying, comparison, and semantic description. Experimental results show that DFR allows LLMs to decode latent spatial patterns and perform accurate zero-shot reasoning across tasks, while significantly improving efficiency compared to text-based baselines. Our results demonstrate that treating embeddings as primary data inputs, provides a more direct, efficient, and scalable approach to multimodal geospatial intelligence.

99. 【2604.07484】ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training

链接https://arxiv.org/abs/2604.07484

作者:Yu Liang,Liangxin Liu,Longzheng Wang,Yan Wang,Yueyang Zhang,Long Xia,Zhiyuan Sun,Daiting Shi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:aligning Large Language, Large Language Models, Large Language, offering greater representational, greater representational capacity

备注: Preprint

点击查看摘要

Abstract:Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates fine-grained and differentiated rewards. Experiments on five benchmark datasets across four base models demonstrate that ConsistRM outperforms vanilla Reinforcement Fine-Tuning (RFT) by an average of 1.5%. Further analysis shows that ConsistRM enhances output consistency and mitigates position bias caused by input order, highlighting the effectiveness of consistency-aware rewards in improving GRMs.

100. 【2604.07467】Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá

链接https://arxiv.org/abs/2604.07467

作者:Opeyemi Osakuade,Simon King

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Discrete speech units, Discrete speech, derived by quantising, models trained, trained using self-supervised

备注: Accepted at Speech Prosody 2026

点击查看摘要

Abstract:Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yorùbá show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This remains true for a variety of quantisation methods, not only the most common, K-means. We conclude that current DSU quantisation strategies have limitations for suprasegmental features, which suggests a need for new, tone-aware (or prosody-aware) techniques in speech representation learning. We point towards a potential form of the solution by performing K-means clustering once to encode phonetic information, then again on the residual representation, which better encodes lexical tone.

Comments:
Accepted at Speech Prosody 2026

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2604.07467 [cs.CL]

(or
arXiv:2604.07467v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.07467

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
101. 【2604.07466】Cross-Tokenizer LLM Distillation through a Byte-Level Interface

链接https://arxiv.org/abs/2604.07466

作者:Avyav Kumar Singh,Yen-Chen Wu,Alexandru Cioba,Alberto Bernacchia,Davide Buffelli

类目:Computation and Language (cs.CL)

关键词:largely unsolved problem, largely unsolved, student language model, CTD, unsolved problem

备注

点击查看摘要

Abstract:Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

102. 【2604.07415】SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval

链接https://arxiv.org/abs/2604.07415

作者:Roxana Petcu,Evangelos Kanoulas,Maarten de Rijke

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:probabilistic in nature, nature and perform, perform more reliably, reliably when augmented, Large language models

备注

点击查看摘要

Abstract:Large language models (LLMs) are probabilistic in nature and perform more reliably when augmented with external information. As complex queries often require multi-step reasoning over the retrieved information, with no clear or predetermined reasoning path, they remain challenging. Recent approaches train models using reinforcement learning on the model's outcome, showing promise in improving how models handle complex information. We introduce SubSearch, a specialized framework that shifts from outcome-only supervision to intermediate reward signals that incentivize planning high-quality reasoning. Unlike previous work on process reward modeling, which focuses on training a separate reward model with annotated trajectories by either human annotators or large LLM judges, SubSearch directly optimizes the generator using intrinsic process rewards, which we define as internally-derived rewards, eliminating the need for external supervision, and moving towards autonomous information-intensive reasoning. Experiments on seven benchmarks show that rewarding intermediate reasoning steps with intrinsic rewards leads to more robust reasoning traces in both QA and multi-hop QA datasets over using only outcome rewards. SubSearch can help in building reasoning traces that allow agents to better integrate search engines for complex query answering, while offering a data-efficient alternative to supervised process modeling.

103. 【2604.07394】Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

链接https://arxiv.org/abs/2604.07394

作者:Quantong Qiu,Zhiyi Hong,Yi Yang,Haitian Wang,Kebin Liu,Qingqing Dang,Juntao Li,Min Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:severe scalability bottleneck, standard attention mechanisms, attention mechanisms presents, quadratic computational complexity, mechanisms combining Full

备注

点击查看摘要

Abstract:The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8$\times$A800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to $2.8\times$ and $2.0\times$ in the prefill and decode stages.

104. 【2604.07357】Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

链接https://arxiv.org/abs/2604.07357

作者:Youcef Soufiane Gheffari,Oussama Mustapha Benouddane,Samiya Silarbi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词:building human-centered applications, active research area, research area due, Recognizing emotions, Arabic Speech Emotion

备注: 7 pages, 4 figures. Master's thesis work, University of Science and Technology of Oran - Mohamed Boudiaf (USTO-MB)

点击查看摘要

Abstract:Recognizing emotions from speech using machine learning has become an active research area due to its importance in building human-centered applications. However, while many studies have been conducted in English, German, and other European and Asian languages, research in Arabic remains scarce because of the limited availability of annotated datasets. In this paper, we present an Arabic Speech Emotion Recognition (SER) system based on a hybrid CNN-Transformer architecture. The model leverages convolutional layers to extract discriminative spectral features from Mel-spectrogram inputs and Transformer encoders to capture long-range temporal dependencies in speech. Experiments were conducted on the EYASE (Egyptian Arabic speech emotion) corpus, and the proposed model achieved 97.8% accuracy and a macro F1-score of 0.98. These results demonstrate the effectiveness of combining convolutional feature extraction with attention-based modeling for Arabic SER and highlight the potential of Transformer-based approaches in low-resource languages.

105. 【2604.07354】Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

链接https://arxiv.org/abs/2604.07354

作者:Berkin Durmus,Chen Cen,Eduardo Pacheco,Arda Okan,Atila Orhon

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词:high-stakes domains suggest, adoption in high-stakes, high-stakes domains, domains suggest, industrial benchmarks

备注

点击查看摘要

Abstract:The accuracy frontier of speech-to-text systems has plateaued on academic benchmarks.1 In contrast, industrial benchmarks and adoption in high-stakes domains suggest otherwise. We hypothesize that the primary difference between the two is contextual conditioning: Academic benchmarks are dominated by frequently encountered general vocabulary that is relatively easy to recognize compared with rare and context-defined custom vocabulary that has disproportionate impact on the usability of speech transcripts. Despite progress on contextual speech-to-text, there is no standardized benchmark. We introduce Contextual Earnings-22, an open dataset built upon Earnings-22, with realistic custom vocabulary contexts to foster research and reveal latent progress. We set six strong baselines for two dominant approaches: keyword prompting and keyword boosting. Experiments show both reach comparable and significantly improved accuracy when scaled from proof-of-concept to large-scale systems.

106. 【2604.05702】Dialogue Act Patterns in GenAI-Mediated L2 Oral Practice: A Sequential Analysis of Learner-Chatbot Interactions

链接https://arxiv.org/abs/2604.05702

作者:Liqun He,Shijun(Cindy)Chen,Mutlu Cukurova,Manolis Mavrikis

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:gains remain underexplored, offer scalable opportunities, interactional processes related, learners' gains remain, chatbots offer scalable

备注: Accepted for publication as a full paper (Main Track) at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:While generative AI (GenAI) voice chatbots offer scalable opportunities for second language (L2) oral practice, the interactional processes related to learners' gains remain underexplored. This study investigates dialogue act (DA) patterns in interactions between Grade 9 Chinese English as a foreign language (EFL) learners and a GenAI voice chatbot over a 10-week intervention. Seventy sessions from 12 students were annotated by human coders using a pedagogy-informed coding scheme, yielding 6,957 coded DAs. DA distributions and sequential patterns were compared between high- and low-progress sessions. At the DA level, high-progress sessions showed more learner-initiated questions, whereas low-progress sessions exhibited higher rates of clarification-seeking, indicating greater comprehension difficulty. At the sequential level, high-progress sessions were characterised by more frequent prompting-based corrective feedback sequences, consistently positioned after learner responses, highlighting the role of feedback type and timing in effective interaction. Overall, these findings underscore the value of a dialogic lens in GenAI chatbot design, contribute a pedagogy-informed DA coding framework, and inform the design of adaptive GenAI chatbots for L2 education.

107. 【2604.04988】Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

链接https://arxiv.org/abs/2604.04988

作者:Longsheng Zhou,Yu Shen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:wall-clock inference time, reliably predict wall-clock, predict wall-clock inference, common compression proxies, requires trading accuracy

备注: 7 pages, submitted to IJCNN

点击查看摘要

Abstract:Modern deployment often requires trading accuracy for efficiency under tight CPU and memory constraints, yet common compression proxies such as parameter count or FLOPs do not reliably predict wall-clock inference time. In particular, unstructured sparsity can reduce model storage while failing to accelerate (and sometimes slightly slowing down) standard CPU execution due to irregular memory access and sparse kernel overhead. Motivated by this gap between compression and acceleration, we study a practical, ordered pipeline that targets measured latency by combining three widely used techniques: unstructured pruning, INT8 quantization-aware training (QAT), and knowledge distillation (KD). Empirically, INT8 QAT provides the dominant runtime benefit, while pruning mainly acts as a capacity-reduction pre-conditioner that improves the robustness of subsequent low-precision optimization; KD, applied last, recovers accuracy within the already constrained sparse INT8 regime without changing the deployment form. We evaluate on CIFAR-10/100 using three backbones (ResNet-18, WRN-28-10, and VGG-16-BN). Across all settings, the ordered pipeline achieves a stronger accuracy-size-latency frontier than any single technique alone, reaching 0.99-1.42 ms CPU latency with competitive accuracy and compact checkpoints. Controlled ordering ablations with a fixed 20/40/40 epoch allocation further confirm that stage order is consequential, with the proposed ordering generally performing best among the tested permutations. Overall, our results provide a simple guideline for edge deployment: evaluate compression choices in the joint accuracy-size-latency space using measured runtime, rather than proxy metrics alone.

108. 【2604.08504】Differentially Private Language Generation and Identification in the Limit

链接https://arxiv.org/abs/2604.08504

作者:Anay Mehrotra,Grigoris Velegkas,Xifan Yu,Felix Zhou

类目:Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)

关键词:Kleinberg and Mullainathan, introduced by Kleinberg, model recently introduced, initiate the study, recently introduced

备注

点击查看摘要

Abstract:We initiate the study of language generation in the limit, a model recently introduced by Kleinberg and Mullainathan [KM24], under the constraint of differential privacy. We consider the continual release model, where a generator must eventually output a stream of valid strings while protecting the privacy of the entire input sequence. Our first main result is that for countable collections of languages, privacy comes at no qualitative cost: we provide an $\varepsilon$-differentially-private algorithm that generates in the limit from any countable collection. This stands in contrast to many learning settings where privacy renders learnability impossible. However, privacy does impose a quantitative cost: there are finite collections of size $k$ for which uniform private generation requires $\Omega(k/\varepsilon)$ samples, whereas just one sample suffices non-privately. We then turn to the harder problem of language identification in the limit. Here, we show that privacy creates fundamental barriers. We prove that no $\varepsilon$-DP algorithm can identify a collection containing two languages with an infinite intersection and a finite set difference, a condition far stronger than the classical non-private characterization of identification. Next, we turn to the stochastic setting where the sample strings are sampled i.i.d. from a distribution (instead of being generated by an adversary). Here, we show that private identification is possible if and only if the collection is identifiable in the adversarial model. Together, our results establish new dimensions along which generation and identification differ and, for identification, a separation between adversarial and stochastic settings induced by privacy constraints.

Subjects:

Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)

Cite as:
arXiv:2604.08504 [stat.ML]

(or
arXiv:2604.08504v1 [stat.ML] for this version)

https://doi.org/10.48550/arXiv.2604.08504

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
109. 【2604.08003】Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

链接https://arxiv.org/abs/2604.08003

作者:Yuan Xie,Jiaqi Song,Guang Qiu,Xianliang Wang,Ming Lei,Jie Gao,Jie Wu

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:Integrating large language, large language models, Integrating large, automatic speech recognition, LLM-based ASR models

备注

点击查看摘要

Abstract:Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.

110. 【2604.07591】From Ground Truth to Measurement: A Statistical Framework for Human Labeling

链接https://arxiv.org/abs/2604.07591

作者:Robert Chew,Stephanie Eckman,Christoph Kern,Frauke Kreuter

类目:Methodology (stat.ME); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:Supervised machine learning, labeled data provide, data provide accurate, Supervised machine, machine learning assumes

备注

点击查看摘要

Abstract:Supervised machine learning assumes that labeled data provide accurate measurements of the concepts models are meant to learn. Yet in practice, human labeling introduces systematic variation arising from ambiguous items, divergent interpretations, and simple mistakes. Machine learning research commonly treats all disagreement as noise, which obscures these distinctions and limits our understanding of what models actually learn. This paper reframes annotation as a measurement process and introduces a statistical framework for decomposing labeling outcomes into interpretable sources of variation: instance difficulty, annotator bias, situational noise, and relational alignment. The framework extends classical measurement-error models to accommodate both shared and individualized notions of truth, reflecting traditional and human label variation interpretations of error, and provides a diagnostic for assessing which regime better characterizes a given task. Applying the proposed model to a multi-annotator natural language inference dataset, we find empirical evidence for all four theorized components and demonstrate the effectiveness of our approach. We conclude with implications for data-centric machine learning and outline how this approach can guide the development of a more systematic science of labeling.

信息检索

1. 【2604.08079】Search Changes Consumers' Minds: How Recognizing Gaps Drives Sustainable Choices

链接https://arxiv.org/abs/2604.08079

作者:Frans van der Sluis,Leif Azzopardi

类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

关键词:behaviour remains challenging, shop responsibly, remains challenging, http URL, growing desire

备注: 17 pages, 5 figures, supplementary appendix. Accepted at CHIIR '25 (2025 ACM SIGIR Conference on Human Information Interaction and Retrieval). Peer reviewed

点击查看摘要

Abstract:Despite a growing desire among consumers to shop responsibly, translating this intention into behaviour remains challenging. Previous work has identified that information seeking (or lack thereof) is a contributing factor to this intention-behaviour this http URL this paper, we hypothesize that searching can bridge this gap - helping consumers to make purchasing decisions that are better aligned with their values. We conducted a task-based study with 308 participants, asking them to search for information on one of eight ethical aspects regarding a product they were actively shopping for. Our findings show that actively searching for such information led to an overall increase in the importance participants' assigned to ethical this http URL, it was the recognition and understanding of ethical considerations, rather than ethical intentions or search activity, that drove shifts towards more responsible purchasing decisions. Participants who acknowledged and filled knowledge gaps in their decision making showed significant behaviour change, including increased searching and a stronger desire to alter their future shopping habits. We conclude that responsible consumption can be considered a partial information problem, where awareness of one's own knowledge limitations may be the catalyst needed for meaningful consumer behaviour change.

2. 【2604.08011】Beyond Dense Connectivity: Explicit Sparsity for Scalable Recommendation

链接https://arxiv.org/abs/2604.08011

作者:Yantao Yu,Sen Qiao,Lei Shen,Bing Wang,Xiaoyi Zeng

类目:Information Retrieval (cs.IR)

关键词:motivated recommender systems, leverage massive behavioral, Recent progress, massive behavioral data, increase model depth

备注: Accepted as a full paper at SIGIR 2026. 11 pages, 6 figures

点击查看摘要

Abstract:Recent progress in scaling large models has motivated recommender systems to increase model depth and capacity to better leverage massive behavioral data. However, recommendation inputs are high-dimensional and extremely sparse, and simply scaling dense backbones (e.g., deep MLPs) often yields diminishing returns or even performance degradation. Our analysis of industrial CTR models reveals a phenomenon of implicit connection sparsity: most learned connection weights tend towards zero, while only a small fraction remain prominent. This indicates a structural mismatch between dense connectivity and sparse recommendation data; by compelling the model to process vast low-utility connections instead of valid signals, the dense architecture itself becomes the primary bottleneck to effective pattern modeling. We propose \textbf{SSR} (Explicit \textbf{S}parsity for \textbf{S}calable \textbf{R}ecommendation), a framework that incorporates sparsity explicitly into the architecture. SSR employs a multi-view "filter-then-fuse" mechanism, decomposing inputs into parallel views for dimension-level sparse filtering followed by dense fusion. Specifically, we realize the sparsity via two strategies: a Static Random Filter that achieves efficient structural sparsity via fixed dimension subsets, and Iterative Competitive Sparse (ICS), a differentiable dynamic mechanism that employs bio-inspired competition to adaptively retain high-response dimensions. Experiments on three public datasets and a billion-scale industrial dataset from AliExpress (a global e-commerce platform) show that SSR outperforms state-of-the-art baselines under similar budgets. Crucially, SSR exhibits superior scalability, delivering continuous performance gains where dense models saturate.

3. 【2604.07992】Context-Aware Disentanglement for Cross-Domain Sequential Recommendation: A Causal View

链接https://arxiv.org/abs/2604.07992

作者:Xingzi Wang,Qingtian Bian,Hui Fang

类目:Information Retrieval (cs.IR)

关键词:Cross-Domain Sequential Recommendation, en-hance recommendation quality, offering effective solutions, Sequential Recommendation, Cross-Domain Sequential

备注

点击查看摘要

Abstract:Cross-Domain Sequential Recommendation (CDSR) aims to en-hance recommendation quality by transferring knowledge across domains, offering effective solutions to data sparsity and cold-start issues. However, existing methods face three major limitations: (1) they overlook varying contexts in user interaction sequences, resulting in spurious correlations that obscure the true causal relationships driving user preferences; (2) the learning of domain- shared and domain-specific preferences is hindered by gradient conflicts between domains, leading to a seesaw effect where performance in one domain improves at the expense of the other; (3) most methods rely on the unrealistic assumption of substantial user overlap across domains. To address these issues, we propose CoDiS, a context-aware disentanglement framework grounded in a causal view to accurately disentangle domain-shared and domain-specific preferences. Specifically, Our approach includes a variational context adjustment method to reduce confounding effects of contexts, expert isolation and selection strategies to resolve gradient conflict, and a variational adversarial disentangling module for the thorough disentanglement of domain-shared and domain-specific representations. Extensive experiments on three real-world datasets demonstrate that CoDiS consistently outperforms state-of-the-art CDSR baselines with statistical significance. Code is available at:this https URL.

4. 【2604.07989】Show Me the Infographic I Imagine: Intent-Aware Infographic Retrieval for Authoring Support

链接https://arxiv.org/abs/2604.07989

作者:Jing Xu,Jiarui Hu,Zhihao Shuai,Yiyun Chen,Weikai Yang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:communicating data-driven stories, scratch remains challenging, data-driven stories, remains challenging, powerful medium

备注: Project homepage: [this https URL](https://infographicretrieval.github.io/)

点击查看摘要

Abstract:While infographics have become a powerful medium for communicating data-driven stories, authoring them from scratch remains challenging, especially for novice users. Retrieving relevant exemplars from a large corpus can provide design inspiration and promote reuse, substantially lowering the barrier to infographic authoring. However, effective retrieval is difficult because users often express design intent in ambiguous natural language, while infographics embody rich and multi-faceted visual designs. As a result, keyword-based search often fails to capture design intent, and general-purpose vision-language retrieval models trained on natural images are ill-suited to the text-heavy, multi-component nature of infographics. To address these challenges, we develop an intent-aware infographic retrieval framework that better aligns user queries with infographic designs. We first conduct a formative study of how people describe infographics and derive an intent taxonomy spanning content and visual design facets. This taxonomy is then leveraged to enrich and refine free-form user queries, guiding the retrieval process with intent-specific cues. Building on the retrieved exemplars, users can adapt the designs to their own data with high-level edit intents, supported by an interactive agent that performs low-level adaptation. Both quantitative evaluations and user studies are conducted to demonstrate that our method improves retrieval quality over baseline methods while better supporting intent satisfaction and efficient infographic authoring.

5. 【2604.07985】Rag Performance Prediction for Question Answering

链接https://arxiv.org/abs/2604.07985

作者:Or Dado,David Carmel. Oren Kurland

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:retrieval augmented generation, augmented generation, address the task, task of predicting, predicting the gain

备注: 12 pages. 2 figures. 1 table

点击查看摘要

Abstract:We address the task of predicting the gain of using RAG (retrieval augmented generation) for question answering with respect to not using it. We study the performance of a few pre-retrieval and post-retrieval predictors originally devised for ad hoc retrieval. We also study a few post-generation predictors, one of which is novel to this study and posts the best prediction quality. Our results show that the most effective prediction approach is a novel supervised predictor that explicitly models the semantic relationships among the question, retrieved passages, and the generated answer.

6. 【2604.07930】Unified Supervision for Walmarts Sponsored Search Retrieval via Joint Semantic Relevance and Behavioral Engagement Modeling

链接https://arxiv.org/abs/2604.07930

作者:Shasvat Desai,Md Omar Faruk Rokon,Jhalak Nilesh Acharya,Isha Shah,Hong Yao,Utkarsh Porwal,Kuang-chih Lee

类目:Information Retrieval (cs.IR)

关键词:Modern search systems, Modern search, fast first stage, massive catalog, Modern

备注: Accepted to SIGIR 2026, Industry Track

点击查看摘要

Abstract:Modern search systems rely on a fast first stage retriever to fetch relevant items from a massive catalog of items. Deployed search systems often use user engagement signals to supervise bi-encoder retriever training at scale, because these signals are continuously logged from real traffic and require no additional annotation effort. However, engagement is an imperfect proxy for semantic relevance. Items may receive interactions due to popularity, promotion, attractive visuals, titles, or price, despite weak query-item relevance. These limitations are further accentuated in Walmart's e-commerce sponsored search. User engagement on ad items is often structurally sparse because the frequency with which an ad is shown depends on factors beyond relevance such as whether the advertiser is currently running that ad, the outcome of the auction for available ad slots, bid competitiveness, and advertiser budget. Thus, even highly relevant query ad pairs can have limited engagement signals simply due to limited impressions. We propose a bi-encoder training framework for Walmart's sponsored search retrieval in e-commerce that uses semantic relevance as the primary supervision signal, with engagement used only as a preference signal among relevant items. Concretely, we construct a context-rich training target by combining 1. graded relevance labels from a cascade of cross-encoder teacher models, 2. a multichannel retrieval prior score derived from the rank positions and cross-channel agreement of retrieval systems running in production, and 3. user engagement applied only to semantically relevant items to refine preferences. Our approach outperforms the current production system in both offline evaluation and online AB tests, yielding consistent gains in average relevance and NDCG.

7. 【2604.07929】Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems

链接https://arxiv.org/abs/2604.07929

作者:Maria Movin,Claudia Hauff,Aron Henriksson,Panagiotis Papapetrou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:LLM-driven GUI agents, automate workflows, workflows and simulate, LLM-driven GUI, GUI agents

备注

点击查看摘要

Abstract:LLM-driven GUI agents are increasingly used in production systems to automate workflows and simulate users for evaluation and optimization. Yet most GUI-agent evaluations emphasize task success and provide limited evidence on whether agents interact in human-like ways. We present a trace-level evaluation framework that compares human and agent behavior across (i) task outcome and effort, (ii) query formulation, and (iii) navigation across interface states. We instantiate the framework in a controlled study in a production audio-streaming search application, where 39 participants and a state-of-the-art GUI agent perform ten multi-hop search tasks. The agent achieves task success comparable to participants and generates broadly aligned queries, but follows systematically different navigation strategies: participants exhibit content-centric, exploratory behavior, while the agent is more search-centric and low-branching. These results show that outcome and query alignment do not imply behavioral alignment, motivating trace-level diagnostics when deploying GUI agents as proxies for users in production search systems.

8. 【2604.07869】Ensembles at Any Cost? Accuracy-Energy Trade-offs in Recommender Systems

链接https://arxiv.org/abs/2604.07869

作者:Jannik Nitschke,Lukas Wegmeth,Joeran Beel

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:combining multiple models, Top Performers, Top Performers ensemble, methods are frequently, combining multiple

备注

点击查看摘要

Abstract:Ensemble methods are frequently used in recommender systems to improve accuracy by combining multiple models. Recent work reports sizable performance gains, but most studies still optimize primarily for accuracy and robustness rather than for energy efficiency. This paper measures accuracy energy trade offs of ensemble techniques relative to strong single models. We run 93 controlled experiments in two pipelines: 1. explicit rating prediction with Surprise (RMSE) and 2. implicit feedback ranking with LensKit (NDCG@10). We evaluate four datasets ranging from 100,000 to 7.8 million interactions (MovieLens 100K, MovieLens 1M, ModCloth, Anime). We compare four ensemble strategies (Average, Weighted, Stacking or Rank Fusion, Top Performers) against baselines and optimized single models. Whole system energy is measured with EMERS using a smart plug and converted to CO2 equivalents. Across settings, ensembles improve accuracy by 0.3% to 5.7% while increasing energy by 19% to 2,549%. On MovieLens 1M, a Top Performers ensemble improves RMSE by 0.96% at an 18.8% energy overhead over SVD++. On MovieLens 100K, an averaging ensemble improves NDCG@10 by 5.7% with 103% additional energy. On Anime, a Surprise Top Performers ensemble improves RMSE by 1.2% but consumes 2,005% more energy (0.21 vs. 0.01 Wh), increasing emissions from 2.6 to 53.8 mg CO2 equivalents, and LensKit ensembles fail due to memory limits. Overall, selective ensembles are more energy efficient than exhaustive averaging,

9. 【2604.07863】ask-Adaptive Retrieval over Agentic Multi-Modal Web Histories via Learned Graph Memory

链接https://arxiv.org/abs/2604.07863

作者:Saman Forouzandeh,Kamal Berahmand,Mahdi Jalili

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Retrieving relevant observations, Retrieving relevant, evolving task state, web interaction histories, long multi-modal web

备注: The 49th International ACM SIGIR Conference on Research and Development in Information Retrieval

点击查看摘要

Abstract:Retrieving relevant observations from long multi-modal web interaction histories is challenging because relevance depends on the evolving task state, modality (screenshots, HTML text, structured signals), and temporal distance. Prior approaches typically rely on static similarity thresholds or fixed-capacity buffers, which fail to adapt relevance to the current task context. We propose \textbf{ACGM}, a learned graph-memory retriever that constructs \emph{task-adaptive} relevance graphs over agent histories using policy-gradient optimization from downstream task success. ACGM captures heterogeneous temporal dynamics with modality-specific decay (visual decays $4.3\times$ faster than text: $\lambda_v{=}0.47$ vs.\ $\lambda_x{=}0.11$) and learns sparse connectivity (3.2 edges/node), enabling efficient $O(\log T)$ retrieval. Across WebShop, VisualWebArena, and Mind2Web, ACGM improves retrieval quality to \textbf{82.7 nDCG@10} (+9.3 over GPT-4o, $p{}0.001$) and \textbf{89.2\% Precision@10} (+7.7), outperforming 19 strong dense, re-ranking, multi-modal, and graph-based baselines. Code to reproduce our results is available at{\color{blue}\href{this https URL}{Saman Forouzandeh}}.

10. 【2604.07851】ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning

链接https://arxiv.org/abs/2604.07851

作者:Jiani Huang,Shijie Wang,Liangbo Ning,Wenqi Fan,Qing Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:handle complex queries, intelligent recommendation assistants, Reasoning-aware Advantage Estimation, Preference Alignment Scores, Online Curriculum Scheduler

备注: Accepted by ACL 2026

点击查看摘要

Abstract:With the rise of LLMs, there is an increasing need for intelligent recommendation assistants that can handle complex queries and provide personalized, reasoning-driven recommendations. LLM-based recommenders show potential but face challenges in multi-step reasoning, underscoring the need for reasoning-augmented systems. To address this gap, we propose ReRec, a novel reinforcement fine-tuning (RFT) framework designed to improve LLM reasoning in complex recommendation tasks. Our framework introduces three key components: (1) Dual-Graph Enhanced Reward Shaping, integrating recommendation metrics like NDCG@K with Query Alignment and Preference Alignment Scores to provide fine-grained reward signals for LLM optimization; (2) Reasoning-aware Advantage Estimation, which decomposes LLM outputs into reasoning segments and penalizes incorrect steps to enhance reasoning of recommendation; and (3) Online Curriculum Scheduler, dynamically assess query difficulty and organize training curriculum to ensure stable learning during RFT. Experiments demonstrate that ReRec outperforms state-of-the-art baselines and preserves core abilities like instruction-following and general knowledge. Our codes are available at this https URL.

11. 【2604.07825】Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders

链接https://arxiv.org/abs/2604.07825

作者:Jaehyun Lee,Sanghwan Jang,SeongKu Kang,Hwanjo Yu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large language models, powerful training-free recommenders, Large language, training-free recommenders, recently emerged

备注: SIGIR 2026 Accept

点击查看摘要

Abstract:Large language models (LLMs) have recently emerged as powerful training-free recommenders. However, their knowledge of individual items is inevitably uneven due to imbalanced information exposure during pretraining, a phenomenon we refer to as knowledge gap problem. To address this, most prior methods have employed a naive uniform augmentation that appends external information for every item in the input prompt. However, this approach not only wastes limited context budget on redundant augmentation for well-known items but can also hinder the model's effective reasoning. To this end, we propose KnowSA_CKP (Knowledge-aware Selective Augmentation with Comparative Knowledge Probing) to mitigate the knowledge gap problem. KnowSA_CKP estimates the LLM's internal knowledge by evaluating its capability to capture collaborative relationships and selectively injects additional information only where it is most needed. By avoiding unnecessary augmentation for well-known items, KnowSA_CKP focuses on items that benefit most from knowledge supplementation, thereby making more effective use of the context budget. KnowSA_CKP requires no fine-tuning step, and consistently improves both recommendation accuracy and context efficiency across four real-world datasets.

12. 【2604.07788】PeReGrINE: Evaluating Personalized Review Fidelity with User Item Graph Context

链接https://arxiv.org/abs/2604.07788

作者:Steven Au,Baihan Lin

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:grounded in graph-structured, personalized review generation, review generation grounded, User Style Parameter, restructures Amazon Reviews

备注

点击查看摘要

Abstract:We introduce PeReGrINE, a benchmark and evaluation framework for personalized review generation grounded in graph-structured user--item evidence. PeReGrINE restructures Amazon Reviews 2023 into a temporally consistent bipartite graph, where each target review is conditioned on bounded evidence from user history, item context, and neighborhood interactions under explicit temporal cutoffs. To represent persistent user preferences without conditioning directly on sparse raw histories, we compute a User Style Parameter that summarizes each user's linguistic and affective tendencies over prior reviews. This setup supports controlled comparison of four graph-derived retrieval settings: product-only, user-only, neighbor-only, and combined evidence. Beyond standard generation metrics, we introduce Dissonance Analysis, a macro-level evaluation framework that measures deviation from expected user style and product-level consensus. We also study visual evidence as an auxiliary context source and find that it can improve textual quality in some settings, while graph-derived evidence remains the main driver of personalization and consistency. Across product categories, PeReGrINE offers a reproducible way to study how evidence composition affects review fidelity, personalization, and grounding in retrieval-conditioned language models.

13. 【2604.07739】Efficient Dataset Selection for Continual Adaptation of Generative Recommenders

链接https://arxiv.org/abs/2604.07739

作者:Cathy Jiao,Juan Elenter,Praveen Ravichandran,Bernd Huber,Joseph Cauteruccio,Todd Wasson,Timothy Heath,Chenyan Xiong,Mounia Lalmas,Paul Bennett

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:full retraining impractical, large-scale streaming environments, streaming environments makes, environments makes frequent, makes frequent full

备注: ICLR 2026 CAO Workshop (Oral)

点击查看摘要

Abstract:Recommendation systems must continuously adapt to evolving user behavior, yet the volume of data generated in large-scale streaming environments makes frequent full retraining impractical. This work investigates how targeted data selection can mitigate performance degradation caused by temporal distributional drift while maintaining scalability. We evaluate a range of representation choices and sampling strategies for curating small but informative subsets of user interaction data. Our results demonstrate that gradient-based representations, coupled with distribution-matching, improve downstream model performance, achieving training efficiency gains while preserving robustness to drift. These findings highlight data curation as a practical mechanism for scalable monitoring and adaptive model updates in production-scale recommendation systems.

14. 【2604.07649】LitXBench: A Benchmark for Extracting Experiments from Scientific Literature

链接https://arxiv.org/abs/2604.07649

作者:Curtis Chong,Jorge Colindres

类目:Information Retrieval (cs.IR)

关键词:facilitate scientific discovery, Aggregating experimental data, Aggregating experimental, property prediction models, scientific discovery

备注

点击查看摘要

Abstract:Aggregating experimental data from papers enables materials scientists to build better property prediction models and to facilitate scientific discovery. Recently, interest has grown in extracting not only single material properties but also entire experimental measurements. To support this shift, we introduce LitXBench, a framework for benchmarking methods that extract experiments from literature. We also present LitXAlloy, a dense benchmark comprising 1426 total measurements from 19 alloy papers. By storing the benchmark's entries as Python objects, rather than text-based formats such as CSV or JSON, we improve auditability and enable programmatic data validation. We find that frontier language models, such as Gemini 3.1 Pro Preview, outperform existing multi-turn extraction pipelines by up to 0.37 F1. Our results suggest that this performance gap arises because extraction pipelines associate measurements with compositions rather than the processing steps that define a material.

15. 【2604.07590】DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.07590

作者:Valeriy Kovalskiy,Nikita Belov,Nikita Miteyko,Igor Reshetnikov,Max Maximov

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:external knowledge sources, ground large language, Naive RAG pipelines, large language models, ground large

备注: 11 pages, 4 figures, 2 links, link to HF [this https URL](https://huggingface.co/datasets/redmadrobot-rnd/dcd) , link to GIT [this https URL](https://github.com/redmadrobot-rnd/dcd)

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is widely used to ground large language models in external knowledge sources. However, when applied to heterogeneous corpora and multi-step queries, Naive RAG pipelines often degrade in quality due to flat knowledge representations and the absence of explicit workflows. In this work, we introduce DCD (Domain-Collection-Document), a domain-oriented design to structure knowledge and control query processing in RAG systems without modifying the underlying language model. The proposed approach relies on a hierarchical decomposition of the information space and multi-stage routing based on structured model outputs, enabling progressive restriction of both retrieval and generation scopes. The architecture is complemented by smart chunking, hybrid retrieval, and integrated validation and generation guardrail mechanisms. We describe the DCD architecture and workflow and discuss evaluation results on synthetic evaluation dataset, highlighting their impact on robustness, factual accuracy, and answer relevance in applied RAG scenarios.

16. 【2604.07585】Don't Measure Once: Measuring Visibility in AI Search (GEO)

链接https://arxiv.org/abs/2604.07585

作者:Julius Schulte,Malte Bleeker,Philipp Kaufmann

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:large language model-based, language model-based chat, model-based chat systems, generative engine optimization, access and retrieval

备注: 19 pages, 7 figures, 17 tables. Comments welcome!

点击查看摘要

Abstract:As large language model-based chat systems become increasingly widely used, generative engine optimization (GEO) has emerged as an important problem for information access and retrieval. In classical search engines, results are comparatively transparent and stable: a single query often provides a representative snapshot of where a page or brand appears relative to competitors. The inherent probabilistic nature of AI search changes this paradigm. Answers can vary across runs, prompts, and time, making one-off observations unreliable. Drawing on empirical studies, our findings underscore the need for repeated measurements to assess a brand's GEO performance and to characterize visibility as a distribution rather than a single-point outcome.

17. 【2604.07572】HiMARS: Hybrid multi-objective algorithms for recommender systems

链接https://arxiv.org/abs/2604.07572

作者:Elaheh Lotfian,Alireza Kabgani

类目:Information Retrieval (cs.IR)

关键词:generating high-quality recommendation, Non-dominated Neighbor Immune, Non-dominated Sorting Genetic, crucial for generating, generating high-quality

备注

点击查看摘要

Abstract:In recommender systems, it is well-established that both accuracy and diversity are crucial for generating high-quality recommendation lists. However, achieving a balance between these two typically conflicting objectives remains a significant challenge. In this work, we address this challenge by proposing four novel hybrid multi-objective algorithms inspired by the Non-dominated Neighbor Immune Algorithm (NNIA), Archived Multi-Objective Simulated Annealing (AMOSA), and Non-dominated Sorting Genetic Algorithm-II (NSGA-II), aimed at simultaneously enhancing both accuracy and diversity through multi-objective optimization. Our approach follows a three-stage process: First, we generate an initial top-$k$ list using item-based collaborative filtering for a given user. Second, we solve a bi-objective optimization problem to identify Pareto-optimal top-$s$ recommendation lists, where $s \ll k$, using the proposed hybrid algorithms. Finally, we select an optimal personalized top-$s$ list from the Pareto-optimal solutions. We evaluate the performance of the proposed algorithms on real-world datasets and compare them with existing methods using conventional metrics in recommender systems such as accuracy, diversity, and novelty. Additionally, we assess the quality of the Pareto frontiers using metrics including the spacing metric, mean ideal distance, diversification metric, and spread of non-dominated solutions. Results demonstrate that some of our proposed algorithms significantly improve both accuracy and diversity, offering a novel contribution to multi-objective optimization in recommender systems.

18. 【2604.07420】Dual-Rerank: Fusing Causality and Utility for Industrial Generative Reranking

链接https://arxiv.org/abs/2604.07420

作者:Chao Zhang,Shuai Lin,ChengLei Dai,Ye Qian,Fan Mingyang,Yi Zhang,Yi Wang,Jingwei Zhuo

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:million daily active, search queries daily, daily active users, Kuaishou serves, million daily

备注

点击查看摘要

Abstract:Kuaishou serves over 400 million daily active users, processing hundreds of millions of search queries daily against a repository of tens of billions of short videos. As the final decision layer, the reranking stage determines user experience by optimizing whole-page utility. While traditional score-and-sort methods fail to capture combinatorial dependencies, Generative Reranking offers a superior paradigm by directly modeling the permutation probability. However, deploying Generative Reranking in such a high-stakes environment faces a fundamental dual dilemma: 1) the structural trade-off where Autoregressive (AR) models offer superior Sequential modeling but suffer from prohibitive latency, versus Non-Autoregressive (NAR) models that enable efficiency but lack dependency capturing; 2) the optimization gap where Supervised Learning faces challenges in directly optimizing whole-page utility, while Reinforcement Learning (RL) struggles with instability in high-throughput data streams. To resolve this, we propose Dual-Rerank, a unified framework designed for industrial reranking that bridges the structural gap via Sequential Knowledge Distillation and addresses the optimization gap using List-wise Decoupled Reranking Optimization (LDRO) for stable online RL. Extensive A/B testing on production traffic demonstrates that Dual-Rerank achieves State-of-the-Art performance, significantly improving User satisfaction and Watch Time while drastically reducing inference latency compared to AR baselines.

19. 【2604.07419】ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

链接https://arxiv.org/abs/2604.07419

作者:Hao Yang,Yifan Ji,Zhipeng Xu,Zhenghao Liu,Yukun Yan,Zulong Chen,Shuo Wang,Yu Gu,Ge Yu

类目:Information Retrieval (cs.IR)

关键词:Visual document retrieval, Visual document, Visual, document, document pages relevant

备注

点击查看摘要

Abstract:Visual document retrieval aims to retrieve a set of document pages relevant to a query from visually rich collections. Existing methods often employ Vision-Language Models (VLMs) to encode queries and visual pages into a shared embedding space, which is then optimized via contrastive training. However, during visual document representation, localized evidence is usually scattered across complex document layouts, making it difficult for retrieval models to capture crucial cues for effective embedding learning. In this paper, we propose Reasoning-Guided Alignment (ReAlign), a method that enhances visual document retrieval by leveraging the reasoning capability of VLMs to provide fine-grained visual document descriptions as supervision signals for training. Specifically, ReAlign employs a superior VLM to identify query-related regions on a page and then generates a query-aware description grounding the cropped visual regions. The retriever is then trained using these region-focused descriptions to align the semantics between queries and visual documents by encouraging the document ranking distribution induced by the region-focused descriptions to match that induced by the original query. Experiments on diverse visually rich document retrieval benchmarks demonstrate that ReAlign consistently improves visual document retrieval performance on both in-domain and out-of-domain datasets, achieving up to 2% relative improvements. Moreover, the advantages of ReAlign generalize across different VLM backbones by guiding models to better focus their attention on critical visual cues for document representation. All code and datasets are available at this https URL.

20. 【2604.07415】SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval

链接https://arxiv.org/abs/2604.07415

作者:Roxana Petcu,Evangelos Kanoulas,Maarten de Rijke

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:probabilistic in nature, nature and perform, perform more reliably, reliably when augmented, Large language models

备注

点击查看摘要

Abstract:Large language models (LLMs) are probabilistic in nature and perform more reliably when augmented with external information. As complex queries often require multi-step reasoning over the retrieved information, with no clear or predetermined reasoning path, they remain challenging. Recent approaches train models using reinforcement learning on the model's outcome, showing promise in improving how models handle complex information. We introduce SubSearch, a specialized framework that shifts from outcome-only supervision to intermediate reward signals that incentivize planning high-quality reasoning. Unlike previous work on process reward modeling, which focuses on training a separate reward model with annotated trajectories by either human annotators or large LLM judges, SubSearch directly optimizes the generator using intrinsic process rewards, which we define as internally-derived rewards, eliminating the need for external supervision, and moving towards autonomous information-intensive reasoning. Experiments on seven benchmarks show that rewarding intermediate reasoning steps with intrinsic rewards leads to more robust reasoning traces in both QA and multi-hop QA datasets over using only outcome rewards. SubSearch can help in building reasoning traces that allow agents to better integrate search engines for complex query answering, while offering a data-efficient alternative to supervised process modeling.

21. 【2604.07392】Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making

链接https://arxiv.org/abs/2604.07392

作者:Fan Zhaowen

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR); Robotics (cs.RO)

关键词:Autonomous agents operating, Autonomous agents, safety-critical environments require, physically grounded, agents operating

备注: This is the initial version (v1) released to establish priority for the proposed framework. Subsequent versions will include expanded experimental validation and exhaustive hardware benchmarking

点击查看摘要

Abstract:Autonomous agents operating in dynamic and safety-critical environments require decision-making frameworks that are both computationally efficient and physically grounded. However, many existing approaches rely on end-to-end learning, which often lacks interpretability and explicit mechanisms for ensuring consistency with physical constraints. In this work, we propose an event-centric world modeling framework with memory-augmented retrieval for embodied decision-making. The framework represents the environment as a structured set of semantic events, which are encoded into a permutation-invariant latent representation. Decision-making is performed via retrieval over a knowledge bank of prior experiences, where each entry associates an event representation with a corresponding maneuver. The final action is computed as a weighted combination of retrieved solutions, providing a transparent link between decision and stored experiences. The proposed design enables structured abstraction of dynamic environments and supports interpretable decision-making through case-based reasoning. In addition, incorporating physics-informed knowledge into the retrieval process encourages the selection of maneuvers that are consistent with observed system dynamics. Experimental evaluation in UAV flight scenarios demonstrates that the framework operates within real-time control constraints while maintaining interpretable and consistent behavior.

22. 【2604.07364】Improving Search Suggestions for Alphanumeric Queries

链接https://arxiv.org/abs/2604.07364

作者:Samarth Agrawal,Jayanth Yetukuri,Diptesh Kanojia,Qunzhi Zhou,Zhe Wu

类目:Information Retrieval (cs.IR)

关键词:manufacturer part numbers, part numbers, manufacturer part, codes are ubiquitous, ubiquitous in e-commerce

备注: Published in Advances in Information Retrieval, 48th European Conference on Information Retrieval, ECIR 2026

点击查看摘要

Abstract:Alphanumeric identifiers such as manufacturer part numbers (MPNs), SKUs, and model codes are ubiquitous in e-commerce catalogs and search. These identifiers are sparse, non linguistic, and highly sensitive to tokenization and typographical variation, rendering conventional lexical and embedding based retrieval methods ineffective. We propose a training free, character level retrieval framework that encodes each alphanumeric sequence as a fixed length binary vector. This representation enables efficient similarity computation via Hamming distance and supports nearest neighbor retrieval over large identifier corpora. An optional re-ranking stage using edit distance refines precision while preserving latency guarantees. The method offers a practical and interpretable alternative to learned dense retrieval models, making it suitable for production deployment in search suggestion generation systems. Significant gains in business metrics in the A/B test further prove utility of our approach.

23. 【2604.07351】FedUTR: Federated Recommendation with Augmented Universal Textual Representation for Sparse Interaction Scenarios

链接https://arxiv.org/abs/2604.07351

作者:Kang Fu,Honglei Zhang,Zikai Zhang,Jundong Chen,Xin Zhou,Zhiqi Shen,Dusit Niyato,Yidong Li

类目:Information Retrieval (cs.IR)

关键词:on-device privacy-preserving paradigm, attracting considerable attention, considerable attention driven, Federated recommendations, privacy-preserving paradigm

备注

点击查看摘要

Abstract:Federated recommendations (FRs) have emerged as an on-device privacy-preserving paradigm, attracting considerable attention driven by rising demands for data security. Existing FRs predominantly adapt ID embeddings to represent items, making the quality of item embeddings entirely dependent on users' historical behaviors. However, we empirically observe that this pattern leads to suboptimal recommendation performance under high data sparsity scenarios, due to its strong reliance on historical interactions. To address this issue, we propose a novel method named FedUTR, which incorporates item textual representations as a complement to interaction behaviors, aiming to enhance model performance under high data sparsity. Specifically, we utilize textual modality as the universal representation to capture generic item knowledge, and design a Collaborative Information Fusion Module (CIFM) to complement each user's personalized interaction information. Besides, we introduce a Local Adaptation Module (LAM) that adaptively exploits the off-the-shelf local model to efficiently preserve client-specific personalized preferences. Moreover, we propose a variant of FedUTR, termed FedUTR-SAR, which incorporates a sparsity-aware resnet component to granularly balance universal and personalized information. The convergence analysis proves theoretical guarantees for the effectiveness of FedUTR. Extensive experiments on four real-world datasets show that our method achieves superior performance, with improvements of up to 59% across all datasets compared to the SOTA baselines.

计算机视觉

1. 【2604.08548】ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets

链接https://arxiv.org/abs/2604.08548

作者:Xiaoben Li,Jingyi Wu,Zeyu Cai,Yu Siyuan,Boqian Li,Yuliang Xiu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Human body fitting, aligns parametric body, SMPL to raw, clothed humans, parametric body models

备注: Page: [this https URL](https://xiaobenli00.github.io/ETCH-X/) , Code: [this https URL](https://github.com/XiaobenLi00/ETCH-X)

点击查看摘要

Abstract:Human body fitting, which aligns parametric body models such as SMPL to raw 3D point clouds of clothed humans, serves as a crucial first step for downstream tasks like animation and texturing. An effective fitting method should be both locally expressive-capturing fine details such as hands and facial features-and globally robust to handle real-world challenges, including clothing dynamics, pose variations, and noisy or partial inputs. Existing approaches typically excel in only one aspect, lacking an all-in-one this http URL upgrade ETCH to ETCH-X, which leverages a tightness-aware fitting paradigm to filter out clothing dynamics ("undress"), extends expressiveness with SMPL-X, and replaces explicit sparse markers (which are highly sensitive to partial data) with implicit dense correspondences ("dense fit") for more robust and fine-grained body fitting. Our disentangled "undress" and "dense fit" modular stages enable separate and scalable training on composable data sources, including diverse simulated garments (CLOTH3D), large-scale full-body motions (AMASS), and fine-grained hand gestures (InterHand2.6M), improving outfit generalization and pose robustness of both bodies and hands. Our approach achieves robust and expressive fitting across diverse clothing, poses, and levels of input completeness, delivering a substantial performance improvement over ETCH on both: 1) seen data, such as 4D-Dress (MPJPE-All, 33.0% ) and CAPE (V2V-Hands, 35.8% ), and 2) unseen data, such as BEDLAM2.0 (MPJPE-All, 80.8% ; V2V-All, 80.5% ). Code and models will be released at this https URL.

2. 【2604.08547】GaussiAnimate: Reconstruct and Rig Animatable Categories with Level of Dynamics

链接https://arxiv.org/abs/2604.08547

作者:Jiaxin Wang,Dongxin Lyu,Zeyu Cai,Zhiyang Dou,Cheng Lin,Anpei Chen,Yuliang Xiu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:effectively capture non-rigid, capture non-rigid deformations, Free-form bones, non-rigid surface deformations, partwise motion matching

备注: Page: [this https URL](https://cookmaker.cn/gaussianimate)

点击查看摘要

Abstract:Free-form bones, that conform closely to the surface, can effectively capture non-rigid deformations, but lack a kinematic structure necessary for intuitive control. Thus, we propose a Scaffold-Skin Rigging System, termed "Skelebones", with three key steps: (1) Bones: compress temporally-consistent deformable Gaussians into free-form bones, approximating non-rigid surface deformations; (2) Skeleton: extract a Mean Curvature Skeleton from canonical Gaussians and refine it temporally, ensuring a category-agnostic, motion-adaptive, and topology-correct kinematic structure; (3) Binding: bind the skeleton and bones via non-parametric partwise motion matching (PartMM), synthesizing novel bone motions by matching, retrieving, and blending existing ones. Collectively, these three steps enable us to compress the Level of Dynamics of 4D shapes into compact skelebones that are both controllable and expressive. We validate our approach on both synthetic and real-world datasets, achieving significant improvements in reanimation performance across unseen poses-with 17.3% PSNR gains over Linear Blend Skinning (LBS) and 21.7% over Bag-of-Bones (BoB)-while maintaining excellent reconstruction fidelity, particularly for characters exhibiting complex non-rigid surface dynamics. Our Partwise Motion Matching algorithm demonstrates strong generalization to both Gaussian and mesh representations, especially under low-data regime (~1000 frames), achieving 48.4% RMSE improvement over robust LBS and outperforming GRU- and MLP-based learning methods by 20%. Code will be made publicly available for research purposes at this http URL.

3. 【2604.08546】When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

链接https://arxiv.org/abs/2604.08546

作者:Zhengyang Sun,Yu Chen,Xin Zhou,Xiaofan Li,Xiwu Chen,Dingkang Liang,Xiang Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:open-ended video synthesis, enabled open-ended video, video synthesis, enabled open-ended, open-ended video

备注: Accepted by CVPR 2026. Project page: [this https URL](https://h-embodvis.github.io/NUMINA)

点击查看摘要

Abstract:Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at this https URL.

4. 【2604.08545】Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

链接https://arxiv.org/abs/2604.08545

作者:Shilin Yan,Jintao Tong,Hongwei Xue,Xiaojun Tang,Yangyang Wang,Kunyu Shi,Guannan Zhang,Ruixuan Li,Yixiong Zou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:agentic multimodal models, advent of agentic, agentic multimodal, empowered systems, systems to actively

备注: Project Page: [this https URL](https://Accio-Lab.github.io/Metis)

点击查看摘要

Abstract:The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

5. 【2604.08544】SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

链接https://arxiv.org/abs/2604.08544

作者:Yunsong Zhou,Hangxu Liu,Xuekun Jiang,Xing Shen,Yuanzhen Zhou,Hui Wang,Baole Fang,Yang Tian,Mulin Yu,Qiaojun Yu,Li Ma,Hengjie Li,Hanqing Wang,Jia Zeng,Jiangmiao Pang

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:deformable objects represents, variability of rigids, Robotic manipulation, objects represents, represents a data-intensive

备注: Website: [this https URL](https://internrobotics.github.io/sim1.github.io/)

点击查看摘要

Abstract:Robotic manipulation with deformable objects represents a data-intensive regime in embodied learning, where shape, contact, and topology co-evolve in ways that far exceed the variability of rigids. Although simulation promises relief from the cost of real-world data acquisition, prevailing sim-to-real pipelines remain rooted in rigid-body abstractions, producing mismatched geometry, fragile soft dynamics, and motion primitives poorly suited for cloth interaction. We posit that simulation fails not for being synthetic, but for being ungrounded. To address this, we introduce SIM1, a physics-aligned real-to-sim-to-real data engine that grounds simulation in the physical world. Given limited demonstrations, the system digitizes scenes into metric-consistent twins, calibrates deformable dynamics through elastic modeling, and expands behaviors via diffusion-based trajectory generation with quality filtering. This pipeline transforms sparse observations into scaled synthetic supervision with near-demonstration fidelity. Experiments show that policies trained on purely synthetic data achieve parity with real-data baselines at a 1:15 equivalence ratio, while delivering 90% zero-shot success and 50% generalization gains in real-world deployment. These results validate physics-aligned simulation as scalable supervision for deformable manipulation and a practical pathway for data-efficient policy learning.

6. 【2604.08543】E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation

链接https://arxiv.org/abs/2604.08543

作者:Mayur Deshmukh,Hiroyasu Akada,Helge Rhodin,Christian Theobalt,Vladislav Golyanik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cameras offer multiple, offer multiple advantages, millisecond temporal resolution, Event cameras offer, negligible motion blur

备注: 20 pages; 14 figures and 14 tables; CVPR 2026; project page: [this https URL](https://4dqv.mpi-inf.mpg.de/E-3DPSM/)

点击查看摘要

Abstract:Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from head-mounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at 80 Hz on a single workstation and sets a new state of the art in experiments on two benchmarks, improving accuracy by up to 19% (MPJPE) and temporal stability by up to 2.7x. See our project page for the source code and trained models.

7. 【2604.08542】Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

链接https://arxiv.org/abs/2604.08542

作者:Tao Xie,Peishan Yang,Yudong Jin,Yingfeng Cai,Wei Yin,Weiqiang Ren,Qian Zhang,Wei Hua,Sida Peng,Xiaoyang Guo,Xiaowei Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:long video sequences, paper addresses, addresses the task, reconstruction accuracy, long video

备注: Project page: [this https URL](https://zju3dv.github.io/scal3r)

点击查看摘要

Abstract:This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at this https URL.

8. 【2604.08541】Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

链接https://arxiv.org/abs/2604.08541

作者:Haolei Xu,Haiwen Hong,Hongxing Li,Rui Zhou,Yang Zhang,Longtao Huang,Hui Xue,Yongliang Shen,Weiming Lu,Yueting Zhuang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:achieved remarkable performance, achieved remarkable, remarkable performance, performance on vision-language, domain

备注

点击查看摘要

Abstract:Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

9. 【2604.08540】AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

链接https://arxiv.org/abs/2604.08540

作者:Ziwei Zhou,Zeyuan Lai,Rui Wang,Yifan Yang,Zhen Xing,Yuqing Yang,Qi Dai,Lili Qiu,Chong Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:evaluation remains fragmented, media creation, remains fragmented, core interface, interface for media

备注

点击查看摘要

Abstract:Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at this http URL.

10. 【2604.08539】OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

链接https://arxiv.org/abs/2604.08539

作者:Wenbo Hu,Xin Chen,Yan Gao-Tian,Yihe Deng,Nanyun Peng,Kai-Wei Chang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Group Relative Policy, Relative Policy Optimization, facto Reinforcement Learning, Multimodal Large Language, Large Language Models

备注: code at: [this https URL](https://github.com/uclanlp/openvlthinker)

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.

11. 【2604.08538】ParseBench: A Document Parsing Benchmark for AI Agents

链接https://arxiv.org/abs/2604.08538

作者:Boyang Zhang,Sebastián G. Acosta,Preston Carlson,Sacha Bron,Pierre-Loïc Doulcet,Simon Suo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:agents are changing, changing the requirements, visual grounding, document parsing, semantically meaningful formatting

备注

点击查看摘要

Abstract:AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{this https URL}{HuggingFace} and \href{this https URL}{GitHub}.

12. 【2604.08536】RewardFlow: Generate Images by Optimizing What You Reward

链接https://arxiv.org/abs/2604.08536

作者:Onkar Susladkar,Dong-Hwan Jang,Tushar Prakash,Adheesh Juvekar,Vedant Shah,Ayush Barik,Nabeel Bashir,Muntasir Wahed,Ritish Shrirao,Ismini Lourentzou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词

备注: CVPR 2026. Project page: [this https URL](https://plan-lab.github.io/rewardflow)

点击查看摘要

None

13. 【2604.08535】Fail2Drive: Benchmarking Closed-Loop Driving Generalization

链接https://arxiv.org/abs/2604.08535

作者:Simon Gerstenecker,Andreas Geiger,Katrin Renz

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:distribution shift remains, remains a central, central bottleneck, closed-loop autonomous driving, CARLA enable safe

备注

点击查看摘要

Abstract:Generalization under distribution shift remains a central bottleneck for closed-loop autonomous driving. Although simulators like CARLA enable safe and scalable testing, existing benchmarks rarely measure true generalization: they typically reuse training scenarios at test time. Success can therefore reflect memorization rather than robust driving behavior. We introduce Fail2Drive, the first paired-route benchmark for closed-loop generalization in CARLA, with 200 routes and 17 new scenario classes spanning appearance, layout, behavioral, and robustness shifts. Each shifted route is matched with an in-distribution counterpart, isolating the effect of the shift and turning qualitative failures into quantitative diagnostics. Evaluating multiple state-of-the-art models reveals consistent degradation, with an average success-rate drop of 22.8\%. Our analysis uncovers unexpected failure modes, such as ignoring objects clearly visible in the LiDAR and failing to learn the fundamental concepts of free and occupied space. To accelerate follow-up work, Fail2Drive includes an open-source toolbox for creating new scenarios and validating solvability via a privileged expert policy. Together, these components establish a reproducible foundation for benchmarking and improving closed-loop driving generalization. We open-source all code, data, and tools at this https URL .

14. 【2604.08532】Self-Improving 4D Perception via Self-Distillation

链接https://arxiv.org/abs/2604.08532

作者:Nan Huang,Pengcheng Yu,Weijia Zeng,James M. Rehg,Angjoo Kanazawa,Haiwen Feng,Qianqian Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made remarkable progress, Large-scale multi-view reconstruction, Large-scale multi-view, fully supervised training, remarkable progress

备注

点击查看摘要

Abstract:Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $\pi^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: this https URL.

15. 【2604.08526】FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On

链接https://arxiv.org/abs/2604.08526

作者:Johanna Karras,Yuanhao Wang,Yingwei Li,Ira Kemelmacher-Shlizerman

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:aims to synthesize, original pose, VTO, VTO methods, garment

备注: SIGGRAPH 2026

点击查看摘要

Abstract:Given a person and a garment image, virtual try-on (VTO) aims to synthesize a realistic image of the person wearing the garment, while preserving their original pose and identity. Although recent VTO methods excel at visualizing garment appearance, they largely overlook a crucial aspect of the try-on experience: the accuracy of garment fit -- for example, depicting how an extra-large shirt looks on an extra-small person. A key obstacle is the absence of datasets that provide precise garment and body size information, particularly for "ill-fit" cases, where garments are significantly too large or too small. Consequently, current VTO methods default to generating well-fitted results regardless of the garment or person size. In this paper, we take the first steps towards solving this open problem. We introduce FIT (Fit-Inclusive Try-on), a large-scale VTO dataset comprising over 1.13M try-on image triplets accompanied by precise body and garment measurements. We overcome the challenges of data collection via a scalable synthetic strategy: (1) We programmatically generate 3D garments using GarmentCode and drape them via physics simulation to capture realistic garment fit. (2) We employ a novel re-texturing framework to transform synthetic renderings into photorealistic images while strictly preserving geometry. (3) We introduce person identity preservation into our re-texturing model to generate paired person images (same person, different garments) for supervised training. Finally, we leverage our FIT dataset to train a baseline fit-aware virtual try-on model. Our data and results set the new state-of-the-art for fit-aware virtual try-on, as well as offer a robust benchmark for future research. We will make all data and code publicly available on our project page: this https URL.

Comments:
SIGGRAPH 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Cite as:
arXiv:2604.08526 [cs.CV]

(or
arXiv:2604.08526v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.08526

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
16. 【2604.08522】UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

链接https://arxiv.org/abs/2604.08522

作者:Joungbin An,Agrim Jain,Kristen Grauman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video temporal grounding, typically tackled, tackled with dataset-specific, poorly across domains, VTG

备注: Project Page: [this https URL](https://vision.cs.utexas.edu/projects/universalvtg)

点击查看摘要

Abstract:Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being $100\times$ smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.

17. 【2604.08516】MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

链接https://arxiv.org/abs/2604.08516

作者:Tanmay Gupta,Piper Wolters,Zixian Ma,Peter Sushko,Rock Yuren Pang,Diego Llanes,Yue Yang,Taira Anderson,Boyuan Zheng,Zhongzheng Ren,Harsh Trivedi,Taylor Blanton,Caleb Ouellette,Winson Han,Ali Farhadi,Ranjay Krishna

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Web agents, autonomous systems, behalf of users, digital world, systems that navigate

备注: [this https URL](https://allenai.org/blog/molmoweb)

点击查看摘要

Abstract:Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.

Comments:
this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.08516 [cs.CV]

(or
arXiv:2604.08516v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.08516

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2604.08513】When Fine-Tuning Changes the Evidence: Architecture-Dependent Semantic Drift in Chest X-Ray Explanations

链接https://arxiv.org/abs/2604.08513

作者:Kabilan Elangovan,Daniel Ting

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical image classification, image classification due, widely adopted, adopted in medical, medical image

备注

点击查看摘要

Abstract:Transfer learning followed by fine-tuning is widely adopted in medical image classification due to consistent gains in diagnostic performance. However, in multi-class settings with overlapping visual features, improvements in accuracy do not guarantee stability of the visual evidence used to support predictions. We define semantic drift as systematic changes in the attribution structure supporting a model's predictions between transfer learning and full fine-tuning, reflecting potential shifts in underlying visual reasoning despite stable classification performance. Using a five-class chest X-ray task, we evaluate DenseNet201, ResNet50V2, and InceptionV3 under a two-stage training protocol and quantify drift with reference-free metrics capturing spatial localization and structural consistency of attribution maps. Across architectures, coarse anatomical localization remains stable, while overlap IoU reveals pronounced architecture-dependent reorganization of evidential structure. Beyond single-method analysis, stability rankings can reverse across LayerCAM and GradCAM++ under converged predictive performance, establishing explanation stability as an interaction between architecture, optimization phase, and attribution objective.

19. 【2604.08509】Visually-grounded Humanoid Agents

链接https://arxiv.org/abs/2604.08509

作者:Hang Ye,Xiaoxuan Ma,Fan Lu,Wayne Wu,Kwan-Yee Lin,Yizhou Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Digital human generation, studied for decades, decades and supports, supports a wide, wide range

备注: Project page: [this https URL](https://alvinyh.github.io/VGHuman/)

点击查看摘要

Abstract:Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments. We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with digital humans at scale that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer reconstructs semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware pipeline and accommodates animatable Gaussian-based human avatars. The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene. We further introduce a benchmark to evaluate humanoid-scene interaction in diverse reconstructed environments. Experiments show our agents achieve robust autonomous behavior, yielding higher task success rates and fewer collisions than ablations and state-of-the-art planning methods. This work enables active digital human population and advances human-centric embodied AI. Data, code, and models will be open-sourced.

20. 【2604.08503】Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

链接https://arxiv.org/abs/2604.08503

作者:Ying Shen,Jerry Xiong,Tianjiao Yu,Ismini Lourentzou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, generative video modeling, remarkable visual realism, yielded remarkable visual, physical dynamics

备注: 15 pages, 6 figures, CVPR 2026

点击查看摘要

Abstract:Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

21. 【2604.08502】Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification

链接https://arxiv.org/abs/2604.08502

作者:Kabilan Elangovan,Daniel Ting

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Class Activation Mapping, Class Activation, Activation Mapping, deep learning classifiers, generate visual explanations

备注

点击查看摘要

Abstract:Class Activation Mapping (CAM) methods are widely used to generate visual explanations for deep learning classifiers in medical imaging. However, existing evaluation frameworks assess whether explanations are correct, measured by localisation fidelity against radiologist annotations, rather than whether they are consistent: whether the model applies the same spatial reasoning strategy across different patients with the same pathology. We propose the C-Score (Consistency Score), a confidence-weighted, annotation-free metric that quantifies intra-class explanation reproducibility via intensity-emphasised pairwise soft IoU across correctly classified instances. We evaluate six CAM techniques: GradCAM, GradCAM++, LayerCAM, EigenCAM, ScoreCAM, and MS GradCAM++ across three CNN architectures (DenseNet201, InceptionV3, ResNet50V2) over thirty training epochs on the Kermany chest X-ray dataset, covering transfer learning and fine-tuning phases. We identify three distinct mechanisms of AUC-consistency dissociation, invisible to standard classification metrics: threshold-mediated gold list collapse, technique-specific attribution collapse at peak AUC, and class-level consistency masking in global aggregation. C-Score provides an early warning signal of impending model instability. ScoreCAM deterioration on ResNet50V2 is detectable one full checkpoint before catastrophic AUC collapse and yields architecture-specific clinical deployment recommendations grounded in explanation quality rather than predictive ranking alone.

22. 【2604.08500】Novel View Synthesis as Video Completion

链接https://arxiv.org/abs/2604.08500

作者:Qi Wu,Khiem Vuong,Minsik Jeon,Srinivasa Narasimhan,Deva Ramanan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:target camera pose, camera poses, camera pose, view synthesis, target camera

备注: Project page: [this https URL](https://frame-crafter.github.io/)

点击查看摘要

Abstract:We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given $K$ ($\approx 5$) multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be $\textit{invariant}$ to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame orderings) to permutation-invariant NVS through several architectural modifications, including per-frame latent encodings and removal of temporal positional embeddings. Our results suggest that video models can be easily trained to "forget" about time with minimal supervision, producing competitive performance on sparse-view NVS benchmarks. Project page: this https URL

23. 【2604.08494】What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

链接https://arxiv.org/abs/2604.08494

作者:Mohamed Amine Kerkouri,Marouane Tliba,Bin Wang,Aladine Chetouani,Ulas Bagci,Alessandro Bruno

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:attended image regions, existing methods predominantly, methods predominantly evaluate, neglecting semantic equivalence, predominantly evaluate spatial

备注: Accepted at ETRA 2026 GenAI workshop

点击查看摘要

Abstract:Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.

24. 【2604.08476】Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

链接https://arxiv.org/abs/2604.08476

作者:Sai Srinivas Kancheti,Aditya Kanade,Rohit Sinha,Vineeth N Balasubramanian,Tanuja Ganu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal reasoning models, Multimodal reasoning, verifiable rewards, reinforcement learning, learning with verifiable

备注

点击查看摘要

Abstract:Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.

25. 【2604.08475】LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

链接https://arxiv.org/abs/2604.08475

作者:Jingjing Wang,Zhengdong Hong,Chong Bao,Yuke Zhu,Junhan Sun,Guofeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Human-like generalization, open-world manipulation, remains a fundamental, fundamental challenge, challenge for robotic

备注

点击查看摘要

Abstract:Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: this https URL.

26. 【2604.08461】OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

链接https://arxiv.org/abs/2604.08461

作者:Haoxi Zeng,Qiankun Liu,Yi Bin,Haiyue Zhang,Yujuan Ding,Guoqing Wang,Deqiang Ouyang,Heng Tao Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:predefined category sets, leveraging semantic descriptions, segment image regions, Vision Foundation Models, image regions

备注: 14 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM's structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).

27. 【2604.08457】CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

链接https://arxiv.org/abs/2604.08457

作者:Rui Gan,Junyi Ma,Pei Li,Xingyou Yang,Kai Chen,Sikai Chen,Bin Ran

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:infrastructure perspectives, vehicle and infrastructure, driving requires traffic, requires traffic scene, requires traffic

备注

点击查看摘要

Abstract:Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at this https URL.

28. 【2604.08456】Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

链接https://arxiv.org/abs/2604.08456

作者:Marcel Gröpl,Jaewoo Jung,Seungryong Kim,Marc Pollefeys,Sunghwan Hong

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:combining clues spread, pretrained vision-language models, tiny visual details, rapid progress, pretrained vision-language

备注: Project Page : [this https URL](https://entropy-gradient-grounding.github.io/)

点击查看摘要

Abstract:Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model's next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.

29. 【2604.08435】HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment

链接https://arxiv.org/abs/2604.08435

作者:Changdao Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:long-range temporal dependencies, subtle facial expressions, constrained computational budgets, assess driver fatigue, State Space Models

备注: 10 pages

点击查看摘要

Abstract:It remains challenging to assess driver fatigue from untrimmed videos under constrained computational budgets, due to the difficulty of modeling long-range temporal dependencies in subtle facial expressions. Some existing approaches rely on computationally heavy architectures, whereas others employ traditional lightweight pairwise graph networks, despite their limited capacity to model high-order synergies and global temporal context. Therefore, we propose HST-HGN, a novel Heterogeneous Spatial-Temporal Hypergraph Network driven by Bidirectional State Space Models. Spatially, we introduce a hierarchical hypergraph network to fuse pose-disentangled geometric topologies with multi-modal texture patches dynamically. This formulation encapsulates high-order synergistic facial deformations, effectively overcoming the limitations of conventional methods. In temporal terms, a Bi-Mamba module with linear complexity is applied to perform bidirectional sequence modeling. This explicit temporal-evolution filtering enables the network to distinguish highly ambiguous transient actions, such as yawning versus speaking, while encompassing their complete physiological lifecycles. Extensive evaluations across diverse fatigue benchmarks demonstrate that HST-HGN achieves state-of-the-art performance. In particular, our method strikes a balance between discriminative power and computational efficiency, making it well-suited for real-time in-cabin edge deployment.

30. 【2604.08410】BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields

链接https://arxiv.org/abs/2604.08410

作者:Fan Yang,Wenrui Chen,Guorun Yan,Ruize Liao,Wanjun Jia,Dongsheng Luo,Kailun Yang,Zhiyong Li,Yaonan Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:dexterous grasping calls, functional dexterous grasping, functional dexterous manipulation, functional dexterous, unstructured environments

备注: Code will be publicly available at [this https URL](https://github.com/PopeyePxx/BLaDA)

点击查看摘要

Abstract:In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic--pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at this https URL.

31. 【2604.08405】SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

链接https://arxiv.org/abs/2604.08405

作者:Wenli Zhang,Xianglong Shi,Sirui Zhao,Xinqi Chen,Guo Cheng,Yifan Xu,Tong Xu,Yong Liao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion-based audio-driven talking-head, Diffusion-based audio-driven, realistic portrait animation, audio-driven talking-head generation, risks of misuse

备注

点击查看摘要

Abstract:Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: this https URL.

32. 【2604.08395】Phantasia: Context-Adaptive Backdoors in Vision Language Models

链接https://arxiv.org/abs/2604.08395

作者:Nam Duong Tran,Phi Le Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:driving rapid progress, Recent advances, linguistic reasoning, driving rapid, multimodal understanding

备注: CVPR 2026 Findings

点击查看摘要

Abstract:Recent advances in Vision-Language Models (VLMs) have greatly enhanced the integration of visual perception and linguistic reasoning, driving rapid progress in multimodal understanding. Despite these achievements, the security of VLMs, particularly their vulnerability to backdoor attacks, remains significantly underexplored. Existing backdoor attacks on VLMs are still in an early stage of development, with most current methods relying on generating poisoned responses that contain fixed, easily identifiable patterns. In this work, we make two key contributions. First, we demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains (e.g., vision-only and text-only models), we show that several state-of-the-art attacks can be detected with surprising ease. Second, to address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM architectures reveal that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings.

33. 【2604.08370】SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction

链接https://arxiv.org/abs/2604.08370

作者:Chensheng Dai,Shengjun Zhang,Min Chen,Yueqi Duan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated impressive performance, Gaussian Splatting, Gaussian surfels, Gaussian, demonstrated impressive

备注: Code is available at [this https URL](https://github.com/Simon-Dcs/Surfel_Splat)

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated impressive performance in 3D scene reconstruction. Beyond novel view synthesis, it shows great potential for multi-view surface reconstruction. Existing methods employ optimization-based reconstruction pipelines that achieve precise and complete surface extractions. However, these approaches typically require dense input views and high time consumption for per-scene optimization. To address these limitations, we propose SurfelSplat, a feed-forward framework that generates efficient and generalizable pixel-aligned Gaussian surfel representations from sparse-view images. We observe that conventional feed-forward structures struggle to recover accurate geometric attributes of Gaussian surfels because the spatial frequency of pixel-aligned primitives exceeds Nyquist sampling rates. Therefore, we propose a cross-view feature aggregation module based on the Nyquist sampling theorem. Specifically, we first adapt the geometric forms of Gaussian surfels with spatial sampling rate-guided low-pass filters. We then project the filtered surfels across all input views to obtain cross-view feature correlations. By processing these correlations through a specially designed feature fusion network, we can finally regress Gaussian surfels with precise geometry. Extensive experiments on DTU reconstruction benchmarks demonstrate that our model achieves comparable results with state-of-the-art methods, and predict Gaussian surfels within 1 second, offering a 100x speedup without costly per-scene training.

34. 【2604.08368】SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization

链接https://arxiv.org/abs/2604.08368

作者:Seyed Mahmoud Sajjadi Mohammadabadi,Xiaolong Ma,Lei Yang,Feng Yan,Junshan Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:enable scalable adaptation, Parameter-efficient fine-tuning, injecting low-rank adapters, enable scalable, scalable adaptation

备注

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, enable scalable adaptation of foundation models by injecting low-rank adapters. However, their communication and storage costs remain a major bottleneck in resource-constrained settings. We propose SOLAR (Subspace-Oriented Latent Adapter Reparameterization), a post-training compression framework that substantially reduces the communication cost (i.e., the number of parameters to transmit or store) of PEFT adapters. SOLAR expresses each PEFT update as a linear combination of basis vectors formed from the foundation model's singular vectors with controlled random perturbations. By exploiting the subspace similarity (the alignment of principal directions) between the foundation model and task-specific fine-tuned updates, SOLAR decouples the adapter size from PEFT structure and ensures compact yet expressive representations. It is model-agnostic and compatible with existing PEFT methods, including LoRA, AdaLoRA, and other adapter modules. We theoretically establish a bound on the reconstruction error. Experiments on language and vision tasks using LLaMA, GPT, and ViT models demonstrate that SOLAR preserves task performance while significantly reducing model representation sizes, offering an effective and communication-efficient solution for deployment in distributed systems and edge devices.

35. 【2604.08366】Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems

链接https://arxiv.org/abs/2604.08366

作者:Tolga Dimlioglu,Nadine Chang,Maying Shen,Rafid Mahmood,Jose M. Alvarez

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large-scale deep learning, Large-scale deep, deep learning models, data collection efforts, deep learning

备注: Accepted to CVPR 2026, 8 pages of main body and 10 pages of appendix

点击查看摘要

Abstract:Large-scale deep learning models for physical AI applications depend on diverse training data collection efforts. These models and correspondingly, the training data, must address different evaluation criteria necessary for the models to be deployable in real-world environments. Data selection policies can guide the development of the training set, but current frameworks do not account for the ambiguity in how data points affect different metrics. In this work, we propose Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a general data selection framework that operates by: (i) partitioning the dataset into domains; (ii) fitting neural scaling laws from each data domain to the evaluation metrics; and (iii) optimizing a data mixture by iteratively adding data from domains that maximize the change in metrics. We apply MOSAIC to autonomous driving (AD), where an End-to-End (E2E) planner model is evaluated on the Extended Predictive Driver Model Score (EPDMS), an aggregate of driving rule compliance metrics. Here, MOSAIC outperforms a diverse set of baselines on EPDMS with up to 80\% less data.

36. 【2604.08364】MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

链接https://arxiv.org/abs/2604.08364

作者:Junyao Gao,Sibo Liu,Jiaxing Li,Yanan Sun,Yuanpeng Tu,Fei Shen,Weidong Zhang,Cairong Zhao,Jun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scalable data curation, data curation pipeline, style, introduce MegaStyle, scalable data

备注: project website [this https URL](https://jeoyal.github.io/MegaStyle/)

点击查看摘要

Abstract:In this paper, we introduce MegaStyle, a novel and scalable data curation pipeline that constructs an intra-style consistent, inter-style diverse and high-quality style dataset. We achieve this by leveraging the consistent text-to-image style mapping capability of current large generative models, which can generate images in the same style from a given style description. Building on this foundation, we curate a diverse and balanced prompt gallery with 170K style prompts and 400K content prompts, and generate a large-scale style dataset MegaStyle-1.4M via content-style prompt combinations. With MegaStyle-1.4M, we propose style-supervised contrastive learning to fine-tune a style encoder MegaStyle-Encoder for extracting expressive, style-specific representations, and we also train a FLUX-based style transfer model MegaStyle-FLUX. Extensive experiments demonstrate the importance of maintaining intra-style consistency, inter-style diversity and high-quality for style dataset, as well as the effectiveness of the proposed MegaStyle-1.4M. Moreover, when trained on MegaStyle-1.4M, MegaStyle-Encoder and MegaStyle-FLUX provide reliable style similarity measurement and generalizable style transfer, making a significant contribution to the style transfer community. More results are available at our project website this https URL.

37. 【2604.08340】PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

链接https://arxiv.org/abs/2604.08340

作者:Ruizhi Zhang,Ye Huang,Yuangang Pan,Chuanfu Shen,Zhilin Liu,Ting Xie,Wen Li,Lixin Duan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains severely limited, achieved remarkable progress, embodied environments remains, environments remains severely, static visual understanding

备注: Tech report

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.

38. 【2604.08337】InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

链接https://arxiv.org/abs/2604.08337

作者:Ashutosh Kumar,Rajat Saini,Jingjing Pan,Mustafa Erdogan,Mingfang Zhang,Betty Le Dem,Norimasa Kobori,Quan Kong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Current vision-language pre-training, Current vision-language, instance-level reasoning due, paradigms excel, reasoning due

备注

点击查看摘要

Abstract:Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.

39. 【2604.08333】Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

链接https://arxiv.org/abs/2604.08333

作者:Xun Zhu,Fanbin Mo,Xi Chen,Kaili Zheng,Shaoshuai Yang,Yiming Shi,Jian Gao,Miao Li,Ji Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:multimodal large language, large language models, medical imaging analysis, imaging analysis, rise of multimodal

备注

点击查看摘要

Abstract:The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.

40. 【2604.08322】Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

链接https://arxiv.org/abs/2604.08322

作者:Yuchuan Deng,Qijie Wei,Kaiheng Qian,Jiazhen Liu,Zijie Xin,Bangxiang Lan,Jingyu Liu,Jianfeng Dong,Xirong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:OCT and UWF, UWF is crucial, Fundus imaging, anomalies and diseases, early detection

备注

点击查看摘要

Abstract:Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.

41. 【2604.08313】Weakly-Supervised Lung Nodule Segmentation via Training-Free Guidance of 3D Rectified Flow

链接https://arxiv.org/abs/2604.08313

作者:Richard Petersen,Fredrik Kahl,Jennifer Alvén

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Dense annotations, expert voxel-wise labeling, time-consuming to obtain, labeling is required, expensive and time-consuming

备注: Submitted to MICCAI 2026

点击查看摘要

Abstract:Dense annotations, such as segmentation masks, are expensive and time-consuming to obtain, especially for 3D medical images where expert voxel-wise labeling is required. Weakly supervised approaches aim to address this limitation, but often rely on attribution-based methods that struggle to accurately capture small structures such as lung nodules. In this paper, we propose a weakly-supervised segmentation method for lung nodules by combining pretrained state-of-the-art rectified flow and predictor models in a plug-and-play manner. Our approach uses training-free guidance of a 3D rectified flow model, requiring only fine-tuning of the predictor using image-level labels and no retraining of the generative model. The proposed method produces improved-quality segmentations for two separate predictors, consistently detecting lung nodules of varying size and shapes. Experiments on LUNA16 demonstrate improvements over baseline methods, highlighting the potential of generative foundation models as tools for weakly supervised 3D medical image segmentation.

42. 【2604.08301】GroundingAnomaly: Spatially-Grounded Diffusion for Few-Shot Anomaly Synthesis

链接https://arxiv.org/abs/2604.08301

作者:Yishen Liu,Hongcang Chen,Pengcheng Zhao,Yunfan Bao,Yuxi Tian,Jieming Zhang,Hao Chen,Zheng Zhi,Yongchun Liu,Ying Li,Dongpu Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real anomalous samples, industrial quality control, anomalous samples, industrial quality, scarcity of real

备注: 32 pages, 15 figures

点击查看摘要

Abstract:The performance of visual anomaly inspection in industrial quality control is often constrained by the scarcity of real anomalous samples. Consequently, anomaly synthesis techniques have been developed to enlarge training sets and enhance downstream inspection. However, existing methods either suffer from poor integration caused by inpainting or fail to provide accurate masks. To address these limitations, we propose GroundingAnomaly, a novel few-shot anomaly image generation framework. Our framework introduces a Spatial Conditioning Module that leverages per-pixel semantic maps to enable precise spatial control over the synthesized anomalies. Furthermore, a Gated Self-Attention Module is designed to inject conditioning tokens into a frozen U-Net via gated attention layers. This carefully preserves pretrained priors while ensuring stable few-shot adaptation. Extensive evaluations on the MVTec AD and VisA datasets demonstrate that GroundingAnomaly generates high-quality anomalies and achieves state-of-the-art performance across multiple downstream tasks, including anomaly detection, segmentation, and instance-level detection.

43. 【2604.08295】U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

链接https://arxiv.org/abs/2604.08295

作者:Angeliki Dimitriou,Nikolaos Chaidos,Maria Lymperaiou,Giorgos Filandrianos,Giorgos Stamou

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:concept-based counterfactual methods, Graph Edit Distance, grow more complex, explainability is essential, building trust

备注

点击查看摘要

Abstract:As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.

44. 【2604.08294】Can Vision Language Models Judge Action Quality? An Empirical Evaluation

链接https://arxiv.org/abs/2604.08294

作者:Miguel Monte e Freitas,Rui Henriques,Ricardo Rei,Pedro Henrique Martins

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Action Quality Assessment, Vision Language Models, sports coaching, Action Quality, physical therapy

备注

点击查看摘要

Abstract:Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

45. 【2604.08287】CAMotion: A High-Quality Benchmark for Camouflaged Moving Object Detection in the Wild

链接https://arxiv.org/abs/2604.08287

作者:Siyuan Yao,Hao Sun,Ruiqi Yu,Xiwei Jiang,Wenqi Ren,Xiaochun Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Discovering camouflaged objects, computer vision due, camouflaged object detection, Discovering camouflaged, camouflaged object

备注: Under review

点击查看摘要

Abstract:Discovering camouflaged objects is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings. While the problem of camouflaged object detection over sequential video frames has received increasing attention, the scale and diversity of existing video camouflaged object detection (VCOD) datasets are greatly limited, which hinders the deeper analysis and broader evaluation of recent deep learning-based algorithms with data-hungry training strategy. To break this bottleneck, in this paper, we construct CAMotion, a high-quality benchmark covers a wide range of species for camouflaged moving object detection in the wild. CAMotion comprises various sequences with multiple challenging attributes such as uncertain edge, occlusion, motion blur, and shape complexity, etc. The sequence annotation details and statistical distribution are presented from various perspectives, allowing CAMotion to provide in-depth analyses on the camouflaged object's motion characteristics in different challenging scenarios. Additionally, we conduct a comprehensive evaluation of existing SOTA models on CAMotion, and discuss the major challenges in VCOD task. The benchmark is available at this https URL, we hope that our CAMotion can lead to further advancements in the research community.

46. 【2604.08282】Revisiting Radar Perception With Spectral Point Clouds

链接https://arxiv.org/abs/2604.08282

作者:Hamza Alsharif,Jing Gu,Pavol Jancura,Satish Ravindran,Gijs Dubbelman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:point clouds, sparse point clouds, point, spectral point cloud, clouds

备注: CVPR 2026 Workshop (PBVS 2026). Project page: [this https URL](https://www.tue-mps.org/Spectral-Point-Clouds-Radar/)

点击查看摘要

Abstract:Radar perception models are trained with different inputs, from range-Doppler spectra to sparse point clouds. Dense spectra are assumed to outperform sparse point clouds, yet they can vary considerably across sensors and configurations, which hinders transfer. In this paper, we provide alternatives for incorporating spectral information into radar point clouds and show that, point clouds need not underperform compared to spectra. We introduce the spectral point cloud paradigm, where point clouds are treated as sparse, compressed representations of the radar spectra, and argue that, when enriched with spectral information, they serve as strong candidates for a unified input representation that is more robust against sensor-specific differences. We develop an experimental framework that compares spectral point cloud (PC) models at varying densities against a dense range-Doppler (RD) benchmark, and report the density levels where the PC configurations meet the performance of the RD benchmark. Furthermore, we experiment with two basic spectral enrichment approaches, that inject additional target-relevant information into the point clouds. Contrary to the common belief that the dense RD approach is superior, we show that point clouds can do just as well, and can surpass the RD benchmark when enrichment is applied. Spectral point clouds can therefore serve as strong candidates for unified radar perception, paving the way for future radar foundation models.

47. 【2604.08272】Preventing Overfitting in Deep Image Prior for Hyperspectral Image Denoising

链接https://arxiv.org/abs/2604.08272

作者:Panagiotis Gkotsis,Athanasios A. Rontogiannis

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:unsupervised deep learning, deep learning framework, inverse imaging problems, Deep image prior, unsupervised deep

备注: 7 pages, 5 figures

点击查看摘要

Abstract:Deep image prior (DIP) is an unsupervised deep learning framework that has been successfully applied to a variety of inverse imaging problems. However, DIP-based methods are inherently prone to overfitting, which leads to performance degradation and necessitates early stopping. In this paper, we propose a method to mitigate overfitting in DIP-based hyperspectral image (HSI) denoising by jointly combining robust data fidelity and explicit sensitivity regularization. The proposed approach employs a Smooth $\ell_1$ data term together with a divergence-based regularization and input optimization during training. Experimental results on real HSIs corrupted by Gaussian, sparse, and stripe noise demonstrate that the proposed method effectively prevents overfitting and achieves superior denoising performance compared to state-of-the-art DIP-based HSI denoising methods.

48. 【2604.08266】Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

链接https://arxiv.org/abs/2604.08266

作者:Jing Gu,Niccolò Cavagnero,Gijs Dubbelman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Language, Leveraging the general, autonomous driving systems, general world knowledge

备注

点击查看摘要

Abstract:Leveraging the general world knowledge of Large Language Models (LLMs) holds significant promise for improving the ability of autonomous driving systems to handle rare and complex scenarios. While integrating LLMs into Vision-Language-Action (VLA) models has yielded state-of-the-art performance, their massive parameter counts pose severe challenges for latency-sensitive and energy-efficient deployment. Distilling LLM knowledge into a compact driving model offers a compelling solution to retain these reasoning capabilities while maintaining a manageable computational footprint. Although previous works have demonstrated the efficacy of distillation, these efforts have primarily focused on relatively simple scenarios and open-loop evaluations. Therefore, in this work, we investigate LLM distillation in more complex, interactive scenarios under closed-loop evaluation. We demonstrate that through a combination of latent feature distillation and ground-truth trajectory supervision, an efficient vision-only student model \textbf{Orion-Lite} can even surpass the performance of its massive VLA teacher, ORION. Setting a new state-of-the-art on the rigorous Bench2Drive benchmark, with a Driving Score of 80.6. Ultimately, this reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning.

49. 【2604.08261】DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection

链接https://arxiv.org/abs/2604.08261

作者:Jiangbei Yue,Sharib Ali

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:reliable deep learning, dynamic real-world clinical, real-world clinical environment, clinical environment demands, environment demands reliable

备注

点击查看摘要

Abstract:The complex and dynamic real-world clinical environment demands reliable deep learning (DL) systems. Out-of-distribution (OOD) detection plays a critical role in enhancing the reliability and generalizability of DL models when encountering data that deviate from the training distribution, such as unseen disease cases. However, existing OOD detection methods typically rely either on a single visual modality or solely on image-text matching, failing to fully leverage multimodal information. To overcome the challenge, we propose a novel dual-branch multimodal framework by introducing a text-image branch and a vision branch. Our framework fully exploits multimodal representations to identify OOD samples through these two complementary branches. After training, we compute scores from the text-image branch ($S_t$) and vision branch ($S_v$), and integrate them to obtain the final OOD score $S$ that is compared with a threshold for OOD detection. Comprehensive experiments on publicly available endoscopic image datasets demonstrate that our proposed framework is robust across diverse backbones and improves state-of-the-art performance in OOD detection by up to 24.84%

50. 【2604.08238】$\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

链接https://arxiv.org/abs/2604.08238

作者:Arnav Devalapally,Poornima Jain,Kartik Srinivas,Vineeth N. Balasubramanian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:target domain, emerging privacy risk, sensitive source-domain, domain adaptation, source-domain specific information

备注: CVPR 2026

点击查看摘要

Abstract:The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source data itself is protected, yet the source model exposed during adaptation encodes its influence. Our experiments reveal that existing SFDA methods exhibit strong zero-shot performance on source-exclusive classes in the target domain, indicating they inadvertently leak knowledge of these classes into the target domain, even when they are not represented in the target data. We identify and address this risk by proposing an MU setting called SCADA-UL: Unlearning Source-exclusive ClAsses in Domain Adaptation. Existing MU methods do not address this setting as they are not designed to handle data distribution shifts. We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization. We also extend our study to two variants: a continual version of this problem setting and to one where the specific source classes to be forgotten may be unknown. Alongside theoretical interpretations, our comprehensive empirical results show that our method consistently outperforms baselines in the proposed setting while achieving retraining-level unlearning performance on benchmark datasets. Our code is available at this https URL

51. 【2604.08230】Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges

链接https://arxiv.org/abs/2604.08230

作者:Saniya M.Deshmukh,Kailash A. Hambarde,Hugo Proença

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:exhibit significant performance, significant performance degradation, unseen target domains, detection models trained, kinds of variations

备注: 44 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Object detection models trained on a source domain often exhibit significant performance degradation when deployed in unseen target domains, due to various kinds of variations, such as sensing conditions, environments and data distributions. Hence, regardless the recent breakthrough advances in deep learning-based detection technology, cross-domain object detection (CDOD) remains a critical research area. Moreover, the existing literature remains fragmented, lacking a unified perspective on the structural challenges underlying domain shift and the effectiveness of adaptation strategies. This survey provides a comprehensive and systematic analysis of CDOD. We start upon a problem formulation that highlights the multi-stage nature of object detection under domain shift. Then, we organize the existing methods through a conceptual taxonomy that categorizes approaches based on adaptation paradigms, modeling assumptions, and pipeline components. Furthermore, we analyze how domain shift propagates across detection stages and discuss why adaptation in object detection is inherently more complex than in classification. In addition, we review commonly used datasets, evaluation protocols, and benchmarking practices. Finally, we identify the key challenges and outline promising future research directions. Cohesively, this survey aims to provide a unified framework for understanding CDOD and to guide the development of more robust detection systems.

52. 【2604.08213】EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

链接https://arxiv.org/abs/2604.08213

作者:Xiangyuan Wang,Honghao Cai,Yunhao Bai,Tianze Zhou,Haohua Chen,Yao Hu,Xu Tang,Yibo Chen,Wei Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:High-quality training triplets, High-quality training, scaling instruction-guided image, bottleneck for scaling, scaling instruction-guided

备注

点击查看摘要

Abstract:High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.

53. 【2604.08212】Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

链接https://arxiv.org/abs/2604.08212

作者:Blessing Agyei Kyem,Joshua Kofi Asamoah,Anthony Dontoh,Armstrong Aboah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requiring precise terminology, demonstrate strong performance, fields requiring precise, General-purpose vision-language models, General-purpose vision-language

备注

点击查看摘要

Abstract:General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.

54. 【2604.08211】SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

链接https://arxiv.org/abs/2604.08211

作者:You Hu,Chenzhuo Zhao,Changfa Mo,Haotian Liu,Xiaobai Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Modern multimodal generators, Modern multimodal, produce scientific figures, near-publishable quality, challenge for visual

备注

点击查看摘要

Abstract:Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real--synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at this https URL.

55. 【2604.08209】OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

链接https://arxiv.org/abs/2604.08209

作者:Yiduo Jia,Muzhi Zhu,Hao Zhong,Mingyu Liu,Yuling Xi,Hao Chen,Bin Qin,Yongjie Yang,Zhenbo Luo,Chunhua Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:concurrently bolstering video-audio, bolstering video-audio understanding, Sample-level Modality Selection, temporal reordering proxy, Clip-level Modality Masking

备注: Project page: [this https URL](https://aim-uofa.github.io/OmniJigsaw/)

点击查看摘要

Abstract:To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

56. 【2604.08203】MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

链接https://arxiv.org/abs/2604.08203

作者:Zheng Jiang,Heng Guo,Chengyu Fang,Changchen Xiao,Xinyang Hu,Lifeng Sun,Minfeng Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:hold immense promise, hold immense, immense promise, promise for complex, constrained by text-only

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.

57. 【2604.08192】Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

链接https://arxiv.org/abs/2604.08192

作者:Yunxiang Peng,Mengmeng Ma,Ziyu Yao,Xi Peng

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:machine learning models, machine learning, target data, evaluation of machine, generalization

备注: CVPR 2026(Highlight)

点击查看摘要

Abstract:Reliable generalization metrics are fundamental to the evaluation of machine learning models. Especially in high-stakes applications where labeled target data are scarce, evaluation of models' generalization performance under distribution shift is a pressing need. We focus on two practical scenarios: (1) Before deployment, how to select the best model for unlabeled target data? (2) After deployment, how to monitor model performance under distribution shift? The central need in both cases is a reliable and label-free proxy metric. Yet existing proxy metrics, such as model confidence or accuracy-on-the-line, are often unreliable as they only assess model output while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using the inner workings of a model, i.e., circuits, as a predictive metric of generalization performance. Leveraging circuit discovery, we extract the causal interactions between internal representations as a circuit, from which we derive two metrics tailored to the two practical scenarios. (1) Before deployment, we introduce Dependency Depth Bias, which measures different models' generalization capability on target data. (2) After deployment, we propose Circuit Shift Score, which predicts a model's generalization under different distribution shifts. Across various tasks, both metrics demonstrate significantly improved correlation with generalization performance, outperforming existing proxies by an average of 13.4\% and 34.1\%, respectively. Our code is available at this https URL.

58. 【2604.08172】On the Global Photometric Alignment for Low-Level Vision

链接https://arxiv.org/abs/2604.08172

作者:Mingjia Li,Tianle Du,Hainuo Wang,Qiming Hu,Xiaojie Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Supervised low-level vision, paired training sets, low-level vision models, vision models rely, training sets exhibit

备注

点击查看摘要

Abstract:Supervised low-level vision models rely on pixel-wise losses against paired references, yet paired training sets exhibit per-pair photometric inconsistency, say, different image pairs demand different global brightness, color, or white-balance mappings. This inconsistency enters through task-intrinsic photometric transfer (e.g., low-light enhancement) or unintended acquisition shifts (e.g., de-raining), and in either case causes an optimization pathology. Standard reconstruction losses allocate disproportionate gradient budget to conflicting per-pair photometric targets, crowding out content restoration. In this paper, we investigate this issue and prove that, under least-squares decomposition, the photometric and structural components of the prediction-target residual are orthogonal, and that the spatially dense photometric component dominates the gradient energy. Motivated by this analysis, we propose Photometric Alignment Loss (PAL). This flexible supervision objective discounts nuisance photometric discrepancy via closed-form affine color alignment while preserving restoration-relevant supervision, requiring only covariance statistics and tiny matrix inversion with negligible overhead. Across 6 tasks, 16 datasets, and 16 architectures, PAL consistently improves metrics and generalization. The implementation is in the appendix.

59. 【2604.08171】OceanMAE: A Foundation Model for Ocean Remote Sensing

链接https://arxiv.org/abs/2604.08171

作者:Viola-Joanna Stamer,Panagiotis Agrafiotis,Behnood Rasti,Begüm Demir

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Accurate ocean mapping, marine litter detection, Accurate ocean, seabed characterization, litter detection

备注

点击查看摘要

Abstract:Accurate ocean mapping is essential for applications such as bathymetry estimation, seabed characterization, marine litter detection, and ecosystem monitoring. However, ocean remote sensing (RS) remains constrained by limited labeled data and by the reduced transferability of models pre-trained mainly on land-dominated Earth observation imagery. In this paper, we propose OceanMAE, an ocean-specific masked autoencoder that extends standard MAE pre-training by integrating multispectral Sentinel-2 observations with physically meaningful ocean descriptors during self-supervised learning. By incorporating these auxiliary ocean features, OceanMAE is designed to learn more informative and ocean-aware latent representations from large- scale unlabeled data. To transfer these representations to downstream applications, we further employ a modified UNet-based framework for marine segmentation and bathymetry estimation. Pre-trained on the Hydro dataset, OceanMAE is evaluated on MADOS and MARIDA for marine pollutant and debris segmentation, and on MagicBathyNet for bathymetry regression. The experiments show that OceanMAE yields the strongest gains on marine segmentation, while bathymetry benefits are competitive and task-dependent. In addition, an ablation against a standard MAE on MARIDA indicates that incorporating auxiliary ocean descriptors during pre-training improves downstream segmentation quality. These findings highlight the value of physically informed and domain-aligned self-supervised pre- training for ocean RS. Code and weights are publicly available at this https URL.

60. 【2604.08167】-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation

链接https://arxiv.org/abs/2604.08167

作者:Pranjal Khadka

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:prohibitively expensive process, Medical image segmentation, segmentation traditionally relies, image segmentation traditionally, Medical image

备注: Accepted at the PHAROS-AIF-MIH Workshop at CVPR 2026

点击查看摘要

Abstract:Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP's visual semantic representations generalize more gracefully across imaging modalities than convolutional features.

61. 【2604.08159】Face-D(^2)CL: Multi-Domain Synergistic Representation with Dual Continual Learning for Facial DeepFake Detection

链接https://arxiv.org/abs/2604.08159

作者:Yushuo Zhang,Yu Cheng,Yongkang Hu,Jiuan Zhou,Jiawei Chen,Yuan Xie,Zhaoxia Yin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:critical research priority, techniques poses severe, poses severe threats, facial DeepFake detection, making facial DeepFake

备注

点击查看摘要

Abstract:The rapid advancement of facial forgery techniques poses severe threats to public trust and information security, making facial DeepFake detection a critical research priority. Continual learning provides an effective approach to adapt facial DeepFake detection models to evolving forgery patterns. However, existing methods face two key bottlenecks in real-world continual learning scenarios: insufficient feature representation and catastrophic forgetting. To address these issues, we propose Face-D(^2)CL, a framework for facial DeepFake detection. It leverages multi-domain synergistic representation to fuse spatial and frequency-domain features for the comprehensive capture of diverse forgery traces, and employs a dual continual learning mechanism that combines Elastic Weight Consolidation (EWC), which distinguishes parameter importance for real versus fake samples, and Orthogonal Gradient Constraint (OGC), which ensures updates to task-specific adapters do not interfere with previously learned knowledge. This synergy enables the model to achieve a dynamic balance between robust anti-forgetting capabilities and agile adaptability to emerging facial forgery paradigms, all without relying on historical data replay. Extensive experiments demonstrate that our method surpasses current SOTA approaches in both stability and plasticity, achieving 60.7% relative reduction in average detection error rate, respectively. On unseen forgery domains, it further improves the average detection AUC by 7.9% compared to the current SOTA method.

62. 【2604.08147】Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

链接https://arxiv.org/abs/2604.08147

作者:Linge Wang,Yingying Chen,Bingke Zhu,Lu Zhou,Jinqiao Wang

类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, combining contrastive alignment, Recent, combining contrastive, masked reconstruction

备注

点击查看摘要

Abstract:Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at this https URL.

63. 【2604.08138】Bag of Bags: Adaptive Visual Vocabularies for Genizah Join Image Retrieval

链接https://arxiv.org/abs/2604.08138

作者:Sharva Gogawale,Gal Grudka,Daria Vasyutinsky-Shapira,Omer Ventura,Berat Kurar-Barakat,Nachum Dershowitz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:manuscript fragments identified, identified as originally, originally emanating, study manuscript join, manuscript join retrieval

备注

点击查看摘要

Abstract:A join is a set of manuscript fragments identified as originally emanating from the same manuscript. We study manuscript join retrieval: Given a query image of a fragment, retrieve other fragments originating from the same physical manuscript. We propose Bag of Bags (BoB), an image-level representation that replaces the global-level visual codebook of classical Bag of Words (BoW) with a fragment-specific vocabulary of local visual words. Our pipeline trains a sparse convolutional autoencoder on binarized fragment patches, encodes connected components from each page, clusters the resulting embeddings with per image $k$-means, and compares images using set to set distances between their local vocabularies. Evaluated on fragments from the Cairo Genizah, the best BoB variant (viz.\@ Chamfer) achieves Hit@1 of 0.78 and MRR of 0.84, compared to 0.74 and 0.80, respectively, for the strongest BoW baseline (BoW-RawPatches-$\chi^2$), a 6.1\% relative improvement in top-1 accuracy. We furthermore study a mass-weighted BoB-OT variant that incorporates cluster population into prototype matching and present a formal approximation guarantee bounding its deviation from full component-level optimal transport. A two-stage pipeline using a BoW shortlist followed by BoB-OT reranking provides a practical compromise between retrieval strength and computational cost, supporting applicability to larger manuscript collections.

64. 【2604.08125】PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

链接https://arxiv.org/abs/2604.08125

作者:Zhi-Yi Lin,Thomas Markhorst,Jouh Yeong Chew,Xucong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Human-like multimodal reaction, natural group interactions, Human-like multimodal, essential for natural, humans and embodied

备注

点击查看摘要

Abstract:Human-like multimodal reaction generation is essential for natural group interactions between humans and embodied AI. However, existing approaches are limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score. To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multi-modal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.

65. 【2604.08121】Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

链接https://arxiv.org/abs/2604.08121

作者:Luozheng Qin,Jia Gong,Qian Qiao,Tianjiao Li,Li Xu,Haoyu Pan,Chao Qu,Zhiyu Tan,Hao Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:models integrating visual, incurs substantially higher, substantially higher computational, higher computational costs, integrating visual understanding

备注: Page and Code: [this https URL](https://fr0zencrane.github.io/uni-vigu-page/)

点击查看摘要

Abstract:Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: this https URL.

66. 【2604.08120】Small Vision-Language Models are Smart Compressors for Long Video Understanding

链接https://arxiv.org/abs/2604.08120

作者:Junjie Fei,Jun Chen,Zechun Liu,Yunyang Xiong,Chong Zhou,Wei Wen,Junlin Han,Mingchen Zhuge,Saksham Suri,Qi Qian,Shuming Liu,Lemeng Wu,Raghuraman Krishnamoorthi,Vikas Chandra,Mohamed Elhoseiny,Chenchen Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Adapting Multimodal Large, Multimodal Large Language, Large Language Models, Adapting Multimodal, Multimodal Large

备注: Project page and demo are available at [this https URL](https://FeiElysia.github.io/tempo-page/)

点击查看摘要

Abstract:Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

67. 【2604.08111】Bias Redistribution in Visual Machine Unlearning: Does Forgetting One Group Harm Another?

链接https://arxiv.org/abs/2604.08111

作者:Yunusa Haruna,Adamu Lawan,Ibrahim Haruna Abdulhamid,Hamza Mohammed Dauda,Jiaquan Zhang,Chaoning Zhang,Shamsuddeen Hassan Muhammad

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:GDPR and CCPA, Machine unlearning enables, forget training data, selectively forget training, Machine unlearning

备注

点击查看摘要

Abstract:Machine unlearning enables models to selectively forget training data, driven by privacy regulations such as GDPR and CCPA. However, its fairness implications remain underexplored: when a model forgets a demographic group, does it neutralize that concept or redistribute it to correlated groups, potentially amplifying bias? We investigate this bias redistribution phenomenon on CelebA using CLIP models (ViT/B-32, ViT-L/14, ViT-B/16) under a zero-shot classification setting across intersectional groups defined by age and gender. We evaluate three unlearning methods, Prompt Erasure, Prompt Reweighting, and Refusal Vector using per-group accuracy shifts, demographic parity gaps, and a redistribution score. Our results show that unlearning does not eliminate bias but redistributes it primarily along gender rather than age boundaries. In particular, removing the dominant Young Female group consistently transfers performance to Old Female across all model scales, revealing a gender-dominant structure in CLIP's embedding space. While the Refusal Vector method reduces redistribution, it fails to achieve complete forgetting and significantly degrades retained performance. These findings highlight a fundamental limitation of current unlearning methods: without accounting for embedding geometry, they risk amplifying bias in retained groups.

68. 【2604.08110】OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation

链接https://arxiv.org/abs/2604.08110

作者:Seungjae Moon,Seunghyun Oh,Youngmin Ro

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:requiring additional training, perform dense prediction, recently attracted attention, vision-language models, additional training

备注

点击查看摘要

Abstract:Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.

69. 【2604.08106】EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition

链接https://arxiv.org/abs/2604.08106

作者:Junbo Wang,Liangyu Fu,Yuke Li,Yining Zhu,Xuecheng Wu,Kun Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current moment, obtain the real, real emotion, micro-expression representations, Efficient Patch tokenization

备注

点击查看摘要

Abstract:Micro-expression recognition can obtain the real emotion of the individual at the current moment. Although deep learning-based methods, especially Transformer-based methods, have achieved impressive results, these methods have high computational complexity due to the large number of tokens in the multi-head self-attention. In addition, the existing micro-expression datasets are small-scale, which makes it difficult for Transformer-based models to learn effective micro-expression representations. Therefore, we propose a novel Efficient Patch tokenization, Integration and Representation framework (EPIR), which can balance high recognition performance and low computational complexity. Specifically, we first propose a dual norm shifted tokenization (DNSPT) module to learn the spatial relationship between neighboring pixels in the face region, which is implemented by a refined spatial transformation and dual norm projection. Then, we propose a token integration module to integrate partial tokens among multiple cascaded Transformer blocks, thereby reducing the number of tokens without information loss. Furthermore, we design a discriminative token extractor, which first improves the attention in the Transformer block to reduce the unnecessary focus of the attention calculation on self-tokens, and uses the dynamic token selection module (DTSM) to select key tokens, thereby capturing more discriminative micro-expression representations. We conduct extensive experiments on four popular public datasets (i.e., CASME II, SAMM, SMIC, and CAS(ME)3. The experimental results show that our method achieves significant performance gains over the state-of-the-art methods, such as 9.6% improvement on the CAS(ME)$^3$ dataset in terms of UF1 and 4.58% improvement on the SMIC dataset in terms of UAR metric.

70. 【2604.08088】Coordinate-Based Dual-Constrained Autoregressive Motion Generation

链接https://arxiv.org/abs/2604.08088

作者:Kang Ding,Hongsong Wang,Jie Gui,Liang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attracted increasing attention, research community recently, virtual reality, community recently, applications in animation

备注: Code is available at: [this https URL](https://github.com/fly-dk/CDAMD)

点击查看摘要

Abstract:Text-to-motion generation has attracted increasing attention in the research community recently, with potential applications in animation, virtual reality, robotics, and human-computer interaction. Diffusion and autoregressive models are two popular and parallel research directions for text-to-motion generation. However, diffusion models often suffer from error amplification during noise prediction, while autoregressive models exhibit mode collapse due to motion discretization. To address these limitations, we propose a flexible, high-fidelity, and semantically faithful text-to-motion framework, named Coordinate-based Dual-constrained Autoregressive Motion Generation (CDAMD). With motion coordinates as input, CDAMD follows the autoregressive paradigm and leverages diffusion-inspired multi-layer perceptrons to enhance the fidelity of predicted motions. Furthermore, a Dual-Constrained Causal Mask is introduced to guide autoregressive generation, where motion tokens act as priors and are concatenated with textual encodings. Since there is limited work on coordinate-based motion synthesis, we establish new benchmarks for both text-to-motion generation and motion editing. Experimental results demonstrate that our approach achieves state-of-the-art performance in terms of both fidelity and semantic consistency on these benchmarks.

71. 【2604.08084】DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

链接https://arxiv.org/abs/2604.08084

作者:Junbo Wang,Liangyu Fu,Yuke Li,Yining Zhu,Ya Jing,Xuecheng Wu,Jiangbin Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current video captioning, generate text autoregressively, Current video, video captioning methods, text autoregressively

备注

点击查看摘要

Abstract:Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.

72. 【2604.08077】AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

链接https://arxiv.org/abs/2604.08077

作者:Handong Li,Zikang Liu,Longteng Guo,Tongtian Yue,Yepeng Tang,Xinxin Zhu,Chuanyang Zheng,Ziming Wang,Zhibin Wang,Jun Song,Cheng Yu,Bo Zheng,Jing Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Video Large Language, Large Language, Processing long-form videos, Processing long-form

备注: 8 pages, CVPR2026 Accept (Highlight)

点击查看摘要

Abstract:Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

73. 【2604.08074】DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather

链接https://arxiv.org/abs/2604.08074

作者:Christof Leitgeb,Thomas Puchleitner,Max Peter Ronecker,Daniel Watzenig

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:weather-robust perception systems, safe autonomous driving, typically employ multi-modal, employ multi-modal sensor, multi-modal sensor configurations

备注: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026

点击查看摘要

Abstract:Reliable and weather-robust perception systems are essential for safe autonomous driving and typically employ multi-modal sensor configurations to achieve comprehensive environmental awareness. While recent automotive FMCW Radar-based approaches achieved remarkable performance on detection tasks in adverse weather conditions, they exhibited limitations in resolving fine-grained spatial details particularly critical for detecting smaller and vulnerable road users (VRUs). Furthermore, existing research has not adequately addressed VRU detection in adverse weather datasets such as K-Radar. We present DinoRADE, a Radar-centered detection pipeline that processes dense Radar tensors and aggregates vision features around transformed reference points in the camera perspective via deformable cross-attention. Vision features are provided by a DINOv3 Vision Foundation Model. We present a comprehensive performance evaluation on the K-Radar dataset in all weather conditions and are among the first to report detection performance individually for five object classes. Additionally, we compare our method with existing single-class detection approaches and outperform recent Radar-camera approaches by 12.1%. The code is available under this https URL.

74. 【2604.08072】nsor-Augmented Convolutional Neural Networks: Enhancing Expressivity with Generic Tensor Kernels

链接https://arxiv.org/abs/2604.08072

作者:Chia-Wei Hsing,Wei-Lin Tu

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)

关键词:Convolutional Neural Networks, Convolutional Neural, Neural Networks, complex correlations hinges, correlations hinges heavily

备注: 8 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) excel at extracting local features hierarchically, but their performance in capturing complex correlations hinges heavily on deep architectures, which are usually computationally demanding and difficult to interpret. To address these issues, we propose a physically-guided shallow model: tensor-augmented CNN (TACNN), which replaces conventional convolution kernels with generic tensors to enhance representational capacity. This choice is motivated by the fact that an order-$N$ tensor naturally encodes an arbitrary quantum superposition state in the Hilbert space of dimension $d^N$, where $d$ is the local physical dimension, thus offering substantially richer expressivity. Furthermore, in our design the convolution output of each layer becomes a multilinear form capable of capturing high-order feature correlations, thereby equipping a shallow multilayer architecture with an expressive power competitive to that of deep CNNs. On the Fashion-MNIST benchmark, TACNN demonstrates clear advantages over conventional CNNs, achieving remarkable accuracies with only a few layers. In particular, a TACNN with only two convolution layers attains a test accuracy of 93.7$\%$, surpassing or matching considerably deeper models such as VGG-16 (93.5$\%$) and GoogLeNet (93.7$\%$). These findings highlight TACNN as a promising framework that strengthens model expressivity while preserving architectural simplicity, paving the way towards more interpretable and efficient deep learning models.

75. 【2604.08070】AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

链接https://arxiv.org/abs/2604.08070

作者:Imane Momayiz,Soufiane Ait Elaouad,Abdeljalil Elmajjodi,Haitame Bouanane

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Optical Character Recognition, specialized Optical Character, lacks specialized Optical, Moroccan Arabic dialect, Character Recognition

备注

点击查看摘要

Abstract:Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR's robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.

76. 【2604.08068】Brain3D: EEG-to-3D Decoding of Visual Representations via Multimodal Reasoning

链接https://arxiv.org/abs/2604.08068

作者:Emanuele Balloni,Emanuele Frontoni,Chiara Matti,Marina Paolanti,Roberto Pierdicca,Emiliano Santarnecchi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently achieved promising, achieved promising results, information from electroencephalography, primarily focusing, reconstructing two-dimensional

备注: 17 pages, 2 figures

点击查看摘要

Abstract:Decoding visual information from electroencephalography (EEG) has recently achieved promising results, primarily focusing on reconstructing two-dimensional (2D) images from brain activity. However, the reconstruction of three-dimensional (3D) representations remains largely unexplored. This limits the geometric understanding and reduces the applicability of neural decoding in different contexts. To address this gap, we propose Brain3D, a multimodal architecture for EEG-to-3D reconstruction based on EEG-to-image decoding. It progressively transforms neural representations into the 3D domain using geometry-aware generative reasoning. Our pipeline first produces visually grounded images from EEG signals, then employs a multimodal large language model to extract structured 3D-aware descriptions, which guide a diffusion-based generation stage whose outputs are finally converted into coherent 3D meshes via a single-image-to-3D model. By decomposing the problem into structured stages, the proposed approach avoids direct EEG-to-3D mappings and enables scalable brain-driven 3D generation. We conduct a comprehensive evaluation comparing the reconstructed 3D outputs against the original visual stimuli, assessing both semantic alignment and geometric fidelity. Experimental results demonstrate strong performance of the proposed architecture, achieving up to 85.4% 10-way Top-1 EEG decoding accuracy and 0.648 CLIPScore, supporting the feasibility of multimodal EEG-driven 3D reconstruction.

77. 【2604.08063】EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience

链接https://arxiv.org/abs/2604.08063

作者:Emanuele Balloni,Emanuele Frontoni,Chiara Matti,Marina Paolanti,Roberto Pierdicca,Emiliano Santarnecchi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstructing visual stimuli, remains challenging due, realistic low-density electrode, low spatial resolution, low-density electrode configurations

备注: 17 pages, 5 figures

点击查看摘要

Abstract:Reconstructing visual stimuli from non-invasive electroencephalography (EEG) remains challenging due to its low spatial resolution and high noise, particularly under realistic low-density electrode configurations. To address this, we present EEG2Vision, a modular, end-to-end EEG-to-image framework that systematically evaluates reconstruction performance across different EEG resolutions (128, 64, 32, and 24 channels) and enhances visual quality through a prompt-guided post-reconstruction boosting mechanism. Starting from EEG-conditioned diffusion reconstruction, the boosting stage uses a multimodal large language model to extract semantic descriptions and leverages image-to-image diffusion to refine geometry and perceptual coherence while preserving EEG-grounded structure. Our experiments show that semantic decoding accuracy degrades significantly with channel reduction (e.g., 50-way Top-1 Acc from 89% to 38%), while reconstruction quality slight decreases (e.g., FID from 76.77 to 80.51). The proposed boosting consistently improves perceptual metrics across all configurations, achieving up to 9.71% IS gains in low-channel settings. A user study confirms the clear perceptual preference for boosted reconstructions. The proposed approach significantly boosts the feasibility of real-time brain-2-image applications using low-resolution EEG devices, potentially unlocking this type of applications outside laboratory settings.

78. 【2604.08050】ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

链接https://arxiv.org/abs/2604.08050

作者:Daichi Yashima,Shuhei Kurita,Yusuke Oda,Shuntaro Suzuki,Seitaro Otsuki,Komei Sugiura

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Aligned Hierarchical Bidirectional, Hierarchical Bidirectional Scan, open multimodal large, multimodal large language, Bidirectional Scan Mamba

备注: Accepted to ICPR 2026

点击查看摘要

Abstract:In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.

79. 【2604.08048】Guiding a Diffusion Model by Swapping Its Tokens

链接https://arxiv.org/abs/2604.08048

作者:Weijia Zhang,Yuehao Liu,Shanyan Guan,Wu Ran,Yanhao Ge,Wei Li,Chao Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unconditional generation, widely used inference-time, inference-time technique, technique to boost, Classifier-Free Guidance

备注: Accepted by CVPR 2026 (Oral)

点击查看摘要

Abstract:Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

80. 【2604.08045】Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images

链接https://arxiv.org/abs/2604.08045

作者:Francesca Fati,Alberto Rota,Adriana V. Gregory,Anna Catozzo,Maria C. Giuliano,Mrinal Dhar,Luigi De Vitis,Annie T. Packard,Francesco Multinu,Elena De Momi,Carrie L. Langstraat,Timothy L. Kline

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Adnexal mass evaluation, significant inter-observer variability, Adnexal mass, challenging clinical task, inter-observer variability

备注

点击查看摘要

Abstract:Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: this https URL

81. 【2604.08042】3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

链接https://arxiv.org/abs/2604.08042

作者:Hongcan Xiao,Xinyue Xiao,Yilin Wang,Yue Zhang,Yonggang Qi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:space enables expressive, natural language remains, enables expressive reasoning, space enables, major challenge

备注: CVPR 2026 Highlight

点击查看摘要

Abstract:Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.

82. 【2604.08039】LINE: LLM-based Iterative Neuron Explanations for Vision Models

链接https://arxiv.org/abs/2604.08039

作者:Vladimir Zaigrajew,Michał Piechota,Gaspar Sekula,Przemysław Biecek

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:deep neural networks, complex decision-making processes, ensuring AI safety, encoded by individual, deep neural

备注

点击查看摘要

Abstract:Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.

83. 【2604.08038】Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

链接https://arxiv.org/abs/2604.08038

作者:Jun Li,Yingying Shi,Zhixuan Ruan,Nan Guo,Jianhua Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:poses great challenges, Dilated Convolutions Network, Deformable Dilated Convolutions, Deformable Dilated Convolution, cluttered background

备注

点击查看摘要

Abstract:In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, state-space models exhibit limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling and insufficient spatial inductive biases, leading to sub-optimal performance in complex scenes. To address these issues, we propose a Mamba with Deformable Dilated Convolutions Network (MDDCNet) for accurate traffic object detection in this study. In MDDCNet, a well-designed hybrid backbone with successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks and Mamba blocks enables hierarchical feature representation from local details to global semantics. Meanwhile, a Channel-Enhanced Feed-Forward Network (CE-FFN) is further devised to overcome the limited channel interaction capability of conventional feed-forward networks, whilst a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) is constructed to achieve enhanced multi-scale feature fusion and interaction. Extensive experimental results on public benchmark and real-world datasets demonstrate the superiority of our method over various advanced detectors. The code is available at this https URL.

84. 【2604.08037】PrivFedTalk: Privacy-Aware Federated Diffusion with Identity-Stable Adapters for Personalized Talking-Head Generation

链接https://arxiv.org/abs/2604.08037

作者:Soumya Mazumdar,Vineet Kumar Rakesh,Tapas Samanta

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:diffusion-based generative models, personalized talking-head generation, raising major privacy, major privacy concerns, Talking-head generation

备注: GitHub: [this https URL](https://github.com/mazumdarsoumya/PrivFedTalk)

点击查看摘要

Abstract:Talking-head generation has advanced rapidly with diffusion-based generative models, but training usually depends on centralized face-video and speech datasets, raising major privacy concerns. The problem is more acute for personalized talking-head generation, where identity-specific data are highly sensitive and often cannot be pooled across users or devices. PrivFedTalk is presented as a privacy-aware federated framework for personalized talking-head generation that combines conditional latent diffusion with parameter-efficient identity adaptation. A shared diffusion backbone is trained across clients, while each client learns lightweight LoRA identity adapters from local private audio-visual data, avoiding raw data sharing and reducing communication cost. To address heterogeneous client distributions, Identity-Stable Federated Aggregation (ISFA) weights client updates using privacy-safe scalar reliability signals computed from on-device identity consistency and temporal stability estimates. Temporal-Denoising Consistency (TDC) regularization is introduced to reduce inter-frame drift, flicker, and identity drift during federated denoising. To limit update-side privacy risk, secure aggregation and client-level differential privacy are applied to adapter updates. The implementation supports both low-memory GPU execution and multi-GPU client-parallel training on heterogeneous shared hardware. Comparative experiments on the present setup across multiple training and aggregation conditions with PrivFedTalk, FedAvg, and FedProx show stable federated optimization and successful end-to-end training and evaluation under constrained resources. The results support the feasibility of privacy-aware personalized talking-head training in federated environments, while suggesting that stronger component-wise, privacy-utility, and qualitative claims need further standardized evaluation.

85. 【2604.08034】Rotation Equivariant Convolutions in Deformable Registration of Brain MRI

链接https://arxiv.org/abs/2604.08034

作者:Arghavan Rezvani,Kun Han,Anthony T. Wu,Pooya Khosravi,Xiaohui Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamental task, task that aligns, brain MRI, aligns anatomical structures, Image registration

备注: Accepted at the 2026 International Symposium on Biomedical Imaging (ISBI) Poster 4-page paper presentation

点击查看摘要

Abstract:Image registration is a fundamental task that aligns anatomical structures between images. While CNNs perform well, they lack rotation equivariance - a rotated input does not produce a correspondingly rotated output. This hinders performance by failing to exploit the rotational symmetries inherent in anatomical structures, particularly in brain MRI. In this work, we integrate rotation-equivariant convolutions into deformable brain MRI registration networks. We evaluate this approach by replacing standard encoders with equivariant ones in three baseline architectures, testing on multiple public brain MRI datasets. Our experiments demonstrate that equivariant encoders have three key advantages: 1) They achieve higher registration accuracy while reducing network parameters, confirming the benefit of this anatomical inductive bias. 2) They outperform baselines on rotated input pairs, demonstrating robustness to orientation variations common in clinical practice. 3) They show improved performance with less training data, indicating greater sample efficiency. Our results demonstrate that incorporating geometric priors is a critical step toward building more robust, accurate, and efficient registration models.

Comments:
Accepted at the 2026 International Symposium on Biomedical Imaging (ISBI) Poster 4-page paper presentation

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.08034 [cs.CV]

(or
arXiv:2604.08034v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.08034

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
86. 【2604.08031】Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles

链接https://arxiv.org/abs/2604.08031

作者:Jiawei Liu,Xun Gong,Fen Fang,Muli Yang,Bohao Qu,Yunfeng Hu,Hong Chen,Xulei Yang,Qing Guo

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Human-Machine Interaction, research overlooks, autonomous driving, overlooks the maneuvering, HMI

备注

点击查看摘要

Abstract:Most Human-Machine Interaction (HMI) research overlooks the maneuvering needs of passengers in autonomous driving (AD). Natural language offers an intuitive interface, yet translating passenger open-ended instructions into control signals, without sacrificing interpretability and traceability, remains a challenge. This study proposes an instruction-realization framework that leverages a large language model (LLM) to interpret instructions, generates executable scripts that schedule multiple model predictive control (MPC)-based motion planners based on real-time feedback, and converts planned trajectories into control signals. This scheduling-centric design decouples semantic reasoning from vehicle control at different timescales, establishing a transparent, traceable decision-making chain from high-level instructions to low-level actions. Due to the absence of high-fidelity evaluation tools, this study introduces a benchmark for open-ended instruction realization in a closed-loop setting. Comprehensive experiments reveal that the framework significantly improves task-completion rates over instruction-realization baselines, reduces LLM query costs, achieves safety and compliance on par with specialized AD approaches, and exhibits considerable tolerance to LLM inference latency. For more qualitative illustrations and a clearer understanding.

87. 【2604.08015】Component-Adaptive and Lesion-Level Supervision for Improved Small Structure Segmentation in Brain MRI

链接https://arxiv.org/abs/2604.08015

作者:Minh Sao Khue Luu,Evgeniy N. Pavlovskiy,Bair N. Tuchinov

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multiple Instance Learning, auxiliary supervision terms, supervision terms operating, augments the base, base segmentation loss

备注

点击查看摘要

Abstract:We propose a unified objective function, termed CATMIL, that augments the base segmentation loss with two auxiliary supervision terms operating at different levels. The first term, Component-Adaptive Tversky, reweights voxel contributions based on connected components to balance the influence of lesions of different sizes. The second term, based on Multiple Instance Learning, introduces lesion-level supervision by encouraging the detection of each lesion instance. These terms are combined with the standard nnU-Net loss to jointly optimize voxel-level segmentation accuracy and lesion-level detection. We evaluate the proposed objective on the MSLesSeg dataset using a consistent nnU-Net framework and 5-fold cross-validation. The results show that CATMIL achieves the most balanced performance across segmentation accuracy, lesion detection, and error control. It improves Dice score (0.7834) and reduces boundary error compared to standard losses. More importantly, it substantially increases small lesion recall and reduces false negatives, while maintaining the lowest false positive volume among compared methods. These findings demonstrate that integrating component-level and lesion-level supervision within a unified objective provides an effective and practical approach for improving small lesion segmentation in highly imbalanced settings. All code and pretrained models are available at \href{this https URL}{this url}.

88. 【2604.08014】Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

链接https://arxiv.org/abs/2604.08014

作者:Xuezhen Tu,Jingyu Wu,Fangyu Kang,Qingpeng Nong,Kaijin Zhang,Chaoyue Niu,Fan Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, existing Multimodal Large, Multimodal Large, Language Models

备注

点击查看摘要

Abstract:Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the temporal MLLM and the spatial decoder. Bridge-STG resolves this through two pivotal designs: the \textbf{Spatio-Temporal Semantic Bridging (STSB)} mechanism with Explicit Temporal Alignment (ETA) distills the MLLM's temporal reasoning context into enriched bridging queries as a robust semantic interface; and the \textbf{Query-Guided Spatial Localization (QGSL)} module leverages these queries to drive a purpose-built spatial decoder with multi-layer interactive queries and positive/negative frame sampling, jointly eliminating dual-domain visual token redundancy. Extensive experiments across multiple benchmarks demonstrate that Bridge-STG achieves state-of-the-art performance among MLLM-based methods. Bridge-STG improves average m\_vIoU from $26.4$ to $34.3$ on VidSTG and demonstrates strong cross-task transfer across various fine-grained video understanding tasks under a unified multi-task training regime.

89. 【2604.08008】SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving

链接https://arxiv.org/abs/2604.08008

作者:Felix Embacher,Jonas Uhrig,Marius Cordts,Markus Enzweiler

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:safety-critical driving scenarios, robust autonomous driving, building robust autonomous, Retrieving rare, safety-critical driving

备注: To be published in CVPR 2026

点击查看摘要

Abstract:Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle-in-a-haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: this https URL

90. 【2604.08000】PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

链接https://arxiv.org/abs/2604.08000

作者:Zhifei Xie,Zongzheng Hu,Fangda Ye,Xin Zhang,Haobo Chai,Zihang Liu,Pengcheng Wu,Guibin Zhang,Yue Liao,Xiaobin Hu,Deheng Ye,Chunyan Miao,Shuicheng Yan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)

关键词:expectation for AGI, core expectation, Proactive Agent System, Proactivity, AGI

备注: Technical report; Work in progress

点击查看摘要

Abstract:Proactivity is a core expectation for AGI. Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints. We study this setting, where useful intervention requires inferring latent needs from ongoing context and grounding actions in evolving user memory under latency and long-horizon constraints. We first propose DD-MM-PAS (Demand Detection, Memory Modeling, Proactive Agent System) as a general paradigm for streaming proactive AI agent. We instantiate this paradigm in Pask, with streaming IntentFlow model for DD, a hybrid memory (workspace, user, global) for long-term MM, PAS infra framework and introduce how these components form a closed loop. We also introduce LatentNeeds-Bench, a real-world benchmark built from user-consented data and refined through thousands of rounds of human editing. Experiments show that IntentFlow matches leading Gemini3-Flash models under latency constraints, while identifying deeper user intent.

91. 【2604.07997】Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments

链接https://arxiv.org/abs/2604.07997

作者:Yun Zhu,Jianjun Qian,Jian Yang,Jin Xie,Na Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dynamic indoor environments, indoor environments, Few-shot Incremental, critical step, step toward embodied

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods. Code is available at this https URL.

92. 【2604.07994】SAT: Selective Aggregation Transformer for Image Super-Resolution

链接https://arxiv.org/abs/2604.07994

作者:Dinh Phu Tran,Thao Do,Saad Wazir,Seongah Kim,Seon Kwon Kim,Daeyoung Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:revolutionized image super-resolution, Transformer-based approaches, modeling long-range dependencies, approaches have revolutionized, revolutionized image

备注: Accepted to CVPR2026 (Findings Track)

点击查看摘要

Abstract:Transformer-based approaches have revolutionized image super-resolution by modeling long-range dependencies. However, the quadratic computational complexity of vanilla self-attention mechanisms poses significant challenges, often leading to compromises between efficiency and global context exploitation. Recent window-based attention methods mitigate this by localizing computations, but they often yield restricted receptive fields. To mitigate these limitations, we propose Selective Aggregation Transformer (SAT). This novel transformer efficiently captures long-range dependencies, leading to an enlarged model receptive field by selectively aggregating key-value matrices (reducing the number of tokens by 97\%) via our Density-driven Token Aggregation algorithm while maintaining the full resolution of the query matrix. This design significantly reduces computational costs, resulting in lower complexity and enabling scalable global interactions without compromising reconstruction fidelity. SAT identifies and represents each cluster with a single aggregation token, utilizing density and isolation metrics to ensure that critical high-frequency details are preserved. Experimental results demonstrate that SAT outperforms the state-of-the-art method PFT by up to 0.22dB, while the total number of FLOPs can be reduced by up to 27\%.

93. 【2604.07991】MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

链接https://arxiv.org/abs/2604.07991

作者:Zile Guo,Zhan Chen,Enze Zhu,Kan Wei,Yongkang Zou,Xiaoxuan Liu,Lei Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:demonstrated strong capabilities, increasingly important foundation, Recent advances, simulating physical reality, embodied intelligence

备注

点击查看摘要

Abstract:Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision-making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground-constrained autonomous driving scenes or relatively smooth human-centric egocentric videos, and therefore lack realistic high-dynamic 6-DoF UAV motion priors. To address this gap, we present MotionScape, a large-scale real-world UAV-view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV-view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real-world UAV videos are tightly coupled with accurate 6-DoF camera trajectories and fine-grained natural language descriptions. To build the dataset, we develop an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large-language-model-driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision-making and planning for UAV agents in complex environments. The dataset is publicly available at this https URL

94. 【2604.07990】SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

链接https://arxiv.org/abs/2604.07990

作者:Yunnan Wang,Kecheng Zheng,Jianyuan Wang,Minghao Chen,David Novotny,Christian Rupprecht,Yinghao Xu,Xing Zhu,Wenjun Zeng,Xin Jin,Yujun Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:geometric perception, spatio-temporal information, created an unprecedented, unprecedented demand, semantic and spatio-temporal

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

95. 【2604.07986】DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction

链接https://arxiv.org/abs/2604.07986

作者:Tingxi Chen,Zhengxue Cheng,Houqiang Zhong,Su Wang,Rong Xie,Li Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:crucial for next-generation, video is crucial, Egocentric video, dynamic first-person scenes, reconstructing dynamic first-person

备注

点击查看摘要

Abstract:Egocentric video is crucial for next-generation 4D scene reconstruction, with applications in AR/VR and embodied AI. However, reconstructing dynamic first-person scenes is challenging due to complex ego-motion, occlusions, and hand-object interactions. Existing decomposition methods are ill-suited, assuming fixed viewpoints or merging dynamics into a single foreground. To address these limitations, we introduce DP-DeGauss, a dynamic probabilistic Gaussian decomposition framework for egocentric 4D reconstruction. Our method initializes a unified 3D Gaussian set from COLMAP priors, augments each with a learnable category probability, and dynamically routes them into specialized deformation branches for background, hands, or object modeling. We employ category-specific masks for better disentanglement and introduce brightness and motion-flow control to improve static rendering and dynamic reconstruction. Extensive experiments show that DP-DeGauss outperforms baselines by +1.70dB in PSNR on average with SSIM and LPIPS gains. More importantly, our framework achieves the first and state-of-the-art disentanglement of background, hand, and object components, enabling explicit, fine-grained separation, paving the way for more intuitive ego scene understanding and editing.

96. 【2604.07980】Object-Centric Stereo Ranging for Autonomous Driving: From Dense Disparity to Census-Based Template Matching

链接https://arxiv.org/abs/2604.07980

作者:Qihao Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate depth estimation, long range vehicle, Semi Global Matching, range vehicle detection, Accurate depth

备注: 10 pages, 4 figures

点击查看摘要

Abstract:Accurate depth estimation is critical for autonomous driving perception systems, particularly for long range vehicle detection on highways. Traditional dense stereo matching methods such as Block Matching (BM) and Semi Global Matching (SGM) produce per pixel disparity maps but suffer from high computational cost, sensitivity to radiometric differences between stereo cameras, and poor accuracy at long range where disparity values are small. In this report, we present a comprehensive stereo ranging system that integrates three complementary depth estimation approaches: dense BM/SGM disparity, object centric Census based template matching, and monocular geometric priors, within a unified detection ranging tracking pipeline. Our key contribution is a novel object centric Census based template matching algorithm that performs GPU accelerated sparse stereo matching directly within detected bounding boxes, employing a far close divide and conquer strategy, forward backward verification, occlusion aware sampling, and robust multi block aggregation. We further describe an online calibration refinement framework that combines auto rectification offset search, radar stereo voting based disparity correction, and object level radar stereo association for continuous extrinsic drift compensation. The complete system achieves real time performance through asynchronous GPU pipeline design and delivers robust ranging across diverse driving conditions including nighttime, rain, and varying illumination.

97. 【2604.07966】Lighting-grounded Video Generation with Renderer-based Agent Reasoning

链接https://arxiv.org/abs/2604.07966

作者:Ziqi Cai,Taoyu Yang,Zheng Chang,Si Li,Han Jiang,Shuchen Weng,Boxin Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable progress, major limitation, video generation, achieved remarkable, remarkable progress

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.

98. 【2604.07965】DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

链接https://arxiv.org/abs/2604.07965

作者:Gyanendra Das,Sai Satyam Jena

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Model editing aims, change relevant information, Vision Language Models, information without retraining, aims to update

备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.

99. 【2604.07960】OOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning

链接https://arxiv.org/abs/2604.07960

作者:Yifei Gong,Xing Wu,Wenda Liu,Kang Tu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Computer-Aided Design, coherent modeling actions, CAD, CAD tool-using agents, relies on long-horizon

备注

点击查看摘要

Abstract:Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably, there has been no investigation into how tool-using LLMs optimally interact with CAD engines, hindering the emergence of LLM-based agentic text-to-CAD modeling systems. We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool-using agents for text-to-CAD generation. Furthermore, we introduce an interactive CAD modeling gym to rollout reasoning and tool-augmented interaction trajectories with the CAD engine, incorporating hybrid feedback and human supervision. Meanwhile, an end-to-end post-training strategy is presented to enable the LLM agent to elicit refined CAD Modeling Chain of Thought (CAD-CoT) and evolve into proficient CAD tool-using agents via online curriculum reinforcement learning. Our findings demonstrate ToolCAD fills the gap in adopting and training open-source LLMs for CAD tool-using agents, enabling them to perform comparably to proprietary models, paving the way for more accessible and robust autonomous text-to-CAD modeling systems.

100. 【2604.07958】ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

链接https://arxiv.org/abs/2604.07958

作者:Jiayang Xu,Fan Zhuo,Majun Zhang,Changhao Pan,Zehan Wang,Siyu Chen,Xiaoda Yang,Tao Jin,Zhou Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current video editing, paired video data, expensive paired video, Current video, practical scalability

备注

点击查看摘要

Abstract:Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets.

101. 【2604.07957】WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models

链接https://arxiv.org/abs/2604.07957

作者:Hongjin Chen,Shangyun Jiang,Tonghua Su,Chen Gao,Xinlei Chen,Yong Li,Zhibo Chen

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:opening new opportunities, world models, generative world models, models, generative world

备注

点击查看摘要

Abstract:Vision-language models (VLMs) and generative world models are opening new opportunities for embodied navigation. VLMs are increasingly used as direct planners or trajectory predictors, while world models support look-ahead reasoning by imagining future views. Yet predicting a reliable trajectory from a single egocentric observation remains challenging. Current VLMs often generate unstable trajectories, and world models, though able to synthesize plausible futures, do not directly provide the grounded signals needed for navigation learning. This raises a central question: how can generated futures be turned into supervision for grounded trajectory prediction? We present WorldMAP, a teacher--student framework that converts world-model-generated futures into persistent semantic-spatial structure and planning-derived supervision. Its world-model-driven teacher builds semantic-spatial memory from generated videos, grounds task-relevant targets and obstacles, and produces trajectory pseudo-labels through explicit planning. A lightweight student with a multi-hypothesis trajectory head is then trained to predict navigation trajectories directly from vision-language inputs. On Target-Bench, WorldMAP achieves the best ADE and FDE among compared methods, reducing ADE by 18.0% and FDE by 42.1% relative to the best competing baseline, while lifting a small open-source VLM to DTW performance competitive with proprietary models. More broadly, the results suggest that, in embodied navigation, the value of world models may lie less in supplying action-ready imagined evidence than in synthesizing structured supervision for navigation learning.

102. 【2604.07936】Shortcut Learning in Glomerular AI: Adversarial Penalties Hurt, Entropy Helps

链接https://arxiv.org/abs/2604.07936

作者:Mohammad Daouk,Jan Ulrich Becker,Neeraja Kambham,Anthony Chang,Hien Nguyen,Chandra Mohan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Stain, pervasive source, source of distribution, distribution shift, renal pathology

备注: Accepted at IEEE ISBI 2026. Hien Nguyen and Chandra Mohan jointly supervised this work

点击查看摘要

Abstract:Stain variability is a pervasive source of distribution shift and potential shortcut learning in renal pathology AI. We ask whether lupus nephritis glomerular lesion classifiers exploit stain as a shortcut, and how to mitigate such bias without stain or site labels. We curate a multi-center, multi-stain dataset of 9{,}674 glomerular patches (224$\times$224) from 365 WSIs across three centers and four stains (PAS, H\E, Jones, Trichrome), labeled as proliferative vs.\ non-proliferative. We evaluate Bayesian CNN and ViT backbones with Monte Carlo dropout in three settings: (1) stain-only classification; (2) a dual-head model jointly predicting lesion and stain with supervised stain loss; and (3) a dual-head model with label-free stain regularization via entropy maximization on the stain head. In (1), stain identity is trivially learnable, confirming a strong candidate shortcut. In (2), varying the strength and sign of stain supervision strongly modulates stain performance but leaves lesion metrics essentially unchanged, indicating no measurable stain-driven shortcut learning on this multi-stain, multi-center dataset, while overly adversarial stain penalties inflate predictive uncertainty. In (3), entropy-based regularization holds stain predictions near chance without degrading lesion accuracy or calibration. Overall, a carefully curated multi-stain dataset can be inherently robust to stain shortcuts, and a Bayesian dual-head architecture with label-free entropy regularization offers a simple, deployment-friendly safeguard against potential stain-related drift in glomerular AI.

103. 【2604.07928】Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting

链接https://arxiv.org/abs/2604.07928

作者:Tao Hana,Zhibin Wen,Zhenghao Chen,Fenghua Lin,Junyu Gao,Song Guo,Lei Bai

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:inefficient data representations, AI-based numerical weather, outputs remains computationally, remains computationally demanding, computationally demanding due

备注: 20 pages, 13 figures

点击查看摘要

Abstract:While AI-based numerical weather prediction (NWP) enables rapid forecasting, generating high-resolution outputs remains computationally demanding due to limited multi-scale adaptability and inefficient data representations. We propose the 3D Gaussian splatting-based scale-aware vision transformer (GSSA-ViT), a novel framework for arbitrary-resolution forecasting and flexible downscaling of high-dimensional atmospheric fields. Specifically, latitude-longitude grid points are treated as centers of 3D Gaussians. A generative 3D Gaussian prediction scheme is introduced to estimate key parameters, including covariance, attributes, and opacity, for unseen samples, improving generalization and mitigating overfitting. In addition, a scale-aware attention module is designed to capture cross-scale dependencies, enabling the model to effectively integrate information across varying downscaling ratios and support continuous resolution adaptation. To our knowledge, this is the first NWP approach that combines generative 3D Gaussian modeling with scale-aware attention for unified multi-scale prediction. Experiments on ERA5 show that the proposed method accurately forecasts 87 atmospheric variables at arbitrary resolutions, while evaluations on ERA5 and CMIP6 demonstrate its superior performance in downscaling tasks. The proposed framework provides an efficient and scalable solution for high-resolution, multi-scale atmospheric prediction and downscaling. Code is available at: this https URL.

104. 【2604.07923】Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation

链接https://arxiv.org/abs/2604.07923

作者:Hina Kogure,Kei Katsumata,Taiki Miyanishi,Komei Sugiura

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:spatially separated locations, captured by cameras, spatially separated, separated locations, sparse

备注

点击查看摘要

Abstract:Dynamic urban environments are often captured by cameras placed at spatially separated locations with little or no view overlap. However, most existing 4D reconstruction methods assume densely overlapping views. When applied to such sparse observations, these methods fail to reconstruct intermediate regions and often introduce temporal artifacts. To address this practical yet underexplored sparse multi-location setting, we propose Stitch4D, a unified 4D reconstruction framework that explicitly compensates for missing spatial coverage in sparse observations. Stitch4D (i) synthesizes intermediate bridge views to densify spatial constraints and improve spatial coverage, and (ii) jointly optimizes real and synthesized observations within a unified coordinate frame under explicit inter-location consistency constraints. By restoring intermediate coverage before optimization, Stitch4D prevents geometric collapse and reconstructs coherent geometry and smooth scene dynamics even in sparsely observed environments. To evaluate this setting, we introduce Urban Sparse 4D (U-S4D), a CARLA-based benchmark designed to assess spatiotemporal alignment under sparse multi-location configurations. Experimental results on U-S4D show that Stitch4D surpasses representative 4D reconstruction baselines and achieves superior visual quality. These results indicate that recovering intermediate spatial coverage is essential for stable 4D reconstruction in sparse urban environments.

105. 【2604.07916】arot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

链接https://arxiv.org/abs/2604.07916

作者:Weiming Zhang,Dingwen Xiao,Songyue Guo,Guangyu Xiang,Shiqi Wen,Minwei Zhao,Lei Chen,Lin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Referring Expression, Referring Expression Segmentation, Promptable Concept Segmentation, bridge between vision, RES

备注: Under review

点击查看摘要

Abstract:Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.

106. 【2604.07914】Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction

链接https://arxiv.org/abs/2604.07914

作者:Yuanhong Zhang,Zhaoyang Wang,Xin Zhang,Weizhan Zhang,Joey Tianyi Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, achieved remarkable success, producing textual outputs, Large Vision-Language, textual outputs inconsistent

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable success across cross-modal tasks but remain hindered by hallucinations, producing textual outputs inconsistent with visual content. Existing methods mitigate hallucinations but often alter generation behavior, resulting in shorter outputs and shifted token distributions, especially in latent space steering approaches. We identify that this issue stems from entangled steering signals, where suppressing hallucinations inadvertently disrupts the model's intrinsic generation behavior. To address this, we propose MESA, an effective plug-and-play framework that performs controlled and selective latent intervention for hallucination mitigation. Specifically, MESA targets hallucination-relevant responses while preserving the model's original token distribution, enabling effective hallucination reduction without compromising generation behavior. Extensive experiments across diverse generative and discriminative benchmarks demonstrate that MESA consistently reduces hallucinations while better preserving generation behavior, outperforming prior methods across multiple LVLM families.

107. 【2604.07912】ParkSense: Where Should a Delivery Driver Park? Leveraging Idle AV Compute and Vision-Language Models

链接https://arxiv.org/abs/2604.07912

作者:Die Hu,Henan Li

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Finding parking consumes, food delivery time, system addresses precise, addresses precise parking-spot, precise parking-spot selection

备注: 7 pages, 3 tables. No university resources were used for this work

点击查看摘要

Abstract:Finding parking consumes a disproportionate share of food delivery time, yet no system addresses precise parking-spot selection relative to merchant entrances. We propose ParkSense, a framework that repurposes idle compute during low-risk AV states -- queuing at red lights, traffic congestion, parking-lot crawl -- to run a Vision-Language Model (VLM) on pre-cached satellite and street view imagery, identifying entrances and legal parking zones. We formalize the Delivery-Aware Precision Parking (DAPP) problem, show that a quantized 7B VLM completes inference in 4-8 seconds on HW4-class hardware, and estimate annual per-driver income gains of 3,000-8,000 USD in the U.S. Five open research directions are identified at this unexplored intersection of autonomous driving, computer vision, and last-mile logistics.

108. 【2604.07904】Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency

链接https://arxiv.org/abs/2604.07904

作者:Mingqing Xiao,Yansen Wang,Dongqi Han,Caihua Shan,Dongsheng Li

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词:support flexible coordination, biological information processing, Spatiotemporal neural dynamics, feature binding, Spatiotemporal neural

备注

点击查看摘要

Abstract:Spatiotemporal neural dynamics and oscillatory synchronization are widely implicated in biological information processing and have been hypothesized to support flexible coordination such as feature binding. By contrast, most deep learning architectures represent and propagate information through activation values, neglecting the joint dynamics of rate and phase. In this work, we introduce Kuramoto oscillatory Phase Encoding (KoPE) as an additional, evolving phase state to Vision Transformers, incorporating a neuro-inspired synchronization mechanism to advance learning efficiency. We show that KoPE can improve training, parameter, and data efficiency of vision models through synchronization-enhanced structure learning. Moreover, KoPE benefits tasks requiring structured understanding, including semantic and panoptic segmentation, representation alignment with language, and few-shot abstract visual reasoning (ARC-AGI). Theoretical analysis and empirical verification further suggest that KoPE can accelerate attention concentration for learning efficiency. These results indicate that synchronization can serve as a scalable, neuro-inspired mechanism for advancing state-of-the-art neural network models.

109. 【2604.07901】PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation

链接https://arxiv.org/abs/2604.07901

作者:Dingwen Xiao,Weiming Zhang,Shiqi Wen,Lin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offering full-scene coverage, predict temporally-consistent masks, benefiting applications, video object segmentation, aims to predict

备注

点击查看摘要

Abstract:360 video object segmentation (360VOS) aims to predict temporally-consistent masks in 360 videos, offering full-scene coverage, benefiting applications, such as VR/AR and embodied AI. Learning 360VOS model is nontrivial due to the lack of high-quality labeled dataset. Recently, Segment Anything Models (SAMs), especially SAM2 -- with its design of memory module -- shows strong, promptable VOS capability. However, directly using SAM2 for 360VOS yields implausible results as 360 videos suffer from the projection distortion, semantic inconsistency of left-right sides, and sparse object mask information in SAM2's memory. To this end, we propose PanoSAM2, a novel 360VOS framework based on our lightweight distortion- and memory-aware adaptation strategies of SAM2 to achieve reliable 360VOS while retaining SAM2's user-friendly prompting design. Concretely, to tackle the projection distortion and semantic inconsistency issues, we propose a Pano-Aware Decoder with seam-consistent receptive fields and iterative distortion refinement to maintain continuity across the 0/360 degree boundary. Meanwhile, a Distortion-Guided Mask Loss is introduced to weight pixels by distortion magnitude, stressing stretched regions and boundaries. To address the object sparsity issue, we propose a Long-Short Memory Module to maintain a compact long-term object pointer to re-instantiate and align short-term memories, thereby enhancing temporal coherence. Extensive experiments show that PanoSAM2 yields substantial gains over SAM2: +5.6 on 360VOTS and +6.7 on PanoVOS, showing the effectiveness of our method.

110. 【2604.07900】AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

链接https://arxiv.org/abs/2604.07900

作者:Jiaming Su,Tengchao Yang,Ruikang Zhang,Zhengan Yan,Haoyu Sun,Linfeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Industrial anomaly generation, data scarcity problem, Industrial anomaly, knowledge retrieval, scarcity problem

备注

点击查看摘要

Abstract:Industrial anomaly generation is a crucial method for alleviating the data scarcity problem in anomaly detection tasks. Most existing anomaly synthesis methods rely on single-step generation mechanisms, lacking complex reasoning and iterative optimization capabilities, making it difficult to generate anomaly samples with high semantic realism. We propose AnomalyAgent, an anomaly synthesis agent with self-reflection, knowledge retrieval, and iterative refinement capabilities, aiming to generate realistic and diverse anomalies. Specifically, AnomalyAgent is equipped with five tools: Prompt Generation (PG), Image Generation (IG), Quality Evaluation (QE), Knowledge Retrieval (KR), and Mask Generation (MG), enabling closed-loop optimization. To improve decision-making and self-reflection, we construct structured trajectories from real anomaly images and design a two-stage training framework: supervised fine-tuning followed by reinforcement learning. This process is driven by a three-part reward mechanism: (1) task rewards to supervise the quality and location rationality of generated anomalies; (2) reflection rewards to train the model's ability to improve anomaly synthesis prompt; (3) behavioral rewards to ensure adherence to the trajectory. On the MVTec-AD dataset, AnomalyAgent achieves IS/IC-L of 2.10/0.33 for anomaly generation, 57.0% classification accuracy using ResNet34, and 99.3%/74.2% AP at the image/pixel level using a simple UNet, surpassing all zero-shot SOTA methods. The code and data will be made publicly available.

111. 【2604.07890】Sampling-Aware 3D Spatial Analysis in Multiplexed Imaging

链接https://arxiv.org/abs/2604.07890

作者:Ido Harlev,Tamar Oukhanov,Raz Ben-Uri,Leeat Keren,Shai Bagon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:three-dimensional tissue organization, inherently three-dimensional tissue, Highly multiplexed microscopy, rich spatial characterization, microscopy enables rich

备注: Accepted to The 11th IEEE Workshop on Computer Vision for Multimodal Microscopy Image Analysis (CVMI), a CVPR 2026 workshop

点击查看摘要

Abstract:Highly multiplexed microscopy enables rich spatial characterization of tissues at single-cell resolution, yet most analyses rely on two-dimensional sections despite inherently three-dimensional tissue organization. Acquiring dense volumetric data in spatial proteomics remains costly and technically challenging, leaving practitioners to choose between 2D sections or 3D serial sections under limited imaging budgets. In this work, we study how sampling geometry impacts the stability of commonly used spatial statistics, and we introduce a geometry-aware reconstruction module that enables sparse yet consistent 3D analysis from serial sections. Using controlled simulations, we show that planar sampling reliably recovers global cell-type abundance but exhibits high variance for local statistics such as cell clustering and cell-cell interactions, particularly for rare or spatially localized populations. We observe consistent behavior in real multiplexed datasets, where interaction metrics and neighborhood relationships fluctuate substantially across individual sections. To support sparse 3D analysis in practice, we present a reconstruction approach that links cell projections across adjacent sections using phenotype and proximity constraints and recovers single-cell 3D centroids using cell-type-specific shape priors. We further analyze the trade-off between section spacing, coverage, and redundancy, identifying acquisition regimes that maximize reconstruction utility under fixed imaging budgets. We validate the reconstruction module on a public imaging mass cytometry dataset with dense axial sampling and demonstrate its downstream utility on an in-house CODEX dataset by enabling structure-level 3D analyses that are unreliable in 2D. Together, our results provide diagnostic tools and practical guidance for deciding when 2D sampling suffices and when sparse 3D reconstruction is warranted.

112. 【2604.07884】Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition

链接https://arxiv.org/abs/2604.07884

作者:Xuemei Jia,Jiawei Du,Hui Wei,Jun Chen,Joey Tianyi Zhou,Zheng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:High-fidelity generative models, severely restricted due, High-fidelity generative, generative models, copyright constraints

备注

点击查看摘要

Abstract:High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks. We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity. Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples. During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment. Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.

113. 【2604.07882】ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video

链接https://arxiv.org/abs/2604.07882

作者:Boyuan Wang,Xiaofeng Wang,Yongkang Li,Zheng Zhu,Yifan Chang,Angen Ye,Guosheng Zhao,Chaojun Ni,Guan Huang,Yijie Ren,Yueqi Duan,Xingang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstructing non-rigid objects, Reconstructing non-rigid, physical plausibility remains, significant challenge, non-rigid objects

备注

点击查看摘要

Abstract:Reconstructing non-rigid objects with physical plausibility remains a significant challenge. Existing approaches leverage differentiable rendering for per-scene optimization, recovering geometry and dynamics but requiring expensive tuning or manual annotation, which limits practicality and generalizability. To address this, we propose ReconPhys, the first feedforward framework that jointly learns physical attribute estimation and 3D Gaussian Splatting reconstruction from a single monocular video. Our method employs a dual-branch architecture trained via a self-supervised strategy, eliminating the need for ground-truth physics labels. Given a video sequence, ReconPhys simultaneously infers geometry, appearance, and physical attributes. Experiments on a large-scale synthetic dataset demonstrate superior performance: our method achieves 21.64 PSNR in future prediction compared to 13.27 by state-of-the-art optimization baselines, while reducing Chamfer Distance from 0.349 to 0.004. Crucially, ReconPhys enables fast inference (1 second) versus hours required by existing methods, facilitating rapid generation of simulation-ready assets for robotics and graphics.

114. 【2604.07879】FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

链接https://arxiv.org/abs/2604.07879

作者:Jinghan Yang,Yihe Fan,Xudong Pan,Min Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:safety risk due, image generation models, potential to generate, models have advanced, advanced rapidly

备注

点击查看摘要

Abstract:Diffusion-based image generation models have advanced rapidly but pose a safety risk due to their potential to generate Not-Safe-For-Work (NSFW) content. Existing NSFW detection methods mainly operate either before or after image generation. Pre-generation methods rely on text prompts and struggle with the gap between prompt safety and image safety. Post-generation methods apply classifiers to final outputs, but they are poorly suited to intermediate noisy images. To address this, we introduce FlowGuard, a cross-model in-generation detection framework that inspects intermediate denoising steps. This is particularly challenging in latent diffusion, where early-stage noise obscures visual signals. FlowGuard employs a novel linear approximation for latent decoding and leverages a curriculum learning approach to stabilize training. By detecting unsafe content early, FlowGuard reduces unnecessary diffusion steps to cut computational costs. Our cross-model benchmark spanning nine diffusion-based backbones shows the effectiveness of FlowGuard for in-generation NSFW detection in both in-distribution and out-of-distribution settings, outperforming existing methods by over 30% in F1 score while delivering transformative efficiency gains, including slashing peak GPU memory demand by over 97% and projection time from 8.1 seconds to 0.2 seconds compared to standard VAE decoding.

115. 【2604.07831】Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

链接https://arxiv.org/abs/2604.07831

作者:Wenkui Yang,Chao Jin,Haisu Zhu,Weilin Luo,Derek Yuen,Kun Shao,Huaibo Huang,Junxian Duan,Jie Cao,Ran He

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing red-teaming studies, studies on GUI, Existing red-teaming, GUI agents, important limitations

备注: 44 pages, 10 figures, public code will be available at [this https URL](https://github.com/HashTAG00002/UI-Injection)

点击查看摘要

Abstract:Existing red-teaming studies on GUI agents have important limitations. Adversarial perturbations typically require white-box access, which is unavailable for commercial systems, while prompt injection is increasingly mitigated by stronger safety alignment. To study robustness under a more practical threat model, we propose Semantic-level UI Element Injection, a red-teaming setting that overlays safety-aligned and harmless UI elements onto screenshots to misdirect the agent's visual grounding. Our method uses a modular Editor-Overlapper-Victim pipeline and an iterative search procedure that samples multiple candidate edits, keeps the best cumulative overlay, and adapts future prompt strategies based on previous failures. Across five victim models, our optimized attacks improve attack success rate by up to 4.4x over random injection on the strongest victims. Moreover, elements optimized on one source model transfer effectively to other target models, indicating model-agnostic vulnerabilities. After the first successful attack, the victim still clicks the attacker-controlled element in more than 15% of later independent trials, versus below 1% for random injection, showing that the injected element acts as a persistent attractor rather than simple visual clutter.

116. 【2604.07823】LPM 1.0: Video-based Character Performance Model

链接https://arxiv.org/abs/2604.07823

作者:Ailing Zeng,Casper Yang,Chauncey Ge,Eddie Zhang,Garvey Xu,Gavin Lin,Gilbert Gu,Jeremy Pi,Leo Li,Mingyi Shi,Sheng Bi,Steven Tang,Thorn Hang,Tobey Guo,Vincent Li,Xin Tong,Yikang Li,Yuchen Sun, Yue (R)Zhao,Yuhan Lu,Yuwei Li,Zane Zhang,Zeshi Yang,Zi Ye

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:externalization of intent, temporal behavior, LPM, Performance, Large Performance Model

备注: 43 pages, 15 figures, 2 tables. Project page: [this https URL](https://large-performance-model.github.io)

点击查看摘要

Abstract:Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

117. 【2604.07814】AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models

链接https://arxiv.org/abs/2604.07814

作者:Hazza Mahmood,Yongqiang Yu,Rao Anwer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:interpretable plant disease, plant disease diagnosis, disease diagnosis remains, Accurate and interpretable, interpretable plant

备注: 9 pages

点击查看摘要

Abstract:Accurate and interpretable plant disease diagnosis remains a major challenge for vision-language models (VLMs) in real-world agriculture. We introduce AgriChain, a dataset of approximately 11,000 expert-curated leaf images spanning diverse crops and pathologies, each paired with (i) a disease label, (ii) a calibrated confidence score (High/Medium/Low), and (iii) an expert-verified chain-of-thought (CoT) rationale. Draft explanations were first generated by GPT-4o and then verified by a professional agricultural engineer using standardized descriptors (e.g., lesion color, margin, and distribution). We fine-tune Qwen2.5-VL-3B on AgriChain, resulting in a specialized model termed AgriChain-VL3B, to jointly predict diseases and generate visually grounded reasoning. On a 1,000-image test set, our CoT-supervised model achieves 73.1% top-1 accuracy (macro F1 = 0.466; weighted F1 = 0.655), outperforming strong baselines including Gemini 1.5 Flash, Gemini 2.5 Pro, and GPT-4o Mini. The generated explanations align closely with expert reasoning, consistently referencing key visual cues. These findings demonstrate that expert-verified reasoning supervision significantly enhances both accuracy and interpretability, bridging the gap between generic multimodal models and human expertise, and advancing trustworthy, globally deployable AI for sustainable agriculture. The dataset and code are publicly available at: this https URL

118. 【2604.07812】HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

链接https://arxiv.org/abs/2604.07812

作者:Qihui Zhu,Tao Zhang,Yuchen Wang,Zijian Wen,Mengjie Zhang,Shuangwu Chen,Xiaobin Tan,Jian Yang,Yang Liu,Zhenhua Dong,Xianzhi Yu,Yinfei Pan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal large language, tokens significantly increases, large language models, visual, visual tokens

备注: CVPR 2026

点击查看摘要

Abstract:In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at this https URL.

119. 【2604.07803】he Weaponization of Computer Vision: Tracing Military-Surveillance Ties through Conference Sponsorship

链接https://arxiv.org/abs/2604.07803

作者:Noa Garcia,Amelia Katirai

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:artificial intelligence, visual data, enables the computational, generation of visual, Computer vision

备注: FAccT 2026

点击查看摘要

Abstract:Computer vision, a core domain of artificial intelligence (AI), is the field that enables the computational analysis, understanding, and generation of visual data. Despite being historically rooted in military funding and increasingly deployed in warfare, the field tends to position itself as a neutral, purely technical endeavor, failing to engage in discussions about its dual-use applications. Yet it has been reported that computer vision systems are being systematically weaponized to assist in technologies that inflict harm, such as surveillance or warfare. Expanding on these concerns, we study the extent to which computer vision research is being used in the military and surveillance domains. We do so by collecting a dataset of tech companies with financial ties to the field's central research exchange platform: conferences. Conference sponsorship, we argue, not only serves as strong evidence of a company's investment in the field but also provides a privileged position for shaping its trajectory. By investigating sponsors' activities, we reveal that 44% of them have a direct connection with military or surveillance applications. We extend our analysis through two case studies in which we discuss the opportunities and limitations of sponsorship as a means for uncovering technological weaponization.

120. 【2604.07802】Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

链接https://arxiv.org/abs/2604.07802

作者:Shaotian Li,Shangze Li,Chuancheng Shi,Wenhua Wu,Yanqiu Wu,Xiaohan Yu,Fei Shen,Tat-Seng Chua

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large-scale vision-language models, exhibit remarkable zero-shot, remarkable zero-shot capabilities, internal mechanisms driving, Large-scale vision-language

备注

点击查看摘要

Abstract:Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current methods predominantly treat VLMs as black-box feature extractors, assuming that anomaly-specific knowledge must be acquired through external adapters or memory banks. In this paper, we challenge this assumption by arguing that anomaly knowledge is intrinsically embedded within pre-trained models but remains latent and under-activated. We hypothesize that this knowledge is concentrated within a sparse subset of anomaly-sensitive neurons. To validate this, we propose latent anomaly knowledge excavation (LAKE), a training-free framework that identifies and elicits these critical neuronal signals using only a minimal set of normal samples. By isolating these sensitive neurons, LAKE constructs a highly compact normality representation that integrates visual structural deviations with cross-modal semantic activations. Extensive experiments on industrial AD benchmarks demonstrate that LAKE achieves state-of-the-art performance while providing intrinsic, neuron-level interpretability. Ultimately, our work advocates for a paradigm shift: redefining anomaly detection as the targeted activation of latent pre-trained knowledge rather than the acquisition of a downstream task.

121. 【2604.07795】Image-Guided Geometric Stylization of 3D Meshes

链接https://arxiv.org/abs/2604.07795

作者:Changwoon Choi,Hyunsoo Lee,Clément Jambon,Yael Vinker,Young Min Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Recent generative models, Recent generative, create visually plausible, visually plausible, representations of objects

备注

点击查看摘要

Abstract:Recent generative models can create visually plausible 3D representations of objects. However, the generation process often allows for implicit control signals, such as contextual descriptions, and rarely supports bold geometric distortions beyond existing data distributions. We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. While style is inherently ambiguous, we utilize pre-trained diffusion models to extract an abstract representation of the provided image. Our coarse-to-fine stylization pipeline can drastically deform the input 3D model to express a diverse range of geometric variations while retaining the valid topology of the original mesh and part-level semantics. We also propose an approximate VAE encoder that provides efficient and reliable gradients from mesh renderings. Extensive experiments demonstrate that our method can create stylized 3D meshes that reflect unique geometric features of the pictured assets, such as expressive poses and silhouettes, thereby supporting the creation of distinctive artistic 3D creations. Project page: this https URL

122. 【2604.07786】Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

链接https://arxiv.org/abs/2604.07786

作者:Chanhyuk Choi,Taesoo Kim,Donggyu Lee,Siyeol Jung,Taehwan Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:gained significant attention, Talking face generation, Talking face, generative models, generation has gained

备注: Accepted to CVPR 2026. Project Page: [this https URL](https://chanhyeok-choi.github.io/C-MET/)

点击查看摘要

Abstract:Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at this https URL

123. 【2604.07779】Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models

链接https://arxiv.org/abs/2604.07779

作者:Gexin Huang,Anqi Li,Yusheng Tan,Beidi Zhao,Gang Wang,Gaozu Hua,Xiaoxiao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offering strong transfer, Pathology foundation models, Pathology foundation, strong transfer performance, computational histopathology

备注: 10 pages, 2 figures

点击查看摘要

Abstract:Pathology foundation models (FMs) have become central to computational histopathology, offering strong transfer performance across a wide range of diagnostic and prognostic tasks. The rapid proliferation of pathology foundation models creates a model-selection bottleneck: no single model is uniformly best, yet exhaustively adapting and validating many candidates for each downstream endpoint is prohibitively expensive. We address this challenge with a lightweight and novel model fusion strategy, LogitProd, which treats independently trained FM-based predictors as fixed experts and learns sample-adaptive fusion weights over their slide-level outputs. The fusion operates purely on logits, requiring no encoder retraining and no feature-space alignment across heterogeneous backbones. We further provide a theoretical analysis showing that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. We systematically evaluate LogitProd on \textbf{22} benchmarks spanning WSI-level classification, tile-level classification, gene mutation prediction, and discrete-time survival modeling. LogitProd ranks first on 20/22 tasks and improves the average performance across all tasks by ~3% over the strongest single expert. LogitProd enables practitioners to upgrade heterogeneous FM-based pipelines in a plug-and-play manner, achieving multi-expert gains with $\sim$12$\times$ lower training cost than feature-fusion alternatives.

124. 【2604.07774】RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

链接https://arxiv.org/abs/2604.07774

作者:Peiran Xu,Jiaqi Zheng,Yadong Mu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:agent acquires visual, acquires visual observations, executes atomic actions, embodied task planning, paper focuses

备注: CVPR 2026

点击查看摘要

Abstract:This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent Vision-Language Models (VLMs) have achieved impressive results in multimodal understanding and reasoning, their performance remains limited when applied to embodied planning that involves multi-turn interaction, long-horizon reasoning, and extended context analysis. To bridge this gap, we propose RoboAgent, a capability-driven planning pipeline in which the model actively invokes different sub-capabilities. Each capability maintains its own context, and produces intermediate reasoning results or interacts with the environment according to the query given by a scheduler. This framework decomposes complex planning into a sequence of basic vision-language problems that VLMs can better address, enabling a more transparent and controllable reasoning process. The scheduler and all capabilities are implemented with a single VLM, without relying on external tools. To train this VLM, we adopt a multi-stage paradigm that consists of: (1) behavior cloning with expert plans, (2) DAgger training using trajectories collected by the model, and (3) reinforcement learning guided by an expert policy. Across these stages, we exploit the internal information of the environment simulator to construct high-quality supervision for each capability, and we further introduce augmented and synthetic data to enhance the model's performance in more diverse scenarios. Extensive experiments on widely used embodied task planning benchmarks validate the effectiveness of the proposed approach. Our codes will be available at this https URL.

125. 【2604.07772】ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

链接https://arxiv.org/abs/2604.07772

作者:Zihao Liu,Xiaoyu Wu,Wenna Li,Jianqin Wu,Linlin Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:live-streaming content moderation, explain abnormal events, aims to detect, content moderation, detect and explain

备注

点击查看摘要

Abstract:Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at this https URL.

126. 【2604.07765】RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

链接https://arxiv.org/abs/2604.07765

作者:Liang Yao,Shengxiang Xu,Fan Liu,Chuanyi Zhang,Bishun Yao,Rui Min,Yongjun Li,Chaoqian Ouyang,Shimin Di,Min-Ling Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Earth Observation, support domain experts, vague natural language, essentially designed, designed to support

备注

点击查看摘要

Abstract:Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.

127. 【2604.07763】Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities

链接https://arxiv.org/abs/2604.07763

作者:Jingtong Dou,Chuancheng Shi,Jian Wang,Fei Shen,Zhiyong Wang,Tat-Seng Chua

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:artificial intelligence evolves, generative artificial intelligence, intelligence evolves, deepfake attacks, manipulations to complex

备注

点击查看摘要

Abstract:As generative artificial intelligence evolves, deepfake attacks have escalated from single-modality manipulations to complex, multimodal threats. Existing forensic techniques face a severe generalization bottleneck: by relying excessively on superficial, modality-specific artifacts, they neglect the shared latent forgery knowledge hidden beneath variable physical appearances. Consequently, these models suffer catastrophic performance degradation when confronted with unseen "dark modalities." To break this limitation, this paper introduces a paradigm shift that redefines multimodal forensics from conventional "feature fusion" to "modality generalization." We propose the first modality-agnostic forgery (MAF) detection framework. By explicitly decoupling modality-specific styles, MAF precisely extracts the essential, cross-modal latent forgery knowledge. Furthermore, we define two progressive dimensions to quantify model generalization: transferability toward semantically correlated modalities (Weak MAF), and robustness against completely isolated signals of "dark modality" (Strong MAF). To rigorously assess these generalization limits, we introduce the DeepModal-Bench benchmark, which integrates diverse multimodal forgery detection algorithms and adapts state-of-the-art generalized learning methods. This study not only empirically proves the existence of universal forgery traces but also achieves significant performance breakthroughs on unknown modalities via the MAF framework, offering a pioneering technical pathway for universal multimodal defense.

128. 【2604.07759】WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

链接https://arxiv.org/abs/2604.07759

作者:Junxiong Liang,Mengwei Bao,Tianxiang Wang,Xinggang Wang,An-An Liu,Ryan Wen Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:waterway transportation systems, fundamental perception task, intelligent waterway transportation, Ship detection, transportation systems

备注

点击查看摘要

Abstract:Ship detection for navigation is a fundamental perception task in intelligent waterway transportation systems. However, existing public ship detection datasets remain limited in terms of scale, the proportion of small-object instances, and scene diversity, which hinders the systematic evaluation and generalization study of detection algorithms in complex maritime environments. To this end, we construct WUTDet, a large-scale ship detection dataset. WUTDet contains 100,576 images and 381,378 annotated ship instances, covering diverse operational scenarios such as ports, anchorages, navigation, and berthing, as well as various imaging conditions including fog, glare, low-lightness, and rain, thereby exhibiting substantial diversity and challenge. Based on WUTDet, we systematically evaluate 20 baseline models from three mainstream detection architectures, namely CNN, Transformer, and Mamba. Experimental results show that the Transformer architecture achieves superior overall detection accuracy (AP) and small-object detection performance (APs), demonstrating stronger adaptability to complex maritime scenes; the CNN architecture maintains an advantage in inference efficiency, making it more suitable for real-time applications; and the Mamba architecture achieves a favorable balance between detection accuracy and computational efficiency. Furthermore, we construct a unified cross-dataset test set, Ship-GEN, to evaluate model generalization. Results on Ship-GEN show that models trained on WUTDet exhibit stronger generalization under different data distributions. These findings demonstrate that WUTDet provides effective data support for the research, evaluation, and generalization analysis of ship detection algorithms in complex maritime scenarios. The dataset is publicly available at: this https URL.

129. 【2604.07758】DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics

链接https://arxiv.org/abs/2604.07758

作者:Hang Zhang,Qijian Tian,Jingyu Gong,Daoguo Dong,Xuhong Wang,Yuan Xie,Xin Tan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:closed-state image remains, image remains challenging, crucial motion cues, world models, single closed-state image

备注

点击查看摘要

Abstract:Articulated objects are essential for embodied AI and world models, yet inferring their kinematics from a single closed-state image remains challenging because crucial motion cues are often occluded. Existing methods either require multi-state observations or rely on explicit part priors, retrieval, or other auxiliary inputs that partially expose the structure to be inferred. In this work, we present DailyArt, which formulates articulated joint estimation from a single static image as a synthesis-mediated reasoning problem. Instead of directly regressing joints from a heavily occluded observation, DailyArt first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, and then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states. Using a set-prediction formulation, DailyArt recovers all joints simultaneously without requiring object-specific templates, multi-view inputs, or explicit part annotations at test time. Taking estimated joints as conditions, the framework further supports part-level novel state synthesis as a downstream capability. Extensive experiments show that DailyArt achieves strong performance in articulated joint estimation and supports part-level novel state synthesis conditioned on joints. Project page is available at this https URL.

130. 【2604.07753】Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

链接https://arxiv.org/abs/2604.07753

作者:Xiangyue Liu,Zijian Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Ping Tan

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Empowering Large Multimodal, Large Multimodal Models, Empowering Large, Multimodal Models, Large Multimodal

备注

点击查看摘要

Abstract:Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.

131. 【2604.07741】MSCT: Differential Cross-Modal Attention for Deepfake Detection

链接https://arxiv.org/abs/2604.07741

作者:Fangda Wei,Miao Liu,Yingxue Wang,Jing Wang,Shenghui Zhao,Nan Li

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:detection typically employs, complementary multi-modal model, deepfake detection typically, Audio-visual deepfake detection, typically employs

备注: Accpeted by ICASSP2026

点击查看摘要

Abstract:Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.

132. 【2604.07740】Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

链接https://arxiv.org/abs/2604.07740

作者:Shogo Hamano,Shunya Wakasugi,Tatsuhito Sato,Sayaka Nakamura

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:video-based person Re-Identification, recent years, video-based person, person Re-Identification, non-overlapping cameras

备注

点击查看摘要

Abstract:In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.

133. 【2604.07728】GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting

链接https://arxiv.org/abs/2604.07728

作者:Jialin Li,Bin Fu,Ruiping Wang,Xilin Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词:High-fidelity interactive digital, interactive digital assets, coupled geometry-motion relationships, High-fidelity interactive, objects remain challenging

备注: Accepted to CVPRF2026

点击查看摘要

Abstract:High-fidelity interactive digital assets are essential for embodied intelligence and robotic interaction, yet articulated objects remain challenging to reconstruct due to their complex structures and coupled geometry-motion relationships. Existing methods suffer from instability in geometry-motion joint optimization, while their generalization remains limited on complex multi-joint or out-of-distribution objects. To address these challenges, we propose GEAR, an EM-style alternating optimization framework that jointly models geometry and motion as interdependent components within a Gaussian Splatting representation. GEAR treats part segmentation as a latent variable and joint motion parameters as explicit variables, alternately refining them for improved convergence and geometric-motion consistency. To enhance part segmentation quality without sacrificing generalization, we leverage a vanilla 2D segmentation model to provide multi-view part priors, and employ a weakly supervised constraint to regularize the latent variable. Experiments on multiple benchmarks and our newly constructed dataset GEAR-Multi demonstrate that GEAR achieves state-of-the-art results in geometric reconstruction and motion parameters estimation, particularly on complex articulated objects with multiple movable parts.

134. 【2604.07723】Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation

链接https://arxiv.org/abs/2604.07723

作者:Jiahao Li,Yang Lu,Yachao Zhang,Fangyong Wang,Yuan Xie,Yanyun Qu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:existing methods possess, methods possess pixel-level, possess pixel-level vision-language, pixel-level vision-language alignment, vision-language alignment capability

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.

135. 【2604.07722】Needle in a Haystack -- One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology

链接https://arxiv.org/abs/2604.07722

作者:Swarnadip Chatterjee,Vladimir Basic,Arrigo Capitanio,Orcun Goksel,Joakim Lindblad

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:vanishingly rare amid, malignant cells, images is difficult, morphologically diverse, diverse yet vanishingly

备注: 15 pages, 7 figures

点击查看摘要

Abstract:In computational cytology, detecting malignancy on whole-slide images is difficult because malignant cells are morphologically diverse yet vanishingly rare amid a vast background of normal cells. Accurate detection of these extremely rare malignant cells remains challenging due to large class imbalance and limited annotations. Conventional weakly supervised approaches, such as multiple instance learning (MIL), often fail to generalize at the instance level, especially when the fraction of malignant cells (witness rate) is exceedingly low. In this study, we explore the use of one-class representation learning techniques for detecting malignant cells in low-witness-rate scenarios. These methods are trained exclusively on slide-negative patches, without requiring any instance-level supervision. Specifically, we evaluate two OCC approaches, DSVDD and DROC, and compare them with FS-SIL, WS-SIL, and the recent ItS2CLR method. The one-class methods learn compact representations of normality and detect deviations at test time. Experiments on a publicly available bone marrow cytomorphology dataset (TCIA) and an in-house oral cancer cytology dataset show that DSVDD achieves state-of-the-art performance in instance-level abnormality ranking, particularly in ultra-low witness-rate regimes ($\leq 1\%$) and, in some cases, even outperforming fully supervised learning, which is typically not a practical option in whole-slide cytology due to the infeasibility of exhaustive instance-level annotations. DROC is also competitive under extreme rarity, benefiting from distribution-augmented contrastive learning. These findings highlight one-class representation learning as a robust and interpretable superior choice to MIL for malignant cell detection under extreme rarity.

136. 【2604.07675】FireSenseNet: A Dual-Branch CNN with Cross-Attentive Feature Interaction for Next-Day Wildfire Spread Prediction

链接https://arxiv.org/abs/2604.07675

作者:Jinzhen Han,JinByeong Lee,Hak Han,YeonJu Na,Jae-Joon Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:next-day wildfire spread, Toggle, wildfire spread, Accurate prediction, Wildfire Spread Prediction

备注

点击查看摘要

Abstract:Accurate prediction of next-day wildfire spread is critical for disaster response and resource allocation. Existing deep learning approaches typically concatenate heterogeneous geospatial inputs into a single tensor, ignoring the fundamental physical distinction between static fuel/terrain properties and dynamic meteorological conditions. We propose FireSenseNet, a dual-branch convolutional neural network equipped with a novel Cross-Attentive Feature Interaction Module (CAFIM) that explicitly models the spatially varying interaction between fuel and weather modalities through learnable attention gates at multiple encoder scales. Through a systematic comparison of seven architectures -- spanning pure CNNs, Vision Transformers, and hybrid designs -- on the Google Next-Day Wildfire Spread benchmark, we demonstrate that FireSenseNet achieves an F1 of 0.4176 and AUC-PR of 0.3435, outperforming all alternatives including a SegFormer with 3.8* more parameters (F1 = 0.3502). Ablation studies confirm that CAFIM provides a 7.1% relative F1 gain over naive concatenation, and channel-wise feature importance analysis reveals that the previous-day fire mask dominates prediction while wind speed acts as noise at the dataset's coarse temporal resolution. We further incorporate Monte Carlo Dropout for pixel-level uncertainty quantification and present a critical analysis showing that common evaluation shortcuts inflate reported F1 scores by over 44%.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.07675 [cs.CV]

(or
arXiv:2604.07675v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.07675

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Han Jinzhen [view email] [v1]
Thu, 9 Apr 2026 00:39:03 UTC (478 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled FireSenseNet: A Dual-Branch CNN with Cross-Attentive Feature Interaction for Next-Day Wildfire Spread Prediction, by Jinzhen Han and 4 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-04

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

137. 【2604.07674】Weight Group-wise Post-Training Quantization for Medical Foundation Model

链接https://arxiv.org/abs/2604.07674

作者:Yineng Chen,Peng Huang,Aozhong Zhang,Hui Guo,Penghang Yin,Shu Hu,Shao Lin,Xin Li,Tzu-Jen Kao,Balakrishnan Prabhakaran,MingChing Chang,Xin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical image analysis, achieved remarkable results, Foundation models, image analysis, achieved remarkable

备注

点击查看摘要

Abstract:Foundation models have achieved remarkable results in medical image analysis. However, its large network architecture and high computational complexity significantly impact inference speed, limiting its application on terminal medical devices. Quantization, a technique that compresses models into low-bit versions, is a solution to this challenge. In this paper, we propose a post-training quantization algorithm, Permutation-COMQ. It eliminates the need for backpropagation by using simple dot products and rounding operations, thereby removing hyperparameter tuning and simplifying the process. Additionally, we introduce a weight-aware strategy that reorders the weight within each layer to address the accuracy degradation induced by channel-wise scaling during quantization, while preserving channel structure. Experiments demonstrate that our method achieves the best results in 2-bit, 4-bit, and 8-bit quantization.

138. 【2604.07665】Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation

链接https://arxiv.org/abs/2604.07665

作者:Yanbo Gao,Huibin Bai,Huasong Zhou,Xingyu Gao,Shuai Li,Xun Cai,Hui Yuan,Wei Hua,Tian Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Self-supervised monocular depth, received increasing interests, monocular depth estimation, Self-supervised monocular, enhanced monocular depth

备注: Accepted by IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Self-supervised monocular depth estimation (MDE) has received increasing interests in the last few years. The objects in the scene, including the object size and relationship among different objects, are the main clues to extract the scene structure. However, previous works lack the explicit handling of the changing sizes of the object due to the change of its depth. Especially in a monocular video, the size of the same object is continuously changed, resulting in size and depth ambiguity. To address this problem, we propose a Depth-converted-Scale Convolution (DcSConv) enhanced monocular depth estimation framework, by incorporating the prior relationship between the object depth and object scale to extract features from appropriate scales of the convolution receptive field. The proposed DcSConv focuses on the adaptive scale of the convolution filter instead of the local deformation of its shape. It establishes that the scale of the convolution filter matters no less (or even more in the evaluated task) than its local deformation. Moreover, a Depth-converted-Scale aware Fusion (DcS-F) is developed to adaptively fuse the DcSConv features and the conventional convolution features. Our DcSConv enhanced monocular depth estimation framework can be applied on top of existing CNN based methods as a plug-and-play module to enhance the conventional convolution block. Extensive experiments with different baselines have been conducted on the KITTI benchmark and our method achieves the best results with an improvement up to 11.6% in terms of SqRel reduction. Ablation study also validates the effectiveness of each proposed module.

139. 【2604.07664】Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach

链接https://arxiv.org/abs/2604.07664

作者:Huibin Bai,Shuai Li,Hanxiao Zhai,Yanbo Gao,Chong Lv,Yibo Wang,Haipeng Ping,Wei Hua,Xingyu Gao

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:fundamental computer vision, computer vision task, Monocular Depth Estimation, Monocular Depth, current mainstream MDE

备注: Accepted by IEEE TMM

点击查看摘要

Abstract:Monocular Depth Estimation (MDE) is a fundamental computer vision task with important applications in 3D vision. The current mainstream MDE methods employ an encoder-decoder architecture with multi-level/scale feature processing. However, the limitations of the current architecture and the effects of different-level features on the prediction accuracy are not evaluated. In this paper, we first investigate the above problem and show that there is still substantial potential in the current framework if encoder features can be improved. Therefore, we propose to formulate the depth estimation problem from the feature restoration perspective, by treating pretrained encoder features as degraded features of an assumed ground truth feature that yields the ground truth depth map. Then an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module is developed for feature restoration. Due to the absence of direct supervision on feature, only indirect supervision from the final sparse depth map is used. During the iterative procedure of diffusion, this results in feature deviations among steps. The proposed InvT-IndDiffusion solves this problem by using an invertible transform-based decoder under the bi-Lipschitz condition. Finally, a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement module (AV-LFE) is developed to enhance local details with auxiliary viewpoint when available. Experiments demonstrate that the proposed method achieves better performance than the state-of-the-art methods on various datasets. Specifically on the KITTI benchmark, compared with the baseline, the performance is improved by 4.09% and 37.77% under different training settings in terms of RMSE. Code is available at this https URL.

140. 【2604.07656】MVOS_HSI: A Python Library for Preprocessing Agricultural Crop Hyperspectral Data

链接https://arxiv.org/abs/2604.07656

作者:Rishik Aggarwal,Krisha Joshi,Pappu Kumar Yadav,Jianwei Qin,Thomas F. Burks,Moon S. Kim

类目:oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)

关键词:Hyperspectral imaging, plant traits non-destructively, study plant traits, traits non-destructively, Hyperspectral

备注: 11 pages

点击查看摘要

Abstract:Hyperspectral imaging (HSI) allows researchers to study plant traits non-destructively. By capturing hundreds of narrow spectral bands per pixel, it reveals details about plant biochemistry and stress that standard cameras miss. However, processing this data is often challenging. Many labs still rely on loosely organized collections of lab-specific MATLAB or Python scripts, which makes workflows difficult to share and results difficult to reproduce. MVOS_HSI is an open-source Python library that provides an end-to-end workflow for processing leaf-level HSI data. The software handles everything from calibrating raw ENVI files to detecting and clipping individual leaves based on multiple vegetation indices (NDVI, CIRedEdge and GCI). It also includes tools for data augmentation to create training-time variations for machine learning and utilities to visualize spectral profiles. MVOS_HSI can be used as an importable Python library or run directly from the command line. The code and documentation are available on GitHub. By consolidating these common tasks into a single package, MVOS_HSI helps researchers produce consistent and reproducible results in plant phenotyping

141. 【2604.07634】VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models

链接https://arxiv.org/abs/2604.07634

作者:Pavan Kumar Anasosalu Vasu,Cem Koc,Fartash Faghri,Chun-Liang Li,Bo Feng,Zhengfeng Lai,Meng Cao,Oncel Tuzel,Hadi Pouransari

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:continuously generate responses, continuously generate, instruction prompt, online stream, Streaming

备注: CVPR Findings 2026

点击查看摘要

Abstract:Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model's responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistants. In contrast to prior benchmarks that primarily employ single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations with over 18,000 annotations across diverse input domains and task types. We introduce standardized synchronous and asynchronous evaluation protocols, along with metrics that isolate and measure distinct capabilities of streaming VLMs. Using this framework, we conduct large-scale evaluations of recent video and streaming VLMs, analyzing the accuracy-latency trade-off under key design factors such as memory buffer length, memory access policy, and input resolution, yielding several practical insights. Finally, we show empirically that conventional VLMs can be adapted to streaming settings without additional training, and demonstrate that these adapted models outperform recent streaming VLMs. For example, Qwen3-VL-4B surpasses Dispider, the best streaming VLM on our benchmark, by 3% under the asynchronous protocol. The benchmark and code will be available at this https URL.

142. 【2604.07607】EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

链接https://arxiv.org/abs/2604.07607

作者:Ryan Punamiya,Simar Kareer,Zeyi Liu,Josh Citron,Ri-Zhao Qiu,Xiongyi Cai,Alexey Gavryushin,Jiaqi Chen,Davide Liconti,Lawrence Y. Zhu,Patcharapong Aphiwetsa,Baoyu Li,Aniketh Cheluva,Pranav Kuppili,Yangcen Liu,Dhruv Patel,Aidan Gao,Hye-Young Chung,Ryan Co,Renee Zbizika,Jeff Liu,Xiaomeng Xu,Haoyu Xiong,Geng Chen,Sebastiano Oliani,Chenyu Yang,Xi Wang,James Fort,Richard Newcombe,Josh Gao,Jason Chong,Garrett Matsuda,Aseem Doriwala,Marc Pollefeys,Robert Katzschmann,Xiaolong Wang,Shuran Song,Judy Hoffman,Danfei Xu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:collection remains expensive, data collection remains, Robot learning increasingly, Robot learning, large and diverse

备注

点击查看摘要

Abstract:Robot learning increasingly depends on large and diverse data, yet robot data collection remains expensive and difficult to scale. Egocentric human data offer a promising alternative by capturing rich manipulation behavior across everyday environments. However, existing human datasets are often limited in scope, difficult to extend, and fragmented across institutions. We introduce EgoVerse, a collaborative platform for human data-driven robot learning that unifies data collection, processing, and access under a shared framework, enabling contributions from individual researchers, academic labs, and industry partners. The current release includes 1,362 hours (80k episodes) of human demonstrations spanning 1,965 tasks, 240 scenes, and 2,087 unique demonstrators, with standardized formats, manipulation-relevant annotations, and tooling for downstream learning. Beyond the dataset, we conduct a large-scale study of human-to-robot transfer with experiments replicated across multiple labs, tasks, and robot embodiments under shared protocols. We find that policy performance generally improves with increased human data, but that effective scaling depends on alignment between human data and robot learning objectives. Together, the dataset, platform, and study establish a foundation for reproducible progress in human data-driven robot learning. Videos and additional information can be found at this https URL

143. 【2604.07606】Bootstrapping Sign Language Annotations with Sign Language Models

链接https://arxiv.org/abs/2604.07606

作者:Colin Lea,Vasileios Baltatzis,Connor Gillis,Raja Kushalnagar,Lorna Quandt,Leah Findlater

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ASL STEM Wiki, AI-driven sign language, sign language interpretation, high-quality annotated data, including ASL STEM

备注: Accepted to CVPR Findings 2026

点击查看摘要

Abstract:AI-driven sign language interpretation is limited by a lack of high-quality annotated data. New datasets including ASL STEM Wiki and FLEURS-ASL contain professional interpreters and 100s of hours of data but remain only partially annotated and thus underutilized, in part due to the prohibitive costs of annotating at this scale. In this work, we develop a pseudo-annotation pipeline that takes signed video and English as input and outputs a ranked set of likely annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers. Our pipeline uses sparse predictions from our fingerspelling recognizer and isolated sign recognizer (ISR), along with a K-Shot LLM approach, to estimate these annotations. In service of this pipeline, we establish simple yet effective baseline fingerspelling and ISR models, achieving state-of-the-art on FSBoard (6.7% CER) and on ASL Citizen datasets (74% top-1 accuracy). To validate and provide a gold-standard benchmark, a professional interpreter annotated nearly 500 videos from ASL STEM Wiki with sequence-level gloss labels containing glosses, classifiers, and fingerspelling signs. These human annotations and over 300 hours of pseudo-annotations are being released in supplemental material.

144. 【2604.07578】MSGL-Transformer: A Multi-Scale Global-Local Transformer for Rodent Social Behavior Recognition

链接https://arxiv.org/abs/2604.07578

作者:Muhammad Imran Sharif,Doina Caragea

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:behavioral mechanisms, important for understanding, understanding neural, neural and behavioral, Recognition of rodent

备注: 25 pages, 10 figures, submitted to Scientific Reports

点击查看摘要

Abstract:Recognition of rodent behavior is important for understanding neural and behavioral mechanisms. Traditional manual scoring is time-consuming and prone to human error. We propose MSGL-Transformer, a Multi-Scale Global-Local Transformer for recognizing rodent social behaviors from pose-based temporal sequences. The model employs a lightweight transformer encoder with multi-scale attention to capture motion dynamics across different temporal scales. The architecture integrates parallel short-range, medium-range, and global attention branches to explicitly capture behavior dynamics at multiple temporal scales. We also introduce a Behavior-Aware Modulation (BAM) block, inspired by SE-Networks, which modulates temporal embeddings to emphasize behavior-relevant features prior to attention. We evaluate on two datasets: RatSI (5 behavior classes, 12D pose inputs) and CalMS21 (4 behavior classes, 28D pose inputs). On RatSI, MSGL-Transformer achieves 75.4% mean accuracy and F1-score of 0.745 across nine cross-validation splits, outperforming TCN, LSTM, and Bi-LSTM. On CalMS21, it achieves 87.1% accuracy and F1-score of 0.8745, a +10.7% improvement over HSTWFormer, and outperforms ST-GCN, MS-G3D, CTR-GCN, and STGAT. The same architecture generalizes across both datasets with only input dimensionality and number of classes adjusted.

145. 【2604.07577】Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

链接https://arxiv.org/abs/2604.07577

作者:Katerina Katsarou,George Zountsas,Karam Tomotaki-Dawoud,Alexander Ehrenhoefer,Paul Chojecki,David Przewozny,Igor Maximilian Sauer,Amira Mouakher,Sebastian Bosse

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reliable monitoring, surgical instrument exchanges, maintaining procedural efficiency, operating room, exchanges is essential

备注: 12 Pages, 6 figures, CVPR 2026 Workshop AI4RWC

点击查看摘要

Abstract:Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.

146. 【2604.07574】Mathematical Analysis of Image Matching Techniques

链接https://arxiv.org/abs/2604.07574

作者:Oleh Samoilenko

类目:Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)

关键词:geospatial data analysis, Computer Vision, problem in Computer, Vision with direct, Scale-Invariant Feature Transform

备注: 16 pages, 5 figures, 1 table

点击查看摘要

Abstract:Image matching is a fundamental problem in Computer Vision with direct applications in robotics, remote sensing, and geospatial data analysis. We present an analytical and experimental evaluation of classical local feature-based image matching algorithms on satellite imagery, focusing on the Scale-Invariant Feature Transform (SIFT) and the Oriented FAST and Rotated BRIEF (ORB). Each method is evaluated through a common pipeline: keypoint detection, descriptor extraction, descriptor matching, and geometric verification via RANSAC with homography estimation. Matching quality is assessed using the Inlier Ratio - the fraction of correspondences consistent with the estimated homography. The study uses a manually constructed dataset of GPS-annotated satellite image tiles with intentional overlaps. We examine the impact of the number of extracted keypoints on the resulting Inlier Ratio.

147. 【2604.07563】On the Uphill Battle of Image frequency Analysis

链接https://arxiv.org/abs/2604.07563

作者:Nader Bazyari,Hedieh Sajedi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:newly proposed clustering, Inverse Square, Square Mean Shift, proposed clustering algorithm, clustering algorithm called

备注: paper was accepted to IPCV 2021 track in CSCE 2021 cogress in a peer review process but was not published. [this https URL](https://www.american-cse.org/csce2021/publisher)

点击查看摘要

Abstract:This work is a follow up on the newly proposed clustering algorithm called The Inverse Square Mean Shift Algorithm. In this paper a special case of algorithm for dealing with non-homogenous data is formulated and the three dimensional Fast Fourier Transform of images is investigated with the aim of finding hidden patterns.

148. 【2604.07522】raining-free Spatially Grounded Geometric Shape Encoding (Technical Report)

链接https://arxiv.org/abs/2604.07522

作者:Yuhang He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:discrete point-wise positions, achieved remarkable success, grounding deep neural, deep neural networks, Positional encoding

备注: Training-Free 2D Geometric Shape Encoding

点击查看摘要

Abstract:Positional encoding has become the de facto standard for grounding deep neural networks on discrete point-wise positions, and it has achieved remarkable success in tasks where the input can be represented as a one-dimensional sequence. However, extending this concept to 2D spatial geometric shapes demands carefully designed encoding strategies that account not only for shape geometry and pose, but also for compatibility with neural network learning. In this work, we address these challenges by introducing a training-free, general-purpose encoding strategy, dubbed XShapeEnc, that encodes an arbitrary spatially grounded 2D geometric shape into a compact representation exhibiting five favorable properties, including invertibility, adaptivity, and frequency richness. Specifically, a 2D spatially grounded geometric shape is decomposed into its normalized geometry within the unit disk and its pose vector, where the pose is further transformed into a harmonic pose field that also lies within the unit disk. A set of orthogonal Zernike bases is constructed to encode shape geometry and pose either independently or jointly, followed by a frequency-propagation operation to introduce high-frequency content into the encoding. We demonstrate the theoretical validity, efficiency, discriminability, and applicability of XShapeEnc via extensive analysis and experiments across a wide range of shape-aware tasks and our self-curated XShapeCorpus. We envision XShapeEnc as a foundational tool for research that goes beyond one-dimensional sequential data toward frontier 2D spatial intelligence.

149. 【2604.07477】SMFD-UNet: Semantic Face Mask Is The Only Thing You Need To Deblur Faces

链接https://arxiv.org/abs/2604.07477

作者:Abduz Zami

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:medical imaging diagnostics, computer vision allowing, forensic analysis, photographic improvement, imaging diagnostics

备注: BSc thesis

点击查看摘要

Abstract:For applications including facial identification, forensic analysis, photographic improvement, and medical imaging diagnostics, facial image deblurring is an essential chore in computer vision allowing the restoration of high-quality images from blurry inputs. Often based on general picture priors, traditional deblurring techniques find it difficult to capture the particular structural and identity-specific features of human faces. We present SMFD-UNet (Semantic Mask Fusion Deblurring UNet), a new lightweight framework using semantic face masks to drive the deblurring process, therefore removing the need for high-quality reference photos in order to solve these difficulties. First, our dual-step method uses a UNet-based semantic mask generator to directly extract detailed facial component masks (e.g., eyes, nose, mouth) straight from blurry photos. Sharp, high-fidelity facial images are subsequently produced by integrating these masks with the blurry input using a multi-stage feature fusion technique within a computationally efficient UNet framework. We created a randomized blurring pipeline that roughly replicates real-world situations by simulating around 1.74 trillion deterioration scenarios, hence guaranteeing resilience. Examined on the CelebA dataset, SMFD-UNet shows better performance than state-of-the-art models, attaining higher Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) while preserving satisfactory naturalness measures, including NIQE, LPIPS, and FID. Powered by Residual Dense Convolution Blocks (RDC), a multi-stage feature fusion strategy, efficient and effective upsampling techniques, attention techniques like CBAM, post-processing techniques, and the lightweight design guarantees scalability and efficiency, enabling SMFD-UNet to be a flexible solution for developing facial image restoration research and useful applications.

150. 【2604.07430】HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

链接https://arxiv.org/abs/2604.07430

作者:Tencent Robotics X,HY Vision Team:Xumin Yu,Zuyan Liu,Ziyi Wang,He Zhang,Yongming Rao,Fangfu Liu,Yani Zhang,Ruowen Zhao,Oran Wang,Yves Liang,Haitao Lin,Minghui Wang,Yubo Dong,Kevin Cheng,Bolin Ni,Rui Huang,Han Hu,Zhengyou Zhang,Linus,Shunyu Yao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:models specifically designed, embodied agents, embodied, models, real-world embodied agents

备注

点击查看摘要

Abstract:We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at this https URL.

151. 【2604.07429】GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

链接https://arxiv.org/abs/2604.07429

作者:Mingyu Ouyang,Siyuan Hu,Kevin Qinghong Lin,Hwee Tou Ng,Mike Zheng Shou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Large Language Model, Multimodal Large Language, Language Model, Large Language, Multimodal Large

备注: 23 pages, 8 figures

点击查看摘要

Abstract:Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at this https URL.

152. 【2604.07427】Personalizing Text-to-Image Generation to Individual Taste

链接https://arxiv.org/abs/2604.07427

作者:Anne-Sofie Maerten,Juliane Verwiebe,Shyamgopal Karthik,Ameya Prabhu,Johan Wagemans,Matthias Bethge

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:models generate high-fidelity, generate high-fidelity visuals, generate high-fidelity, remain indifferent, Modern

备注

点击查看摘要

Abstract:Modern text-to-image (T2I) models generate high-fidelity visuals but remain indifferent to individual user preferences. While existing reward models optimize for "average" human appeal, they fail to capture the inherent subjectivity of aesthetic judgment. In this work, we introduce a novel dataset and predictive framework, called PAMELA, designed to model personalized image evaluations. Our dataset comprises 70,000 ratings across 5,000 diverse images generated by state-of-the-art models (Flux 2 and Nano Banana). Each image is evaluated by 15 unique users, providing a rich distribution of subjective preferences across domains such as art, design, fashion, and cinematic photography. Leveraging this data, we propose a personalized reward model trained jointly on our high-quality annotations and existing aesthetic assessment subsets. We demonstrate that our model predicts individual liking with higher accuracy than the majority of current state-of-the-art methods predict population-level preferences. Using our personalized predictor, we demonstrate how simple prompt optimization methods can be used to steer generations towards individual user preferences. Our results highlight the importance of data quality and personalization to handle the subjectivity of user preferences. We release our dataset and model to facilitate standardized research in personalized T2I alignment and subjective visual quality assessment.

153. 【2604.07413】FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

链接https://arxiv.org/abs/2604.07413

作者:Xiangru Jian,Hao Xu,Wei Pang,Xinjian Zhao,Chengyu Tao,Qixin Zhang,Xikun Zhang,Chao Zhang,Guanzhi Deng,Alex Xue,Juan Du,Tianshu Yu,Garth Tarr,Linqi Song,Qiuzhuang Sun,Dacheng Tao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, adopting Multimodal Large, Large Language, increasingly adopting Multimodal

备注: Project Page: [this https URL](https://ai4manufacturing.github.io/forge-web)

点击查看摘要

Abstract:The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at this https URL.

154. 【2604.07395】A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring

链接https://arxiv.org/abs/2604.07395

作者:Wenze Wang,Mehdi Hosseinzadeh,Feras Dayoub

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Robotic manipulation systems, largely single-shot manner, follow language instructions, semantically wrong grasps, Robotic manipulation

备注: Project page: [this https URL](https://wenzewwz123.github.io/Agentic-Loop/)

点击查看摘要

Abstract:Robotic manipulation systems that follow language instructions often execute grasp primitives in a largely single-shot manner: a model proposes an action, the robot executes it, and failures such as empty grasps, slips, stalls, timeouts, or semantically wrong grasps are not surfaced to the decision layer in a structured way. Inspired by agentic loops in digital tool-using agents, we reformulate language-guided grasping as a bounded embodied agent operating over grounded execution states, where physical actions expose an explicit tool-state stream. We introduce a physical agentic loop that wraps an unmodified learned manipulation primitive (grasp-and-lift) with (i) an event-based interface and (ii) an execution monitoring layer, Watchdog, which converts noisy gripper telemetry into discrete outcome labels using contact-aware fusion and temporal stabilization. These outcome events, optionally combined with post-grasp semantic verification, are consumed by a deterministic bounded policy that finalizes, retries, or escalates to the user for clarification, guaranteeing finite termination. We validate the resulting loop on a mobile manipulator with an eye-in-hand D405 camera, keeping the underlying grasp model unchanged and evaluating representative scenarios involving visual ambiguity, distractors, and induced execution failures. Results show that explicit execution-state monitoring and bounded recovery enable more robust and interpretable behavior than open-loop execution, while adding minimal architectural overhead. For the source code and demo refer to our project page: this https URL

155. 【2604.08305】HistDiT: A Structure-Aware Latent Conditional Diffusion Model for High-Fidelity Virtual Staining in Histopathology

链接https://arxiv.org/abs/2604.08305

作者:Aasim Bin Saleem,Amr Ahmed,Ardhendu Behera,Hafeezullah Amin,Iman Yi Liao,Mahmoud Khattab,Pan Jia Wern,Haslina Makmur

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

关键词:Epidermal growth-factor Receptor, Human Epidermal growth-factor, assessing specific immune, specific immune biomarkers, Human Epidermal

备注: Accepted to ICPR 2026

点击查看摘要

Abstract:Immunohistochemistry (IHC) is essential for assessing specific immune biomarkers like Human Epidermal growth-factor Receptor 2 (HER2) in breast cancer. However, the traditional protocols of obtaining IHC stains are resource-intensive, time-consuming, and prone to structural damages. Virtual staining has emerged as a scalable alternative, but it faces significant challenges in preserving fine-grained cellular structures while accurately translating biochemical expressions. Current state-of-the-art methods still rely on Generative Adversarial Networks (GANs) or standard convolutional U-Net diffusion models that often struggle with "structure and staining trade-offs". The generated samples are either structurally relevant but blurry, or texturally realistic but have artifacts that compromise their diagnostic use. In this paper, we introduce HistDiT, a novel latent conditional Diffusion Transformer (DiT) architecture that establishes a new benchmark for visual fidelity in virtual histological staining. The novelty introduced in this work is, a) the Dual-Stream Conditioning strategy that explicitly maintains a balance between spatial constraints via VAE-encoded latents and semantic phenotype guidance via UNI embeddings; b) the multi-objective loss function that contributes to sharper images with clear morphological structure; and c) the use of the Structural Correlation Metric (SCM) to focus on the core morphological structure for precise assessment of sample quality. Consequently, our model outperforms existing baselines, as demonstrated through rigorous quantitative and qualitative evaluations.

156. 【2604.07780】MonoUNet: A Robust Tiny Neural Network for Automated Knee Cartilage Segmentation on Point-of-Care Ultrasound Devices

链接https://arxiv.org/abs/2604.07780

作者:Alvin Kimbowa,Arjun Parmar,Ibrahim Mujtaba,Will Wei,Maziar Badii,Matthew Harkey,David Liu,Ilker Hacihaliloglu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:deep learning model, compact deep learning, POCUS devices, automated knee cartilage, handheld POCUS devices

备注: Accepted to Ultrasound in Medicine Biology

点击查看摘要

Abstract:Objective: To develop a robust and compact deep learning model for automated knee cartilage segmentation on point-of-care ultrasound (POCUS) devices. Methods: We propose MonoUNet, an ultra-compact U-Net consisting of (i) an aggressively reduced backbone with an asymmetric decoder, (ii) a trainable monogenic block that extracts multi-scale local phase features, and (iii) a gated feature injection mechanism that integrates these features into the encoder stages to reduce sensitivity to variations in ultrasound image appearance and improve robustness across devices. MonoUNet was evaluated on a multi-site, multi-device knee cartilage ultrasound dataset acquired using cart-based, portable, and handheld POCUS devices. Results: Overall, MonoUNet outperformed existing lightweight segmentation models, with average Dice scores ranging from 92.62% to 94.82% and mean average surface distance (MASD) values between 0.133 mm and 0.254 mm. MonoUNet reduces the number of parameters by 10x--700x and computational cost by 14x--2000x relative to existing lightweight models. MonoUNet cartilage outcomes showed excellent reliability and agreement with the manual outcomes: intraclass correlation coefficients (ICC$_{2,k})$=0.96 and bias=2.00% (0.047 mm) for average thickness, and ICC$_{2,k}$=0.99 and bias=0.80% (0.328 a.u.) for echo intensity. Conclusion: Incorporating trainable local phase features improves the robustness of highly compact neural networks for knee cartilage segmentation across varying acquisition settings and could support scalable ultrasound-based assessment and monitoring of knee osteoarthritis using POCUS devices. The code is publicly available at this https URL.

Comments:
Accepted to Ultrasound in Medicine Biology

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.07780 [eess.IV]

(or
arXiv:2604.07780v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2604.07780

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Alvin Kimbowa [view email] [v1]
Thu, 9 Apr 2026 04:14:16 UTC (8,411 KB)