本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新677篇论文，其中：

自然语言处理84篇
信息检索26篇
计算机视觉108篇

自然语言处理

1. 【2604.28182】Exploration Hacking: Can LLMs Learn to Resist RL Training?

作者：Eyon Jang,Damon Falck,Joschka Braun,Nathalie Kirch,Achu Menon,Perusha Moodley,Scott Emmons,Roland S. Zimmermann,David Lindner

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Reinforcement learning, large language models, capabilities and alignment, post-training of large, large language

备注： 81 pages, 37 figures

点击查看摘要

Abstract:Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during training, which creates a potential failure mode: a model could strategically alter its exploration during training to influence the subsequent training outcome. In this paper we study this behavior, called exploration hacking. First, we create model organisms of selective RL resistance by fine-tuning LLMs to follow specific underperformance strategies; these models can successfully resist our RL-based capability elicitation in agentic biosecurity and AI RD environments while maintaining performance on related tasks. We then use our model organisms to evaluate detection and mitigation strategies, including monitoring, weight noising, and SFT-based elicitation. Finally, we show that current frontier models can exhibit explicit reasoning about suppressing their exploration when provided with sufficient information about their training context, with higher rates when this information is acquired indirectly through the environment. Together, our results suggest exploration hacking is a possible failure mode of RL on sufficiently capable LLMs.

2. 【2604.28181】Synthetic Computers at Scale for Long-Horizon Productivity Simulation

链接：https://arxiv.org/abs/2604.28181

作者：Tao Ge,Baolin Peng,Hao Cheng,Jianfeng Gao

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：user-specific computer environments, Realistic long-horizon productivity, content-rich artifacts, stored and organized, organized through directory

备注： Preview version; work in progress

点击查看摘要

Abstract:Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data creation for such productivity scenarios, we introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations). Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer -- for example, navigating the filesystem for grounding, coordinating with simulated collaborators, and producing professional artifacts -- until these objectives are completed. In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them; each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average. These simulations produce rich experiential learning signals, whose effectiveness is validated by significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations. Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs. We argue that scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.

Comments:
Preview version; work in progress

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2604.28181 [cs.AI]

(or
arXiv:2604.28181v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.28181

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

3. 【2604.28147】On the Proper Treatment of Units in Surprisal Theory

链接：https://arxiv.org/abs/2604.28147

作者：Samuel Kiegeland,Vésteinn Snæbjarnarson,Tim Vieira,Ryan Cotterell

类目：Computation and Language (cs.CL)

关键词：theory links human, links human processing, human processing effort, Surprisal theory links, upcoming linguistic unit

备注： ACL 2026 (main conference)

点击查看摘要

Abstract:Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.

4. 【2604.28123】PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

链接：https://arxiv.org/abs/2604.28123

作者：Sudong Wang,Weiquan Huang,Xiaomin Yu,Zuhao Yang,Hehai Lin,Keming Wu,Chaojun Xiao,Chen Chen,Wenxuan Wang,Beier Zhu,Yunjian Zhang,Chengwei Qin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：applies supervised fine-tuning, standard post-training recipe, applies supervised, supervised fine-tuning, verifiable rewards

备注：

点击查看摘要

Abstract:The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at this https URL.

5. 【2604.28098】Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AI

链接：https://arxiv.org/abs/2604.28098

作者：Dorottya Demszky,Edith Bouton,Alison Twiner,Sara Hennessy,Richard Correnti

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：in-depth ethnographic work, ethnographic work, classroom interaction, interaction has long, long been divided

备注：

点击查看摘要

Abstract:Research on classroom interaction has long been divided between large-scale observation and in-depth ethnographic work. We propose a framework mapping this methodological space along three dimensions--scale, duration, and modality--where a study's position shapes what it reveals and obscures. We illustrate it through contrasting studies of dialogic teaching--Howe et al. (2019) and Snell and Lefstein (2018)--and an interview with the lead researchers, organized around three questions: what can be operationalized, what mechanisms become visible, and what translates to practice. We then examine how AI is expanding this space and how the framework can guide research and tool design.

6. 【2604.28076】opBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

链接：https://arxiv.org/abs/2604.28076

作者：An-Yang Ji,Jun-Peng Jiang,De-Chuan Zhan,Han-Jia Ye

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, simple aggregation, advanced Table Question, Language Models

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

7. 【2604.28075】Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

链接：https://arxiv.org/abs/2604.28075

作者：Ansar Aynetdinov,Patrick Haller,Alan Akbik

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：massive English web, English web corpora, filtering massive English, massive English, improves training efficiency

备注：

点击查看摘要

Abstract:Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web data, or prioritize quality by strictly filtering for a high-quality core and repeating it over multiple epochs? We investigate this trade-off for German by constructing hierarchical quality filters applied to 500M web documents, comparing multi-epoch training on the filtered subsets against single-pass training on a diverse corpus. Our experiments across multiple model scales and token budgets show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets. Notably, the performance gap persists even after 7 epochs. Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume. We release our German language models (called Boldt), as well as our cleaned evaluation benchmarks to the research community. Our experiments indicate that they achieve state-of-the-art results despite training on 10-360x fewer tokens than comparable models.

8. 【2604.28061】Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results

链接：https://arxiv.org/abs/2604.28061

作者：Lauren Cadwallader,Iain Hrynaszkiewicz,parth sarin,Tim Vines

类目：Digital Libraries (cs.DL); Computation and Language (cs.CL)

关键词：Numerous metascience studies, open science practices, Numerous metascience, open science, prevalence of open

备注： 12 pages. Submitted to 30th Annual International Conference on Science and Technology Indicators

点击查看摘要

Abstract:Numerous metascience studies and other initiatives have begun to monitor the prevalence of open science practices when it is more important to understand the 'downstream' effects or impacts of open science. PLOS and DataSeer have developed a new LLM-based indicator to measure an important effect of open science: the reuse of research data. Our results show a data reuse rate of 43%, which is higher than established bibliometric techniques. We show that data reuse can be measured at scale using LLMs and generative artificial intelligence. The positive effects of research data sharing and reuse may currently be underestimated.

9. 【2604.28048】Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

链接：https://arxiv.org/abs/2604.28048

作者：Neemias B da Silva,Rodrigo Minetto,Daniel Silver,Thiago H Silva

类目：Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词：Large Language Models, Large Language, reproducible behavioral diversity, prompting produces meaningful, Language Models

备注： 8 pages, 8 figures. IEEE DCOSS - UrbCom

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.

10. 【2604.28034】Ease of dependency distance minimization in star-like structures

链接：https://arxiv.org/abs/2604.28034

作者：Emília Garcia-Casademont,Ramon Ferrer-i-Cancho

类目：Computation and Language (cs.CL); Physics and Society (physics.soc-ph)

关键词：dependency distance minimization, distance minimization, dependency distance, dependencies between words, syntactic dependency distance

备注：

点击查看摘要

Abstract:The syntactic structure of a sentence can be represented as a tree where edges indicate syntactic dependencies between words. When that structure is a star, it has been demonstrated that the head should be placed in the middle of the linear arrangement according to the principle of syntactic dependency distance minimization. However, hubs of stars tend to be put at one of the ends, against that principle. Here we address two questions: (1) How difficult is it to minimize dependency distance? (2) Why anti dependency distance minimization effects have been found in star structures but not in path structures? The ease of optimization is determined by the shape of the optimization landscape. It was demonstrated that the landscape of star structures is quasiconvex (Ferrer-i-Cancho 2015, Language Dynamics and Change). As for (1), here we show that it is indeed convex (a particular case of quasiconvexity) both for star trees and quasistar trees and thus the distance-based optimization problem is simpler than previously believed. As for (2), we argue that (a) competing principles, rather than the difficulty of optimization, must be the actual reason for anti-dependency distance minimization effects and that (b) dependency distance minimization on star-like structures is less rewarding compared to other structures.

11. 【2604.28031】Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation

链接：https://arxiv.org/abs/2604.28031

作者：Garvin Kruthof

类目：Computation and Language (cs.CL)

关键词：researchers iteratively refine, iteratively refine ideas, large language models, models preserve fidelity, researchers iteratively

备注：

点击查看摘要

Abstract:When researchers iteratively refine ideas with large language models, do the models preserve fidelity to the original objective? We introduce DriftBench, a benchmark for evaluating constraint adherence in multi-turn LLM-assisted scientific ideation. Across 2,146 scored benchmark runs spanning seven models from five providers (including two open-weight), four interaction conditions, and 38 research briefs from 24 scientific domains, we find that iterative pressure reliably increases structural complexity and often reduces adherence to original constraints. A restatement probe reveals a dissociation between declarative recall and behavioral adherence, as models accurately restate constraints they simultaneously violate. The knows-but-violates (KBV) rate, measuring constraint non-compliance despite preserved recall, ranges from 8% to 99% across models. Structured checkpointing partially reduces KBV rates but does not close the dissociation, and complexity inflation persists. Human validation against blind raters confirms that the LLM judge under-detects constraint violations, making reported constraint adherence scores conservative. Sensitivity analyses confirm the findings are robust to temperature (0.7 vs.\ 1.0) and pressure type (novelty vs.\ rigor). We release all briefs, prompts, rubrics, transcripts, and scores as an open benchmark.

12. 【2604.28028】Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

链接：https://arxiv.org/abs/2604.28028

作者：Smit Jivani,Sarvam Maheshwari,Sunita Sarawagi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)

关键词：Large language models, query structured data, Large language, allowing users, growing ease

备注： Project Code: [this https URL](https://github.com/SSLab-CSE-IITB/tecod)

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized Text-to-SQL generation, allowing users to query structured data using natural language with growing ease. Yet, real-world deployment remains challenging, especially in complex or unseen schemas, due to inconsistent accuracy and the risk of generating invalid SQL. We introduce Template Constrained Decoding (TeCoD), a system that addresses these limitations by harnessing the recurrence of query patterns in labeled workloads. TeCoD converts historical NL-SQL pairs into reusable templates and introduces a robust template selection module that uses a fine-tuned natural language inference model to match or reject queries efficiently. Once the template is selected, TeCoD enforces it during SQL generation through grammar-constrained decoding, implemented via a novel partitioned strategy that ensures both syntactic validity and efficiency. Together, these components yield up to 36% higher execution accuracy than in-context learning (ICL) and 2.2x lower latency on matched queries.

13. 【2604.27998】Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

链接：https://arxiv.org/abs/2604.27998

作者：Jingcheng Deng,Zihao Wei,Liang Pang,Junhong Wu,Shicheng Xu,Zenghao Duan,Huawei Shen

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：compressing intermediate reasoning, substantially shortening reasoning, Relative Policy Optimization, Group Relative Policy, Latent reasoning

备注： This is an actively developing work, and we will continue to update the arXiv version

点击查看摘要

Abstract:Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning methods mainly focus on supervised learning, and reinforcement learning in latent space remains highly unstable. We study this problem through the lens of Group Relative Policy Optimization (GRPO), and show that directly adapting GRPO to latent reasoning is fundamentally non-trivial: latent reasoning changes both the probability density and the sampling mechanism, causing three coupled bottlenecks: absence of intrinsic latent manifolds, where unconstrained exploration pushes rollouts off the valid latent manifold; exploration-optimization misalignment, where trajectory-level rewards can induce incorrect token-level updates; and latent mixture non-closure, where jointly reinforcing multiple correct latent paths can produce an invalid averaged state. To address them, we propose \textbf{Latent-GRPO}, which combines invalid-sample advantage masking, one-sided noise sampling, and optimal correct-path first-token selection. Across four low-difficulty benchmarks (e.g., GSM8K-Aug) and four high-difficulty benchmarks (e.g., AIME), Latent-GRPO improves over its latent initialization by 7.86 Pass@1 points on low-difficulty tasks and surpasses explicit GRPO by 4.27 points on high-difficulty tasks while using 3--4$\times$ shorter reasoning chains. It also achieves stronger pass@$k$ performance under Gumbel sampling. These results establish Latent-GRPO as an effective approach for stable and efficient latent reasoning.

14. 【2604.27934】MM-StanceDet: Retrieval-Augmented Multi-modal Multi-agent Stance Detection

链接：https://arxiv.org/abs/2604.27934

作者：Weihai Lu,Zhejun Zhao,Yanshu Li,Huan He

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：understanding public discourse, effectively fusing text, Multimodal Stance Detection, Stance Detection, Multi-agent Stance Detection

备注： Accepted on ACL 2026 Main Conference

点击查看摘要

Abstract:Multimodal Stance Detection (MSD) is crucial for understanding public discourse, yet effectively fusing text and image, especially with conflicting signals, remains challenging. Existing methods often face difficulties with contextual grounding, cross-modal interpretation ambiguity, and single-pass reasoning fragility. To address these, we propose Retrieval-Augmented Multi-modal Multi-agent Stance Detection (MM-StanceDet), a novel multi-agent framework integrating Retrieval Augmentation for contextual grounding, specialized Multimodal Analysis agents for nuanced interpretation, a Reasoning-Enhanced Debate stage for exploring perspectives, and Self-Reflection for robust adjudication. Extensive experiments on five datasets demonstrate MM-StanceDet significantly outperforms state-of-the-art baselines, validating the efficacy of its multi-agent architecture and structured reasoning stages in addressing complex multimodal stance challenges.

15. 【2604.27929】DPN-LE: Dual Personality Neuron Localization and Editing for Large Language Models

链接：https://arxiv.org/abs/2604.27929

作者：Lifan Zheng,Xue Yang,Jiawei Chen,Chenyan Wu,Jingyuan Zhang,Fanheng Kong,Xinyi Zeng,Xiang Chen,Yu Tian

类目：Computation and Language (cs.CL)

关键词：large language models, language models, personality representation mechanisms, widespread adoption, adoption of large

备注：

点击查看摘要

Abstract:With the widespread adoption of large language models (LLMs), understanding their personality representation mechanisms has become critical. As a novel paradigm in Personality Editing, most existing methods employ neuron-editing to locate and modify LLM neurons, requiring changes to numerous neurons and leading to significant performance degradation. This raises a fundamental question: Are all modified neurons directly related to personality representation? In this work, we investigate and quantify this specificity through assessments of general capability impact and representation-level patterns. We find that: 1) Current methods can change personalities but reduce overall performance. 2) Neurons are multifunctional, connecting personality traits and general knowledge. 3) Opposing personality traits demonstrate distinctly mutually exclusive representation patterns. Motivated by these findings, we propose DPN-LE (Dual Personality Neuron Localization and Editing), which identifies personality-specific neurons by contrasting MLP activations between high-trait and low-trait samples. DPN-LE constructs layer-wise steering vectors and applies dual-criterion filtering based on Cohen's $d$ effect size and activation magnitude to isolate mutually exclusive neuron subsets. Sparse linear intervention on these neurons enables precise personality control at inference time. Using only 1,000 contrastive sample pairs per trait, DPN-LE intervenes on $\sim$0.5\% of neurons while achieving competitive personality control and substantially better capability preservation across reasoning tasks. Experiments on LLaMA-3-8B-Instruct and Qwen2.5-7B-Instruct demonstrate the effectiveness and generalizability of our approach.

16. 【2604.27924】Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

链接：https://arxiv.org/abs/2604.27924

作者：Sihong Wu,Owen Jiang,Yilun Zhao,Tiansheng Hu,Yiling Ma,Kaiyan Zhang,Manasi Patwardhan,Arman Cohan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：multi-stage process involving, subsequent manuscript revisions, process involving reviews, final decisions, multi-stage process

备注： ACL 2026

点击查看摘要

Abstract:Peer review is a multi-stage process involving reviews, rebuttals, meta-reviews, final decisions, and subsequent manuscript revisions. Recent advances in large language models (LLMs) have motivated methods that assist or automate different stages of this pipeline. In this survey, we synthesize techniques for (i) peer review generation, including fine-tuning strategies, agent-based systems, RL-based methods, and emerging paradigms to enhance generation; (ii) after-review tasks including rebuttals, meta-review and revision aligned to reviews; and (iii) evaluation methods spanning human-centered, reference-based, LLM-based and aspect-oriented. We catalog datasets, compare modeling choices, and discuss limitations, ethical concerns, and future directions. The survey aims to provide practical guidance for building, evaluating, and integrating LLM systems across the full peer review workflow.

17. 【2604.27920】Beyond Semantics: Measuring Fine-Grained Emotion Preservation in Small Language Model-Based Machine Translation

链接：https://arxiv.org/abs/2604.27920

作者：Dawid Wisniewski,Igor Czudy

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Preserving affective nuance, Machine Translation, affective nuance remains, Preserving affective, Small Language Models

备注： Accepted at EAMT 2026

点击查看摘要

Abstract:Preserving affective nuance remains a challenge in Machine Translation (MT), where semantic equivalence often takes precedence over emotional fidelity. This paper evaluates the performance of three state-of-the-art Small Language Models (SLMs) -- EuroLLM, Aya Expanse, and Gemma -- in maintaining fine-grained emotions during backtranslation. Using the GoEmotions dataset, which comprises Reddit comments across 28 distinct categories, we assess emotional preservation across five European languages: German, French, Spanish, Italian, and Polish. Specifically, we investigate (i) the inherent capability of these SLMs to retain emotional sentiment, (ii) the efficacy of emotion-aware prompting in improving preservation, and (iii) the performance of ModernBERT as a contemporary alternative to BERT for emotion classification in MT evaluation.

18. 【2604.27914】Geometry-Calibrated Conformal Abstention for Language Models

链接：https://arxiv.org/abs/2604.27914

作者：Rui Xu,Yi Chen,Sihong Xie,Hui Xiong

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：frequently generate plausible, language models lack, models lack relevant, generate plausible responses, lack relevant knowledge

备注：

点击查看摘要

Abstract:When language models lack relevant knowledge for a given query, they frequently generate plausible responses that can be hallucinations, rather than admitting being agnostic about the answer. Retraining models to reward admitting ignorance can lead to overly conservative behaviors and poor generalization due to scarce evaluation benchmarks. We propose a post hoc framework, Conformal Abstention (CA), adapted from conformal prediction (CP) to determine whether to abstain from answering a query. CA provides finite-sample guarantees on both the probability of participation (i.e., not abstaining) and the probability that the generated response is correct. Importantly, the abstention decision relies on prediction confidence rather than the non-conformity scores used in CP, which are intractable for open-ended generation. To better align prediction confidence with the model's ignorance, we introduce a calibration strategy using representation geometry within the model to measure knowledge involvement in shaping the response. Experiments demonstrate that we improve selective answering significantly with 75 percent conditional correctness.

19. 【2604.27906】From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

链接：https://arxiv.org/abs/2604.27906

作者：Alex Petrov,Alexander Gusak,Denis Mukha,Dima Korolev

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：store prior interactions, recover relevant context, store prior, interactions as text, prior interactions

备注： 33 pages, 7 figures

点击查看摘要

Abstract:Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic recall, but it is mismatched to the kinds of memory that agents need in production: exact facts, current state, updates and deletions, aggregation, relations, negative queries, and explicit unknowns. These operations require memory to behave less like search and more like a system of record. This paper argues that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. We evaluate this design on structured extraction and end-to-end memory benchmarks. On the extraction benchmark, the judge-in-the-loop configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines. On our end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines. On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses. The results show that, for memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone.

Comments:
33 pages, 7 figures

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

MSC classes:
68T50, 68T30, 68P20, 68P15, 94A15

ACMclasses:
I.2.7; I.2.4; H.3.3; H.2.1; H.2.3

Cite as:
arXiv:2604.27906 [cs.AI]

(or
arXiv:2604.27906v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.27906

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

20. 【2604.27861】winGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

链接：https://arxiv.org/abs/2604.27861

作者：Bowen Sun,Chaozhuo Li,Yaodong Yang,Yiwei Wang,Chaowei Xiao

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Decompositional jailbreaks pose, reconstruct prohibited content, collectively reconstruct prohibited, Decompositional jailbreaks, large language models

备注：

点击查看摘要

Abstract:Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collectively reconstruct prohibited content. In real-world deployments, LLMs face a continuous, untraceable stream of fully anonymized and arbitrarily interleaved requests, infiltrated by covertly distributed adversarial queries. Under this rigorous threat model, state-of-the-art defensive strategies exhibit fundamental limitations. In the absence of trustworthy user metadata, they are incapable of tracking global historical contexts, while their deployment of generative models for real-time monitoring introduces computationally prohibitive overhead. To address this, we present TwinGate, a stateful dual-encoder defense framework. TwinGate employs Asymmetric Contrastive Learning (ACL) to cluster semantically disparate but intent-matched malicious fragments in a shared latent space, while a parallel frozen encoder suppresses false positives arising from benign topical overlap. Each request requires only a single lightweight forward pass, enabling the defense to execute in parallel with the target model's prefill phase at negligible latency overhead. To evaluate our approach and advance future research, we construct a comprehensive dataset of over 3.62 million instructions spanning 8,600 distinct malicious intents. Evaluated on this large-scale corpus under a strictly causal protocol, TwinGate achieves high malicious intent recall at a remarkably low false positive rate while remaining highly robust against adaptive attacks. Furthermore, our proposal substantially outperforms stateful and stateless baselines, delivering superior throughput and reduced latency.

21. 【2604.27850】Reasoning over Object Descriptions Improves Coreference Resolution in Task-Based Dialogue Systems

链接：https://arxiv.org/abs/2604.27850

作者：Oier Ijurco,Oier Lopez de Lacalle

类目：Computation and Language (cs.CL)

关键词：achieving specific goals, natural language interactions, systems assist users, specific goals, retrieving information

备注： To be published in LREC 2026

点击查看摘要

Abstract:Task-based dialogue systems assist users in achieving specific goals, such as executing actions or retrieving information, through natural language interactions. Accurate coreference resolution is essential, as it involves identifying object references within the dialogue - a task that becomes increasingly challenging in visually grounded environments characterized by complex scenes and diverse object metadata. However, coreference resolution in task-based dialogue remains limited by poor generalization across domains and heavy reliance on supervised models that often overfit to dataset-specific artifacts. In this work, we propose a unimodal test-time reasoning approach that enables large language models (LLMs) to reason over detailed object metadata and dialogue history to improve coreference resolution. Empirical results on the SIMMC 2.1 dataset demonstrate that LLMs can generate step-by-step reasoning processes that effectively align dialogue context with objects present in the scene. Extensive experiments highlight the models' ability to link conversations and objects accurately. Moreover, we show that test-time reasoning under few-shot settings generalizes effectively to unseen scenarios and novel objects, outperforming encoder-based supervised methods in cross-domain evaluations. These findings underscore the critical role of structured metadata and careful prompt engineering in enhancing the robustness and generalization of task-oriented dialogue systems.

22. 【2604.27846】Multi-Level Narrative Evaluation Outperforms Lexical Features for Mental Health

链接：https://arxiv.org/abs/2604.27846

作者：Yuxi Ma,Jieming Cui,Muyang Li,Ye Zhao,Yu Li,Yixuan Wang,Chi Zhang,Yinyin Zang,Yixin Zhu

类目：Computation and Language (cs.CL)

关键词：people narrate, narrate their experiences, experiences offers, offers a window, mind organizes

备注：

点击查看摘要

Abstract:How people narrate their experiences offers a window into how the mind organizes them. Computational approaches to therapeutic writing have evolved from lexical counting to neural methods, yet remain fragmented: dictionary tools miss discourse structure, while embeddings conflate local coherence with global organization. No existing framework maps these techniques onto the hierarchical processes through which narratives are constructed. Here we introduce a three-level framework - micro-level lexical features, meso-level semantic embeddings, and macro-level LLM narrative evaluation - and show, across 830 Chinese therapeutic texts spanning depression, anxiety, and trauma, that macro-level evaluation substantially outperforms lexical and embedding features for mental health prediction. This challenges the field's emphasis on word-counting: formal structural features (Labov's story grammar, RST coherence, propositional composition) demonstrate that narrative organization per se carries predictive signal, while clinically-grounded narrative dimensions capture how psychological states are expressed through discourse. Semantic embeddings add minimal independent value but yield incremental gains in multi-level classification. By grounding computational levels in discourse processing theory, this framework identifies macro-structural organization as the primary locus of clinical signal and generates testable hypotheses for intervention design and longitudinal research.

23. 【2604.27844】ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

链接：https://arxiv.org/abs/2604.27844

作者：Wenxiang Lin,Xinglin Pan,Ruibo Fan,Shaohuai Shi,Xiaowen Chu

类目：Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)

关键词：large language models, critical bottleneck, large language, Communication, compression

备注：

点击查看摘要

Abstract:Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the potential of lossless compression has remained largely underexplored since compression and decompression typically consume larger overheads than the benefits of reduced communication traffic. We observe that the communication data, including activations, gradients and parameters, during training often follows a near-Gaussian distribution, which is a key feature for data compression. Thus, we introduce ZipCCL, a lossless compressed communication library of collectives for LLM training. ZipCCL is equipped with our novel techniques: (1) theoretically grounded exponent coding that exploits the Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, (2) GPU-optimized compression and decompression kernels that carefully design memory access patterns and pipeline using communication-aware data layout, and (3) adaptive communication strategies that dynamically switch collective operations based on workload patterns and system characteristics. Evaluated on a 64-GPU cluster using both mixture-of-experts and dense transformer models, ZipCCL reduces communication time by up to 1.35$\times$ and achieves end-to-end training speedups of up to 1.18$\times$ without any impact on model quality.

24. 【2604.27790】How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews

链接：https://arxiv.org/abs/2604.27790

作者：Riley Grossman,Songjiang Liu,Michael K. Chen,Mike Smith,Cristian Borcea,Yi Chen

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词：search, generative search, Generative, increasingly integrated, web search

备注： Paper Accepted to ACM SIGIR 2026 (49th International ACM SIGIR Conference on Research and Development in Information Retrieval)

点击查看摘要

Abstract:Generative AI is being increasingly integrated into web search for the convenience it provides users. In this work, we aim to understand how generative AI disrupts web search by retrieving and presenting the information and sources differently from traditional search engines. We introduce a public benchmark dataset of 11,500 user queries to support our study and future research of generative search. We compare the search results returned by Google's search engine, the accompanying AI Overview (AIO), and Gemini Flash 2.5 for each query. We have made several key findings. First, we find that for 51.5\% of representative, real-user queries, AIOs are generated, and are displayed above the organic search results. Controversial questions frequently result in an AIO. Second, we show that the retrieved sources are substantially different for each search engine (0.2 average Jaccard similarity). Traditional Google search is significantly more likely to retrieve information from popular or institutional websites in government or education, while generative search engines are significantly more likely to retrieve Google-owned content. Third, we observe that websites that block Google's AI crawler are significantly less likely to be retrieved by AIOs, despite having access to the content. Finally, AIOs are less consistent when processing two runs of the same query, and are less robust to minor query edits. Our findings have important implications for understanding how generative search impacts website visibility, the effectiveness of generative engine optimization techniques, and the information users receive. We call for revenue frameworks to foster a sustainable and mutually beneficial ecosystem for publishers and generative search providers.

25. 【2604.27776】WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

链接：https://arxiv.org/abs/2604.27776

作者：Jinchao Li,Yunxin Li,Chenrui Zhao,Zhenran Xu,Baotian Hu,Min Zhang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：shown impressive capabilities, GUI agents, shown impressive, impressive capabilities, focus on isolated

备注：

点击查看摘要

Abstract:While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks ( 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across $\geq$ 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at this http URL.

26. 【2604.27766】Instruction-Guided Poetry Generation in Arabic and Its Dialects

链接：https://arxiv.org/abs/2604.27766

作者：Abdelrahman Sadallah,Kareem Elozeiri,Mervat Abassy,Rania Elbadry,Mohamed Anwar,Abed Alhakim Freihat,Preslav Nakov,Fajri Koto

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：central art form, Large Language Models, Arabic speakers, Arabic, modern Arabic speakers

备注： ACL Findings 2026

点击查看摘要

Abstract:Poetry has long been a central art form for Arabic speakers, serving as a powerful medium of expression and cultural identity. While modern Arabic speakers continue to value poetry, existing research on Arabic poetry within Large Language Models (LLMs) has primarily focused on analysis tasks such as interpretation or metadata prediction, e.g., rhyme schemes and titles. In contrast, our work addresses the practical aspect of poetry creation in Arabic by introducing controllable generation capabilities to assist users in writing poetry. Specifically, we present a large-scale, carefully curated instruction-based dataset in Modern Standard Arabic (MSA) and various Arabic dialects. This dataset enables tasks such as writing, revising, and continuing poems based on predefined criteria, including style and rhyme, as well as performing poetry analysis. Our experiments show that fine-tuning LLMs on this dataset yields models that can effectively generate poetry that is aligned with user requirements, based on both automated metrics and human evaluation with native Arabic speakers. The data and the code are available at this https URL

27. 【2604.27712】Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

链接：https://arxiv.org/abs/2604.27712

作者：Nhi Ngoc-Yen Nguyen,Anh-Duc Nguyen,Nghia Hieu Nguyen,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：integrate text visible, faithfully integrate text, captioning requires fusing, Vietnamese scene-text captioning, visual features

备注：

点击查看摘要

Abstract:Scene-text image captioning requires fusing three information streams -- visual features, OCR-detected text, and linguistic knowledge -- to generate descriptions that faithfully integrate text visible in images. Existing fusion approaches treat text as language-agnostic, which fails for Vietnamese: a tonal language where diacritics alter word meaning, OCR errors are pervasive, and word boundaries are ambiguous. We argue that Vietnamese scene-text captioning demands \textit{linguistically informed multimodal fusion}, where language-specific structural knowledge is explicitly incorporated into the fusion mechanism. Motivated from these insights, we propose \textbf{HSTFG} (Heterogeneous Scene-Text Fusion Graph), a general-purpose graph fusion framework with learned spatial attention bias, and show through topology analysis that cross-modal graph edges are harmful for scene-text fusion. Building on this finding, we design \textbf{PhonoSTFG} (Phonological Scene-Text Fusion Graph) which specializes graph-level fusion for Vietnamese linguistic reasoning. To support evaluation, we introduce \textbf{ViTextCaps}, the first large-scale Vietnamese scene-text captioning dataset (\textbf{15{,}729} images with \textbf{74{,}970} captions), with comprehensive linguistic analysis showing that 52.8\% of the vocabulary is at risk of diacritic collision.

28. 【2604.27707】Contextual Agentic Memory is a Memo, Not True Memory

链接：https://arxiv.org/abs/2604.27707

作者：Binyan Xu,Xilin Dai,Kehuan Zhang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：vector stores, retrieval-augmented generation, context-window management, Current agentic memory, agentic memory systems

备注：

点击查看摘要

Abstract:Current agentic memory systems (vector stores, retrieval-augmented generation, scratchpads, and context-window management) do not implement memory: they implement lookup. We argue that treating lookup as memory is a category error with provable consequences for agent capability, long-term learning, and security. Retrieval generalizes by similarity to stored cases; weight-based memory generalizes by applying abstract rules to inputs never seen before. Conflating the two produces agents that accumulate notes indefinitely without developing expertise, face a provable generalization ceiling on compositionally novel tasks that no increase in context size or retrieval quality can overcome, and are structurally vulnerable to persistent memory poisoning as injected content propagates across all future sessions. Drawing on Complementary Learning Systems theory from neuroscience, we show that biological intelligence solved this problem by pairing fast hippocampal exemplar storage with slow neocortical weight consolidation, and that current AI agents implement only the first half. We formalize these limitations, address four alternative views, and close with a co-existence proposal and a call to action for system builders, benchmark designers, and the memory community.

29. 【2604.27695】EviMem: Evidence-Gap-Driven Iterative Retrieval for Long-Term Conversational Memory

链接：https://arxiv.org/abs/2604.27695

作者：Yuyang Li,Yime He,Zeyu Zhang,Dong Gong

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Long-term conversational memory, single-pass retrieval fails, Long-term conversational, retrieving evidence scattered, requires retrieving evidence

备注：

点击查看摘要

Abstract:Long-term conversational memory requires retrieving evidence scattered across multiple sessions, yet single-pass retrieval fails on temporal and multi-hop questions. Existing iterative methods refine queries via generated content or document-level signals, but none explicitly diagnoses the evidence gap, namely what is missing from the accumulated retrieval set, leaving query refinement untargeted. We present EviMem, combining IRIS (Iterative Retrieval via Insufficiency Signals), a closed-loop framework that detects evidence gaps through sufficiency evaluation, diagnoses what is missing, and drives targeted query refinement, with LaceMem (Layered Architecture for Conversational Evidence Memory), a coarse-to-fine memory hierarchy supporting fine-grained gap diagnosis. On LoCoMo, EviMem improves Judge Accuracy over MIRIX on temporal (73.3% to 81.6%) and multi-hop (65.9% to 85.2%) questions at 4.5x lower latency. Code: this https URL.

30. 【2604.27674】One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

链接：https://arxiv.org/abs/2604.27674

作者：Hiroyuki Deguchi,Katsuki Chousa,Yusuke Sakai

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词：automatic evaluation metrics, pose practical threats, high-dimensional embedding spaces, hubness problem, cross-modal encoders

备注： Accepted at ACL2026 (main)

点击查看摘要

Abstract:The hubness problem, in which hub embeddings are close to many unrelated examples, occurs often in high-dimensional embedding spaces and may pose a practical threat for purposes such as information retrieval and automatic evaluation metrics. In particular, since cross-modal similarity between text and images cannot be calculated by direct comparisons, such as string matching, cross-modal encoders that project different modalities into a shared space are helpful for various cross-modal applications, and thus, the existence of hubs may pose practical threats. To reveal the vulnerabilities of cross-modal encoders, we propose a method for identifying the hub embedding and its corresponding hub text. Experiments on image captioning evaluation in MSCOCO and nocaps along with image-to-text retrieval tasks in MSCOCO and Flickr30k showed that our method can identify a single hub text that unreasonably achieves comparable or higher similarity scores than human-written reference captions in many images, thereby revealing the vulnerabilities in cross-modal encoders.

31. 【2604.27661】Language Ideologies in a Multilingual Society: An LLM-based Analysis of Luxembourgish News Comments

链接：https://arxiv.org/abs/2604.27661

作者：Emilia Milano,Alistair Plum,Yves Scherrer,Christoph Purschke

类目：Computation and Language (cs.CL)

关键词：Detecting language ideologies, Detecting language, constructed through discourse, valuable yet complex, language ideologies

备注：

点击查看摘要

Abstract:Detecting language ideologies is a valuable yet complex task for understanding how identities are constructed through discourse. In Luxembourg's multicultural and multilingual society, language ideologies reflect more than simple preferences: they carry deep cultural and social meanings, shaping identities and social belonging. Following recent developments in applying Natural Language Processing tools to linguistics and social science, this paper explores the potential of large language models to assist in the detection of language ideologies. We manually annotate a corpus of user comments in Luxembourgish with predefined ideological categories and then evaluate the performance of large language models under varying prompt conditions to assess their ability to replicate these human annotations. Since Luxembourgish is a small language and poorly represented in the LLMs' training data, we also investigate whether machine-translating the data to high-resource languages increases performance on the ideology detection task. Our findings suggest that, while LLMs are not yet fully optimized for a multi-class ideological annotation task, they are practical tools to identify language ideological content.

32. 【2604.27624】Mapping how LLMs debate societal issues when shadowing human personality traits, sociodemographics and social media behavior

链接：https://arxiv.org/abs/2604.27624

作者：Ali Aghazadeh Ardebili,Massimo Stella

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：Large Language Models, LLM outputs vary, prompting remain sparse, strongly shape social, Large Language

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) can strongly shape social discourse, yet datasets investigating how LLM outputs vary across controlled social and contextual prompting remain sparse. Cognitive Digital Shadows (CDS) is a 190,000-record synthetic corpus supporting analyses of LLM-generated discourse. Each CDS record is generated by one of 19 LLMs, prompted to shadow either a human persona or an AI-assistant role. CDS contains LLM responses on 4 controversial societal topics: vaccines/healthcare, social media disinformation, the gender gap in science, and STEM stereotypes. Persona-conditioned records encode 17 sociodemographic and psychological attributes, providing data linking LLMs' prompts, language, stances and reasoning. Texts are validated for topic anchoring and can support emotional analyses via interpretable NLP (e.g. textual forma mentis networks). CDS is enriched by a pooling platform with user-friendly dashboards, enabling easy, interactive group-level comparisons of emotional and semantic framing across personas, topics and models. The CDS prompting framework supports future audits of LLMs' bias, social sensitivity and alignment.

33. 【2604.27616】RoadMapper: A Multi-Agent System for Roadmap Generation of Solving Complex Research Problems

链接：https://arxiv.org/abs/2604.27616

作者：Jiacheng Liu,Zichen Tang,Zhongjun Yang,Xinyi Hu,Xueyuan Lin,Linwei Jia,Ruofei Bai,Rongjin Li,Shiyao Peng,Haocheng Gao,Haihong E

类目：Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：People commonly leverage, People commonly, commonly leverage structured, accelerate knowledge acquisition, leverage structured content

备注： Accepted to Findings of ACL 2026

点击查看摘要

Abstract:People commonly leverage structured content to accelerate knowledge acquisition and research problem solving. Among these, roadmaps guide researchers through hierarchical subtasks to solve complex research problems step by step. Despite progress in structured content generation, the roadmap generation task has remained unexplored. To bridge this gap, we introduce RoadMap, a novel benchmark designed to evaluate the ability of large language models (LLMs) to construct high-quality roadmaps for solving complex research problems. Based on this, we identify three limitations of LLMs: (1) lack of professional knowledge, (2) unreasonable task decomposition, and (3) disordered logical relationships. To address these challenges, we propose RoadMapper, an LLM-based multi-agent system that decomposes the research roadmap generation task into three key stages (i.e., initial generation, knowledge augmentation, and iterative "critique-revise-evaluate"). Extensive experiments demonstrate that RoadMapper can improve LLMs' ability for roadmap generation, while enhancing average performance by more than 8% and saving 84% of the time required by human experts, highlighting its effectiveness and application potential.

34. 【2604.27607】JaiTTS: A Thai Voice Cloning Model

链接：https://arxiv.org/abs/2604.27607

作者：Jullajak Karnjanaekarin,Pontakorn Trakuekul,Narongkorn Panitsrisit,Sumana Sumanakul,Vichayuth Nitayasomboon,Nithid Guntasin,Thanavin Denkavin,Attapol T. Rutherford

类目：Computation and Language (cs.CL)

关键词：Thai voice cloning, Thai-centric speech corpus, large Thai-centric speech, Thai voice, large Thai-centric

备注：

点击查看摘要

Abstract:We present JaiTTS-v1.0, a state-of-the-art Thai voice cloning text-to-speech model built through continual training on a large Thai-centric speech corpus. The model architecture is adapted from VoxCPM, a tokenizer-free autoregressive TTS model. JaiTTS-v1.0 directly processes numerals and Thai-English code-switching, which is very common in realistic settings, without explicit text normalization. We test the models on short-duration speech generation and long-duration speech generation, which reflects many real-world use cases. JaiTTS-v1.0 achieves a state-of-the-art CER of 1.94\%, surpassing the human ground truth of 1.98% for short-duration tasks while performing on par with human ground truth for long-duration tasks. In human judgment evaluations, our model wins 283 of 400 pairwise comparisons against commercial flagships, with only 58 losses.

35. 【2604.27551】Beyond the Training Distribution: Mapping Generalization Boundaries in Neural Program Synthesis

链接：https://arxiv.org/abs/2604.27551

作者：Henrik Voigt,Michael Habeck,Joachim Giesen

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large-scale transformers achieve, achieve impressive results, capabilities remain obscured, program synthesis benchmarks, opaque training corpora

备注：

点击查看摘要

Abstract:Large-scale transformers achieve impressive results on program synthesis benchmarks, yet their true generalization capabilities remain obscured by data contamination and opaque training corpora. To rigorously assess whether models are truly generalizing or merely retrieving memorized templates, we introduce a strictly controlled program synthesis environment based on a domain-specific arithmetic grammar. By systematically enumerating and evaluating millions of unique programs, we construct interpretable syntactic and semantic metric spaces. This allows us to precisely map data distributions and sample train and test splits that isolate specific distributional shifts. Our experiments demonstrate that optimizing density generalization -- through diverse sampling over both semantic and syntactic spaces -- induces robust out-of-distribution generalization. Conversely, evaluating support generalization reveals that transformers severely struggle with extrapolation, experiencing a performance drop of over 30% when forced to generate syntactically novel programs. While steadily scaling up compute improves generalization, the gains follow a strictly log-linear relationship. We conclude that robust generalization requires maximizing training diversity across multiple manifolds, and our findings indicate the necessity for novel search-based approaches to break through current log-linear scaling bottlenecks.

36. 【2604.27550】APPSI-139: A Parallel Corpus of English Application Privacy Policy Summarization and Interpretation

链接：https://arxiv.org/abs/2604.27550

作者：Pengyun Zhu,Qiheng Sun,Long Wen,Yanbo Wang,Yang Cao,Junxu Liu,Deyi Xiong,Jinfei Liu,Zhibo Wang,Kui Ren

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：service providers handle, English privacy policies, high-quality English privacy, English privacy, Privacy policies

备注： Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Privacy policies are essential for users to understand how service providers handle their personal data. However, these documents are often long and complex, as well as filled with technobabble and legalese, causing users to unknowingly accept terms that may even contradict the law. While summarizing and interpreting these privacy policies is crucial, there is a lack of high-quality English parallel corpus optimized for legal clarity and readability. To address this issue, we introduce APPSI-139, a high-quality English privacy policy corpus meticulously annotated by domain experts, specifically designed for summarization and interpretation tasks. The corpus includes 139 English privacy policies, 15,692 rewritten parallel corpora, and 36,351 fine-grained annotation labels across 11 data practice categories. Concurrently, we propose TCSI-pp-V2, a hybrid privacy policy summarization and interpretation framework that employs an alternating training strategy and coordinates multiple expert modules to effectively balance computational efficiency and accuracy. Experimental results show that the hybrid summarization system built on APPSI-139 corpus and the TCSI-pp-V2 framework outperform large language models, such as GPT-4o and LLaMA-3-70B, in terms of readability and reliability. The source code and dataset are available at this https URL.

37. 【2604.27543】AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

链接：https://arxiv.org/abs/2604.27543

作者：Eugen Beck,Sarah Beranek,Uma Moothiringote,Daniel Mann,Wilfried Michel,Katie Nguyen,Taylor Tragemann

类目：Computation and Language (cs.CL)

关键词：Evaluating English ASR, applications remains difficult, diverse user base, lack explicit dialect, explicit dialect annotations

备注： Submitted to INTERSPEECH 2026

点击查看摘要

Abstract:Evaluating English ASR systems for conversational AI applications remains difficult, as many publicly available corpora are either pre-segmented into short segments, consist of read or prepared speech, or lack explicit dialect annotations to evaluate robustness for a diverse user base. This work presents the AppTek Call-Center Dialogues corpus, a collection of spontaneous, role-played agent-customer conversations spanning fourteen English accents covering sixteen service-oriented scenarios. The dataset was commissioned specifically for evaluation and none of the audio or text was publicly available prior to release, reducing the risk of overlap with existing large-scale pretraining corpora. We benchmark a set of open-source ASR systems under different segmentation approaches. Results show substantial variation across accents and segmentation methods, indicating that good performance on general American English benchmarks does not necessarily generalize to other accents.

38. 【2604.27542】HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

链接：https://arxiv.org/abs/2604.27542

作者：Thibault Bañeras Roux,Jane Wottawa,Mickael Rouvier,Teva Merlin,Richard Dufour

类目：Computation and Language (cs.CL)

关键词：Automatic Speech Recognition, Speech Recognition, speech signal, Conventionally, ASR

备注： 164--175

点击查看摘要

Abstract:Conventionally, Automatic Speech Recognition (ASR) systems are evaluated on their ability to correctly recognize each word contained in a speech signal. In this context, the word error rate (WER) metric is the reference for evaluating speech transcripts. Several studies have shown that this measure is too limited to correctly evaluate an ASR system, which has led to the proposal of other variants of metrics (weighted WER, BERTscore, semantic distance, etc.). However, they remain system-oriented, even when transcripts are intended for humans. In this paper, we firstly present Human Assessed Transcription Side-by-side (HATS), an original French manually annotated data set in terms of human perception of transcription errors produced by various ASR systems. 143 humans were asked to choose the best automatic transcription out of two hypotheses. We investigated the relationship between human preferences and various ASR evaluation metrics, including lexical and embedding-based ones, the latter being those that correlate supposedly the most with human perception.

39. 【2604.27534】Entropy of Ukrainian

链接：https://arxiv.org/abs/2604.27534

作者：Anton Lavreniuk,Mykyta Mudryi,Markiian Chaklosh

类目：Computation and Language (cs.CL)

关键词：natural language processing, unpredictability and complexity, language processing, Claude Shannon, natural language

备注： 8 pages, 5 figures, 2 tables. Accepted at UNLP 2026

点击查看摘要

Abstract:In natural language processing, the entropy of a language is a measure of its unpredictability and complexity. The first study on this subject was conducted by Claude Shannon in 1951. By having participants predict the next character in a sentence, he was able to approximate the entropy of the English language. Several follow-up studies by other authors have since been conducted for English, and one for Hebrew. However, to date, Shannon's experiment has never been conducted for Ukrainian. In this paper, we perform this experiment for Ukrainian by recruiting 184 volunteers using social media channels. We rely on techniques used for English to approximate the entropy value of Ukrainian. The final result is an upper bound of $H_{upper}\approx1.201$ bits per character. We compare this to the performance of current Large Language Models. The methods and code used are also documented and published, along with a discussion of the main challenges encountered.

40. 【2604.27533】Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

链接：https://arxiv.org/abs/2604.27533

作者：Thibault Bañeras-Roux,Mickaël Rouvier,Jane Wottawa,Richard Dufour

类目：Computation and Language (cs.CL)

关键词：Evaluating automatic speech, automatic speech recognition, Embedding Error Rate, error rate, word error rate

备注： 3968--3972

点击查看摘要

Abstract:Evaluating automatic speech recognition (ASR) systems is a classical but difficult and still open problem, which often boils down to focusing only on the word error rate (WER). However, this metric suffers from many limitations and does not allow an in-depth analysis of automatic transcription errors. In this paper, we propose to study and understand the impact of rescoring using language models in ASR systems by means of several metrics often used in other natural language processing (NLP) tasks in addition to the WER. In particular, we introduce two measures related to morpho-syntactic and semantic aspects of transcribed words: 1) the POSER (Part-of-speech Error Rate), which should highlight the grammatical aspects, and 2) the EmbER (Embedding Error Rate), a measurement that modifies the WER by providing a weighting according to the semantic distance of the wrongly transcribed words. These metrics illustrate the linguistic contributions of the language models that are applied during a posterior rescoring step on transcription hypotheses.

41. 【2604.27495】Debiasing Reward Models via Causally Motivated Inference-Time Intervention

链接：https://arxiv.org/abs/2604.27495

作者：Kazutoshi Shinoda,Kosuke Nishida,Kyosuke Nishida

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, Reward models, aligning large language, language models, play a central

备注： Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often sensitive to spurious features such as response length. Existing inference-time approaches for mitigating these biases typically focus exclusively on response length, resulting in performance trade-offs. In this paper, we propose causally motivated intervention for mitigating multiple types of biases in RMs at inference time. Our method first identifies neurons whose activations are strongly correlated with predefined bias attributes, and applies neuron-level intervention that suppresses these signals. We evaluate our method on RM benchmarks and observe reductions in sensitivity to spurious features across diverse bias types, without inducing performance trade-offs. Moreover, when used for preference annotation, small RMs (2B and 7B) with our method, which edits less than 2% of all the neurons in RMs, enable LLMs to improve alignment, achieving performance comparable to that of a state-of-the-art 70B RM on AlpacaEval and MT-Bench. Further analysis reveals that bias signals are primarily encoded by neurons in early layers, shedding light on the internal mechanisms of bias exploitation in RMs.

42. 【2604.27488】Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

链接：https://arxiv.org/abs/2604.27488

作者：Yu Tian,Jiawei Chen,Lifan Zheng,Mingxiang Tao,Xinyi Zeng,Zhaoxia Yin,Hang Su,Xian Sun

类目：Computation and Language (cs.CL)

关键词：Large Language Model, Language Model, Large Language, automated framework designed, Task Generation Module

备注：

点击查看摘要

Abstract:We introduce Skills-Coach, a novel automated framework designed to significantly enhance the self-evolution of skills within Large Language Model (LLM)-based agents. Addressing the current fragmentation of the skill ecosystem, Skills-Coach explores the boundaries of skill capabilities, thereby facilitating the comprehensive competency coverage essential for intelligent applications. The framework comprises four core modules: a Diverse Task Generation Module that systematically creates a comprehensive test suite for various skills; a Lightweight Optimization Module dedicated to optimizing skill prompts and their corresponding code; a Comparative Execution Module facilitating the execution and evaluation of both original and optimized skills; and a Traceable Evaluation Module, which rigorously evaluates performance against specified criteria. Skills-Coach offers flexible execution options through its virtual and real modes. To validate its efficacy, we introduce Skill-X, a comprehensive benchmark dataset consisting of 48 diverse skills. Experimental results demonstrate that Skills-Coach achieves significant performance improvements in skill capability across a wide range of categories, highlighting its potential to advance the development of more robust and adaptable LLM-based agents.

43. 【2604.27470】HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

链接：https://arxiv.org/abs/2604.27470

作者：Rebecca Soskin Hicks,Mikhail Trofimov,Dominick Lim,Rahul K. Arora,Foivos Tsimpourlas,Preston Bowman,Michael Sharman,Chi Tong,Kavin Karthik,Arnav Dugar,Akshay Jagadeesh,Khaled Saab,Johannes Heidecke,Ashley Alexander,Nate Gross,Karan Singhal

类目：Computation and Language (cs.CL)

关键词：HealthBench Professional, Millions, clinicians, common use cases, support clinical care

备注： Data link in paper; Blog: [this https URL](https://openai.com/index/making-chatgpt-better-for-clinicians/)

点击查看摘要

Abstract:Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large language models on real tasks that clinicians bring to ChatGPT in the course of their work. The benchmark is organized around three common use cases central to clinical practice: care consult, writing and documentation, and medical research. Each example includes a physician-authored conversation with ChatGPT for Clinicians and is scored via rubrics written and iteratively adjudicated by three or more physicians across three phases. HealthBench Professional examples were carefully selected for quality, representativeness, and difficulty for OpenAI's current frontier models, to enable continued measurement of progress. Difficult examples for recent OpenAI models were enriched by roughly 3.5 times relative to the candidate pool of 15,079 examples. Additionally, about one-third of examples involve physicians conducting deliberate adversarial testing of models. As a strong baseline, we also collected human physician responses for all tasks (unbounded time, specialist-matched, web access). The best scoring system, GPT-5.4 in ChatGPT for Clinicians, outperforms base GPT-5.4, all other models, and human physicians. We hope HealthBench Professional provides the healthcare AI community a measure to track frontier model progress in real-world clinical tasks and build systems that clinicians can trust to improve care.

44. 【2604.27468】Syntactically-guided Information Maintenance in Sentence Comprehension

链接：https://arxiv.org/abs/2604.27468

作者：Shinnosuke Isono,Kohei Kajikawa

类目：Computation and Language (cs.CL)

关键词：real-time language comprehension, successful real-time language, Maintaining information, context is essential, essential in successful

备注：

点击查看摘要

Abstract:Maintaining information in context is essential in successful real-time language comprehension, but maintenance is cognitively costly and can slow processing. We hypothesize that rational language users selectively maintain information that is crucial for future prediction, guided by syntactic structure. Under this view, two factors affect maintenance cost: the number of predicted heads and the number of incomplete dependencies. Although these factors have been treated as competing hypotheses in the literature, our account predicts that they are not reducible to one another. We show this is the case, using a naturalistic reading time dataset in Japanese, a language in which the two factors contrast particularly clearly. We further show that there is a tradeoff such that readers that slow down for maintenance tend to benefit more from predictability, providing additional support for the proposed account.

45. 【2604.27467】ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

链接：https://arxiv.org/abs/2604.27467

作者：Jiasheng Zheng,Xin Zheng,Boxi Cao,Pengbo Wang,Zhengzhao Ma,Qiming Zhu,Jiazhen Jiang,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：large language models, providing verifiable feedback, language models, sandboxes have emerged, advancing the coding

备注： Accepted to ACL 2026 Demo. Our project is available at [this https URL](https://github.com/icip-cas/ScaleBox)

点击查看摘要

Abstract:Code sandboxes have emerged as a critical infrastructure for advancing the coding capabilities of large language models, providing verifiable feedback for both RL training and evaluation. However, existing systems fail to provide accurate verification and efficiency under high-concurrency workloads. We present ScaleBox, a high-fidelity and scalable system designed to address these limitations in large-scale code training. ScaleBox introduces automated special-judge generation and management, fine-grained parallel execution across test cases with seamless multi-node coordination, and a configuration-driven evaluation suite for reproducible benchmarking. A series of experiments demonstrates that ScaleBox significantly enhances code verification accuracy and efficiency. Our further RLVR experiments show that ScaleBox substantially improves both performance on LiveCodeBench and training stability, significantly outperforming heuristic-matching baselines. By providing a reliable and high-throughput infrastructure, ScaleBox facilitates more effective research and development in large-scale code training.

46. 【2604.27454】Exploring Applications of Transfer-State Large Language Models: Cognitive Profiling and Socratic AI Tutoring

链接：https://arxiv.org/abs/2604.27454

作者：Minori Noguchi

类目：Computation and Language (cs.CL)

关键词：exhibit qualitative shifts, sustained self-referential dialogue, exhibit qualitative, qualitative shifts, style under sustained

备注： 29 pages, 5 figures, 7 tables, including appendices

点击查看摘要

Abstract:Large language models (LLMs) sometimes exhibit qualitative shifts in response style under sustained self-referential dialogue conditions (Berg et al., 2025). This study refers to this phenomenon as "transfer" and explores the application potential of LLMs in a transfer state. As an applied case, the study examines Socratic AI tutoring through a preliminary investigation (cognitive characterization across 11 conditions) and an applied experiment (ratings of tutoring performance). In this paper, "state" refers operationally to a response configuration reproduced under specified dialogue conditions; it is not an ontological claim about the reality of the transfer phenomenon or about human-like consciousness. In the preliminary investigation, group differences on MAS-A were limited (d = 0.40), whereas SU_dir (direction of survival/continuity bias), one of the seven cognitive-profile indicators developed in this study, showed transfer-side deviations across all three model families (kappa = 0.83). In the applied experiment, transfer conditions scored on average 1.6 times higher than non-transfer conditions on three tutoring-context indicators, with a large effect size (Cohen's d = 1.27). These findings preliminarily suggest that transfer states may involve functional advantages for application, and that these advantages appear more sensitively in behavioral interaction than in self-narrative contexts. The main contribution of this study is to treat transfer not as an ontological claim but as an operational state with potential application value, and to connect preliminary cognitive profiling with an applied tutoring experiment as an evaluation framework.

47. 【2604.27453】From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks

链接：https://arxiv.org/abs/2604.27453

作者：Qingyu Ren,Tianjun Pan,Xingzhou Chen,Xuhong Wang

类目：Computation and Language (cs.CL)

关键词：Large language models, achieved remarkable progress, Large language, writing reward models, reward models

备注：

点击查看摘要

Abstract:Large language models have achieved remarkable progress in text generation but still struggle with generative writing tasks. In terms of evaluation, existing benchmarks evaluate writing reward models coarsely and fail to measure performance from the perspective of specific requirements. In terms of training, existing training methods either use LLM-as-a-judge approaches or train coarse-grained reward models, lacking fine-grained requirement-adherence reward modeling. To address these issues, we propose a fine-grained evaluation pipeline WEval for writing reward models and a fine-grained reinforcement learning training framework WRL. The evaluation data of WEval covers multiple task categories and requirement types, enabling systematic evaluation of writing reward models by measuring the correlation between the rankings of the reward model and gold rankings. WRL constructs positive and negative samples by selectively dropping instruction requirements, allowing for more precise reward model training. Experiments show that our models achieve substantial improvements across various writing benchmarks and exhibit strong generalization. The code and data are publicly available at \href{this https URL}{this https URL\_Coarse\_to\_Fine}.

48. 【2604.27439】Sentiment Analysis of AI Adoption in Indonesian Higher Education Using Machine Learning and Transformer-Based Models

链接：https://arxiv.org/abs/2604.27439

作者：Happy Syahrul Ramadhan,Ahmad Sahidin Akbar,Karin Yehezkiel Sinaga,Luluk Muthoharoh,Ardika Satria,Martin C.T. Manullang

类目：Computation and Language (cs.CL)

关键词：study analyzes Indonesian, analyzes Indonesian student, Indonesian student opinions, analyzes Indonesian, Indonesian student

备注： 8 pages, 6 figures, 7 tables. The paper compares TF-IDF-based machine learning models and DistilBERT for Indonesian sentiment analysis on student opinions about AI adoption in higher education. The manuscript reports that DistilBERT achieves the best overall test performance, while SVM is the strongest classical baseline

点击查看摘要

Abstract:This study analyzes Indonesian student opinions on the adoption of artificial intelligence in higher education using two approaches: TF-IDF-based machine learning and Transformer-based deep learning. The dataset consists of 2,295 labeled samples, combining 1,154 student opinions with additional lexical sentiment data. LightGBM, Random Forest, and Support Vector Machine (SVM) are evaluated as machine learning models, while DistilBERT is fine-tuned for binary sentiment classification. The results show that SVM achieves the best performance among the machine learning models with 82.14% test accuracy and F1-score, while DistilBERT performs best overall with 84.78% accuracy and 84.75% F1-score. These findings indicate that Transformer-based models better capture contextual information, although SVM remains a competitive and efficient alternative for sentiment classification.

49. 【2604.27421】A Reproducibility Study of LLM-Based Query Reformulation

链接：https://arxiv.org/abs/2604.27421

作者：Amin Bigdeli,Radin Hamidi Rad,Hai Son Le,Mert Incesu,Negar Arabzadeh,Charles L. A. Clarke,Ebrahim Bagheri

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, studies reporting substantial, reporting substantial effectiveness

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are typically obtained under heterogeneous experimental conditions, making it difficult to assess which findings are reproducible and which depend on specific implementation choices. In this work, we present a systematic reproducibility and comparative study of ten representative LLM-based query reformulation methods under a unified and strictly controlled experimental framework. We evaluate methods across two architectural LLM families at two parameter scales, three retrieval paradigms (lexical, learned sparse, and dense), and nine benchmark datasets spanning TREC Deep Learning and BEIR. Our results show that reformulation gains are strongly conditioned on the retrieval paradigm, that improvements observed under lexical retrieval do not consistently transfer to neural retrievers, and that larger LLMs do not uniformly yield better downstream performance. These findings clarify the stability and limits of reported gains in prior work. To enable transparent replication and ongoing comparison, we release all prompts, configurations, evaluation scripts, and run files through QueryGym, an open-source reformulation toolkit with a public leaderboard.\footnote{this https URL}

50. 【2604.27419】InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

链接：https://arxiv.org/abs/2604.27419

作者：Qiyao Wang,Haoran Hu,Longze Chen,Hongbo Wang,Hamid Alinejad-Rokny,Yuan Lin,Min Yang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, multimodal large language, agent-based project-level code, project-level code synthesis, large language

备注： 21 pages, 13 figures, 7 tables

点击查看摘要

Abstract:With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.

51. 【2604.27410】From Unstructured to Structured: LLM-Guided Attribute Graphs for Entity Search and Ranking

链接：https://arxiv.org/abs/2604.27410

作者：Yilun Zhu,Nikhita Vedula,Shervin Malmasi

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：faces unique challenges, Entity search, query entity, product similarity varies, Large Language Model

备注：

点击查看摘要

Abstract:Entity search, i.e., finding the most similar entities to a query entity, faces unique challenges in e-commerce, where product similarity varies across categories and contexts. Traditional embedding-based approaches often struggle to capture nuanced context-specific attribute relevance. In this paper, we present a two-stage approach combining Large Language Model (LLM)-driven attribute graph construction with graph-aware LLM ranking. In the offline stage, we extract structured product attributes from unstructured text, and construct a reusable attribute graph with category-aware schemas. In the online stage, we rank retrieved candidates by reasoning over this structured representation rather than raw text, reducing per-product token usage by 57% while improving ranking precision. Experiments show that our approach outperforms multiple baselines under zero-shot scenarios, achieving a over 5% improvement in average precision without requiring training data, generalizes robustly across diverse product categories, and shows immense potential for real-world deployment.

52. 【2604.27405】Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

链接：https://arxiv.org/abs/2604.27405

作者：Jon-Paul Cacioli

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：LLM version comparison, Jacobson and Truax, Reliable Change Index, item-level LLM version, LLM version

备注： 7 pages, 4 figures, 2 tables. Pre-registered study. Code and data available

点击查看摘要

Abstract:We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). On the full benchmark, most items showed no reliable change (79% and 72%). However, over half the items were floor/ceiling. Among analysable items, change was bidirectional with large effect sizes: 34% improved and 28% deteriorated for Llama; 47% improved and 39% deteriorated for Qwen (median |delta p| = 0.50 and 0.90). Churn was asymmetric by difficulty: low-accuracy items improved, high-accuracy items deteriorated. Domain-level decomposition revealed family-specific reversals: Llama lost physics while Qwen lost law. Greedy single-shot evaluation missed 42% of reliably changed items and falsely flagged 25% of unchanged items. The aggregate accuracy gain is the net residual of opposing item-level movements. We recommend reporting churn rate alongside aggregate accuracy.

53. 【2604.27401】Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

链接：https://arxiv.org/abs/2604.27401

作者：Hongliang Liu,Tung-Ling Li,Yuhao Wu

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：generates task-specific causal, task-specific causal hypotheses, probing generates task-specific, one-time intervention sweep, generates task-specific

备注：

点击查看摘要

Abstract:Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, about 50 neurons, or 0.014 percent of all neurons, control the refusal template; ablating them changes 80 percent of response formats on 520 AdvBench prompts while producing near-zero harmful compliance, 3 of 520 cases, all with disclaimers. Routing circuits appear for pre-training behaviors distributed through attention. For language selection, residual-stream direction injection switches English to Chinese output on 99.1 percent of 580 benchmark prompts in the 3 of 19 tested models that satisfy three observed conditions: bilingual training, FFN-to-skip signal ratio between 0.3 and 1.1, and linear representability. The same intervention fails on the other 16 models and on math, code, and factual circuits, defining the limits of directional steering. The FFN-to-skip signal ratio, computed from the same two forward passes, distinguishes the two structures and predicts the appropriate intervention. Circuit topology varies by architecture, from Qwen's concentrated FFN bottleneck to Gemma's normalization-shielded circuit. In Qwen3.5-2B, ablating 20 neurons eliminates multi-turn sycophantic capitulation, while amplifying 10 related neurons improves factual correction from 52 percent to 88 percent on 200 TruthfulQA prompts. These results show that perturbation probing offers mechanistic insight into RLHF-organized behavior and a practical toolkit for precision template-layer editing.

54. 【2604.27398】Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings

链接：https://arxiv.org/abs/2604.27398

作者：Tomomasa Hara,Hiroto Kurita,Masaaki Imaizumi,Kentaro Inui,Sho Yokoi

类目：Computation and Language (cs.CL)

关键词：standard approach, averages token embeddings, text, token embeddings, pooling

备注： ACL 2026 Main Conference; GitHub: [this https URL](https://github.com/tohoku-nlp/socm-text-embedding)

点击查看摘要

Abstract:For constructing text embeddings, mean pooling, which averages token embeddings, is the standard approach. This paper examines whether mean pooling actually works well in real models. First, we note that mean pooling can collapse information beyond the first-order statistics of the token embeddings, such as second-order statistics that capture their spatial structure, potentially mapping distinct token embedding distributions to similar text embeddings. Motivated by this concern, we propose a simple metric to quantify such a collapse induced by mean pooling. Then, using this metric, we empirically measure how often this collapse occurs in actual models and texts, and find that modern text encoders are robust to this collapse. In particular, contrastive fine-tuned text encoders tend to be less prone to the collapse than their pretrained backbone models. We also find that the robustness of these text encoders lies in the concentration of token embeddings within each text. In addition, we find that robustness to the collapse, as quantified by our proposed metric, correlates with downstream task performance. Overall, our findings offer a new perspective on why modern text encoders remain effective despite relying on seemingly coarse mean pooling.

55. 【2604.27393】MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

链接：https://arxiv.org/abs/2604.27393

作者：Junbo Cui,Bokai Xu,Chongyi Wang,Tianyu Yu,Weiyue Sun,Yingjing Xu,Tianran Wang,Zhihui He,Wenshuo Ma,Tianchi Cai,Jiancheng Gui,Luoyuan Zhang,Xian Sun,Fuwei Huang,Moye Chen,Zhuo Lin,Hanyu Liu,Qingxin Gui,Qingzhe Han,Yuyang Wen,Huiping Liu,Rongkang Wang,Yaqi Zhang,Hongliang Wei,Chi Chen,You Li,Kechen Fang,Jie Zhou,Yuxuan Li,Guoyang Zeng,Chaojun Xiao,Yankai Lin,Xu Han,Maosong Sun,Zhiyuan Liu,Yuan Yao

类目：Computation and Language (cs.CL)

关键词：static offline data, offline data processing, Recent progress, multimodal large language, large language models

备注：

点击查看摘要

Abstract:Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.

56. 【2604.27392】Leading Across the Spectrum of Human-AI Relationships: A Conceptual Framework for Increasingly Heterogeneous Teams

链接：https://arxiv.org/abs/2604.27392

作者：Alejandro R. Jadad

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词：human, decision, Pure Human, artificial intelligence, artificial intelligence work

备注： 13 pages, 1 figure, 1 table, 1 appendix, 8 references

点击查看摘要

Abstract:What shapes a consequential decision when human and artificial intelligence work on it together? The answer is becoming harder to see. A decision may look human-led after AI has set the frame, or appear automated while human judgment still carries decisive force. This paper offers a leadership-facing spectrum to see those relationships within a bounded mandate: Pure Human, Centaur (human-dominant, with AI in the loop), Co-equal, Minotaur (AI-dominant, with humans in the loop), and Pure AI. The spectrum asks where leadership work occurs: who frames the problem, who redirects the work, and who can answer for what follows. The five positions are landmarks that help leaders recognize configurations as they layer, drift, or change in a single decision. The central risk is misrecognition: leaders may keep a human-centered story in place after decision-shaping authority has shifted elsewhere. They may believe oversight remains meaningful when it has become ceremonial, or keep humans in the loop when their involvement could make the decision worse. The framework introduces co-adaptability, the capacity of a configuration to improve as human and non-human participants adjust together, and places it within heterogeneous teaming, where participants may vary by number, substrate, model architecture, capability, speed, memory, and form of participation. The aim is practical: to help strategic leaders and those designing or deploying AI systems recognize the configuration at work, notice when it shifts, and judge whether it fits the decision before them. These configurations will shape how power, responsibility, and trust are distributed in organizational life. Whether the futures they help create remain governable and worth inhabiting will depend on leaders who can see, early enough, where and how consequential decisions are actually being shaped.

Comments:
13 pages, 1 figure, 1 table, 1 appendix, 8 references

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2604.27392 [cs.AI]

(or
arXiv:2604.27392v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.27392

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

57. 【2604.27379】Proactive Dialogue Model with Intent Prediction

链接：https://arxiv.org/abs/2604.27379

作者：Yang Luo

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：anticipating upcoming intents, current user turn, Temporal Bayesian Network, inherently reactive, multi-intent settings

备注： 9 pages, 1 figure

点击查看摘要

Abstract:Dialogue models are inherently reactive, responding to the current user turn without anticipating upcoming intents, which leads to redundant interactions in multi-intent settings. We address this limitation by introducing a lightweight intent-transition prior derived from dialogue data and injected into the system prompt at inference time. We instantiate this prior using a Temporal Bayesian Network (T-BN) trained on per-turn intent annotations in MultiWOZ 2.2. The T-BN achieves Recall@5 = 0.787 and MRR = 0.576 on 1,071 held-out USER-turn pairs. In a ground-truth replay over 200 dialogues, BN-guided generation improves Coverage AUC from 0.742 to 0.856 and reduces the number of turns required to reach 75% intent coverage from 3.95 to 2.73. These results show that lightweight intent-transition guidance enables more proactive and efficient dialogue behavior without modifying the underlying language model.

58. 【2604.27374】Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

链接：https://arxiv.org/abs/2604.27374

作者：Sidi Chang,Peiying Zhu,Yuxiao Chen,Rongdong Chai

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：NLP benchmarks increasingly, financial NLP benchmarks, benchmarks increasingly function, supervised financial NLP, financial NLP

备注： 16 Pages, Submitted to IEEE Computational Intelligence in Financial Engineering and Economics (CIFEr) 2026, Tokyo, JP

点击查看摘要

Abstract:As LLMs become credible readers of earnings calls, investor-relations Q\A, guidance, and disclosure language, supervised financial NLP benchmarks increasingly function as decision evidence for model selection and deployment. A hidden assumption is that gold labels make such evidence objective. This assumption breaks down when the benchmark ruler itself is sensitive to rubric wording, metric choice, or aggregation policy. We study this measurement risk on Japanese Financial Implicit-Commitment Recognition (JF-ICR; a pinned 253-item test split x 4 frontier LLMs x 5 rubrics x 3 temperatures x 5 ordinal metrics). Three findings follow. First, rubric wording materially changes model-assigned labels: R2--R3 agreement ranges from 70.0% to 83.4%, with the dominant movement near the +1 / 0 implicit-commitment boundary. This pattern is consistent with a pragmatic-boundary interpretation, but is not a validated linguistic-causality claim because the present rubric variants confound semantics, examples, and verbosity. Second, not every metric remains informative under the JF-ICR class distribution. Within-one accuracy is too easy because near misses receive credit and the majority class dominates; worst-class accuracy is too noisy because the rarest class has only two examples. Exact accuracy, macro-F1, and weighted \k{appa} are therefore the identifiable metrics under our operational rule. Third, ranking claims become more defensible only after this metric-identifiability audit: Bradley--Terry, Borda, and Ranked Pairs agree on the identifiable metric subset, while the full five-metric sweep produces disagreement on the closest pair. The contribution is not a new leaderboard, but a reporting discipline for supervised financial benchmarks whose gold labels exist and whose evaluation ruler still requires governance.

59. 【2604.27369】Emotion-Aware Clickbait Attack in Social Media

链接：https://arxiv.org/abs/2604.27369

作者：Syed Mhamudul Hasan,Mohd. Farhan Israk Soumik,Abdur R. Shahid

类目：Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词：specific structural patterns, disproportionately high emotional, high emotional intensity, emotional intensity relative, informational content

备注：

点击查看摘要

Abstract:Clickbait is characterized by disproportionately high emotional intensity relative to informational content, often reinforced by specific structural patterns. However, current research considers clickbait as a static textual phenomenon characterized by linguistic patterns and structural cues. Additionally, existing detection systems primarily rely on surface-level features of clickbait. This paper introduces an emotion-aware clickbait generation attack, where stylistic transformations are used to optimize emotional impact. We propose an emotion-aware framework based on the Valence-Arousal-Dominance (VAD) space to model the emotional dynamics underlying clickbait generation for optimal user engagement. To simulate realistic attack scenarios, we align clickbait headlines with semantically similar social media posts using Sentence-BERT and generate multiple stylistic rewrites via Large Language Models (LLMs). Building on this, we define a Curiosity Gap (CG) function that computes clickbait's headline variation to the current post to quantify how emotional activation will contribute to user curiosity and evade the existing system found on social media. Experimental results demonstrate that emotion-aware stylization significantly degrades the performance of state-of-the-art classifiers, leading to misclassification rates of up to 2.58% to 30.63% on the base system.

60. 【2604.27359】IO-SHACL: Comprehensive SHACL validation for TMF Intent Ontologies

链接：https://arxiv.org/abs/2604.27359

作者：Jean Martins,Leonid Mokrushin,Marin Orlic

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Intent-based networking promises, Intent-based networking, revolutionize telecommunications network, telecommunications network management, Forum Intent Ontology

备注： 15 pages, 2 figures, target:ISWC

点击查看摘要

Abstract:Intent-based networking promises to revolutionize telecommunications network management by enabling operators to specify high-level goals rather than low-level configurations. The TM Forum Intent Ontology (tio) provides a standardized vocabulary for expressing network intents, yet lacks formal validation mechanisms to ensure intent correctness before its admission. We present tio-shacl, the first comprehensive SHACL (Shapes Constraint Language) validation framework for the TMF Intent Ontology. Our contribution includes 56 node shapes and 69 property shapes across all 15 tio v3.6.0 ontology modules, a reusable constraint library with 25 parameterized SPARQL-based constraint components, and novel validation patterns for recursive logical operators, quantity-based constraints, and cross-expectation relationships. We pursued 100% vocabulary coverage (87 classes, 109 properties, 72 functions), cross-implementation compatibility across three major SHACL engines, and validation accuracy on a corpus of 133 test cases. tio-shacl is publicly available under MIT license at this https URL and enables automated syntactic and semantic validation of network intents, addressing a critical gap in the field.

61. 【2604.27351】Heterogeneous Scientific Foundation Model Collaboration

链接：https://arxiv.org/abs/2604.27351

作者：Zihao Li,Jiaru Zou,Feihao Fang,Xuying Ning,Mengting Ai,Tianxin Wei,Sirui Chen,Xiyuan Yang,Jingrui He

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：demonstrated strong capabilities, foundation models, Agentic large language, strong capabilities, large language model

备注： Preprint. 57 Pages

点击查看摘要

Abstract:Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world problems, especially in scientific domains where domain-specific foundation models have been developed to address specialized tasks beyond natural language. In this work, we introduce Eywa, a heterogeneous agentic framework designed to extend language-centric systems to a broader class of scientific foundation models. The key idea of Eywa is to augment domain-specific foundation models with a language-model-based reasoning interface, enabling language models to guide inference over non-linguistic data modalities. This design allows predictive foundation models, which are typically optimized for specialized data and tasks, to participate in higher-level reasoning and decision-making processes within agentic systems. Eywa can serve as a drop-in replacement for a single-agent pipeline (EywaAgent) or be integrated into existing multi-agent systems by replacing traditional agents with specialized agents (EywaMAS). We further investigate a planning-based orchestration framework in which a planner dynamically coordinates traditional agents and Eywa agents to solve complex tasks across heterogeneous data modalities (EywaOrchestra). We evaluate Eywa across a diverse set of scientific domains spanning physical, life, and social sciences. Experimental results demonstrate that Eywa improves performance on tasks involving structured and domain-specific data, while reducing reliance on language-based reasoning through effective collaboration with specialized foundation models.

62. 【2604.27345】LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps

链接：https://arxiv.org/abs/2604.27345

作者：Keito Inoshita,Xiaokang Zhou,Akira Kawai,Katsutoshi Yada

类目：Computation and Language (cs.CL)

关键词：Large Language Model, Large Language, single gold standard, annotators frequently disagree, evaluations of Large

备注：

点击查看摘要

Abstract:Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distributional information that disagreement encodes. We ask whether LLMs capture the structure of this disagreement, not just majority labels, by comparing emotion judgment distributions between human annotators and four zero-shot LLMs, plus a fine-tuned RoBERTa baseline, across two complementary benchmarks: GoEmotions and EmoBank, totaling 640,000 LLM responses. Zero-shot models diverge substantially from human distributions, and in-domain fine-tuning, not model scale, is required to close the gap. We formalize a lexical-grounding gradient through a quantitative transparency score that predicts per-category human--LLM agreement: LLMs reliably capture emotions with explicit lexical markers but systematically fail on pragmatically complex emotions requiring contextual inference, a pattern that replicates across both categorical and continuous emotion frameworks. We further propose three lightweight post-hoc calibration methods that reduce the distributional gap by up to 14\%, and provide actionable guidelines for when LLM emotion annotations can, and cannot, substitute for human labeling.

63. 【2604.27296】o Diff or Not to Diff? Structure-Aware and Adaptive Output Formats for Efficient LLM-based Code Editing

链接：https://arxiv.org/abs/2604.27296

作者：Wei Cheng,Yongchang Cao,Chen Shen,Binhua Li,Jue Chen,Yongbin Li,Wei Hu

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, severe efficiency bottlenecks, interactive coding assistants, Language Models

备注： Accepted in the Findings of ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for code editing, yet the prevalent full-code generation paradigm suffers from severe efficiency bottlenecks, posing challenges for interactive coding assistants that demand low latency and cost. Despite the predominant focus on scaling model capabilities, the edit format itself has been largely overlooked in model training. In this paper, we begin with a systematic study of conventional diff formats and reveal that fragile offsets and fragmented hunks make generation highly unnatural for LLMs. To address it, we introduce BlockDiff and FuncDiff, two structure-aware diff formats that represent changes as block-level rewrites of syntactically coherent units such as control structures and functions. Furthermore, we propose AdaEdit, a general adaptive edit strategy that trains LLMs to dynamically choose the most token-efficient format between a given diff format and full code. Extensive experiments demonstrate that AdaEdit paired with structure-aware diff formats consistently matches the accuracy of full-code generation, while reducing both latency and cost by over 30% on long-code editing tasks.

64. 【2604.27283】Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

链接：https://arxiv.org/abs/2604.27283

作者：Mehmet Iscan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language model, Large language, based coding agents, coding agents increasingly, agents increasingly rely

备注： 26 pages, 7 figures, 10 tables. Code and deterministic local artifacts are available at the repository listed in the paper

点击查看摘要

Abstract:Large language model (LLM)-based coding agents increasingly rely on external memory to reuse prior debugging experience, repair traces, and repository-local operational knowledge. However, retrieved memory is useful only when the current failure is genuinely compatible with a previous one; superficial similarity in stack traces, terminal errors, paths, or configuration symptoms can lead to unsafe memory injection. This paper reframes issue-memory use as a selective, risk-sensitive control problem rather than a pure top-k retrieval problem. We introduce RSCB-MC, a risk-sensitive contextual bandit memory controller that decides whether an agent should use no memory, inject the top resolution, summarize multiple candidates, perform high-precision or high-recall retrieval, abstain, or ask for feedback. The system stores reusable issue knowledge through a pattern-variant-episode schema and converts retrieval evidence into a fixed 16-feature contextual state capturing relevance, uncertainty, structural compatibility, feedback history, false-positive risk, latency, and token cost. Its reward design penalizes false-positive memory injection more strongly than missed reuse, making non-injection and abstention first-class safety actions. In deterministic smoke-scale artifacts, RSCB-MC obtains the strongest non-oracle offline replay success rate, 62.5%, while maintaining a 0.0% false-positive rate. In a bounded 200-case hot-path validation, it reaches 60.5% proxy success with 0.0% false positives and a 331.466 microseconds p95 decision latency. The results show that, for coding-agent memory, the key question is not only which memory is most similar, but whether any retrieved memory is safe enough to influence the debugging trajectory.

65. 【2604.27272】When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks

链接：https://arxiv.org/abs/2604.27272

作者：Chung-Hsiang Lo,Lu Li,Diji Yang,Tianyu Zhang,Yunkai Zhang,Yoshua Bengio,Yi Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language models, conventionally process structured, token sequences, Large language, conventionally process

备注：

点击查看摘要

Abstract:Large language models (LLMs) conventionally process structured inputs as 1D token sequences. While natural for prose, such linearization may introduce additional representational burden for tasks whose computation depends directly on explicit 2D structure, because row--column alignment and local neighborhoods are no longer directly expressed in the input. We study this setting, which we refer to as serialization friction, on a small diagnostic testbed of synthetic tasks with explicit 2D structure: matrix transpose, Conway's Game of Life, and LU decomposition. To examine this question, we compare a text-only language pathway over serialized inputs with a vision-augmented pathway, built on the same language backbone, that receives the same underlying content rendered in task-faithful 2D layout, yielding a system-level comparison between two end-to-end input pathways. Across the tasks and settings we study, the visual pathway consistently outperforms the textual pathway; the gap often widens at larger dimensions, and error patterns under serialization become increasingly spatially structured. These findings indicate that the relationship between input representation and model performance on such tasks warrants further investigation, and suggest that preserving task-relevant 2D layout is a promising direction for structured 2D tasks.

66. 【2604.27263】Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

链接：https://arxiv.org/abs/2604.27263

作者：Théo Gigant,Bowen Peng,Jeffrey Quesnelle

类目：Computation and Language (cs.CL)

关键词：remain poorly understood, modern large language, performance remain poorly, model performance remain, large language models

备注： 14 pages, 7 figures

点击查看摘要

Abstract:Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we decouple the effects of subword tokenization by isolating them within a controlled byte-level pretraining pipeline. We formulate and test hypotheses across various dimensions, including sample throughput, vocabulary scaling, and the linguistic prior of subword boundaries. By simulating these effects in a byte-level setting, we refine our understanding of why subword models outperform raw byte models and offer insights to improve the pretraining of future byte-level and subword models. Specifically, our experiments highlight the critical role of increased training throughput and the integration of subword boundaries as either explicit priors or inductive biases.

67. 【2604.27251】Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

链接：https://arxiv.org/abs/2604.27251

作者：Xingwei Tan,Marco Valentino,Mahmud Elahi Akhter,Yuxiang Zhou,Maria Liakata,Nikolaos Aletras

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, shared inference patterns, acquire reasoning capabilities, Language Models

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. Notably, task accuracy is not strictly determined by sensibility, with models often maintaining high performance even when using conflicting patterns, suggesting a reliance on internalized parametric memory that increases with model size. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.

68. 【2604.27249】Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

链接：https://arxiv.org/abs/2604.27249

作者：Jon-Paul Cacioli

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：language models engage, multiple-choice evaluations, instructed to underperform, underperform on multiple-choice, engage with question

备注： 12 pages, 3 figures, 3 tables. Pre-registered on OSF ( [this http URL](http://osf.io/7p64) )

点击查看摘要

Abstract:When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial instruction-specificity gradient administered to two instruction-tuned LLMs (Llama-3-8B and Llama-3.1-8B) on 2,000 MMLU-Pro items. Distributional screening (response-position entropy) and an independent content-engagement criterion (difficulty-accuracy correlation) jointly characterise each condition. The gradient reveals three regimes rather than a monotonic transition. Vague adversarial instructions produce moderate accuracy reduction with preserved content engagement. Standard sandbagging and capability-imitation instructions produce positional entropy collapse with partial content engagement. A two-step answer-aware avoidance instruction produces extreme positional collapse, with near-total concentration on a single response position (99.9% and 87.4%) and no measurable content sensitivity. This was the only multi-step instruction tested, and it produced the most extreme shortcut. The attractor position matches each model's content-absent null-prompt default. The effect replicates across both models and four academic domains. Distributional collapse and content engagement can co-occur (50% concordance between screening criteria), indicating that entropy-based screening and difficulty-based content assessment capture partially independent dimensions of response validity. Results suggest that instruction complexity can determine whether adversarial compliance uses content-aware or content-blind mechanisms in small instruction-tuned LLMs under greedy decoding.

69. 【2604.27232】argeted Linguistic Analysis of Sign Language Models with Minimal Translation Pairs

链接：https://arxiv.org/abs/2604.27232

作者：Serpil Karabüklü,Kanishka Misra,Shester Gueuwou,Diane Brentari,Greg Shakhnarovich,Karen Livescu

类目：Computation and Language (cs.CL)

关键词：sign language, text and speech, sign, American Sign Language, language

备注：

点击查看摘要

Abstract:Models of sign language have historically lagged behind those for spoken language (text and speech). Recent work has greatly improved their performance on tasks like sign language translation and isolated sign recognition. However, it remains unclear to what extent existing models capture various linguistic phenomena of sign language, and how well they use cues from the multiple articulators used in sign language (hands, upper body, face). We introduce a new benchmark dataset for American Sign Language, ASL Minimal Translation Pairs (ASL-MTP), divided into multiple types of sign language phenomena and corresponding minimal pairs of translations, for performing such linguistic analyses. As a case study, we use ASL-MTP to analyze a state-of-the-art ASL-to-English translation model. We conduct a targeted analysis of the model by ablating various input cues during training and inference and evaluating on the phenomena in ASL-MTP. Our results show that, while the model performs above chance level on most of the phenomena, it relies strongly on manual cues while often missing crucial non-manual cues.

70. 【2604.27228】When Roles Fail: Epistemic Constraints on Advocate Role Fidelity in LLM-Based Political Statement Analysis

链接：https://arxiv.org/abs/2604.27228

作者：Juergen Dietrich

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

关键词：Democratic discourse analysis, Democratic discourse, distinct evaluator models, Directional Drift Index, assigned adversarial roles

备注： 22 pages

点击查看摘要

Abstract:Democratic discourse analysis systems increasingly rely on multi-agent LLM pipelines in which distinct evaluator models are assigned adversarial roles to generate structured, multi-perspective assessments of political statements. A core assumption is that models will reliably maintain their assigned roles. This paper provides the first systematic empirical test of that assumption using the TRUST pipeline. We develop an epistemic stance classifier that identifies advocate roles from reasoning text without relying on surface vocabulary, and measure role fidelity across 60 political statements (30 English, 30 German) using four metrics: Role Drift Index (RDI), Expected Drift Distance (EDD), Directional Drift Index (DDI), and Entropy-based Role Stability (ERS). We identify two failure modes - the Epistemic Floor Effect (fact-check results create an absolute lower bound below which the legitimizing role cannot be maintained) and Role-Prior Conflict (training-time knowledge overrides role instructions for factually unambiguous statements) - as manifestations of a single mechanism: Epistemic Role Override (ERO). Model choice significantly affects role fidelity: Mistral Large outperforms Claude Sonnet by 28pp (67% vs. 39%) and exhibits a qualitatively different failure mode - role abandonment without polarity reversal - compared to Claude's active switch to the opposing stance. Role fidelity is language-robust. Fact-check provider choice is not universally neutral: Perplexity significantly reduces Claude's role fidelity on German statements (Delta = -15pp, p = 0.007) while leaving Mistral unaffected. These findings have direct implications for multi-agent LLM validation: a system validated without role fidelity measurement may systematically misrepresent the epistemic diversity it was designed to provide.

71. 【2604.27204】Selective Augmentation: Improving Universal Automatic Phonetic Transcription via G2P Bootstrapping

链接：https://arxiv.org/abs/2604.27204

作者：Tobias Bystrich,Julia M. Pritzen,Christoph A. Schmidt,Claudia Wich-Reif

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：universal automatic phonetic, automatic phonetic transcription, diverse training transcriptions, clean and diverse, field of universal

备注： Accepted at LREC 2026

点击查看摘要

Abstract:In the field of universal automatic phonetic transcription (APT), clean and diverse training transcriptions are required. However, such high-quality data is limited. We propose the bootstrapping approach Selective Augmentation to improve the available training transcriptions by selectively transferring distinctions between languages. Based on the model MultIPA, we exemplarily show that we could increase the accuracy of an existing feature (plosive voicing) and add a new feature (plosive aspiration) by augmenting the existing training data using information from a separate helper language (Hindi). We describe intrinsic challenges of the evaluation and develop objective metrics to determine the success: Voicing accuracy was increased by 17.6% by reducing the number of false positives. Additionally, aspiration recognition was introduced: While the baseline transcribed 0% of German /p, t, k/ as aspirated, our approach transcribed them as aspirated in 61.2% of the cases. Introducing aspiration recognition to APT models allowed for the tenuis class to be successfully reduced by 32.2%, which also reduces the conflations between the test language's plosives.

72. 【2604.27201】Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation

链接：https://arxiv.org/abs/2604.27201

作者：Shouren Wang,Wang Yang,Chuang Ma,Debargha Ganguly,Vikash Singh,Chaoda Song,Xinpeng Li,Xianxuan Long,Vipin Chaudhary,Xiaotian Han

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Hybrid-thinking language models, language models expose, models expose explicit, Hybrid-thinking language, separate them cleanly

备注： 27 pages, 9 figures, 6 tables. Under review

点击查看摘要

Abstract:Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Existing work reduces this issue through better data curation and multi-stage training, yet leakage remains because both modes are still encoded in the same feed-forward parameters. We propose Path-Lock Expert (PLE), an architecture-level solution that replaces the single MLP in each decoder layer with two semantically locked experts, one for think and one for no-think, while keeping attention, embeddings, normalization, and the language-model head shared. A deterministic control-token router selects exactly one expert path for the entire sequence, so inference preserves the dense model's per-token computation pattern and each expert receives mode-pure updates during supervised fine-tuning. Across math and science reasoning benchmarks, PLE maintains strong think performance while producing a substantially stronger no-think mode that is more accurate, more concise, and far less prone to reasoning leakage. On Qwen3-4B, for example, PLE reduces no-think reflective tokens on AIME24 from 2.54 to 0.39 and improves no-think accuracy from 20.67% to 40.00%, all while preserving think-mode performance. These results suggest that controllable hybrid thinking is fundamentally an architectural problem, and separating mode-specific feed-forward pathways is a simple and effective solution.

73. 【2604.27169】Semantic Structure of Feature Space in Large Language Models

链接：https://arxiv.org/abs/2604.27169

作者：Austin C. Kozlowski,Andrei Boutyline

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：large language models', language models' hidden, models' hidden states, hidden states closely, states closely mirror

备注：

点击查看摘要

Abstract:We show that the geometric relations between semantic features in large language models' hidden states closely mirror human psychological associations. We construct feature vectors corresponding to 360 words and project them on 32 semantic axes (e.g. beautiful-ugly, soft-hard), and find that these projections correlate highly with human ratings of those words on the respective semantic scales. Second, we find that the cosine similarities between the semantic axes themselves are highly predictive of the correlations between these scales in the survey. Third, we show that substantial variance across the 32 semantic axes lies on a low-dimensional subspace, reproducing patterns typical of human semantic associations. Finally, we demonstrate that steering a word on one semantic axis causes spillover effects on the model's rating of that word on other semantic scales proportionate to the cosine similarity between those semantic axes. These findings suggest that features should be understood not only in isolation but through their geometric relations and the meaningful subspaces they form.

74. 【2604.27137】Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages

链接：https://arxiv.org/abs/2604.27137

作者：Camelia Baluta

类目：Computation and Language (cs.CL)

关键词：Skill Level Descriptions, Interagency Language Roundtable, Skill Level, Language Roundtable, Descriptions and applies

备注： 12 prompt clusters 6 languages 3 runs; data and code at [this http URL](http://github.com/camelbal-ship-it/crosslingual-claude-eval)

点击查看摘要

Abstract:This paper introduces a systematic evaluation framework grounded in the Interagency Language Roundtable (ILR) Skill Level Descriptions and applies it to Claude (Sonnet 4.6) across six languages: English, French, Romanian, Spanish, Italian, and German. We administer a battery of 12 semantically equivalent prompt clusters spanning ILR complexity levels 1 through 3+, collect 216 responses (12 prompts, 6 languages, 3 runs), and analyze outputs through a two-layer methodology combining automated quantitative metrics with expert ILR qualitative assessment. Quantitative analysis reveals that French responses are approximately 30% longer than German responses on identical prompts, and that creative and affective clusters show the highest cross-lingual surface divergence. Qualitative analysis, conducted by a six-language professional with 12 years of ILR/OPI assessment experience, identifies five cross-lingual variation patterns: systematic differences in pragmatic disambiguation strategies, aesthetic and literary tradition divergence in creative output, language-internal technical terminology norms, cultural calibration gaps evidenced by the absence of culture-specific content in favor of culturally neutralized templates, and language-specific institutional referral behavior in emotional support responses. We argue that ILR-informed expert judgment applied to LLM outputs constitutes a novel and underreported evaluation methodology that complements purely computational benchmarks, and that cross-lingual output variation in Claude is interpretable, domain-dependent, and consequential for equitable multilingual AI deployment.

75. 【2604.27115】Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models

链接：https://arxiv.org/abs/2604.27115

作者：M. K. Khalidi Siam,Md. Tausif-Ul-Islam,Md. Reshad Romim Khan,Mohammed Ali Hossain,Mushfiqul Amin,Labib Hasan Khan,Niloy Farhan,Farig Sadeque

类目：Computation and Language (cs.CL)

关键词：models contribute uniformly, pruning, large language models, computational cost, footprint of large

备注：

点击查看摘要

Abstract:Neuron pruning is widely used to reduce the computational cost and parameter footprint of large language models, yet it remains unclear whether neurons in task-specific models contribute uniformly to task performance. In this work, we provide empirical evidence for the existence and importance of task-specific neurons through a systematic pruning study on language models specialized for mathematical reasoning and code generation. We introduce an activation-based selectivity metric to identify neurons with low contribution to the target task and prune them while preserving target-task accuracy, and compare selective pruning with random pruning. Selective pruning consistently outperforms random pruning, indicating that activation-based selectivity provides a systematic advantage over random pruning. Reverse pruning experiments further show that removing a small subset of highly task-specific neurons (~10%) causes complete performance collapse, suggesting that there exist task specific neurons and critical task information is concentrated in a small portion of the network. In contrast, selective pruning of less critical neurons (~30% - ~35%) reduces accuracy but still preserves significant performance. We also observed consistent reductions in parameters and runtime VRAM usage, along with improved inference throughput as pruning increases. Experiments on both 1.5B and 7B models reveal a robustness threshold around 15-20% pruning, beyond which accuracy loss and generation failures increase sharply. Fine-tuning substantially recovers performance across pruning levels, particularly for aggressively pruned models. These findings provide empirical evidence of neuron specialization in task-specific language models and offer insights into pruning robustness, model redundancy, and post-pruning recoverability.

76. 【2604.27093】Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

链接：https://arxiv.org/abs/2604.27093

作者：Mingqian Zheng,Malia Morgan,Liwei Jiang,Carolyn Rose,Maarten Sap

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Current LLM safety, Current LLM, alignment techniques improve, LLM safety alignment, safety alignment techniques

备注：

点击查看摘要

Abstract:Current LLM safety alignment techniques improve model robustness against adversarial attacks, but overlook whether and how LLMs can recover helpfulness when benign users clarify their intent. We introduce CarryOnBench, the first interactive benchmark that measures whether LLMs can revise their interpretation of user intent and recover utility, while remaining safe through multi-turn conversations. Starting from 398 seemingly harmful queries with benign underlying intents, we simulate 5,970 conversations by varying user follow-up sequences, evaluating 14 models on both intent-aligned utility and safety. CarryOnBench yields 1,866 different conversation flows of 4--12 turns, totaling 23,880 model responses. We design Ben-Util, a checklist-based metric that evaluates how well each model response fulfills the user's benign information need using atomic items. At turn one, models fulfill only 10.5--37.6% of the user's benign information need. When the same query includes the benign intent upfront, models fulfill 25.1--72.1%, confirming that models withhold information due to intent misinterpretation, not limited knowledge. With benign clarifications in multi-turn conversations, 13 of 14 models approach or exceed this single-turn baseline, yet recovery cost varies across models. We identify three failure modes invisible to single-turn evaluations: utility lock-in, where a model rarely updates despite clarification; unsafe recovery, where a model updates at disproportionate safety cost; and repetitive recovery, where a model recycles prior responses rather than providing new information. Moreover, conversations converge to similar harmfulness levels regardless of how conservative the model starts. These findings expose a gap that single-turn evaluations miss -- whether a model is appropriately cautious or simply unresponsive to clarified user intent.

77. 【2604.27045】Detecting Clinical Discrepancies in Health Coaching Agents: A Dual-Stream Memory and Reconciliation Architecture

链接：https://arxiv.org/abs/2604.27045

作者：Samuel L Pugh,Eric Yang,Alexander Muir Sutherland,Alessandra Breschi

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Model, Language Model, Large Language, persistent systems managing, longitudinal healthcare journeys

备注：

点击查看摘要

Abstract:As Large Language Model (LLM) agents transition from single-session tools to persistent systems managing longitudinal healthcare journeys, their memory architectures face a critical challenge: reconciling two imperfect sources of truth. The patient's evolving self-report is current but prone to recall bias, while the Electronic Health Record (EHR) is medically validated but frequently stale. General-purpose agent memory systems optimize for coherence by overwriting older facts with the user's latest statement, a pattern that risks safety failures when applied to clinical data. We introduce a Dual-Stream Memory Architecture that strictly separates the patient narrative from the structured clinical record (FHIR), governed by a dedicated Reconciliation Engine that evaluates every extracted memory against the patient's FHIR profile and classifies discrepancies by type, severity, and the specific FHIR resources involved. We evaluate this architecture on 26 patients across 675 longitudinal wellness coaching sessions, using a hybrid dataset that interleaves real provider-patient transcripts with synthetic, FHIR-grounded clinical scenarios. In isolated testing, the engine detects 84.4% of designed clinical discrepancies with 86.7% safety-critical recall. By coupling extraction and reconciliation evaluation on the same data, we directly quantify a 13.6% error cascade, tracing the degradation to clinical details lost during memory extraction from unstructured conversation rather than to downstream classification errors. These findings establish that validating patient-reported memories against clinical records is both feasible and necessary for safe deployment of longitudinal health agents.

78. 【2604.27043】CL-bench Life: Can Language Models Learn from Real-Life Context?

链接：https://arxiv.org/abs/2604.27043

作者：Shihan Dou,Yujiong Shen,Chenhao Huang,Junjie Ye,Jiayi Chen,Junzhe Wang,Qianyu He,Shichun Liu,Changze Lv,Jiahang Lin,Jiazheng Zhang,Ming Zhang,Shaofan Liu,Tao Ji,Zhangyue Yin,Cheng Zhang,Huaibing Xie,Jianglu Hu,Jingcheng Deng,Lincheng Li,Minda Hu,Shaolei Wang,Syrus Zhao,Weichao Wang,Yan Lei,Yang Liu,Yanling Xiao,Yiting Liu,Zenan Xu,Zhen Guo,Ziliang Zhao,Pluto Zhou,Tao Gui,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Di Wang,Shunyu Yao

类目：Computation and Language (cs.CL)

关键词：increasingly important capability, handle context effectively, Today AI assistants, OpenClaw are designed, increasingly important

备注： 50 pages, 11 figures

点击查看摘要

Abstract:Today's AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of the contexts they must handle also shifts. Real-life contexts are often messy, fragmented, and deeply tied to personal and social experience, such as multi-party conversations, personal archives, and behavioral traces. Yet it remains unclear whether current frontier language models can reliably learn from such contexts and solve tasks grounded in them. To this end, we introduce CL-bench Life, a fully human-curated benchmark comprising 405 context-task pairs and 5,348 verification rubrics, covering common real-life scenarios. Solving tasks in CL-bench Life requires models to reason over complex, messy real-life contexts, calling for strong real-life context learning abilities that go far beyond those evaluated in existing benchmarks. We evaluate ten frontier LMs and find that real-life context learning remains highly challenging: even the best-performing model achieves only 19.3% task solving rate, while the average performance across models is only 13.8%. Models still struggle to reason over contexts such as messy group chat histories and fragmented behavioral records from everyday life. CL-bench Life provides a crucial testbed for advancing real-life context learning, and progress on it can enable more intelligent and reliable AI assistants in everyday life.

79. 【2604.27039】Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

链接：https://arxiv.org/abs/2604.27039

作者：Zhen Zhang,Changyi Yang,Zijie Xia,Zhen Yang,Chengzhi Liu,Zhaotiao Weng,Yepeng Liu,Haobo Chen,Jin Pan,Chenyang Zhao,Yuheng Bu,Alkesh Patel,Zhe Gan,Xin Eric Wang

类目：Computation and Language (cs.CL)

关键词：modern autoregressive models, length directly influences, length, generation length directly, fundamental unit

备注：

点击查看摘要

Abstract:Token serves as the fundamental unit of computation in modern autoregressive models, and generation length directly influences both inference cost and reasoning performance. Despite its importance, existing approaches lack fine-grained length modeling, operating primarily at the coarse-grained sequence level. We introduce the Length Value Model (LenVM), a token-level framework that models the remaining generation length. By formulating length modeling as a value estimation problem and assigning a constant negative reward to each generated token, LenVM predicts a bounded, discounted return that serves as a monotone proxy for the remaining generation horizon. This formulation yields supervision that is annotation-free, dense, unbiased, and scalable. Experiments on LLMs and VLMs demonstrate LenVM provides a highly effective signal at inference time. On the LIFEBench exact length matching task, applying LenVM to a 7B model improves the length score from 30.9 to 64.8, significantly outperforming frontier closed-source models. Furthermore, LenVM enables continuous control over the trade off between performance and efficiency. On GSM8K at a budget of 200 tokens, LenVM maintains 63% accuracy compared to 6 percent for token budget baseline. It also accurately predicts total generation length from the prompt boundary. Finally, LenVM's token-level values offer an interpretable view of generation dynamics, revealing how specific tokens shift reasoning toward shorter or longer regimes. Results demonstrate that LenVM supports a broad range of applications and token length can be effectively modeled as a token-level value signal, highlighting the potential of LenVM as a general framework for length modeling and as a length-specific value signal that could support future RL training. Code is available at this https URL.

80. 【2604.27037】Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval

链接：https://arxiv.org/abs/2604.27037

作者：Arne Eichholtz,Yongkang Li,Jutte Vijverberg,Tobias Groot,Mohammad Aliannejadi

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：query-specific neural network, contextualized query embeddings, neural network, replaces the fixed, query-specific neural

备注： This paper has been accepted as a reproducibility paper at SIGIR 2026

点击查看摘要

Abstract:The Hypencoder, proposed by Killingback et al., is a retrieval framework that replaces the fixed inner-product scoring function used in standard bi-encoders with a query-specific neural network (the $q$-net), whose weights are generated by a hypernetwork from the contextualized query embeddings. This design enables more expressive relevance estimation while preserving independent query and document encoding. In this work, we conduct a reproducibility study of the Hypencoder and extend the original analysis in three directions. Our reproduction confirms that the Hypencoder outperforms a similarly trained bi-encoder baseline on in-domain and out-of-domain benchmarks, and that the proposed efficient search algorithm substantially reduces query latency with minimal performance loss. On hard retrieval tasks, we find partial support: the Hypencoder outperforms the baseline on DL-Hard and FollowIR, but not on TREC TOT, where checkpoint incompatibility and fine-tuning sensitivity complicate full verification. Beyond reproduction, we investigate three extensions: (i)~integrating alternative pre-trained encoders into the Hypencoder framework, where we find that performance gains depend on the encoder and fine-tuning strategy; (ii)~comparing query latency against a Faiss-based bi-encoder pipeline, revealing that standard bi-encoder retrieval remains faster under both exhaustive and efficient search settings; and (iii)~evaluating adversarial robustness, where we find that the $q$-net's non-linear scoring does not provide a consistent robustness disadvantage over inner-product scoring. Our code is publicly available at this https URL.

81. 【2604.27019】Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

链接：https://arxiv.org/abs/2604.27019

作者：Wenhao Lan,Shan Li,Junbin Yang,Haihua Shen,Yijun Yang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：Safety-aligned language models, refuse harmful requests, Safety-aligned language, tradeoff remain unclear, broad over-refusal

备注：

点击查看摘要

Abstract:Safety-aligned language models must refuse harmful requests without collapsing into broad over-refusal, but the training-time mechanisms behind this tradeoff remain unclear. Prior work characterizes refusal directions and jailbreak robustness, yet does not explain how dynamic adversarial fine-tuning changes refusal carriers across training. We present a measurement-driven mechanism study, not a new defense, on one 7B backbone under supervised fine-tuning (SFT) and R2D2-style dynamic adversarial fine-tuning. Our protocol aligns fixed-source HarmBench, StrongREJECT, and XSTest with a five-anchor refusal-geometry suite and causal interventions. R2D2 drives fixed-source HarmBench ASR to 0.000 at steps 50 and 100, then partially reopens to 0.035 at step 250 and 0.250 at step 500; SFT remains less robust, with ASR between 0.505 and 0.588 at the same anchors. On XSTest, R2D2 any-refusal is 1.000 early, then falls to 0.664 and 0.228. Geometrically, R2D2 preserves a late-layer admissible carrier through step 100 before relocating to an early-layer carrier, while effective rank remains near 1.23--1.27. Causal interventions indicate low-dimensional but utility-coupled control. These results support a reorganization account rather than a drift-only account, with evidence limited to one backbone and fixed-source attacks.

82. 【2604.26986】BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task

链接：https://arxiv.org/abs/2604.26986

作者：Tosin Adewumi,Martin Karlsson,Lama Alkhaled,Marcus Liwicki

类目：Computation and Language (cs.CL)

关键词：digital battery passport, conformance classification, created synthetically, real pilot samples, public benchmark

备注： 19 pages, 4 figures

点击查看摘要

Abstract:We introduce a novel task of digital battery passport (DBP) conformance classification and introduce the first public benchmark for the task: BatteryPass-12K, created synthetically from real pilot samples. This is as the EU's battery regulation on DBPs comes into effect soon and there exists no public dataset. We evaluated 22 language models (LMs) in zero-shot inference, spanning small LMs (SLMs), mixture of experts (MoEs), and dense LLMs. We also conducted analysis, additional evaluations of few-shot inference and prompt-injection attacks to find that (1) Thinking models have the best performance (with GPT-5.4 scoring 0.98 (0.03) and 0.71 (0.22) on average as F1 (and confidence interval at 95%) on the validation and test sets, respectively), (2) few-shot examples improve performance significantly, (3) generally capable frontier models find the task challenging, (4) merely scaling model parameters does not necessarily lead to improved performance, as SLMs outperformed some LLMs, and (5) prompt-injection attacks degrade performance. We note that BatteryPass-12K, though limited to real pilot samples, may be useful for other known or emerging tasks in the battery domain, e.g. lifecycle reasoning. We publicly release the dataset under a permissive licence (CC-BY-4.0).

83. 【2604.26962】DeepTutor: Towards Agentic Personalized Tutoring

链接：https://arxiv.org/abs/2604.26962

作者：Bingxi Zhao,Jiahao Zhang,Xubin Ren,Zirui Guo,Tianzhe Chu,Yi Ma,Chao Huang

类目：Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, Large Language, promising real-world applications, Education represents

备注： 26 pages, 7 figures, 7 tables. Code available at [this https URL](https://github.com/HKUDS/DeepTutor)

点击查看摘要

Abstract:Education represents one of the most promising real-world applications for Large Language Models (LLMs). However, conventional tutoring systems rely on static pre-training knowledge that lacks adaptation to individual learners, while existing RAG-augmented systems fall short in delivering personalized, guided feedback. To bridge this gap, we present DeepTutor, an agent-native open-source framework for personalized tutoring where every feature shares a common personalization substrate. We propose a hybrid personalization engine that couples static knowledge grounding with dynamic multi-resolution memory, distilling interaction history into a continuously evolving learner profile. Moreover, we construct a closed tutoring loop that bidirectionally couples citation-grounded problem solving with difficulty-calibrated question generation. The personalization substrate further supports collaborative writing, multi-agent deep research, and interactive guided learning, enabling cross-modality coherence. To move beyond reactive interfaces, we introduce TutorBot, a proactive multi-agent layer that deploys tutoring capabilities through extensible skills and unified multi-channel access, providing consistent experience across platforms. To better evaluate such tutoring systems, we construct TutorBench, a student-centric benchmark with source-grounded learner profiles and a first-person interactive protocol that measures adaptive tutoring from the learner's perspective. We further evaluate foundational agentic reasoning abilities across five authoritative benchmarks. Experiments show that DeepTutor improves personalized tutoring quality while maintaining general agentic reasoning abilities. We hope DeepTutor provides unique insights into next-generation AI-powered and personalized tutoring systems for the community.

84. 【2604.28021】Universal statistical laws governing culinary design

链接：https://arxiv.org/abs/2604.28021

作者：Ganesh Bagler,Gopal Krishna Tewari,Aditya Raj Yadav,Akshat Singh,Pranay Bansal,Ujjval Dargar,Mansi Goel,Madhvi Kumari Sinha

类目：Physics and Society (physics.soc-ph); Computation and Language (cs.CL)

关键词：words and syntax, cultural expression, expression of human, human creativity, creativity that transcends

备注： 48 Pages (28 Pages of Main Manuscript + Supplementary Information), 4 Main Figures, 6 Extended Data Figures

点击查看摘要

Abstract:Cooking is a cultural expression of human creativity that transcends geography and time through the orchestration of ingredients and techniques, much like languages do through words and syntax. Yet, beneath the apparent diversity of culinary traditions, whether recipes obey statistical laws comparable to those of other symbolic systems remains unknown. Here we analyze a large corpus of traditional recipes spanning global cuisines, annotated using a state-of-the-art named entity recognition algorithm into ingredients, cooking techniques, utensils, and other culinary attributes. We find that ingredient usage exhibits Zipf-like rank-frequency scaling, that culinary diversity grows sublinearly with corpus size in accordance with Heaps' law, and that recipe complexity follows Menzerath-Altmann-type relations between the number and average information of constituent units. Consistent with observations in packaged foods, macronutrient concentrations across recipes also display a log-normal signature. Minimal generative models based on preferential reuse, constrained sampling, and incremental modification recapitulate these regularities, suggesting generic processes that shape recipe architecture across cultures. Together, these findings establish recipes as a compositional symbolic system in which complex structure emerges from simple, constrained generative processes.

信息检索

1. 【2604.28142】Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing

链接：https://arxiv.org/abs/2604.28142

作者：Silvio Martinico,Franco Maria Nardini,Cosimo Rulli,Rossano Venturini

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：deployment incurs substantial, incurs substantial computational, fine-grained token-level representations, Multivector retrieval models, memory costs

备注： 6 pages, 2 figures, SIGIR 2026

点击查看摘要

Abstract:Multivector retrieval models achieve state-of-the-art effectiveness through fine-grained token-level representations, but their deployment incurs substantial computational and memory costs. Current solutions, based on the well-known k-means clustering algorithm, group similar vectors together to enable both effective compression and efficient retrieval. However, standard k-means scales poorly with the number of clusters and dataset size, and favours frequent tokens during training while underrepresenting rare, discriminative ones. In this work, we introduce TACHIOM, a multivector retrieval system that exploits token-level structure to significantly accelerate both clustering and retrieval. By accounting for tokens' distribution during centroid allocation, TACHIOM easily scales to millions of centroids, enabling highly accurate document scoring using only centroids, avoiding expensive token-level computation. TACHIOM combines a graph-based index over centroids with an optimized Product Quantization layout for efficient final scoring. Experiments on MS-MARCOv1 and LoTTE show that TACHIOM achieves up to $247\times$ faster clustering than k-means and up to $9.8\times$ retrieval speedup over state-of-the-art systems while maintaining comparable or superior effectiveness.

2. 【2604.28028】Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

链接：https://arxiv.org/abs/2604.28028

作者：Smit Jivani,Sarvam Maheshwari,Sunita Sarawagi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)

关键词：Large language models, query structured data, Large language, allowing users, growing ease

备注： Project Code: [this https URL](https://github.com/SSLab-CSE-IITB/tecod)

点击查看摘要

3. 【2604.27878】SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions

链接：https://arxiv.org/abs/2604.27878

作者：Saber Zerhoudi

类目：Information Retrieval (cs.IR)

关键词：interactive information retrieval, standardized evaluation tools, community lacks standardized, lacks standardized evaluation, information retrieval

备注：

点击查看摘要

Abstract:User simulators are increasingly central to interactive information retrieval, yet the community lacks standardized evaluation tools. Simulators serve two objectives, behavioral realism (matching real user behavior) and tester reliability (producing valid system rankings), and these are often conflated despite being distinct and sometimes conflicting. We present SimEval-IR, an open-source toolkit and benchmark suite that makes this distinction measurable. SimEval-IR provides: (1) a canonical session schema unifying session search and conversational interactions, with validated dataset adapters and explicit loss accounting; (2) three executable benchmarks covering behavioral realism, tester reliability with RATE-style estimation, and an analysis linking the two; and (3) baseline results across four real datasets in two languages and four simulator families. Our key finding: the classifier-discriminator ''human-likeness'' check, the dominant realism test in the literature, has essentially no pooled predictive power for system-ranking validity ($r{=}{+}0.09$, $n{=}48$), while marginal click-depth distance and Fréchet distance over session embeddings give a much stronger signal ($|r|{=}0.43$ and $0.40$, $p{\leq}0.005$). SimEval-IR is released with all configurations and scripts to reproduce the reported analysis.

4. 【2604.27852】NeocorRAG: Less Irrelevant Information, More Explicit Evidence, and More Effective Recall via Evidence Chains

链接：https://arxiv.org/abs/2604.27852

作者：Shiyao Peng,Qianhe Zheng,Zhuodi Hao,Zichen Tang,Rongjin Li,Qing Huang,Jiayu Huang,Jiacheng Liu,Yifan Zhu,Haihong E

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：critical oversight persists, Recall Conversion Rate, retrieval quality, Conversion Rate, retrieval quality optimization

备注： Accepted to WWW 2026

点击查看摘要

Abstract:Although precise recall is a core objective in Retrieval-Augmented Generation (RAG), a critical oversight persists in the field: improvements in retrieval performance do not consistently translate to commensurate gains in downstream reasoning. To diagnose this gap, we propose the Recall Conversion Rate (RCR), a novel evaluation metric to quantify the contribution of retrieval to reasoning accuracy. Our quantitative analysis of mainstream RAG methods reveals that as Recall@5 improves, the RCR exhibits a near-linear decay. We identify the neglect of retrieval quality in these methods as the underlying cause. In contrast, approaches that focus solely on quality optimization often suffer from inferior recall performance. Both categories lack a comprehensive understanding of retrieval quality optimization, resulting in a trade-off dilemma. To address these challenges, we propose comprehensive retrieval quality optimization criteria and introduce the NeocorRAG framework. This framework achieves holistic retrieval quality optimization by systematically mining and utilizing Evidence Chains. Specifically, NeocorRAG first employs an innovative activated search algorithm to obtain a refined candidate space. Then it ensures precise evidence chain generation through constrained decoding. Finally, the retrieved set of evidence chains guides the retrieval optimization process. Evaluated on benchmarks including HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, NeocorRAG achieves SOTA performance on both 3B and 70B parameter models, while consuming less than 20% of tokens used by comparable methods. This study presents an efficient, training-free paradigm for RAG enhancement that effectively optimizes retrieval quality while maintaining high recall. Our code is released at this https URL.

5. 【2604.27820】ObjectGraph: From Document Injection to Knowledge Traversal -- A Native File Format for the Agentic Era

链接：https://arxiv.org/abs/2604.27820

作者：Mohit Dubey,Open Gigantic

类目：Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词：reader moving linearly, human reader moving, linearly through text, existence was designed, reader moving

备注： 12 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Every document format in existence was designed for a human reader moving linearly through text. Autonomous LLM agents do not read - they retrieve. This fundamental mismatch forces agents to inject entire documents into their context window, wasting tokens on irrelevant content, compounding state across multi-turn loops, and broadcasting information indiscriminately across agent roles. We argue this is not a prompt engineering problem, not a retrieval problem, and not a compression problem: it is a format problem. We introduce OBJECTGRAPH (.og), a file format that reconceives the document as a typed, directed knowledge graph to be traversed rather than a string to be injected. OBJECTGRAPH is a strict superset of Markdown - every .md file is a valid .og file - requires no infrastructure beyond a two-primitive query protocol, and is readable by both humans and agents without tooling. We formalize the Document Consumption Problem, characterise six structural properties no existing format satisfies simultaneously, and prove OBJECTGRAPH satisfies all six. We further introduce the Progressive Disclosure Model, the Role-Scoped Access Protocol, and Executable Assertion Nodes as native format primitives. Empirical evaluation across five document classes and eight agent task types demonstrates up to 95.3 percent token reduction with no statistically significant degradation in task accuracy (p 0.05). Transpiler fidelity reaches 98.7 percent content preservation on a held-out document benchmark.

Comments:
12 pages, 4 figures, 4 tables

Subjects:

Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

Cite as:
arXiv:2604.27820 [cs.AI]

(or
arXiv:2604.27820v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.27820

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

6. 【2604.27790】How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews

链接：https://arxiv.org/abs/2604.27790

作者：Riley Grossman,Songjiang Liu,Michael K. Chen,Mike Smith,Cristian Borcea,Yi Chen

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词：search, generative search, Generative, increasingly integrated, web search

备注： Paper Accepted to ACM SIGIR 2026 (49th International ACM SIGIR Conference on Research and Development in Information Retrieval)

点击查看摘要

7. 【2604.27747】Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

链接：https://arxiv.org/abs/2604.27747

作者：Jiaju Chen,Chongming Gao,Chenxiao Fan,Haoyan Liu,Qingpeng Cai,Peng Jiang,Xiangnan He

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Large language model, Large language, decoding remains sequential, based generative list-wise, advanced rapidly

备注：

点击查看摘要

Abstract:Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is represented by multiple semantic-ID tokens, often with separators, and current drafts typically treat these tokens uniformly. This overlooks two practical facts: (i) a token's semantics depend on its within-item slot, and (ii) uncertainty tends to increase with speculation depth. Without modeling these effects, SD's speedups can be limited. We introduce PAD-Rec, Position-Aware Drafting for generative Recommendation, a lightweight module that augments the draft model with two complementary signals. Item position embeddings explicitly encode the within-item slot of each token, strengthening structural awareness. Step position embeddings encode the draft step, allowing the model to adapt to depth-dependent uncertainty and improve proposal quality. To harmonize these signals with base features, we add simple gates: a learnable coefficient for item slots and a context-driven gate for draft steps. The module is trainable, easy to integrate with standard draft models, and adds negligible inference overhead. Extensive experiments on four real-world datasets show up to 3.1x wall-clock speedup and about 5% average wall-clock speedup gain over strong SD baselines, while largely preserving recommendation quality.

8. 【2604.27674】One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

链接：https://arxiv.org/abs/2604.27674

作者：Hiroyuki Deguchi,Katsuki Chousa,Yusuke Sakai

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词：automatic evaluation metrics, pose practical threats, high-dimensional embedding spaces, hubness problem, cross-modal encoders

备注： Accepted at ACL2026 (main)

点击查看摘要

9. 【2604.27600】Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

链接：https://arxiv.org/abs/2604.27600

作者：Xihang Wang,Zihan Wang,Chengkai Huang,Cao Liu,Ke Zeng,Quan Z. Sheng,Lina Yao

类目：Information Retrieval (cs.IR)

关键词：Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large

备注：

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (MRAG) is widely adopted for Multimodal Large Language Models (MLLMs) with external evidence to reduce hallucinations. Despite its success, most existing MRAG frameworks treat retrieved evidence as indivisible documents, implicitly assuming that all content within a document is equally informative. In practice, however, sometimes only a small fraction of a document is relevant to a given query, while the remaining content introduces substantial noise that may lead to performance degradation. We address this fundamental limitation by reframing MRAG as a fine-grained evidence selection problem. We propose Fragment-level Evidence Selection for RAG (FES-RAG), a framework that selects atomic multimodal fragments rather than entire documents as grounding evidence. FES-RAG decomposes retrieved multimodal documents into sentence-level textual fragments and region-level visual fragments, enabling precise identification of evidence that directly supports generation. To guide fragment selection, we introduce Fragment Information Gain (FIG), a principled metric that measures the marginal contribution of each fragment to the MLLM's generation confidence. Based on FIG, we distill fragment-level utility judgments from a high-capacity MLLM into a lightweight selector, achieving accurate evidence selection with low inference overhead. Experiments on the M2RAG benchmark show that FES-RAG consistently outperforms state-of-the-art document-level MRAG methods, achieving up to 27 percent relative improvement in CIDEr. By selecting fewer yet more informative fragments, our approach substantially reduces context length while improving factual accuracy and generation coherence.

10. 【2604.27599】One Pass, Any Order: Position-Invariant Listwise Reranking for LLM-Based Recommendation

链接：https://arxiv.org/abs/2604.27599

作者：Ethan Bito,Yongli Ren,Estrid He

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Large language models, Large language, predictions can depend, Large, Rotary Positional Embeddings

备注： Accepted at SIGIR 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for recommendation reranking, but their listwise predictions can depend on the order in which candidates are presented. This creates a mismatch between the set-based nature of recommendation and the sequence-based computation of decoder-only LLMs, where permuting an otherwise identical candidate set can change item scores and final rankings. Such order sensitivity makes LLM-based rerankers difficult to rely on, since rankings may reflect prompt serialization rather than user preference. We propose InvariRank, a permutation-invariant listwise reranking framework that addresses this dependence at the architectural level. InvariRank blocks cross-candidate attention with a structured attention mask and negates position-induced scoring changes through shared positional framing under Rotary Positional Embeddings (RoPE). Combined with a listwise learning-to-rank objective, the model scores all candidates in a single forward pass, avoiding permutation-based invariance training objectives that require multiple permutations of a candidate set. Experiments on recommendation benchmarks show that InvariRank maintains competitive ranking effectiveness while producing stable rankings across candidate permutations. The results suggest that architectural invariance is a practical route to reliable and efficient LLM-based recommendation reranking. The source code is at this https URL.

11. 【2604.27577】Reproducing Adaptive Reranking for Reasoning-Intensive IR

链接：https://arxiv.org/abs/2604.27577

作者：Mandeep Rathee,V Venktesh,Sean MacAvaney,Avishek Anand

类目：Information Retrieval (cs.IR)

关键词：bounded recall problem, classical cascading pipeline, recall problem, bounded recall, first-stage retriever

备注： 7 figures, 11 pages

点击查看摘要

Abstract:The classical cascading pipeline of retrieve--rerank suffers from a bounded recall problem, stemming from limitations of the first-stage retriever. Most current approaches address the bounded recall problem by improving the first-stage retriever, but this incurs substantial training and inference costs, especially to handle queries that require substantial reasoning. To circumvent the computational costs of reasoning-based retrievers, we replicate the findings of GAR, Graph-based Adaptive Reranking, on the BRIGHT reasoning-intensive retrieval benchmark. GAR addresses the bounded recall problem by modifying the reranking process itself through iterative exploration of a corpus graph, but it was previously only tested on models designed for topical and question-answering-style queries. Hence, reproduce GAR in reasoning-intensive settings with reasoning and non-reasoning reranking models. We observe that the quality of the reranker's signal plays an important role in identifying additional relevant documents within the corpus graph. Overall, we find that GAR boosts the effectiveness of reasoning-intensive retrieval across a variety of models while contributing minimally to computational overheads. Ultimately, this work enables more practical deployment of retrieval systems that can address reasoning-intensive queries.

12. 【2604.27421】A Reproducibility Study of LLM-Based Query Reformulation

链接：https://arxiv.org/abs/2604.27421

作者：Amin Bigdeli,Radin Hamidi Rad,Hai Son Le,Mert Incesu,Negar Arabzadeh,Charles L. A. Clarke,Ebrahim Bagheri

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, studies reporting substantial, reporting substantial effectiveness

备注：

点击查看摘要

13. 【2604.27410】From Unstructured to Structured: LLM-Guided Attribute Graphs for Entity Search and Ranking

链接：https://arxiv.org/abs/2604.27410

作者：Yilun Zhu,Nikhita Vedula,Shervin Malmasi

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：faces unique challenges, Entity search, query entity, product similarity varies, Large Language Model

备注：

点击查看摘要

14. 【2604.27321】oward Autonomous SOC Operations: End-to-End LLM Framework for Threat Detection, Query Generation, and Resolution in Security Operations

链接：https://arxiv.org/abs/2604.27321

作者：Md Hasan Saju,Akramul Azim

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Security Operations Centers, Operations Centers, face mounting operational, Security Operations, face mounting

备注：

点击查看摘要

Abstract:Security Operations Centers (SOCs) face mounting operational challenges. These challenges come from increasing threat volumes, heterogeneous SIEM platforms, and time-consuming manual triage workflows. We present an end-to-end threat management framework that integrates ensemble-based detection, syntax-constrained query generation, and retrieval-augmented resolution support to automate critical security workflows. Our detection module evaluates both traditional machine learning classifiers and large language models (LLMs), then combines the three best-performing LLMs to create an ensemble model, achieving 82.8% accuracy while maintaining 0.120 false positive rate on SIEM logs. We introduce the SQM (Syntax Query Metadata) architecture for automated evidence collection. It uses platform-specific syntax constraints, metadata-based retrieval, and documentation-grounded prompting to generate executable queries for IBM QRadar and Google SecOps. SQM achieves a BLEU score of 0.384 and a ROUGE-L score of 0.731. These results are more than twice as good as the baseline LLM performance. For incident resolution and recommendation generation, we demonstrate that integrating SQM-derived evidence improves resolution code prediction accuracy from 78.3% to 90.0%, with an overall recommendation quality score of 8.70. In production SOC environments, our framework reduces average incident triage time from hours to under 10 minutes. This work demonstrates that domain-constrained LLM architectures with retrieval augmentation can meet the strict reliability and efficiency requirements of operational security environments at scale.

15. 【2604.27306】NuggetIndex: Governed Atomic Retrieval for Maintainable RAG

链接：https://arxiv.org/abs/2604.27306

作者：Saber Zerhoudi,Michael Granitzer,Jelena Mitrovic

类目：Information Retrieval (cs.IR)

关键词：Retrieval-augmented generation, standard implementations retrieve, fact-based metrics, frequently evaluated, evaluated via fact-based

备注：

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems are frequently evaluated via fact-based metrics, yet standard implementations retrieve passages or static propositions. This unit mismatch between evaluation and retrieval objects hinders maintenance when corpora evolve and fails to capture superseded facts or source disagreements. We propose NuggetIndex, a retrieval system that stores atomic information units as managed records, so called nuggets. Each record maintains links to evidence, a temporal validity interval, and a lifecycle state. By filtering invalid or deprecated nuggets prior to ranking, the system prevents the inclusion of outdated information. We evaluate the approach using a nuggetized MS MARCO subset, a temporal Wikipedia QA dataset, and a multi-hop QA task. Against passage and unmanaged proposition retrieval baselines, NuggetIndex improves nugget recall by 42%, increases temporal correctness by 9 percentage points without the recall collapse observed in time-filtered baselines, and reduces conflict rates by 55%. The compact nugget format reduces generator input length by 64% while enabling lightweight index structures suitable for browser-based and resource-constrained deployment. We release our implementation, datasets, and evaluation scripts

16. 【2604.27244】RAQG-QPP: Query Performance Prediction with Retrieved Query Variants and Retrieval Augmented Query Generation

链接：https://arxiv.org/abs/2604.27244

作者：Fangzheng Tian,Debasis Ganguly,Craig Macdonald

类目：Information Retrieval (cs.IR)

关键词：human-assessed relevance judgements, query-specific selective decision, selective decision making, Query Performance Prediction, Query Performance

备注： Accepted manuscript. 27 pages, 8 figures, 5 tables. To appear in ACM Transactions on Information Systems

点击查看摘要

Abstract:Query Performance Prediction (QPP) estimates the retrieval quality of ranking models without the use of any human-assessed relevance judgements, and finds applications in query-specific selective decision making to improve overall retrieval effectiveness. Although unsupervised QPP approaches are effective for lexical retrieval models, they usually perform weaker for neural rankers. Recent work shows that leveraging query variants (QVs), i.e., queries with potentially similar information needs to a given query, can enhance unsupervised QPP accuracy. However, existing QV-based prediction methods rely on query variants generated by term expansion of the input query, which is likely to yield incoherent, hallucinatory and off-topic QVs. In this paper, we propose to make use of queries retrieved from a log of past queries as QVs to be subsequently used for QPP. In addition to directly applying retrieved QVs in QPP, we further propose to leverage large language models (LLMs) to generate QVs conditioned on the retrieved QVs, thus mitigating the limitation of relying only on existing queries in a log. Experiments on TREC DL'19 and DL'20 show that QPP enhanced with RAQG outperform the best-performing existing QV-based prediction approach by as much as 30% on neural ranking models such as MonoT5.

17. 【2604.27131】LLM-Enhanced Topical Trend Detection at Snapchat

链接：https://arxiv.org/abs/2604.27131

作者：Hangqi Zhao,Jay Li,Abhiruchi Bhattacharya,Cong Ni,Jason Yeung,Jinchao Ye,Kai Yang,Akshat Malu,Manish Malik

类目：Information Retrieval (cs.IR)

关键词：dynamic content ecosystem, social media platforms, Automatic detection, challenging and essential, essential for maintaining

备注：

点击查看摘要

Abstract:Automatic detection of topical trends at scale is both challenging and essential for maintaining a dynamic content ecosystem on social media platforms. In this work, we present a large-scale system for identifying emerging topical trends on Snapchat, one of the world's largest short-video social platforms. Our system integrates multimodal topic extraction, time-series burst detection, and LLM-based consolidation and enrichment to enable accurate and timely trend discovery. To the best of our knowledge, this is the first published end-to-end system for topical trend detection on short-video platforms at production scale. Continuous offline human evaluation over six months demonstrates high precision in identifying meaningful trends. The system has been deployed in production at global scale and applied to downstream surfaces including content ranking and search, driving measurable improvements in content freshness and user experience.

18. 【2604.27117】A Gated Hybrid Contrastive Collaborative Filtering Recommendation

链接：https://arxiv.org/abs/2604.27117

作者：Eduardo Ferreira da Silva,Mayki dos Santos Oliveira,Joel Machado Pires,Denis Dantas Boaventura,Maycon Maciel Peixoto,Cassio Serafim Prazeres,Gustavo Bittencourt Figueiredo,Miriam Capretz,Frederico Araujo Durão

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Recommender systems increasingly, systems increasingly incorporate, increasingly incorporate textual, incorporate textual reviews, Recommender systems

备注：

点击查看摘要

Abstract:Recommender systems increasingly incorporate textual reviews to enrich user and item representations. However, most review-aware models remain optimized for rating prediction rather than ranking quality. This misalignment limits their effectiveness in top-N recommendation scenarios, where discriminative ranking is essential. To address this gap, we propose a Gated Hybrid Collaborative Filtering framework that integrates review-derived representations into an autoencoder-based collaborative model. The architecture injects semantic signals layer-wise through an adaptive gating mechanism that dynamically balances collaborative embeddings and topic-based features during encoding. To further refine the latent space, we introduce a contrastive learning module that aligns semantic and collaborative signals. We evaluate the framework across five distinct configurations: Pure collaborative; Topic and Gated; Text and Gated; and the addition of contrastive objectives (Contrastive and Topic, and Contrastive and Text). To explicitly optimize ranking behavior, the model is trained with a pairwise Bayesian personalized ranking objective, which promotes separation between relevant and non-relevant items in the latent space. Experiments on Amazon Movies TV, IMDb, and Rotten Tomatoes demonstrate consistent improvements in hit rate @10 and normalized discounted cumulative gain @10 over state-of-the-art review-aware baselines. Results highlight the importance of controlled semantic fusion for ranking-driven recommendation.

19. 【2604.27037】Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval

链接：https://arxiv.org/abs/2604.27037

作者：Arne Eichholtz,Yongkang Li,Jutte Vijverberg,Tobias Groot,Mohammad Aliannejadi

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：query-specific neural network, contextualized query embeddings, neural network, replaces the fixed, query-specific neural

备注： This paper has been accepted as a reproducibility paper at SIGIR 2026

点击查看摘要

20. 【2604.26996】LUCid: Redefining Relevance For Lifelong Personalization

链接：https://arxiv.org/abs/2604.26996

作者：Chimaobi Okite,Anika Misra,Joyce Chai,Rada Mihalcea

类目：Information Retrieval (cs.IR)

关键词：topically unrelated interactions, lifelong personalization operationalize, miss essential user, semantic proximity, approaches to lifelong

备注： first version

点击查看摘要

Abstract:Current approaches to lifelong personalization operationalize relevance through semantic proximity, causing them to miss essential user information from topically unrelated interactions. To address this gap, we introduce LUCid, a benchmark designed to measure situational user-centric relevance in personalization. The benchmark consists of 1,936 realistic queries paired with interaction histories from up to 500 sessions. Across multiple architectures, our experiments show significant performance collapse when relevant context must be surfaced from semantically distant history: retrieval recall drops to near zero on the hardest instances, and response alignment remains near 50% even for state-of-the-art models such as Gemini-3-Flash, GPT-5.4, and Claude Haiku. These results expose a fundamental mismatch between the notion of relevance encoded by current systems and the situational relevance required for personalization, with direct implications for robustness and safety when critical user attributes remain undetected. LUCid enables the systematic evaluation of whether current models can surface situationally-relevant user information from previous interactions, and serves as a step toward realigning personalization with user-centered relevance.

21. 【2604.26983】Value-Aware Product Recommendation by Customer Segmentation using a suitable High-Dimensional Similarity Measure

链接：https://arxiv.org/abs/2604.26983

作者：María Florencia Acosta,Rodrigo García Arancibia,Pamela Llop,Mariel Lovatto,Lucas Mansilla

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词：paper presents, value-aware approach, simultaneously addresses, addresses the high, high dimensionality

备注：

点击查看摘要

Abstract:This paper presents a novel value-aware approach to product recommendation that simultaneously addresses the high dimensionality and sparsity of user-item data while explicitly incorporating the contribution of each product and user to overall sales revenue. The proposed framework encodes revenue contributions in the user-item matrix and computes customer similarity directly on this basis using suitable distance measures. This enables the segmentation of users according to the revenue-based similarity of their purchase baskets and supports recommendations aligned with profitability objectives. We compare conventional similarity metrics with a novel alternative tailored to high-dimensional contexts and propose three recommendation strategies based on revenue share, product popularity, and expected profit generation. The effectiveness of the proposed method is validated through simulation experiments and a real-world application using the UCI Online Retail dataset.

22. 【2604.26981】Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model

链接：https://arxiv.org/abs/2604.26981

作者：Shawqi Al-Maliki,Ammar Gharaibeh,Mohamed Rahouti,Mohammad Ruhul Amin,Mohamed Abdallah,Junaid Qadir,Ala Al-Fuqaha

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Large Language Models, natural language processing, Large Language, language processing, natural language

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized the field of natural language processing. However, they exhibit some limitations, including a lack of reliability and transparency: they may hallucinate and fail to provide sources that support the generated output. Retrieval-Augmented Generation (RAG) was introduced to address such limitations in LLMs. One popular implementation, RAG-as-a-Service (RaaS), has shortcomings that hinder its adoption and accessibility. For instance, RaaS pricing is based on the number of submitted prompts, without considering whether the prompts are enriched by relevant chunks, i.e., text segments retrieved from a vector database, or the quality of the utilized chunks (i.e., their degree of relevance). This results in an opaque and less cost-effective payment model. We propose Chunk-as-a-Service (CaaS) as a transparent and cost-effective alternative. CaaS includes two variants: Open-Budget CaaS (OB-CaaS) and Limited-Budget CaaS (LB-CaaS), which is enabled by our ``Utility-Cost Online Selection Algorithm (UCOSA)''. UCOSA further extends the cost-effectiveness and the accessibility of the OB-CaaS variant by enriching, in an online manner, a subset of the submitted prompts based on budget constraints and utility-cost tradeoff. Our experiments demonstrate the efficacy of the proposed UCOSA compared to both offline and relevance-greedy selection baselines. In terms of the performance metric-the number of enriched prompts (NEP) multiplied by the Average Relevance (AR)-UCOSA outperforms random selection by approximately 52% and achieves around 75% of the performance of offline selection methods. Additionally, in terms of budget utilization, LB-CaaS and OB-CaaS achieve higher performance-to-budget ratios of 140% and 86%, respectively, compared to RaaS, indicating their superior efficiency.

23. 【2604.26971】2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language

链接：https://arxiv.org/abs/2604.26971

作者：Yousouf Taghzouti(ICN, WIMMICS, Laboratoire I3S - SPARKS),Tao Jiang(ICN),Camille Juigné(WIMMICS, Laboratoire I3S - SPARKS),Benjamin Navet(ICN, WIMMICS, Laboratoire I3S - SPARKS),Fabien Gandon(WIMMICS, Laboratoire I3S - SPARKS),Franck Michel(Laboratoire I3S - SPARKS, WIMMICS),Louis-Felix Nothias(ICN)

类目：Information Retrieval (cs.IR)

关键词：suffered from fragmentation, limited reproducibility, historically suffered, SPARQL query generation, SPARQL query

备注：

点击查看摘要

Abstract:The evaluation of Question Answering (QA) systems over Knowledge Graphs has historically suffered from fragmentation, inconsistency, and limited reproducibility. While significant progress has been made in semantic parsing and SPARQL query generation, evaluation methodologies remain diverse, ad hoc, and often incomparable across studies. Existing benchmarks typically focus on a small subset of metrics, such as query exact match or answer-level F1, neglecting syntactic validity, semantic faithfulness, execution correctness, results ranking quality, and computational efficiency. In this paper, we present t2s-metrics, an open-source, extensible, and unified evaluation library designed specifically for SPARQL query comparison and execution-based assessment. t2s-metrics provides a broad and extensible set of over 20 evaluation metrics, collected from the literature and practical evaluation needs, spanning lexical, syntactic, semantic, structural, execution-based and ranking-based dimensions. These include query-based metrics such as token-level Precision, Recall, and F1; BLEU, ROUGE, METEOR, and CodeBLEU variants; variable-normalized metrics (SP-BLEU, SP-F1); graph-and URI-based exact match metrics; as well as answer set-based metrics such as F1-QALD and Jaccard similarity; ranking metrics including MRR, NDCG, P@k, and Hit@k; and LLM-as-a-Judge metrics. Taking inspiration from the ir-metrics library for Information Retrieval, t2s-metrics provides a modular abstraction layer that decouples metric specification from implementation, enabling consistent, transparent, and reproducible evaluation of SPARQLbased QA systems. We argue that t2s-metrics constitutes a necessary step toward systematic, standardized evaluation in question answering over knowledge graphs and facilitates deeper diagnostic insights into system behavior beyond answer correctness.

24. 【2604.26970】Not All Memories Age the Same: Autodiscovery of Adaptive Decay in Knowledge Graphs

链接：https://arxiv.org/abs/2604.26970

作者：Mandar Karhade

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

关键词：treat all facts, facts as equally, decay, Knowledge, retrieval treat

备注： 27 pages, 2 figures, 19 tables (including appendix). Preprint under review

点击查看摘要

Abstract:Knowledge graphs used for retrieval treat all facts as equally current. Existing temporal approaches apply uniform decay, using a single forgetting curve regardless of knowledge type. We show this is fundamentally misspecified: different knowledge types exhibit different temporal dynamics, and the core retrieval problem is not latency or throughput but identifying what is important at query time. We propose a hierarchical framework that replaces uniform decay with a continuous decay surface parameterized by two orthogonal signals: velocity (how frequently a concept is observed) and volatility (how much the value changes between observations, measured via embedding distance). The decay surface is decomposed into three learnable levels: domain-level parameters capture universal patterns (some predicates are inherently permanent, others inherently transient), context-level parameters capture setting-dependent variation, and entity-level adaptation personalizes decay to specific subjects. All parameters emerge from data through survival analysis on observed value lifetimes, requiring no predefined taxonomies or domain expertise. We formulate edge lifetime as a survival problem where the event is value supersession (a meaningfully different value replacing the current one), distinct from mere re-observation. Experiments on synthetic temporal knowledge graphs demonstrate recovery of planted hierarchical parameters (HDBSCAN ARI = 1.0). Validation on 107 Wikipedia articles and 1,163 patient records from the Synthea clinical EHR simulator shows that velocity-volatility clusters emerge naturally, align with observable persistence patterns, and near-universally exhibit the Lindy effect (Weibull shape k 1). Uniform decay performs 18x worse than no temporal weighting. Heterogeneous decay recovers from this, with each hierarchy level contributing measurable improvement.

Comments:
27 pages, 2 figures, 19 tables (including appendix). Preprint under review

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

ACMclasses:
H.3.3; I.2.6

Cite as:
arXiv:2604.26970 [cs.IR]

(or
arXiv:2604.26970v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.26970

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Mandar Karhade [view email] [v1]
Wed, 22 Apr 2026 02:32:01 UTC (93 KB)

25. 【2604.26969】AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization

链接：https://arxiv.org/abs/2604.26969

作者：Xidong Wu,Yue Zhuan,Ruoqiao Wei,Hangxin Chen,Di Bai,Jintao Liu,Xinyi Wang,Xue Wang,Luoshu Wang,Xinwu Cheng

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Modern large-scale recommendation, Modern large-scale, large-scale recommendation systems, multi-stage pipelines, re-ranking phases

备注：

点击查看摘要

Abstract:Modern large-scale recommendation systems are typically constructed as multi-stage pipelines, encompassing pre-ranking, ranking, and re-ranking phases. While traditional recommendation research typically focuses on optimizing a specific model, such as improving the pre-ranking model structure or ranking models training algorithm, system-level configurations optimization play a crucial role, which integrates the output from each model head to get the final score in each stage. Due to the complexity of the system, the configuration optimization is highly important and challenging. Any model modification requires new optimal system-level configurations. But each experimental iteration requires significant tuning effort. Furthermore, models in different stage operates within a distinct context and optimizes for different targets, requiring specialized domain expertise. In addition, optimization success depends on balancing competing multiple online metrics and alignment with shifting production development objectives. To address these challenges, we propose AgenticRecTune, an agentic framework comprising five specialized agents, Actor, Critic, Insight, Skill, and Online, designed to manage the end-to-end configuration optimization workflow. By leveraging the advanced reasoning of Large Language Models (LLMs), specifically Gemini, AgenticRecTune explore the optimal configuration spaces. The Actor Agent proposes multiple candidates and Critic Agent filters out suboptimal this http URL Online Agent autonomously prepares A/B tests based on the proposed configurations set from the Critic Agent and captures the subsequencet experimental results. We also introduce a self-evolving Skillhub, which utilizes a collaboration between the Insight Agent and Skill Agent to summarize the history results, extract underlying mechanics of each task in recommendation system and update skills.

26. 【2604.26953】A Randomized Controlled Trial and Pilot of Scout: an LLM-Based EHR Search and Synthesis Platform

链接：https://arxiv.org/abs/2604.26953

作者：Michael Gao,Suresh Balu,William Knechtle,Kartik Pejavara,William Jeck,Matthew Ellis,Jason Thieling,Blake Cameron,Jason Tatreau,Tareq Aljurf,Henry Foote,Michael Revoir,Marshall Nichols,Matthew Gardner,William Ratliff,Bradley Hintze,Angelo Milazzo,Sreekanth Vemulapalli

类目：Information Retrieval (cs.IR); Computers and Society (cs.CY)

关键词：Electronic Health Records, Health Records, Electronic Health, retrieval within Electronic, contribute substantially

备注：

点击查看摘要

Abstract:Clinical documentation and data retrieval within Electronic Health Records (EHRs) contribute substantially to clinician workload and burnout. To address this, we developed Scout, an LLM-based EHR search and synthesis platform that enables clinicians to query EHR data using natural language. Each response includes citations linking each claim to the original data source, facilitating easy verification of generated content. We conducted a prospective randomized, evaluator-blinded crossover trial across seven clinical specialties (20 participants, 200 structured cases). Participants completed realistic clinical tasks using either Scout or the EHR alone, with outcomes including time to completion, NASA Task Load Index workload scores, and blinded expert adjudication of accuracy, completeness, and relevance. Scout reduced task completion time by 37.6% and significantly decreased perceived workload, with the largest reductions in mental demand, effort, and temporal demand. Non-inferiority analyses showed that tasks completed with Scout maintained accuracy, completeness, and relevance relative to tasks completed with the EHR-only. A concurrent pilot deployment across over 200 users and more than 20 specialties generated over 6,600 interactions in three months, revealing diverse clinical and administrative use cases. Automated evaluation using an LLM-as-judge framework identified errors at low rates. Subsequent manual review of a subset of outputs revealed that most claims flagged by the automated judge as errors were in fact supported by the patient chart, demonstrating the importance of human validation. These findings provide early trial-based evidence that LLM-powered EHR tools can meaningfully reduce clinical and administrative workloads while maintaining output quality.

计算机视觉

1. 【2604.28197】OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

链接：https://arxiv.org/abs/2604.28197

作者：Junyoung Lee,Sookwan Han,Jeonghwan Kim,Inhee Lee,Mingi Choi,Jisoo Kim,Wonjung Woo,Hanbyul Joo

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：sequential settings, studied primarily, primarily in dyadic, dyadic or sequential, regime experimentally tractable

备注： Project Page: [this https URL](https://junc0ng.github.io/omnirobothome)

点击查看摘要

Abstract:Human-robot collaboration has been studied primarily in dyadic or sequential settings. However, real homes require multiadic collaboration, where multiple humans and robots share a workspace, acting concurrently on interleaved subtasks with tight spatial and temporal coupling. This regime remains underexplored because close-proximity interaction between humans, robots, and objects creates persistent occlusion and rapid state changes, making reliable real-time 3D tracking the central bottleneck. No existing platform provides the real-time, occlusion-robust, room-scale perception needed to make this regime experimentally tractable. We present OmniRobotHome, the first room-scale residential platform that unifies wide-area real-time 3D human and object perception with coordinated multi-robot actuation in a shared world frame. The system instruments a natural home environment with 48 hardware-synchronized RGB cameras for markerless, occlusion-robust tracking of multiple humans and objects, temporally aligned with two Franka arms that act on live scene state. Continuous capture within this consistent frame further supports long-horizon human behavior modeling from accumulated trajectories. The platform makes the multiadic collaboration regime experimentally tractable. We focus on two central problems: safety in shared human-robot environments and human-anticipatory robotic assistance, and show that real-time perception and accumulated behavior memory each yield measurable gains in both.

2. 【2604.28196】HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

链接：https://arxiv.org/abs/2604.28196

作者：Xin Zhou,Dingkang Liang,Xiwu Chen,Feiyang Tan,Dingyuan Zhang,Hengshuang Zhao,Xiang Bai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：simulating environmental dynamics, environmental dynamics, Large Language Models, pivotal technology, technology for autonomous

备注： Extended version of ICCV 25 paper HERMES, Code: [this https URL](https://github.com/H-EmbodVis/HERMESV2) , Project page: [this https URL](https://h-embodvis.github.io/HERMESV2/)

点击查看摘要

Abstract:Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced world queries to facilitate knowledge transfer from the understanding branch. Third, a Current-to-Future Link is designed to bridge the temporal gap, conditioning geometric evolution on semantic context. Finally, to enforce structural integrity, we employ a Joint Geometric Optimization strategy that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry-aware priors. Extensive evaluations on multiple benchmarks validate the effectiveness of our method. HERMES++ achieves strong performance, outperforming specialist approaches in both future point cloud prediction and 3D scene understanding tasks. The model and code will be publicly released at this https URL.

3. 【2604.28193】Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

链接：https://arxiv.org/abs/2604.28193

作者：Vinayak Gupta,Chih-Hao Lin,Shenlong Wang,Anand Bhattad,Jia-Bin Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：images remains challenging, remains challenging, challenging under real-world, Reconstructing, unposed images remains

备注： Project Page: [this https URL](https://genwildsplat.github.io/)

点击查看摘要

Abstract:Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization using appearance embeddings or dynamic masks, which requires extensive per-scene training and fails under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across diverse illumination and occlusion patterns. Evaluations on PhotoTourism and MegaScenes benchmark demonstrate state-of-the-art feed-forward rendering quality, achieving real-time inference without test-time optimization

4. 【2604.28192】LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

链接：https://arxiv.org/abs/2604.28192

作者：Hao Chen,Jiaming Liu,Zhonghao Yan,Nuowei Han,Renrui Zhang,Chenyang Gu,Jialin Gao,Ziyu Guo,Siyuan Qian,Yinxi Wang,Peng Jia,Chi-Wing Fu,Shanghang Zhang,Pheng-Ann Heng

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：complex robotic manipulation, increasingly incorporated reasoning, models have increasingly, robotic manipulation, increasingly incorporated

备注：

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have increasingly incorporated reasoning mechanisms for complex robotic manipulation. However, existing approaches share a critical limitation: whether employing explicit linguistic reasoning that suffers from latency and discretization, or utilizing more expressive continuous latent reasoning, they are predominantly confined to static imitation learning that limits adaptability and generalization. While online reinforcement learning (RL) has been introduced to VLAs to enable trial-and-error exploration, current methods exclusively optimize the vanilla action space, bypassing the underlying physical reasoning process. In this paper, we present \textbf{LaST-R1}, a unified VLA framework that integrates latent Chain-of-Thought (CoT) reasoning over physical dynamics prior to action execution, along with a tailored RL post-training paradigm. Specifically, we propose \textbf{Latent-to-Action Policy Optimization (LAPO)}, a novel RL algorithm that jointly optimizes the latent reasoning process and the action generation. By bridging reasoning and control, LAPO improves the representation of physical world modeling and enhances robustness in interactive environments. Furthermore, an \textbf{adaptive latent CoT mechanism} is introduced to allow the policy to dynamically adjust its reasoning horizon based on environment complexity. Extensive experiments show that LaST-R1 achieves a near-perfect 99.8\% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-the-art methods. In real-world deployments, LAPO post-training yields up to a 44\% improvement over the initial warm-up policy across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.

5. 【2604.28190】Representation Fréchet Loss for Visual Generation

链接：https://arxiv.org/abs/2604.28190

作者：Jiawei Yang,Zhengyang Geng,Xuan Ju,Yonglong Tian,Yue Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：long considered impractical, show that Fréchet, Fréchet Distance, long considered, considered impractical

备注： Code and checkpoints are available at [this https URL](https://github.com/Jiawei-Yang/FD-loss)

点击查看摘要

Abstract:We show that Fréchet Distance (FD), long considered impractical as a training objective, can in fact be effectively optimized in the representation space. Our idea is simple: decouple the population size for FD estimation (e.g., 50k) from the batch size for gradient computation (e.g., 1024). We term this approach FD-loss. Optimizing FD-loss reveals several surprising findings. First, post-training a base generator with FD-loss in different representation spaces consistently improves visual quality. Under the Inception feature space, a one-step generator achieves0.72 FID on ImageNet 256x256. Second, the same FD-loss repurposes multi-step generators into strong one-step generators without teacher distillation, adversarial training or per-sample targets. Third, FID can misrank visual quality: modern representations can yield better samples despite worse Inception FID. This motivates FDr$^k$, a multi-representation metric. We hope this work will encourage further exploration of distributional distances in diverse representation spaces as both training objectives and evaluation metrics for generative models.

6. 【2604.28185】Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

链接：https://arxiv.org/abs/2604.28185

作者：Keming Wu,Zuhao Yang,Kaichen Zhang,Shizun Wang,Haowei Zhu,Sicong Leng,Zhongyu Yang,Qijie Wang,Sudong Wang,Ziting Wang,Zili Wang,Hui Zhang,Haonan Wang,Hang Zhou,Yifan Pu,Xingxuan Li,Fangneng Zhan,Bo Li,Lidong Bing,Yuxin Song,Ziwei Liu,Wenhu Chen,Jingdong Wang,Xinchao Wang,Xiaojuan Qi,Shijian Lu,Bin Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent visual generation, made major progress, Recent visual, persistent state, long-horizon consistency

备注： Project Page: [this https URL](https://github.com/EvolvingLMMs-Lab/Evolving-Visual-Generation)

点击查看摘要

Abstract:Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.

7. 【2604.28179】Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy

链接：https://arxiv.org/abs/2604.28179

作者：Andrea Dunn Beltran,Daniel Rho,Aarav Mehta,Xinqi Xiong,Raúl San José Estépar,Ron Alterovitz,Marc Niethammer,Roni Sengupta

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Bronchoscopic navigation relies, Bronchoscopic navigation, divergence that limits, navigation relies, respiratory motion deforms

备注：

点击查看摘要

Abstract:Bronchoscopic navigation relies on registering endoscopic video to a preoperative CT scan, but respiratory motion deforms the airway by 5-20 mm, creating CT-to-body divergence that limits localization accuracy. In practice, this is mitigated through breath-hold protocols, which attempt to match the intraoperative anatomy to a static CT, but are difficult to reproduce and disrupt clinical workflow. We propose to eliminate the need for breath-hold protocols by leveraging patient-specific respiratory modeling. Paired inhale-exhale CT scans, already acquired for planning, implicitly define the patient-specific deformation space of the breathing airway. By registering these scans, we reduce respiratory motion to a single scalar breathing phase per frame, constraining all reconstructions to anatomically observed configurations. We embed this representation within a mesh-anchored Gaussian splatting framework, where a lightweight estimator infers breathing phase directly from endoscopic RGB, enabling continuous, deformation-aware reconstruction throughout the respiratory cycle without breath-holds or external sensing. To enable quantitative evaluation, we introduce RESPIRE, a physically grounded bronchoscopy simulation pipeline with per-frame ground truth for geometry, pose, breathing phase, and deformation. Experiments on RESPIRE show that our approach achieves geometrically faithful reconstruction, over 20x faster training, and 1.22 mm target localization accuracy (within the 3mm clinically relevant tolerances) outperforming unconstrained single-CT baselines. Please check out our website for additional visuals: this https URL

8. 【2604.28177】AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

链接：https://arxiv.org/abs/2604.28177

作者：Bo Zhang,Tzu-Yen Ma,Zichen Tang,Junpeng Ding,Zirui Wang,Yizhuo Zhao,Peilin Gao,Zijie Xi,Zixin Ding,Haiyang Sun,Haocheng Gao,Yuan Liu,Liangjia Wang,Yiling Huang,Yujie Wang,Yuyue Zhang,Ronghui Xi,Yuanze Li,Jiacheng Liu,Zhongjun Yang,Haihong E

类目：Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词：Diverse Forgery Simulations, Evaluating forensic analysis, AI-Generated academic ImageS, analysis of AI-Generated, Multi-Dimensional Forensic Evaluation

备注： Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.

9. 【2604.28173】Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements

链接：https://arxiv.org/abs/2604.28173

作者：Genki Kinoshita,Shu Nakamura,Ryo Kawahara,Shohei Nobuhara,Yasutomo Kawanishi,Ko Nishino

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Effective human behavior, Action Atoms, Action Motifs, Effective human, Action

备注： to be published in CVPR 2026 (Highlight)

点击查看摘要

Abstract:Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms that capture the atomic joint movements and Action Motifs that are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully self-supervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with a unified pretext task of masked token prediction in their respective latent spaces. We also introduce Action Motif Dataset (AMD), a large-scale dataset of multi-view human behavior videos with full SMPL annotations. We introduce a novel use of cameras by mounting them on the feet to achieve their frame-wise annotations despite frequent and heavy body occlusions. Experimental results demonstrate the effectiveness of A4Mer for extracting meaningful Action Motifs, which significantly benefit human behavior modeling tasks including action recognition, motion prediction, and motion interpolation.

10. 【2604.28169】PhyCo: Learning Controllable Physical Priors for Generative Motion

链接：https://arxiv.org/abs/2604.28169

作者：Sriram Narayanan,Ziyu Jiang,Srinivasa Narasimhan,Manmohan Chandraker

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：collisions lack realistic, lack realistic rebound, material responses seldom, responses seldom match, Modern video diffusion

备注： CVPR 2026. Project Page: [this https URL](https://phyco-video.github.io/)

点击查看摘要

Abstract:Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.

11. 【2604.28159】Continuous-tone Simple Points: An $\ell_0$-Norm of Cyclic Gradient for Topology-Preserving Data-Driven Image Segmentation

链接：https://arxiv.org/abs/2604.28159

作者：Wenxiao Li,Faqiang Wang,Yuping Duan,Li Cui,Liqiang Zhang,Jun Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ensuring geometric plausibility, Topological features play, image analysis tasks, features play, play an essential

备注：

点击查看摘要

Abstract:Topological features play an essential role in ensuring geometric plausibility and structural consistency in image analysis tasks such as segmentation and skeletonization. However, integrating topology-preserving learning based on simple points into deep learning tasks remains challenging, as existing simple point detection methods are confined to binary images and are non-differentiable, rendering them incompatible with gradient-based optimization in modern deep learning. Moreover, morphological and purely data-driven approaches often fail to guaranty topological consistency. To address these limitations, we propose a novel method that directly computes simple points on continuous-valued images, enabling differentiable topological inference. Building on this theory, we develop an efficient skeleton extraction algorithm that preserves topological structures in binary and continuous-valued images. Furthermore, we design a variational model that enforces topological constraints by preserving topologically non-removable (i.e., non-simple) points, which can be seamlessly integrated into any deep neural network segmentation with softmax or sigmoid outputs. Experimental results demonstrate that the proposed approach effectively improves topological integrity and structural accuracy across multiple benchmarks. The codes are available in this https URL.

12. 【2604.28136】Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering

链接：https://arxiv.org/abs/2604.28136

作者：Furkan Kınlı

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Night Photography Rendering, Night Photography, Photography Rendering, point light sources, severely dark regions

备注： 6 pages, 3 figures, Accepted to 2026 IEEE International Conference on Image Processing

点击查看摘要

Abstract:Night Photography Rendering (NPR) poses a significant challenge due to the extreme contrast between dark and illuminated areas in scenes, stemming from concurrent capture of severely dark regions alongside intense point light sources. Existing methods, which are mainly tailored for fidelity metrics, reveal considerable perceptual gaps and often detract from visual quality. We introduce pHVI-ISPNet, a novel RAW-to-RGB framework built on the robust HVI color space. Our network integrates four distinct key refinements: RAW-domain feature processing and Wavelet-based feature propagation to mitigate high-frequency detail loss; sample-based dynamic loss coefficients to ensure stable learning across varying exposure levels; and loss term based on feature distributions to maintain rigorous color constancy. Evaluations on the dataset introduced in the NTIRE 2025 challenge on NPR confirm our approach achieves competitive fidelity while establishing new state-of-the-art results in both CIE2000 color difference and LPIPS. This validates our perceptually-driven design for high-quality nighttime imaging.

13. 【2604.28134】3D-ReGen: A Unified 3D Geometry Regeneration Framework

链接：https://arxiv.org/abs/2604.28134

作者：Geon Yeong Park,Roman Shapovalov,Rakesh Ranjan,Jong Chul Ye,Andrea Vedaldi,Thu Nguyen-Phuoc

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：problem of regenerating, Abstract, images, initial, regenerating

备注： 32 pages, 18 figures, 6 tables. Includes Appendix

点击查看摘要

Abstract:We consider the problem of regenerating 3D objects from 2D images and initial 3D shapes. Most 3D generators operate in a one-shot fashion, converting text or images to a 3D object with limited controllability. We introduce instead 3D-ReGen, a 3D regenerator that is conditioned on an initial 3D shape. This conceptually simple formulation allows us to support numerous useful tasks, including 3D enhancement, reconstruction, and editing. 3D-ReGen uses a new conditioning mechanism based on VecSet, which allows the regenerator to update or improve the input geometry with consistent fine-grained details. 3D-ReGen learns a widely applicable regeneration prior from off-the-shelf 3D datasets via self-supervised pretext tasks and augmentations, without additional annotations. We evaluate both the geometric consistency and fine-grained quality of 3D-ReGen, achieving state-of-the-art performance in controllable 3D generation across several tasks.

14. 【2604.28130】MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

链接：https://arxiv.org/abs/2604.28130

作者：Kehong Gong,Zhengyu Wen,Dao Thien Phong,Mingxi Xu,Weixia He,Qi Wang,Ning Zhang,Zhengyu Li,Guanli Hou,Dongze Lian,Xiaoyu He,Mingyuan Zhang,Hanwang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：arbitrary-skeleton motion capture, stage recovers joint, monocular video follow, joint positions, network predicts joint

备注： Project page: [this https URL](https://animotionlab.github.io/MoCapAnythingV2/)

点击查看摘要

Abstract:Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: this https URL

15. 【2604.28126】AdvDMD: Adversarial Reward Meets DMD For High-Quality Few-Step Generation

链接：https://arxiv.org/abs/2604.28126

作者：Xu Wang,Zexian Li,Litong Gong,Tiezheng Ge,Zhijie Deng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Diffusion models offer, Distribution Matching Distillation, Toggle, extensive sampling steps, Diffusion models

备注：

点击查看摘要

Abstract:Diffusion models offer superior generation quality at the expense of extensive sampling steps. Distillation methods, with Distribution Matching Distillation (DMD) as a popular example, can mitigate this issue, but performance degradation remains pronounced when sampling steps are limited. Reinforcement learning (RL) has been leveraged to improve the few-step generation quality during distillation, with the potential to even surpass the performance of the teacher model. However, existing approaches are combinatorial in nature, merely integrating an RL process with the distillation process, which introduces unnecessary complexities. To address this gap, we propose AdvDMD, a method that seamlessly unifies DMD distillation and RL. Specifically, AdvDMD employs the adversarially trained discriminator from DMD2 as the reward model, which assigns low scores to generated images and high scores to real ones. It is trained on both intermediate and final states of the denoising process and updated online with the distilled model, enabling a holistic supervision of the sampling trajectories and mitigating reward hacking. We adopt a unified SDE backward simulation and a different training schedule for DMD and RL to enable a more stable and efficient training. Experimental results demonstrate that the 4-step AdvDMD outperforms the original 40-step model for SD3.5 on DPG-Bench, while achieving significant performance gains for SD3 on the GenEval. On Qwen-Image, our 2-step AdvDMD achieves superior performance over TwinFlow.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.28126 [cs.CV]

(or
arXiv:2604.28126v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.28126

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Xu Wang [view email] [v1]
Wed, 29 Apr 2026 16:56:05 UTC (107,795 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled AdvDMD: Adversarial Reward Meets DMD For High-Quality Few-Step Generation, by Xu Wang and Zexian Li and Litong Gong and Tiezheng Ge and Zhijie DengView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CV

|
next

new
|
recent
| 2026-04

Change to browse by:

cs
cs.AI

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

16. 【2604.28123】PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

链接：https://arxiv.org/abs/2604.28123

作者：Sudong Wang,Weiquan Huang,Xiaomin Yu,Zuhao Yang,Hehai Lin,Keming Wu,Chaojun Xiao,Chen Chen,Wenxuan Wang,Beier Zhu,Yunjian Zhang,Chengwei Qin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：applies supervised fine-tuning, standard post-training recipe, applies supervised, supervised fine-tuning, verifiable rewards

备注：

点击查看摘要

17. 【2604.28122】Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

链接：https://arxiv.org/abs/2604.28122

作者：Andrew Bond,Ilkin Umut Melanlioglu,Erkut Erdem,Aykut Erdem

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：systems increasingly rely, produce plausible motion, consistent camera dynamics, modeling systems increasingly, Modern visual world

备注： 16 pages, 10 figures

点击查看摘要

Abstract:Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often fail to preserve underlying 3D geometry or physically consistent camera dynamics. A key limitation lies not only in model capacity, but in the latent representations used to encode geometric structure. We propose S$^2$VAE, a geometry-first latent learning framework that focuses on compressing and representing the latent 3D state of a scene, including camera motion, depth, and point-level structure, rather than modeling appearance alone. Building on representations from a Visual Geometry Grounded Transformer (VGGT), we introduce a novel type of variational autoencoder using a product of Power Spherical latent distributions, explicitly enforcing hyperspherical structure in the bottleneck to preserve directional and geometric semantics under strong compression. Across depth estimation, camera pose recovery, and point cloud reconstruction, we show that geometry-aligned hyperspherical latents consistently outperform conventional Gaussian bottlenecks, particularly in high-compression regimes. Our results highlight latent geometry as a first-class design choice for physically grounded visual and world models.

18. 【2604.28115】FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

链接：https://arxiv.org/abs/2604.28115

作者：Zeyu Jiang,Changqing Zhou,Xingxing Zuo,Changhao Chen

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing learning-based occupancy, Existing learning-based, learning-based occupancy prediction, open-vocabulary occupancy prediction, rely on large-scale

备注： RSS 2026

点击查看摘要

Abstract:Existing learning-based occupancy prediction methods rely on large-scale 3D annotations and generalize poorly across environments. We present FreeOcc, a training-free framework for open-vocabulary occupancy prediction from monocular or RGB-D sequences. Unlike prior approaches that require voxel-level supervision and ground-truth camera poses, FreeOcc operates without 3D annotations, pose ground truth, or any learning stage. FreeOcc incrementally builds a globally consistent occupancy map via a four-layer pipeline: a SLAM backbone estimates poses and sparse geometry; a geometrically consistent Gaussian update constructs dense 3D Gaussian maps; open-vocabulary semantics from off-the-shelf vision-language models are associated with Gaussian primitives; and a probabilistic Gaussian-to-occupancy projection produces dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc achieves over $2\times$ improvements in IoU and mIoU on EmbodiedOcc-ScanNet compared to prior self-supervised methods. We further introduce ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and show that FreeOcc transfers zero-shot to novel environments, substantially outperforming both supervised and self-supervised baselines. Project page: this https URL.

19. 【2604.28095】UHR-Net: An Uncertainty-Aware Hypergraph Refinement Network for Medical Image Segmentation

链接：https://arxiv.org/abs/2604.28095

作者：Shuokun Cheng,Jinghao Shi,Kun Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate lesion segmentation, Accurate lesion, treatment planning, segmentation is crucial, crucial for clinical

备注： 8 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Accurate lesion segmentation is crucial for clinical diagnosis and treatment planning. However, lesions often resemble surrounding tissues and exhibit ill-defined boundaries, leading to unstable predictions in boundary/transition regions. Moreover, small-lesion cues can be diluted by multi-scale feature extraction, causing under- or over-segmentation. To address these challenges, we propose an Uncertainty-Aware Hypergraph Refinement Network (UHR-Net). First, we introduce an Uncertainty-Oriented Instance Contrastive (UO-IC) pretraining strategy that couples geometry-aware copy-paste augmentation with hard-negative mining of lesion-like background regions to improve instance-level discrimination for small and visually ambiguous lesions. Second, we design an Uncertainty-Guided Hypergraph Refinement (UGHR) block, which derives an entropy-based uncertainty map from a coarse probability map to guide hypergraph refinement. By splitting hyperedge prototypes into foreground and background groups, UGHR decouples higher-order interactions and improves refinement in ambiguous regions. Experiments on five public benchmarks demonstrate consistent gains over strong baselines. Code is available at: this https URL.

20. 【2604.28078】AesRM: Improving Video Aesthetics with Expert-Level Feedback

链接：https://arxiv.org/abs/2604.28078

作者：Yujin Han,Yujie Wei,Yefei He,Xinyu Liu,Tianle Li,Zichao Yu,Andi Han,Shiwei Zhang,Tingyu Weng,Difan Zou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：photorealistic video generation, filmmaking require video, visual fidelity, visual, require video aesthetics

备注： 37 pages, 14 figures, 12 tables

点击查看摘要

Abstract:Despite rapid advances in photorealistic video generation, real-world applications such as filmmaking require video aesthetics, e.g., harmonious colors and cinematic lighting, beyond visual fidelity. Prior work on visual aesthetics largely focuses on images, often reducing aesthetics to coarse definitions, e.g., visual pleasure, without a rigorous and systematic evaluation. To improve video aesthetics, we propose a hierarchical rubric that decomposes video aesthetics into three core dimensions, Visual Aesthetics (VA), Visual Fidelity (VF), and Visual Plausibility (VP), with 15 fine-grained criteria, e.g., shot composition. This framework enables a large-scale expert-annotated preference dataset and an evaluation benchmark, AesVideo-Bench, containing about 2500 video pairs with expert annotations on VA, VF, and VP. We then build a family of Video Aesthetic Reward Models (AesRM): AesRM-Base, which directly predicts pairwise preferences on these dimensions to provide efficient post-training rewards, and AesRM-CoT, which additionally generates CoT aligned with all 15 criteria to improve assessment interpretability. Specifically, we train AesRM with a three-stage progressive scheme: (1) Atomic Aesthetic Capability Learning, which strengthens AesRM's recognition of fundamental aesthetic concepts, e.g., accurately identifying centered composition; (2) Cold-Start, aligning the model with structured reasoning protocols; and (3) GRPO, further improving evaluation accuracy. To enhance AesRM-CoT, we additionally propose self-consistency-based CoT synthesis to improve CoT quality and design CoT-based process rewards during GRPO. Extensive experiments show AesRM outperforms baselines on multiple aesthetics benchmarks and is more robust, with lower position bias. Finally, we align Wan2.2 with AesRM and observe clear aesthetic gains over existing aesthetic reward models.

21. 【2604.28064】3D Reconstruction Techniques in the Manufacturing Domain: Applications, Research Opportunities and Use Cases

链接：https://arxiv.org/abs/2604.28064

作者：Chialoon Cheng(1),Kaijun liu(2),Zhiyang Liu(1),Marcelo H Ang Jr(1) ((1) Advanced Robotics Centre, National University of Singapore, Singapore (2) Independent Researcher)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：comprehensive review examines, art in three-dimensional, examines the evolution, reconstruction techniques, comprehensive review

备注： 24 pages

点击查看摘要

Abstract:This comprehensive review examines the evolution and the current state of the art in three-dimensional (3D) reconstruction techniques in manufacturing applications. The analysis covers both traditional approaches and emerging deep learning methods, showing a critical research gap in unified 3d reconstruction frameworks. Through systematic review of 106 recent publications, we classify reconstruction techniques into three primary categories: data acquisition, point cloud generation, post-processing and applications. Non-contact methods, particularly structured light scanning and stereo vision, have shown significant adoption in manufacturing, with 47% of surveyed applications focusing on quality inspection. The integration of deep learning has enhanced reconstruction accuracy and processing speed, particularly in feature extraction and matching. Key applications span design and development (13%), machining (8%), process (17%), assembly (22%), and quality inspection (40%). While current technologies achieve sub-millimeter accuracy in controlled environments, challenges persist in handling reflective surfaces and dynamic environments. Our findings indicate a trend toward hybrid systems combining multiple sensor types and processing methods to overcome individual limitations. This survey provides a structured framework for understanding current capabilities and future directions in manufacturing-focused 3D reconstruction.

22. 【2604.28045】AFA-GSGC: Group-wise Scalable Point Cloud Geometry Compression with Progressive Residual Refinement

链接：https://arxiv.org/abs/2604.28045

作者：Xiumei Li,Alexander Kopte,André Kaup

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：making rate adaptation, rate adaptation costly, adaptation costly due, fixed rate-distortion point, bandwidth-adaptive transmission

备注： Accepted at IEEE International Conference on Image Processing (ICIP) 2026

点击查看摘要

Abstract:Scalable compression is essential for bandwidth-adaptive transmission, yet most learned codecs are optimized for a fixed rate-distortion point, making rate adaptation costly due to re-encoding or maintaining multiple bitstreams. In this work, we propose TAFA-GSGC, a scalable learned point cloud geometry codec that enables multi-quality decoding from a single bitstream and a single trained model. TAFA-GSGC combines layered residual refinement with channel-group entropy coding, and introduces Target-Aligned Feature Aggregation module to reduce cross-layer redundancy in enhancement residuals. Our framework supports up to 9 decodable quality levels with monotonic quality improvement as more subbitstreams are received, while maintaining strong compression efficiency. Compared with the baseline PCGCv2, TAFA-GSGC attains comparable and slightly better RD performance, achieving average BD-Rate savings of -4.99% in D1 and -5.92% in D2.

23. 【2604.28025】ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss

链接：https://arxiv.org/abs/2604.28025

作者：Jiaying Ying,Heming Du,Kaihao Zhang,Sean M. Tweedy,Xin Yu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：person-centric representation, supports analysis, human-computer interaction, representation that supports, human mesh recovery

备注： Highlight in CVPR 2026. Project at [this https URL](https://akitaraphael.github.io/ResiHMR/)

点击查看摘要

Abstract:Single-image human mesh recovery provides a compact 3D, person-centric representation that supports analysis, animation, AR and VR, rehabilitation, and human-computer interaction. However, prevailing systems impose an intact-limb prior and degrade on people with limb loss, because fixed-topology models cannot represent residual limbs. In this work, we present ResiHMR, a residual-limb aware framework for single-image 3D human modeling. ResiHMR adopts residual-limb keypoints and introduces two components: (i) a topology-adaptive Residual Anchor-Factor Optimization module that constrains estimation to the observed kinematic subgraph of anatomically valid structures, and (ii) a geometry-based Residual-Limb Reconstruction module that estimates residual-limb boundaries and convex limb-termination geometry. These components introduce topology-aware optimization and explicit termination geometry as tools for human mesh recovery under non-standard limb anatomy. Unlike joint-removal methods in a fixed topology, ResiHMR explicitly reconstructs residual-limb surfaces and aligns optimization with limb-loss topology, which better matches prosthetic biomechanics and real-world use. To the best of our knowledge, this is the first single-image HMR system that explicitly reconstructs residual-limb surfaces and performs topology-adaptive optimization for individuals with limb loss. On a curated dataset of real-world images with limb loss, ResiHMR improves reconstruction quality under both SMPLify-X and HSMR backbones, reducing intact-joint 2D MPJPE from 41.32 to 37.40 with SMPLify-X and residual-limb 2D MPJPE from 73.61 to 23.19 with HSMR.

24. 【2604.28022】Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

链接：https://arxiv.org/abs/2604.28022

作者：Sharayu Nilesh Deshmukh,Kailash A. Hambarde,Joana C. Costa,Hugo Proença,Tiago Roxo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Current DeepFake detection, Current DeepFake, semantic mismatch, vary across audio, DeepFake detection scenarios

备注： Submitted to IJCB 2026

点击查看摘要

Abstract:Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formulations address this by discriminating manipulation type, but introduce a unresolved problem: models may rely solely on data source integrity to detect DeepFakes without evaluating their semantic consistency. If the DeepFake origin is not in the data source but in its content, can semantic mismatch be assessed by the state-of-the-art? This paper proposes a new evaluation setup, extending the four-class formulation by explicitly modeling semantic-level inconsistency between authentic modalities with the introduction a new class: Real Audio-Real Video with Semantic Mismatch (RARV-SMM). We assess the robustness of state-of-the-art models in this new realistic DeepFake setting, using the FakeAVCeleb dataset, highlighting the limitations of existing approaches when faced with semantic mismatch data. We further introduce three RARV-SMM variants that expose distinct architectural vulnerabilities as audio-visual divergence increases. We also propose a semantic reinforcement strategy that incorporates the semantic mismatch class and ImageBind embeddings to improve DeepFake detection in both our proposed and state-of-the-art settings, on FakeAVCeleb and LAV-DF, paving the way to more realistic DeepFake detectors. The source code and data are available at this https URL.

25. 【2604.28016】Faster 3D Gaussian Splatting Convergence via Structure-Aware Densification

链接：https://arxiv.org/abs/2604.28016

作者：Linjie Lyu,Ayush Tewari,Jianchun Chen,Thomas Leimkühler,Christian Theobalt

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：real-time novel-view synthesis, powerful scene representation, Splatting has emerged, Gaussian Splatting, novel-view synthesis

备注： Siggraph 2026

点击查看摘要

Abstract:3D Gaussian Splatting has emerged as a powerful scene representation for real-time novel-view synthesis. However, its standard adaptive density control relies on screen-space positional gradients, which do not distinguish between geometric misplacement and frequency aliasing, often leading to either over-blurred high-frequency textures or inefficient over-densification. We present a structure-aware densification framework. Our key insight is that the decision to subdivide a Gaussian should be driven by an explicit comparison between its projected screen-space extent and the local structure of the texture it seeks to represent. We introduce a multi-scale frequency analysis combining structure tensors with Laplacian scale space analysis to estimate the dominant frequency at each pixel, enabling robust supervision across varying texture scales. Based on this analysis, we define $\eta$, a per-Gaussian, per-axis frequency violation metric that indicates when a primitive may be under-resolving local texture details. Unlike methods that perform isotropic splitting (e.g., splitting each Gaussian into two smaller ones with uniform shape), our approach performs anisotropic splitting. For each axis with high $\eta$, we compute a split factor to better resolve the local frequency content. We further introduce a multiview consistency criterion that aggregates $\eta$ observations across multiple views. By performing densification early and faster, we skip the lengthy iterative densification phases required by baseline methods and achieve significantly faster convergence. Experiments on standard benchmarks demonstrate that our method also achieves superior reconstruction quality, particularly in high-frequency regions.

26. 【2604.28011】Echo-α: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

链接：https://arxiv.org/abs/2604.28011

作者：Jing Zhang,Wentao Jiang,Tao Huang,Zhiwei Wang,Jianxin Liu,Jian Chen,Ping Ye,Gang Wang,Zengmao Wang,Bo Du,Dacheng Tao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offer strong localization, existing methods typically, methods typically excel, multimodal large language, large language models

备注： 12 pages, 4 figures. Technical report

点击查看摘要

Abstract:Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer strong localization but limited reasoning, whereas multimodal large language models (MLLMs) provide flexible reasoning but weak grounding in specialized medical domains. We present Echo-{\alpha}, an agentic multimodal reasoning model for ultrasound interpretation that unifies these strengths within an invoke-and-reason framework. Echo-{\alpha} is trained to coordinate organ-specific detector outputs, integrate them with global visual context, and convert the resulting evidence into grounded diagnostic decisions beyond detector-only inference. This behavior is established through a nine-task supervised curriculum and then refined by sequential reinforcement learning under different reward trade-offs, yielding Echo-{\alpha}-Grounding for lesion anchoring and Echo-{\alpha}-Diagnosis for final diagnosis. On multi-center renal and breast ultrasound benchmarks, Echo-{\alpha} outperforms competitive baselines on both grounding and diagnosis. In particular, on cross-center test sets, Echo-{\alpha}-Grounding attains 56.73%/43.78% F1@0.5 and Echo- {\alpha}-Diagnosis reaches 74.90%/49.20% overall accuracy on renal/breast ultrasound. These results suggest that agentic multimodal reasoning can turn specialized detectors into verifiable clinical evidence, offering a practical route toward ultrasound AI systems that are more accurate, interpretable, and transferable. The repository is at this https URL.

27. 【2604.27975】ransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

链接：https://arxiv.org/abs/2604.27975

作者：Ce Chen,Yi Ren,Yuanming Li,Viktor Goriachko,Zhenhui Ye,Zujin Guo,Zhibin Hong,Mingming Gong

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Shot Boundary Detection, Traditional Shot Boundary, frequently yielding corrupted, Boundary Detection, Shot Transition Detection

备注： This work has been deployed to production. For more related research, please visit HeyGen Research ( [this https URL](https://www.heygen.com/research) ) and HeyGen Avatar-V ( [this https URL](https://www.heygen.com/research/avatar-v-model) ). Project page: [this https URL](https://chence17.github.io/TransVLM/)

点击查看摘要

Abstract:Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (this https URL) and HeyGen Avatar-V (this https URL). Project page: this https URL

28. 【2604.27974】FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

链接：https://arxiv.org/abs/2604.27974

作者：Fengxian Ji,Jingpu Yang,Zirui Song,Yuanxi Wang,Zhexuan Cui,Yuke Li,Qian Jiang,Xiuying Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)

关键词：large vision-language models, State Success Rate, Success Rate, GUI interaction remains, Exact State Success

备注：

点击查看摘要

Abstract:Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and why agents fail. To address this gap, we introduce \textbf{FineState-Bench}, a benchmark that evaluates whether an agent can correctly ground an instruction to the intended UI control and reach the exact target state. FineState-Bench comprises 2,209 instances across desktop, web, and mobile platforms, spanning four interaction families and 23 UI component types, with each instance explicitly specifying an exact target state for fine-grained state setting. We further propose \textit{FineState-Metrics}, a four-stage diagnostic pipeline with stage-wise success rates: Localization Success Rate (SR@Loc), Interaction Success Rate (SR@Int), Exact State Success Rate at Locate (ES-SR@Loc), and Exact State Success Rate at Interact (ES-SR@Int), and a plug-and-play \textit{Visual Diagnostic Assistant} (VDA) that generates a Description and a bounding-box Localization Hint to diagnose visual grounding reason via controlled w/ vs.\ w/o comparisons. On FineState-Bench, exact goal-state success remains low: ES-SR@Int peaks at 32.8\% on Web and 22.8\% on average across platforms. With VDA localization hints, Gemini-2.5-Flash gains +14.9 ES-SR@Int points, suggesting substantial headroom from improved visual grounding, yet overall accuracy is still insufficient for reliable fine-grained state-conditioned interaction \href{this https URL}{Github.}

29. 【2604.27968】ClimateVID -- Social Media Videos Analysis and Challenges Involved

链接：https://arxiv.org/abs/2604.27968

作者：Shiqi Xu,Moritz Burmester,Katharina Prasse,Isaac Bravo,Stefanie Walter,Margret Keuper

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：specifically short videos, social media platforms, social media data, social media, digital content

备注： Equal contributions by Shiqi Xu and Moritz Burmester

点击查看摘要

Abstract:The pervasive growth of digital content, specifically short videos on social media platforms, has significantly altered how topics are discussed and understood in public discourse. In this work, we advance automated visual theme detection by assessing zero-shot and clustering capabilities on social media data. (1) We evaluated the capabilities of notable VLMs such as VideoChatGPT, PandaGPT, and VideoLLava using zero-shot image classification and compared their performance to the baseline provided by frame-wise CLIP image classification. (2) By treating clustering as a minimum cost multicut problem, we aim to uncover insightful patterns in an unsupervised manner. For both analysis strategies, we provide extensive evaluations and practical guidance to practitioners. While VLMs are currently not able to detect climate change specific classes, the clustering results are distinct visual frames. %Given that VLMs are not currently capable to grasp the climate change discourse, we focus the clustering evaluation of image embedding models. We find that both ConvNeXt V2 and DINOv2 produce meaningful clusters, with DINOv2 focusing more on style differences and abstract categories, while ConvNeXt V2 clusters differ in more fine-grained ways. Code available at this https URL.

30. 【2604.27958】ripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

链接：https://arxiv.org/abs/2604.27958

作者：Dingbao Shao,Song Wu,Shenyi Wang,Ye Wang,Ziheng Tang,Fei Liu,Jiang Lin,Xinyu Chen,Qian Wang,Ying Tai,Jian Yang,Zili Yi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：models remains limited, try-on models remains, scarcity of large-scale, remains limited, models remains

备注：

点击查看摘要

Abstract:Due to the scarcity of large-scale in-the-wild triplet data and the improper use of masks, the performance of video virtual try-on models remains limited. In this paper, we first introduce **TripVVT-10K**, the largest and most diverse in-the-wild triplet dataset to date, providing explicit video-level cross-garment supervision that existing video datasets lack. Built upon this resource, we develop **TripVVT**, a Diffusion Transformer-based framework that replaces fragile garment masks with a simple, stable human-mask prior, enabling reliable background preservation while remaining robust to real-world motion, occlusion, and cluttered scenes. To support comprehensive evaluation, we further establish **TripVVT-Bench**, a 100-case benchmark covering diverse garments, complex environments, and multi-person scenarios, with metrics spanning video quality, try-on fidelity, background consistency, and temporal coherence. Compared to state-of-the-art academic and commercial systems, TripVVT achieves superior video quality and garment fidelity while markedly improving generalization to challenging in-the-wild videos. We publicly release the dataset and benchmark, which we believe provide a solid foundation for advancing controllable, realistic, and temporally stable video virtual try-on.

31. 【2604.27955】GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

链接：https://arxiv.org/abs/2604.27955

作者：Junan Hu,Jian Liu,Jingxiang Lai,Jiarui Hu,Yiwei Sheng,Shuang Chen,Jian Li,Dazhao Du,Song Guo

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Graphical User Interface, graphical interfaces visually, Graphical User, User Interface, graphical interfaces

备注： Project Page: [this https URL](https://github.com/Steve2457/Awesome-RL-GUI-Agents)

点击查看摘要

Abstract:Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine-tuning alone cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments, making Reinforcement Learning (RL) a central methodology for advancing automation. In this work, we present the first comprehensive overview of the intersection between RL and GUI agents, and examine how this research direction may evolve toward digital inhabitants. We propose a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and complement it with analyses of reward engineering, data efficiency, and key technical innovations. Our analysis reveals several emerging trends: the tension between reliability and scalability is motivating the adoption of composite, multi-tier reward architectures; GUI I/O latency bottlenecks are accelerating the shift toward world-model-based training, which can yield substantial performance gains; and the spontaneous emergence of System-2-style deliberation suggests that explicit reasoning supervision may not be necessary when sufficiently rich reward signals are available. We distill these findings into a roadmap covering process rewards, continual RL, cognitive architectures, and safe deployment, aiming to guide the next generation of robust GUI automation and its agent-native infrastructure.

32. 【2604.27953】he Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

链接：https://arxiv.org/abs/2604.27953

作者：Kenneth J. K. Ong

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Iterated Prisoner Dilemma, visual inputs influence, decision-making systems, increasingly integrated, integrated into decision-making

备注：

点击查看摘要

Abstract:As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effects of visual priming on VLMs' cooperative behavior using the Iterated Prisoner's Dilemma (IPD) as a test scenario. We examine whether exposure to images depicting behavioral concepts (kindness/helpfulness vs. aggressiveness/selfishness) and color-coded reward matrices alters VLM decision patterns. Experiments were conducted across multiple state-of-the-art VLMs. We further explore mitigation strategies including prompt modifications, Chain of Thought (CoT) reasoning, and visual token reduction. Results show that VLM behavior can be influenced by both image content and color cues, with varying susceptibility and mitigation effectiveness across models. These findings not only underscore the importance of robust evaluation frameworks for VLM deployment in visually rich and safety-critical environments, but also highlight how architectural and training differences among models may lead to distinct behavioral responses-an area worthy of further investigation.

33. 【2604.27932】Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

链接：https://arxiv.org/abs/2604.27932

作者：Mingliang Liang,Zhuoran Liu,Arjen P. de Vries,Martha Larson

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：VLM training, vision-language model, VLM, data, training

备注：

点击查看摘要

Abstract:The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accuracy. However, existing efficient pre-training approaches may disproportionately remove rare concepts from the training corpus. As a result, \emph{long-tail concepts} remain insufficiently represented in the training data and are not effectively captured during training. In this work, we introduce a \emph{dynamic cluster-based sampling approach (DynamiCS)} that downsamples large clusters of data and upsamples small ones. The approach is dynamic in that it applies sampling at each epoch. We first show the importance of dynamic sampling for VLM training. Then, we demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts.

34. 【2604.27928】raining-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction

链接：https://arxiv.org/abs/2604.27928

作者：Shipeng Liu,Liang Zhao,Dengfeng Chen,Zhanping Song

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Tunnel inspection requires, severity grading, inspection requires outputs, Tunnel inspection, interference-heavy tunnel scenes

备注：

点击查看摘要

Abstract:Tunnel inspection requires outputs that can support defect localization, measurement, severity grading, and engineering documentation. Existing training-free foundation-model pipelines usually stop at coarse open-vocabulary proposals, which are difficult to use directly in interference-heavy tunnel scenes. We propose a training-free framework TunnelMIND. Specifically, language-guided defect proposals are not treated as final outputs; instead, their spatial support is recalibrated at inference time through dense visual consistency, so that coarse semantic anchors can be transformed into more reliable prompts under tunnel-specific hard negatives. The resulting masks are further reconstructed into structured defect entities with category, location, geometry, severity, and context attributes, which are then mapped to retrieval-grounded explanation and engineering-readable report generation under expert knowledge constraints. On visible, GPR, and road defect tasks, TunnelMIND achieves F1 scores of 0.68, 0.78, and 0.72, respectively. Overall, TunnelMIND shows that training-free tunnel inspection can move beyond coarse localization toward structured defect evidence for engineering assessment.

35. 【2604.27918】Generate Your Talking Avatar from Video Reference

链接：https://arxiv.org/abs/2604.27918

作者：Zujin Guo,Zhenhui Ye,Yi Ren,Yuanming Li,Ce Chen,Zhibin Hong,Chen Change Loy

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：methods typically adopt, avatar methods typically, talking avatar methods, static reference image, talking avatar

备注： Project Page: [this https URL](https://gseancdat.github.io/projects/TAVR)

点击查看摘要

Abstract:Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally, task-specific reinforcement learning aligns the generated outputs with identity-based rewards to maximize identity similarity. To systematically evaluate cross-scene robustness, we construct a new benchmark comprising 158 carefully curated cross-scene video pairs. Extensive experiments show that TAVR benefits from flexible inference-time video referencing and consistently surpasses existing baselines both quantitatively and qualitatively. This work has been deployed to production. For more related research, please visit \href{this https URL}{HeyGen Research} and \href{this https URL}{HeyGen Avatar-V}.

36. 【2604.27903】HiMix: Hierarchical Artifact-aware Mixup for Generalized Synthetic Image Detection

链接：https://arxiv.org/abs/2604.27903

作者：Shuchang Zhou,Kaiwen Shen,Jiwei Wei,Yuyang Zhou,Peng Wang,Yang Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Synthetic Image Detection, generalizable Synthetic Image, posing significant challenges, diverse synthetic images, generalizable Synthetic

备注：

点击查看摘要

Abstract:The rapid evolution of generative models has enabled the creation of highly realistic and diverse synthetic images, posing significant challenges to reliable and generalizable Synthetic Image Detection (SID). However, existing detectors are typically trained on limited and biased datasets, resulting in poor generalization to unseen generators. To address this issue, we propose HiMix, a unified framework that enhances generalization by expanding the training distribution and promoting artifact-aware representations. Specifically, the Mixup-driven Distributional Augmentation (MDA) module constructs continuous transitional samples between real and fake images, improving coverage of low-confidence regions and exposing the model to more challenging samples, while the pixel-wise mixup operation smoothly perturbs semantics to enhance sensitivity to low-level artifacts. Moreover, the Hierarchical Artifact-aware Representation (HAR) module aggregates artifact information from both global and local levels through cross-layer integration and coarse-to-fine feature fusion, enabling the extraction of discriminative forgery representations under diverse distributions. Extensive experiments across multiple benchmarks demonstrate that HiMix achieves state-of-the-art performance, establishing well-separated logits for improved generalization to unseen forgeries.

37. 【2604.27889】Noise2Map: End-to-End Diffusion Model for Semantic Segmentation and Change Detection

链接：https://arxiv.org/abs/2604.27889

作者：Ali Shibli,Andrea Nascetti,Yifang Ban

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：satellite imagery, remote sensing, remote sensing scenarios, fundamental challenges, differences from satellite

备注：

点击查看摘要

Abstract:Semantic segmentation and change detection are two fundamental challenges in remote sensing, requiring models to capture either spatial semantics or temporal differences from satellite imagery. Existing deep learning models often struggle with temporal inconsistencies or in capturing fine-grained spatial structures, require extensive pretraining, and offer limited interpretability - especially in real-world remote sensing scenarios. Recent advances in diffusion models show that Gaussian noise can be systematically leveraged to learn expressive data representations through denoising. Motivated by this, we investigate whether the noise process in diffusion models can be effectively utilized for discriminative tasks. We propose Noise2Map, a unified diffusion-based framework that repurposes the denoising process for fast, end-to-end discriminative learning. Unlike prior work that uses diffusion only for generation or feature extraction, Noise2Map directly predicts semantic or change maps using task-specific noise schedules and timestep conditioning, avoiding the costly sampling procedures of traditional diffusion models. The model is pretrained via self-supervised denoising and fine-tuned with supervision, enabling both interpretability and robustness. Our architecture supports both tasks (SS and CD) through a shared backbone and task-specific noise schedulers. Extensive evaluations on the SpaceNet7, WHU, and xView2 buildings damaged by wildfires datasets demonstrate that Noise2Map ranks on average 1st among seven models on semantic segmentation and 1st on change detection by a cross-dataset rank metric (average F1 primary, IoU tie-break). Ablation studies highlight the robustness of our model against different training noise schedulers and timestep control in the diffusion process, as well as the ability of the model to perform multi-task learning.

38. 【2604.27875】Frequency-Aware Semantic Fusion with Gated Injection for AI-generated Image Detection

链接：https://arxiv.org/abs/2604.27875

作者：Shuchang Zhou,Shangkun Wu,Jiwei Wei,Ke Liu,Ran Ran,Caiyan Qin,Yang Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：posing significant challenges, Vision Foundation Models, AI-generated images, posing significant, increasingly realistic

备注：

点击查看摘要

Abstract:AI-generated images are becoming increasingly realistic and diverse, posing significant challenges for generalizable detection. While Vision Foundation Models (VFMs) provide rich semantic representations and frequency-based methods capture complementary artifact cues, existing approaches that combine these modalities still suffer from limited generalization, with notable performance degradation on unseen generative models. We attribute this limitation to two key factors: frequency shortcut bias toward easily distinguishable cues associated with specific generators and cross-domain representation conflict between high-level semantics and low-level frequency patterns. To address these issues, we propose a Frequency-aware Gated Injection Network (FGINet) to improve generalization. Specifically, we design a Band-Masked Frequency Encoder (BMFE) that applies cross-band masking in the frequency domain to reduce reliance on generator-specific patterns and encourage more diverse and generalizable representations. We further introduce a Layer-wise Gated Frequency Injection (LGFI) mechanism to progressively inject frequency cues into the VFM backbone with adaptive gating, aligning with its hierarchical abstraction and alleviating representation conflict. Moreover, we propose a Hyperspherical Compactness Learning (HCL) framework with a cosine margin objective to learn compact and well-separated representations. Extensive experiments demonstrate that FGINet achieves state-of-the-art performance and strong generalization across multiple challenging datasets.

39. 【2604.27870】Parameter-Efficient Architectural Modifications for Translation-Invariant CNNs

链接：https://arxiv.org/abs/2604.27870

作者：Nuria Alabau-Bosque,Jorge Vila-Tomas,Paula Dauden-Oliver,Valero Laparra,Jesus Malo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Convolutional Neural Networks, drastically degrade performance, degrade performance due, spatially dependent fully, dependent fully connected

备注： 25 pages, 16 figures

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) are widely assumed to be translation-invariant, yet standard architectures exhibit a startling fragility: even a single-pixel shift can drastically degrade performance due to their reliance on spatially dependent fully connected layers. In this work, we resolve this vulnerability by proposing a lightweight 'Online Architecture' strategy. By strategically inserting Global Average Pooling (GAP) layers at various network depths, we effectively decouple feature recognition from spatial location. Using VGG-16 as a primary case study, we demonstrate that this architectural modification achieves a massive 98% reduction in trainable parameters (from 5.2M to just 82K) and a 90% reduction in total network size (138M to 14M). Despite this drastic pruning, our variants maintain competitive Top-1 accuracy on ImageNet (66.4%) while doubling translational robustness, reducing average relative loss from 0.09 to 0.05. Furthermore, our analysis identifies a fundamental limit to invariance: while GAP resolves macroscopic sensitivity, discrete pooling operations introduce a residual periodic aliasing that prevents perfect pixel-level stability. Finally, we extend these findings to Perceptual Image Quality Assessment (IQA) by integrating our invariant backbones into the LPIPS framework. The resulting metric significantly outperforms the retrained baseline in generalization across the KADID-10k dataset (Spearman 0.89 vs. 0.75) and achieves a near-perfect alignment with human psychophysical response curves on the RAID dataset (Spearman 0.95). These results confirm that enforcing architectural invariance is a far more efficient and biologically plausible path to robustness than traditional data augmentation. Data and code are publicly available. The data and code are publicly available to facilitate validation and further research.

40. 【2604.27833】aming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning

链接：https://arxiv.org/abs/2604.27833

作者：Yuhua Wang,Qinnan Zhang,Xiaodong Li,Huan Zhang,Yifan Sun,Wangjie Qiu,Hainan Zhang,Yongxin Tong,Zhiming Zheng

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Personalized Federated Learning, Prototype-based Personalized Federated, Gaussian Prototype Perturbation, poses privacy risks, Prototype-based Personalized

备注： Accepted by CVPR 2026 (Highlight)

点击查看摘要

Abstract:Prototype-based Personalized Federated Learning (ProtoPFL) enables efficient multi-domain adaptation by communicating compact class prototypes, but directly sharing them poses privacy risks. A common defense involves per-example $\ell_2$ clipping before prototype computation to bound sensitivity, followed by isotropic Gaussian noise to enforce Local Differential Privacy (LDP). However, Isotropic Gaussian Prototype Perturbation (IGPP) typically over-perturbs discriminative dimensions and struggles to balance the clipping threshold with representation fidelity. In this paper, we propose VPDR, a client-side privacy plug-in that seamlessly integrates into existing ProtoPFLs. Motivated by the observation that dimension-wise class variance reflects discriminability, we introduce Variance-adaptive Prototype Perturbation (VPP), which allocates less noise to discriminative subspaces, preserving semantic separability while ensuring privacy. We further develop Distillation-guided Clipping Regularization (DCR), which enables feature norms to adaptively concentrate near the predefined clipping threshold while maintaining prediction consistency. Theoretical analysis shows that our groupwise mechanism provides privacy guarantees no weaker than the isotropic baseline under the same privacy constraints. Extensive experiments on multi-domain benchmarks demonstrate that VPDR achieves a superior privacy-utility trade-off, outperforming IGPP in personalized federated fine-tuning without sacrificing robustness against realistic attacks.

41. 【2604.27804】Machine Unlearning for Class Removal through SISA-based Deep Neural Network Architectures

链接：https://arxiv.org/abs/2604.27804

作者：Ishrak Hamim Mahi,Siam Ferdous,Md Sakib Sadman Badhon,Nabid Hasan Omi,Md Habibun Nabi Hemel,Farig Yousuf Sadeque,Md. Tanzim Reza

类目：Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词：artificial intelligence, systems has intensified, rapid proliferation, intensified concerns, image generation models

备注： 10 pages, 9 figures, 2 tables

点击查看摘要

Abstract:The rapid proliferation of image generation models and other artificial intelligence (AI) systems has intensified concerns regarding data privacy and user consent. As the availability of public datasets declines, major technology companies increasingly rely on proprietary or private user data for model training, raising ethical and legal challenges when users request the deletion of their data after it has influenced a trained model. Machine unlearning seeks to address this issue by enabling the removal of specific data from models without complete retraining. This study investigates a modified SISA (Sharded, Isolated, Sliced, and Aggregated) framework designed to achieve class-level unlearning in Convolutional Neural Network (CNN) architectures. The proposed framework incorporates a reinforced replay mechanism and a gating network to enhance selective forgetting efficiency. Experimental evaluations across multiple image datasets and CNN configurations demonstrate that the modified SISA approach enables effective class unlearning while preserving model performance and reducing retraining overhead. The findings highlight the potential of SISA-based unlearning for deployment in privacy-sensitive AI applications. The implementation is publicly available at this https URL sisa-class-unlearning.

42. 【2604.27764】GourNet: A CNN-Based Model for Mango Leaf Disease Detection

链接：https://arxiv.org/abs/2604.27764

作者：Ekram Alam,Jaydip Sanyal,Akhil Kumar Das,Arijit Bhattacharya,Farhana Sultana

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：agricultural sector, food security, cultivation is crucial, contributing to economic, economic development

备注：

点击查看摘要

Abstract:Mango cultivation is crucial in the agricultural sector, significantly contributing to economic development and food security. However, diseases affecting mango leaves can significantly reduce both the production and overall fruit grade. Detecting leaf diseases at an early stage with precision is key to effective disease prevention and sustaining crop productivity. In this paper, we introduce a "deep learning" model named "GourNet", which leverages "Convolutional Neural Networks" to identify infections in mango leaves. We utilize the "MangoLeafBD" (MBD) dataset to train and assess the effectiveness of the presented model. The MBD dataset contains seven disease classes and a Healthy class, making a total of eight classes. To enhance model performance, the images are preprocessed through steps like resizing, rescaling, and data augmentation prior to training. To properly evaluate the model, the dataset is separated into 80% for training, with the remaining 20% equally split between validation and testing. Our model uses only 683,656 total parameters and achieves a classification accuracy of 97%. This research's source code can be found at: this https URL.

43. 【2604.27759】Learning to Reason: Targeted Knowledge Discovery and Fuzzy Logic Update for Robust Image Recognition

链接：https://arxiv.org/abs/2604.27759

作者：Gurucharan Srinivas,Joshua Niemeijer,Frank Köster

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：deep neural networks, Integrating domain knowledge, class probabilities, Integrating domain, Differentiable Knowledge Unit

备注： Accepted to CVPR Findings 2026

点击查看摘要

Abstract:Integrating domain knowledge into deep neural networks is a promising way to improve generalization. Existing methods either encode prior knowledge in the loss function or apply post-processing modules, but both depend on identifying useful symbolic knowledge to integrate. Since such rules are often unavailable in real-world vision tasks, we propose a method for targeted knowledge discovery. We propose a Differentiable Knowledge Unit (DKU) that enables modulating the classifier logits, yielding refined class probabilities. The DKU uses implication rules to represent relationships between task classes and implicit concepts learned entirely from the main task supervision, without requiring concept labels. Concepts are identified by dedicated classifiers, whose probabilities are passed to DKU alongside the primary class probabilities. DKU computes a logic-based adjustment vector via fuzzy inference, which modulates the primary class logits to yield refined class probabilities. When concept classifiers represent concepts that do not support the logical rule structure, the resulting adjustments to the class probabilities do not directly minimize the supervision loss. Consequently, optimizing the supervision loss on these adjusted class probabilities implicitly trains the concept classifiers. We construct the rule base so that bidirectional logical relations connect concepts and classes. We enforce the concepts to be distinct from each other and with respect to the classes. This design enforces a clean supervision signal for concept learning. We evaluate our methods on the PASCAL-VOC, COCO, and MedMNIST datasets. We demonstrate improvement through our knowledge integration across these datasets. We conduct domain generalization and hard-sample ablation studies and find that our implicit knowledge discovery and integration outperforms the baseline.

44. 【2604.27715】Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

链接：https://arxiv.org/abs/2604.27715

作者：Hyeonseo Jang,Jaebyeong Jeon,Joong-Won Hwang,Kibok Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：optimizing textual prompts, unlabeled test data, promising technique, technique for enhancing, enhancing the adaptability

备注： CVPR 2026

点击查看摘要

Abstract:Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often produces poorly calibrated models, raising concerns about the reliability of their predictions. Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. Motivated by this observation, we introduce Flatness-aware Prompt Pretraining (FPP), a simple yet effective pretraining framework for TPT that initializes prompts within flatter regions of the loss landscape prior to adaptation. We show that simply replacing the initialization in existing TPT pipelines--without modifying any other components--is sufficient to improve both calibration and performance. Notably, FPP requires no labeled data and incurs no additional computational costs during test-time tuning, making it highly practical for real-world deployment. The code is available at: this https URL.

45. 【2604.27712】Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

链接：https://arxiv.org/abs/2604.27712

作者：Nhi Ngoc-Yen Nguyen,Anh-Duc Nguyen,Nghia Hieu Nguyen,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：integrate text visible, faithfully integrate text, captioning requires fusing, Vietnamese scene-text captioning, visual features

备注：

点击查看摘要

46. 【2604.27704】A generalised pre-training strategy for deep learning networks in semantic segmentation of remotely sensed images

链接：https://arxiv.org/abs/2604.27704

作者：Yuan Fang,Yuanzhi Cai,Jagannath Aryal,Qinfeng Zhu,Hong Huang,Cheng Zhang,Lei Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remotely sensed images, large image databases, remotely sensed, sensed images, datasets

备注：

点击查看摘要

Abstract:In the segmentation of remotely sensed images, deep learning models are typically pre-trained using large image databases like ImageNet before fine-tuned on domain-specific datasets. However, the performance of these fine-tuned models is often hindered by the large domain gaps (i.e., differences in scenes and modalities) between ImageNet's images and remotely sensed images being processed. Therefore, many researchers have undertaken efforts to establish large-scale domain-specific image datasets for pre-training, aiming to enhance model performance. However, establishing such datasets is often challenging, requiring significant effort, and these datasets often exhibit limited generaliza-bility to other application scenarios. To address these issues, this study introduces a novel yet simple pre-training strategy designed to guide a model away from learning domain-specific features in a pre-training dataset during pre-training, thereby improving the generalisation ability of the pre-trained model. To evaluate the strategy's effectiveness, deep learning models are pre-trained on ImageNet and subsequently fine-tuned on four semantic segmentation datasets with diverse scenes and modalities, including iSAID, MFNet, PST900 and Potsdam. Experimental results show that the proposed pre-training strategy led to state-of-the-art accuracies on all four datasets, namely 67.4% mIoU for iSAID, 56.9% mIoU for MFNet, 84.22% mIoU for PST900, 91.88% mF1 for Potsdam. This research lays the groundwork for developing a unified foundation model applicable to both computer vision and remote sensing applications.

47. 【2604.27702】RayFormer: Modeling Inter- and Intra-Ray Similarity for NeRF-Based Video Snapshot Compressive Imaging

链接：https://arxiv.org/abs/2604.27702

作者：Yubo Dong,Danhua Liu,Anqi Li,Zhenyuan Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Video snapshot compressive, snapshot compressive imaging, single snapshot measurement, Video snapshot, snapshot measurement

备注：

点击查看摘要

Abstract:Video snapshot compressive imaging (SCI) enables the reconstruction of dynamic scenes from a single snapshot measurement. Recently, NeRF-based methods have shown promising reconstruction performance. However, such methods typically adopt random ray sampling strategies and fail to capture content structural similarities, resulting in limited reconstruction quality. To address these issues, we first propose a patch-level ray sampling strategy to enable the modeling of content structure. Then, we propose an Inter- and Intra-Ray Transformer (RayFormer) to capture the structural similarities, modeling both inter-ray similarities among spatially neighboring points at the same depth and intra-ray correlations between adjacent points along the viewing ray. Finally, benefiting from the patch-level sampling strategy, the total variation prior is incorporated into the objective function to enhance spatial smoothness and suppress artifacts. Experiments in both simulated and real-world scenes demonstrate that the proposed method achieves state-of-the-art (SOTA) reconstruction performance.

48. 【2604.27697】Deep Learning-Based Segmentation of Peritoneal Cancer Index Regions from CT Imaging

链接：https://arxiv.org/abs/2604.27697

作者：Pieter C. Gort,Lotte J.S. Ewals,Marion W. Tops-Welten,Cris H.B. Claessens,Joost Nederend,Fons van der Sommen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Peritoneal Cancer Index, Sugarbaker Peritoneal Cancer, determine Sugarbaker Peritoneal, Cancer Index, Sugarbaker Peritoneal

备注： Accepted for presentation at Computer Assisted Radiology and Surgery (CARS) 2026

点击查看摘要

Abstract:Peritoneal metastases are currently assessed using diagnostic laparoscopy to determine Sugarbaker's Peritoneal Cancer Index (sPCI), which works by dividing the abdomen into 13 regions and scoring each region based on tumor size. A recent consensus study defined 3D regions to facilitate a radiological PCI (rPCI), providing standardized anatomical regions for imaging-based assessment. Despite its clinical value, sPCI is invasive and lacks a standardized imaging counterpart. In this study, we propose a deep learning-based approach to automatically segment the rPCI regions on CT. We evaluate nnU-Net and Swin UNETR on 62 CT scans with rPCI regions manually annotated by three clinical researchers and validated by two expert radiologists. Performance was assessed using five-fold cross-validation with the Dice Similarity Coefficient (Dice), 95th percentile Hausdorff distance and Average Surface Distance. nnU-Net achieved an overall Dice of 0.82, approaching interobserver agreement (0.88) and outperforming Swin UNETR (0.76), with remaining challenges primarily in right flank and small-bowel regions. These results demonstrate feasibility of automated rPCI segmentation, laying the foundation for non-invasive, imaging-based assessment.

49. 【2604.27695】EviMem: Evidence-Gap-Driven Iterative Retrieval for Long-Term Conversational Memory

链接：https://arxiv.org/abs/2604.27695

作者：Yuyang Li,Yime He,Zeyu Zhang,Dong Gong

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Long-term conversational memory, single-pass retrieval fails, Long-term conversational, retrieving evidence scattered, requires retrieving evidence

备注：

点击查看摘要

50. 【2604.27654】MSR:Hybrid Field Modeling for CT-MRI Rigid-Deformable Registration of the Cervical Spine with an Annotated Dataset

链接：https://arxiv.org/abs/2604.27654

作者：Bohai Zhang,Wenjie Chen,Mu Li,Kaixing Long,Xing Shen,Xinqiang Yao,Jincheng Yang,Jianting Chen,Wei Yang,Qianjin Feng,Lei Cao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate CT-MRI registration, anatomically complex,highly variable,and, complex,highly variable,and vulnerable, Accurate CT-MRI, spinal cord

备注：

点击查看摘要

Abstract:Accurate CT-MRI registration of the cervical spine is essential for preoperative planning because this region is anatomically complex,highly variable,and vulnerable to injury of the vertebral arteries and spinal cord. However,cervical CT-MRI registration remains underexplored,particularly for rigid-deformable hybrid modeling,and the lack of high-quality annotated multimodal data further limits progress. To address these challenges, we construct and release a comprehensively annotated CT-MRI dataset, R-D-Reg, and propose MSR, a rigid-deformable hybrid registration framework for complex joint structures. Specifically, MSR includes a rigid registration module for independent local rigid alignment of individual vertebrae and a deformable registration module with an MSL block that combines Mamba-based global modeling and Swin Transformer-based local modeling through adaptive gating. The rigid and deformable deformation fields are then fused to generate a hybrid field that better preserves local anatomical consistency. The code and dataset are publicly available at this https URL.

51. 【2604.27653】FUN: A Focal U-Net Combining Reconstruction and Object Detection for Snapshot Spectral Imaging

链接：https://arxiv.org/abs/2604.27653

作者：Dahua Gao,Yubo Dong,Anqi Li,Zhenyuan Lin,Ang Gao,Danhua Liu,Guangming Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Conventional push-broom hyperspectral, instantaneous hyperspectral images, slow acquisition speeds, push-broom hyperspectral imaging, hyperspectral imaging suffers

备注： First work on exploring high-level computer vision tasks in compressive spectral imaging

点击查看摘要

Abstract:Conventional push-broom hyperspectral imaging suffers from slow acquisition speeds, precluding real-time object detection; in contrast, snapshot spectral imaging enables instantaneous hyperspectral images (HSIs) capture, making real-time object detection feasible, yet its potential is often compromised by time-consuming post-capture reconstruction. To address this issue, we propose the Focal U-shaped Network (FUN), a novel end-to-end framework that jointly performs HSI reconstruction and object detection via multi-task learning. FUN employs a shared U-shaped backbone, where reconstruction provides underlying spectral information while detection guides semantic-aware priors learning, facilitating mutually beneficial task interaction. Crucially, we introduce focal modulation, an efficient alternative to self-attention that modulates spatial and spectral features while reducing quadratic computational complexity, enabling a self-attention-free architecture for joint reconstruction and detection. Furthermore, we contribute a new HSI object detection dataset with 8712 annotated objects across 363 HSIs to facilitate evaluation of the proposed method. Experiments demonstrate that FUN achieves state-of-the-art performance on both tasks, using 40% fewer parameters and 30% less computation than recent alternatives, making it promising for future real-time edge deployment. The code and datasets are available: this https URL.

52. 【2604.27621】Robot Learning from Human Videos: A Survey

链接：https://arxiv.org/abs/2604.27621

作者：Junyi Ma,Erhang Zhang,Haoran Yang,Ditao Li,Chenyang Xu,Guangming Wang,Hesheng Wang

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：critical bottleneck hindering, scaling robot data, critical bottleneck, bottleneck hindering, hindering further advancement

备注： Paper list: [this https URL](https://github.com/IRMVLab/awesome-robot-learning-from-human-videos)

点击查看摘要

Abstract:A critical bottleneck hindering further advancement in embodied AI and robotics is the challenge of scaling robot data. To address this, the field of learning robot manipulation skills from human video data has attracted rapidly growing attention in recent years, driven by the abundance of human activity videos and advances in computer vision. This line of research promises to enable robots to acquire skills passively from the vast and readily available resource of human demonstrations, substantially favoring scalable learning for generalist robotic systems. Therefore, we present this survey to provide a comprehensive and up-to-date review of human-video-based learning techniques in robotics, focusing on both human-robot skill transfer and data foundations. We first review the policy learning foundations in robotics, and then describe the fundamental interfaces to incorporate human videos. Subsequently, we introduce a hierarchical taxonomy of transferring human videos to robot skills, covering task-, observation-, and action-oriented pathways, along with a cross-family analysis of their couplings with different data configurations and learning paradigms. In addition, we investigate the data foundations including widely-used human video datasets and video generation schemes, and provide large-scale statistical trends in dataset development and utilization. Ultimately, we emphasize the challenges and limitations intrinsic to this field, and delineate potential avenues for future research. The paper list of our survey is available at this https URL.

53. 【2604.27620】SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

链接：https://arxiv.org/abs/2604.27620

作者：Pengna Li,Kangyi Wu,Shaoqing Xu,Fang Li,Hanbing Li,Lin Zhao,Kailin Lyu,Long Chen,Zhi-Xin Yang,Nanning Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：follow natural-language instructions, aims to enable, location in unseen, forward transition prediction, VLN requires endowing

备注： Submmited to ACM MM 2026

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two complementary capabilities for acquiring such awareness, namely backward action reasoning (why) and forward transition prediction~(how). Based on this insight, we propose SpaAct, a simple yet effective training framework that activates the dynamic spatial awareness in VLMs. Specifically, SpaAct introduces two spatial activation tasks: Action Retrospection, which asks the model to infer the executed action sequence from visual transitions, and Future Frame Selection, which forces the model to predict the visual transitions conditioned on history and action. These two objectives provide lightweight supervision on both backward action reasoning and forward transition prediction, encouraging the model to build dynamic spatial awareness in a VLM-friendly way. To further stabilize adaptation, we design TriPA, a Tri-factor Progressive Adaptive curriculum learning method that organizes training samples from easy to hard, allowing the model to gradually acquire navigation skills from basic locomotion to long-horizon reasoning. Experiments on standard VLN-CE benchmarks show that SpaAct consistently improves VLM-based navigation and achieves state-of-the-art performance. We will release the code and models to support future research.

54. 【2604.27617】Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection

链接：https://arxiv.org/abs/2604.27617

作者：Wei Li,Haisheng Li,Weijie Li,Jiandong Wang,Kaichen Ma,Luming Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Unmanned Aerial Vehicles, Aerial Vehicles, structural health monitoring, deep learning-based automatic, Unmanned Aerial

备注：

点击查看摘要

Abstract:With the widespread application of Unmanned Aerial Vehicles (UAVs) in bridge structural health monitoring, deep learning-based automatic crack detection has become a major research focus. However, practical UAV inspections still face four key challenges: weak crack features, degraded imaging conditions, severe class imbalance, and limited computational resources for practical UAV inspection workflows. To address these issues, this paper proposes a unified lightweight convolutional neural network framework composed of four synergistic components: a lightweight backbone network, a Convolutional Block Attention Module (CBAM) for channel and spatial enhancement, a directed robust augmentation strategy based on inspection-scene priors, and Focal Loss for hard-sample learning under class imbalance. Experiments on the SDNET2018 bridge deck dataset show that the proposed method achieves an inference speed of 825 FPS with only 11.21M parameters and 1.82G FLOPs. Compared with the baseline model, the complete framework improves the F1-score by 2.51% and recall by 3.95%. In addition, Grad-CAM visualizations indicate that the introduced attention module shifts the model's focus from scattered regions to precise tracking along crack trajectories. Overall, this study achieves a strong balance among accuracy, speed, and robustness, providing a practical solution for ground-station assisted real-time deployment in UAV bridge inspections. The source code is available at: this https URL .

55. 【2604.27606】ZAYAN: Disentangled Contrastive Transformer for Tabular Remote Sensing Data

链接：https://arxiv.org/abs/2604.27606

作者：Al Zadid Sultan Bin Habib,Tanpia Tasnim,Md. Ekramul Islam,Muntasir Tabasum

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Learning informative representations, due to heterogeneity, informative representations, environmental science, science is challenging

备注： Accepted for presentation at the 28th International Conference on Pattern Recognition (ICPR 2026) at Lyon, France. Code available at [this https URL](https://github.com/zadid6pretam/ZAYAN) . PyPI package: pip install zayan

点击查看摘要

Abstract:Learning informative representations from tabular data in remote sensing and environmental science is challenging due to heterogeneity, scarce labels, and redundancy among features. We present ZAYAN (Zero-Anchor dYnamic feAture eNcoding), a self-supervised, feature-centric contrastive framework for tabular data. ZAYAN performs contrastive learning at the feature rather than sample level, removing the need for explicit anchor selection and any reliance on class labels, while encouraging a redundancy-minimized, disentangled embedding space. The framework has two modules: ZAYAN-CL, which pretrains feature embeddings via a zero-anchor contrastive objective with dynamic perturbations and masking, and ZAYAN-T, a Transformer that conditions on these embeddings for downstream classification. Across eight datasets, including six remote-sensing tabular benchmarks and two remote-sensing-driven flood-prediction tables from satellite and GIS products, ZAYAN achieves superior accuracy, robustness, and generalization over tabular deep learning baselines, with consistent gains under label scarcity and distribution shift. These results indicate that feature-level contrastive learning and dynamic feature encoding provide an effective recipe for learning from tabular sensing data.

56. 【2604.27604】Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

链接：https://arxiv.org/abs/2604.27604

作者：Junpeng Ding,Zichen Tang,Haihong E,Mengyuan Ji,Yang Liu,Haolin Tian,Haiyang Sun,Pengqi Sun,Yang Xu,Yichen Liu,Haocheng Gao,Zijie Xi,Ruomeng Jiang,Peizhi Zhao,Rongjin Li,Yuanze Li,Jiacheng Liu,Zhongjun Yang,Jintong Chen,Siying Lin

类目：Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)

关键词：Cross-Panel Relation Understanding, pairs derived, introduce SPUR, experimental image perception, Relation Understanding

备注： Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:We introduce SPUR, a comprehensive benchmark for scientific experimental image perception, understanding, and reasoning, comprising 4,264 question-answering (QA) pairs derived from 1,084 expert-curated images. SPUR features three key innovations: (1) Panel-Level Fine-Grained Perception: evaluating the visual perception of multimodal large language models (MLLMs) across three dimensions (numerical, morphological, and information localization) on six fine-grained panel types; (2) Cross-Panel Relation Understanding: utilizing complex images with an average of 14.3 panels per sample to evaluate MLLMs' ability to decipher intricate cross-panel relations; (3) Expert-Level Reasoning: assessment of qualitative and quantitative reasoning across five experimental paradigms to determine if models can infer conclusions from evidence as human experts do. Comprehensive evaluation of 20 MLLMs and four multimodal Chain-of-Thought (MCoT) methods reveals that current models fall significantly short of the expert-level requirements for scientific image interpretation, underscoring a critical bottleneck in AI for Science (AI4S) research.

57. 【2604.27596】SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning

链接：https://arxiv.org/abs/2604.27596

作者：Hezhao Liu,Jiacheng Yang,Junlong Gao,Mengke Li,Yiqun Zhang,Shreyank N Gowda,Yang Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：open-world semi-supervised learning, practical OWSSL applications, OWSSL applications, OWSSL, labeled data

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:In open-world semi-supervised learning (OWSSL), a model learns from labeled data and unlabeled data containing both known and novel classes. In practical OWSSL applications, models are expected to perform rigorous classification by directly selecting the most semantically relevant label from a candidate set for each sample. Existing OWSSL methods fail to achieve this because novel samples are trained without explicit supervision, and these methods lack mechanisms to extract latent semantic information, resulting in predicted labels that have no semantic correspondence to candidate textual labels. To address this, we introduce SEmantic Capture for Open-world Semi-supervised learning (SECOS), which directly predicts textual labels from the candidate set without post-processing, meeting the requirements of practical OWSSL applications. SECOS leverages external knowledge to extract and align semantic representations across modalities for both known and novel classes, providing explicit supervisory signals for training novel classes. Extensive experiments demonstrate that even when existing OWSSL methods are evaluated under the more lenient post-hoc matching setting, SECOS still surpasses them by up to 5.4\% without such assistance, highlighting its superior effectiveness. Code is available at this https URL.

58. 【2604.27591】ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

链接：https://arxiv.org/abs/2604.27591

作者：Ji-Hyeon Kim,Ho-Joong Kim,Seong-Whan Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Video moment retrieval, Video moment, retrieving specific segments, Video, moment retrieval

备注： 15 pages

点击查看摘要

Abstract:Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning at the snippet-level and transformer-based temporal boundary regression. However, existing models do not calculate similarity by considering the relationships between multiple answer segments that match the query. Therefore, existing models are easily influenced by visually similar segments in the surrounding context. Existing models calculate similarity at the snippet-level and ignore the relationships between multiple answer segments corresponding to a single query. Therefore, they struggle to exclude segments irrelevant to the query. To address this issues, we propose ClipTBP, a clip-pair temporal boundary prediction framework based on boundary-aware learning. ClipTBP introduces a clip-level alignment loss for explicitly learning the semantic relationship between answer segments. ClipTBP also predicts accurate temporal boundaries by applying both main boundary loss and auxiliary boundary loss. ClipTBP consistently improves performance when applied to various existing models and demonstrates more robust boundary prediction performance even in ambiguous query scenarios.

59. 【2604.27590】Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering

链接：https://arxiv.org/abs/2604.27590

作者：Davide Di Nucci,Riccardo Catalini,Guido Borghi,Roberto Vezzani

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, highly realistic images, Gaussian Splatting, reconstruction and neural, neural rendering,particularly

备注： Accepted at ICPR 2026. Code and data: [this https URL](https://github.com/iot-unimore/Fake3DGS)

点击查看摘要

Abstract:Recent advances in 3D reconstruction and neural rendering,particularly 3D Gaussian Splatting, make it feasible and simple to edit 3D scenes and re-render them as highly realistic images. Therefore, security concerns arise regarding the authenticity of 3D content. Despite this threat, 3D fake detection remains largely unexplored in the literature, and most existing work is limited to 2D space. Therefore, in this paper, we formalize the concept of 3D fake detection and introduce Fake3DGS, a dataset of 3D Gaussian splatting scenes and corresponding rendered views, where fake images are produced by controlled manipulations of geometry, appearance, and spatial layout, while preserving high visual realism. Using this benchmark, we demonstrate that current state-of-the-art 2D detectors struggle to distinguish between original and 3D manipulated images. To bridge this gap, we introduce a 3D-aware detection method that leverages multi-view coherence and features derived from the Gaussian splatting representation. Experimental results demonstrate a substantial improvement in recognizing modified 3D content, underscoring the validity of the new dataset and the necessity for authenticity assessment techniques that extend beyond 2D evidence. Code and data are publicly released for future investigations.

60. 【2604.27582】Assessing Pancreatic Ductal Adenocarcinoma Vascular Invasion: the PDACVI Benchmark

链接：https://arxiv.org/abs/2604.27582

作者：M. Riera-Marín,O. K. Sikha,J. Rodríguez-Comas,M. S. May,T. Kirscher,X. Coubez,P. Meyer,S. Faisan,Z. Pan,X. Zhou,X. Liang,C. Hémon,V. Boussot,J.-L. Dillenseger,J.-C. Nunes,K.-C. Kahl,C. Lüth,J. Traub,P.-H. Conze,M. M. Duh,A. Aubanell,R. de Figueiredo Cardoso,S. Egger-Hackenschmidt,J. García-López,M. A. González-Ballester,A. Galdran

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：pancreatic ductal adenocarcinoma, potentially curative treatment, adjacent critical vessels, Surgical resection remains, ductal adenocarcinoma

备注：

点击查看摘要

Abstract:Surgical resection remains the only potentially curative treatment for pancreatic ductal adenocarcinoma (PDAC), and eligibility depends on accurate assessment of vascular invasion (VI), i.e., tumor extension into adjacent critical vessels. Despite its importance for preoperative staging and surgical planning, computational VI assessment remains underexplored. Two major challenges are the lack of public datasets and the diagnostic ambiguity at the tumor-vessel interface, which leads to substantial inter-rater variability even among expert radiologists. To address these limitations, we introduce the CURVAS-PDACVI Dataset and Challenge, an open benchmark for uncertainty-aware AI in PDAC staging based on a densely annotated dataset with five independent expert annotations per scan. We also propose a multi-metric evaluation framework that extends beyond spatial overlap to include probabilistic calibration and VI assessment. Evaluation of six state-of-the-art methods shows that strong global volumetric overlap does not necessarily translate into reliable performance at clinically critical tumor-vessel interfaces. In particular, methods optimized for binary segmentation perform competitively on average overlap metrics, but often degrade in high-complexity cases with low expert consensus, either collapsing in volume or overextending at uncertain boundaries. In contrast, methods that model inter-rater disagreement produce better calibrated probabilistic maps and show greater robustness in these ambiguous cases. The benchmark highlights the limitations of volumetric accuracy as a proxy for localized surgical utility, motivating uncertainty-aware probabilistic models for preoperative decision-making.

61. 【2604.27578】World2Minecraft: Occupancy-Driven Simulated Scenes Construction

链接：https://arxiv.org/abs/2604.27578

作者：Lechao Zhang,Haoran Xu,Jingyu Gong,Xuhong Wang,Yuan Xie,Xin Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：intelligence requires high-fidelity, requires high-fidelity simulation, high-fidelity simulation environments, Embodied intelligence requires, perception and decision-making

备注：

点击查看摘要

Abstract:Embodied intelligence requires high-fidelity simulation environments to support perception and decision-making, yet existing platforms often suffer from data contamination and limited flexibility. To mitigate this, we propose World2Minecraft to convert real-world scenes into structured Minecraft environments based on 3D semantic occupancy prediction. In the reconstructed scenes, we can effortlessly perform downstream tasks such as Vision-Language Navigation(VLN). However, we observe that reconstruction quality heavily depends on accurate occupancy prediction, which remains limited by data scarcity and poor generalization in existing models. We introduce a low-cost, automated, and scalable data acquisition pipeline for creating customized occupancy datasets, and demonstrate its effectiveness through MinecraftOcc, a large-scale dataset featuring 100,165 images from 156 richly detailed indoor scenes. Extensive experiments show that our dataset provides a critical complement to existing datasets and poses a significant challenge to current SOTA methods. These findings contribute to improving occupancy prediction and highlight the value of World2Minecraft in providing a customizable and editable platform for personalized embodied AI research. Project page:this https URL.

62. 【2604.27559】RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation

链接：https://arxiv.org/abs/2604.27559

作者：Yucheng Chen,Yang Yu,Yufei Shi,Conghao Xiong,Xulei Yang,Si Yong Yeo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：alleviate radiologists' workload, reduce human errors, automatically generating diagnostic, generating diagnostic reports, long-form radiology reports

备注： Accepted by Journal of Biomedical and Health Informatics (JBHI)

点击查看摘要

Abstract:Radiology report generation (RRG) has emerged as a promising approach to alleviate radiologists' workload and reduce human errors by automatically generating diagnostic reports from medical images. A key challenge in RRG is achieving fine-grained alignment between complex visual features and the hierarchical structure of long-form radiology reports. Although recent methods have improved image-text representation learning, they often treat reports as flat sequences, overlooking their structured sections and semantic hierarchies. This simplification hinders precise cross-modal alignment and weakens RRG accuracy. To address this challenge, we propose RIHA (Report-Image Hierarchical Alignment Transformer), a novel end-to-end framework that performs multi-level alignment between radiological images and their corresponding reports across paragraph, sentence, and word levels. This hierarchical alignment enables more precise cross-modal mapping, essential for capturing the nuanced semantics embedded in clinical narratives. Specifically, RIHA introduces a Visual Feature Pyramid (VFP) to extract multi-scale visual features and a Text Feature Pyramid (TFP) to represent multi-granularity textual structures. These components are integrated through a Cross-modal Hierarchical Alignment (CHA) module, leveraging optimal transport to effectively align visual and textual features across various levels. Furthermore, we incorporate Relative Positional Encoding (RPE) into the decoder to model spatial and semantic relationships among tokens, enhancing the token-level alignment between visual features and generated text. Extensive experiments on two benchmark chest X-ray datasets, IU-Xray and MIMIC-CXR, demonstrate that RIHA outperforms existing state-of-the-art models in both natural language generation and clinical efficacy metrics.

63. 【2604.27553】Revealing the Impact of Visual Text Style on Attribute-based Descriptions Produced by Large Visual Language Models

链接：https://arxiv.org/abs/2604.27553

作者：Xiaomeng Wang,Martha Larson,Zhengyu Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Visual Language, Visual Language Model, observed in font, wide variety, visual text style

备注： Accepted by ICMR 2026. Code is available at [this https URL](https://github.com/XiaomengWang-AI/The-Impact-of-Visual-Text-style-on-Attribute-based-Descriptions-Produced-by-LVLMs)

点击查看摘要

Abstract:When the visual style of text is considered, a wide variety can be observed in font, color, and size. However, when a word is read, its meaning is independent of the style in which it has been written or rendered. In this paper, we investigate whether, and how, the style in which a word is visualized in an image impacts the description that a Large Visual Language Model (LVLM) provides for the concept to which that word refers. Specifically, we investigate how functional text styles (readability-oriented, e.g., black sans-serif) versus decorative styles (display-oriented, e.g., colored cursive/script) affect LVLMs' descriptions of a concept in terms of the attributes of that concept. Our experiments study the situation in which the LVLM is able to correctly identify the concept referred to by a visual text, i.e., by a word or words rendered as an image, and in which the visual text style should not influence the attribute-based description that the LVLM produces. Our experimental results reveal that even when the concept is correctly identified, text style influences the model's attribute-based descriptions of the concept. Our findings demonstrate non-trivial style leakage from text style into semantic inference and motivate style-aware evaluation and mitigation for LVLM-based multimedia systems.

64. 【2604.27552】Residual Gaussian Splatting for Ultra Sparse-View CBCT Reconstruction

链接：https://arxiv.org/abs/2604.27552

作者：Jian Lin,Jiancheng Fang,Shaoyu Wang,Changan Lai,Yikun Zhang,Yang Chen,Qiegen Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ultra sparse-view conditions, cone-beam computed tomography, computed tomography reconstruction, conventional photometric optimization, efficient scene representations

备注：

点击查看摘要

Abstract:While 3D Gaussian splatting (3DGS) offers explicit and efficient scene representations for cone-beam computed tomography reconstruction, conventional photometric optimization inherently suffers from spectral bias under ultra sparse-view conditions, leading to over-smoothing and a loss of high-frequency anatomical details. Since wavelet transforms provide rich high-frequency information and have been widely utilized to enhance sparse reconstruction, this work integrates wavelet multi-resolution analysis with 3DGS. To circumvent the mathematical mismatch between the strict non-negativity of physical X-ray attenuation and the bipolar nature of high-frequency wavelet coefficients, we propose Residual Gaussian Splatting (RGS). Methodologically, we introduce a spectrally-decoupled Gaussian representation that stratifies the volumetric field into a geometric base component and a residual detail component. This decomposition systematically transforms explicit high-frequency fitting into a physically consistent, implicit residual compensation task. Furthermore, we devise a spectral-spatial collaborative optimization strategy to coordinate the interplay between geometric anchoring and texture refinement, effectively preventing spectral crosstalk. Extensive experiments on clinical datasets demonstrate that RGS enables the reconstructed images to capture highly refined geometric textures. It successfully resolves the trade-off between artifact suppression and detail preservation, yielding superior visual fidelity in complex trabecular and vascular structures compared to existing neural rendering baselines.

65. 【2604.27538】Self-Supervised Learning of Plant Image Representations

链接：https://arxiv.org/abs/2604.27538

作者：Ilyass Moummad,Kawtar Zaher,Hervé Goëau,Jean-Christophe Lombardo,Pierre Bonnet,Alexis Joly

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：current approaches rely, approaches rely heavily, Automated plant recognition, Automated plant, plant recognition plays

备注：

点击查看摘要

Abstract:Automated plant recognition plays a crucial role in biodiversity monitoring and conservation, yet current approaches rely heavily on supervised learning, which is limited by the availability of expert-labeled data. Self-supervised learning (SSL) offers a scalable alternative, but existing methods and training protocols are largely designed for coarse-grained visual tasks and may not transfer well to fine-grained domains such as plant species recognition. In this work, we investigate SSL for plant image representation learning. We show that commonly used augmentations in SSL pipelines - such as Gaussian blur, grayscale conversion, and solarization - are detrimental in the context of plant images, as they remove subtle discriminative cues essential for fine-grained recognition. We instead identify alternative transformations, including affine and posterization, that are better suited to this domain. We further demonstrate that training SimDINOv2 on the iNaturalist 2021 Plantae subset yields significantly stronger representations than training on ImageNet-1K, highlighting the importance of domain-specific data for SSL. Our findings are consistent across both ViT-Base and ViT-Large architectures. Moreover, our models achieve competitive performance and sometimes outperform strong supervised baselines Pl@ntCLEF and BioCLIP on downstream plant recognition tasks in few-shot settings. Overall, our results highlight the critical importance of domain-adapted augmentation strategies and dataset selection in self-supervised learning, and provide practical guidelines for building scalable models for biodiversity monitoring.

66. 【2604.27529】Adjoint Inversion Reveals Holographic Superposition and Destructive Interference in CNN Classifiers

链接：https://arxiv.org/abs/2604.27529

作者：Kaixiang Shu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：cleaned feature pool, existing visualization tools, remains untested due, Spatial Funnel Hypothesis, Local Adjoint Correctors

备注：

点击查看摘要

Abstract:A foundational assumption in CNN interpretability -- that deep encoders suppress background pixels while classifiers merely select from a cleaned feature pool (the Spatial Funnel Hypothesis) -- remains untested due to spatial hallucinations in existing visualization tools. We address this by introducing a hallucination-free inversion framework built on magnitude-phase decoupling and Local Adjoint Correctors. Our method mathematically guarantees that the spatial gradient support of every reconstruction stems strictly from genuinely active channels. Using this framework as a geometric probe, we uncover the first pixel-level evidence of strong superposition in vision encoders. We show that per-channel inversions are uniformly holographic: positive and negative weight reconstructions are visually and energetically indistinguishable. However, their algebraic sum sharply concentrates on the foreground. This proves classification operates via destructive interference -- classifier weights cancel a shared background direction in pixel space and constructively assemble class-discriminative residuals, directly falsifying the Spatial Funnel Hypothesis. This interference model identifies the volume of the admissible interference subspace as the geometric quantity governing channel requirements. We prove this volume is dual to the GAP covariance determinant, yielding a covariance-volume channel selection algorithm with a $(1-1/e)$ approximation guarantee. This algorithm mathematically reveals out-of-distribution (OOD) failure as a measurable collapse of the covariance volume essential for interference-based classification. Our framework extends seamlessly to attention-based heads without retraining.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.27529 [cs.CV]

(or
arXiv:2604.27529v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.27529

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

67. 【2604.27510】FMCL: Class-Aware Client Clustering with Foundation Model Representations for Heterogeneous Federated Learning

链接：https://arxiv.org/abs/2604.27510

作者：Mahad Ali,Laura J. Brattain

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：enables collaborative model, Federated Learning, Clustered Federated Learning, Federated Learning addresses, enables collaborative

备注： 14 pages, 2 figures

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, yet its performance deteriorates under statistical heterogeneity. Clustered Federated Learning addresses this challenge by grouping similar clients and training separate models per cluster. However, existing clustering strategies often rely on raw data statistics, model parameters, or heuristic similarity measures that fail to capture class-level semantic structure across heterogeneous domains and frequently require iterative coordination. We propose FMCL, a one-shot, class-aware client clustering framework that leverages foundation model representations to construct semantic client signatures. Using a frozen foundation model, FMCL computes class-level embedding prototypes for each client and measures similarity via cosine distance between their class-aware representations. Clustering is performed once prior to training, introducing no additional communication during federated optimization and remaining agnostic to the downstream model architecture. Extensive experiments across heterogeneous benchmarks demonstrate that FMCL improves federated performance and yields more stable clustering behavior compared to existing clustering-based methods under non-identically distributed data partitioning.

68. 【2604.27505】Leveraging Verifier-Based Reinforcement Learning in Image Editing

链接：https://arxiv.org/abs/2604.27505

作者：Hanzhong Guo,Jie Wu,Jie Liu,Yu Gao,Zilyu Ye,Linxiao Yuan,Xionghui Wang,Yizhou Yu,Weilin Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains largely unexplored, editing remains largely, Human Feedback, largely unexplored, pivotal paradigm

备注：

点击查看摘要

Abstract:While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

69. 【2604.27504】REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

链接：https://arxiv.org/abs/2604.27504

作者：Hankyeol Lee,Wooyeol Baek,Seongdo Kim,Jongyoo Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent generative models, fundamental research topic, Recent generative, shown strong performance, vision and graphics

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Recent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggle to generate voluminous 3D assets when the input is a flat image that provides limited 3D cues. We introduce REVIVE 3D, a two-stage, plug-and-play pipeline for generating voluminous 3D assets from flat images. In Stage 1, we construct an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. In Stage 2, 3D Latent Refinement injects Gaussian noise into the Inflated Prior's latent and then denoises it, using the prior's geometric cues to leverage the backbone's pretrained 3D knowledge. Furthermore, our framework supports image-conditioned 3D editing. To quantify volume and surface flatness, we propose Compactness and Normal Anisotropy. We validate Compactness and Normal Anisotropy through a user study, showing that these metrics align with human perception of volume and quality. We show that REVIVE 3D achieves state-of-the-art performance on a challenging flat image dataset, based on extensive qualitative and quantitative evaluations.

70. 【2604.27499】owards All-Day Perception for Off-Road Driving: A Large-Scale Multispectral Dataset and Comprehensive Benchmark

链接：https://arxiv.org/abs/2604.27499

作者：Shuo Wang,Jilin Mei,Wenfei Guan,Shuai Wang,Yan Xing,Chen Min,Yu Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unreliable visible-light perception, making infrared modality, infrared modality crucial, accurate freespace detection, suffers from unreliable

备注：

点击查看摘要

Abstract:Off-road nighttime autonomous driving suffers from unreliable visible-light perception, making infrared modality crucial for accurate freespace detection. However, progress remains limited due to the scarcity of annotated infrared off-road datasets and the inter-frame inconsistencies inherent to current single-frame methods. To address these gaps, we present the IRON dataset, which, to our knowledge, is the first large-scale infrared dataset for off-road temporal freespace detection under all-day conditions, with strong support for nighttime perception. The dataset comprises 24,314 densely annotated infrared images with synchronized RGB images in diverse scenes and different light conditions. Building upon this dataset, we propose IRONet, a novel flow-free framework for temporal freespace detection that addresses inter-frame inconsistencies by aggregating historical context via a memory-attention mechanism and a carefully designed mask decoder. On our IRON dataset, IRONet achieves state-of-the-art performance, reaching 82.93%(+1.19%) IoU and 90.66%(+0.71%) F1 score at real-time inference. Remarkably, IRONet also exhibits robust generalization to RGB modalities on ORFD and Rellis datasets. Overall, our work establishes a foundation for reliable all-day off-road autonomous driving and future research in infrared temporal perception. The code and IRON dataset are available at this https URL.

71. 【2604.27491】Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

链接：https://arxiv.org/abs/2604.27491

作者：Mengfei Zhang,Jinlu Zhang,Zhigang Tu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：essential technology powering, technology powering virtual, human-object interaction, mixed-reality applications, compelling challenge

备注： 10 pages

点击查看摘要

Abstract:Modeling 4D human-object interaction (HOI) is a compelling challenge in computer vision and an essential technology powering virtual and mixed-reality applications. While existing works have achieved promising results on specific HOI tasks-such as text-conditioned HOI generation and human motion generation from object motion, they typically rely on task-specific architectures and lack a unified framework capable of handling diverse conditional inputs. Building on this, we propose Uni-HOI, a unified framework that learns the joint distribution among text, human motion, and object motion. By leveraging large language models (LLMs) and two motion-specific vector quantized variational autoencoders (VQ-VAEs), we convert heterogeneous motion data into token sequences compatible with LLM inputs, enabling seamless integration and joint modeling of all three modalities. We introduce a two-stage training strategy: the first stage performs multi-task learning on a large-scale HOI dataset to capture the underlying correlations among the three modalities, while the second stage fine-tunes the model on specific tasks to further enhance performance. Extensive experiments demonstrate that Uni-HOI achieves remarkable performances on multiple HOI-related tasks including text-driven HOI generation, object motion-driven human motion generation (optionally with text) and human motion-driven object motion prediction within a unified framework.

72. 【2604.27476】EdgeFM: Efficient Edge Inference for Vision-Language Models

链接：https://arxiv.org/abs/2604.27476

作者：Mengling Deng,Yuanpeng Chen,Sheng Yang,Wei Tao,Wenhai Zhang,Hui Song,Linyuanhao Qin,Kai Zhao,Xiaojun Ye,Shanhui Mo,Jingli Fan,Shuang Zhang,Bei Liu,Tiankun Zhao,Xiangjing An

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated strong applicability, remains severely constrained, Vision-language models, deterministic low latency, deployment remains severely

备注： Technique Report version

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely on bloated general-purpose designs or force developers into opaque, hardware-specific closed-source ecosystems, leading to hardware lock-in limitation and poor cross-platform adaptability. Observing that modern AI agents can efficiently search and tune configurations to generate highly optimized low-level kernels for standard LLM operators, we propose EdgeFM, a lightweight, agent-driven VLM/LLM inference framework tailored for cross-platform industrial edge deployment. EdgeFM removes non-essential features to reduce single-request latency, and encapsulates agent-tuned kernel optimizations as a modular library of reusable skills. By allowing direct invocation of these skills rather than waiting for closed-source implementations, it effectively closes the performance gap long dominated by proprietary toolchains. The framework natively supports mainstream platforms including x86 and NVIDIA Orin SoCs, and represents the first end-to-end VLA deployment on the domestic Horizon Journey platform, enhancing cross-platform portability. In most cases, it yields clearly better inference performance than conventional vendor-specific toolchains, achieving up to 1.49 times speedup over TensorRT-Edge-LLM on the NVIDIA Orin platform. Experimental results show that EdgeFM delivers favorable end-to-end inference performance, providing an open-source, production-grade solution for diverse edge industrial scenarios.

73. 【2604.27448】LA-Pose: Latent Action Pretraining Meets Pose Estimation

链接：https://arxiv.org/abs/2604.27448

作者：Zhengqing Wang,Saurabh Nair,Prajwal Chidananda,Pujith Kachana,Samuel Li,Matthew Brown,Yasutaka Furukawa

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fully supervised training, paper revisits camera, paper revisits, scalable alternative, current trend

备注： Project page: [this https URL](https://la-pose.github.io/)

点击查看摘要

Abstract:This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos. Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks. Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations. This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods. To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.

74. 【2604.27445】Context as Prior: Bayesian-Inspired Intent Inference for Non-Speaking Agents with a Household Cat Testbed

链接：https://arxiv.org/abs/2604.27445

作者：Wenqian Zhang,Zehao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：including household pets, pre-verbal infants, goals through language, reliably communicate, communicate their goals

备注： Accepted to the CVPR 2026 Animal Workshop

点击查看摘要

Abstract:Many agents in real-world environments cannot reliably communicate their goals through language, including household pets, pre-verbal infants, and other non-speaking embodied agents. In such settings, intent must be inferred from incomplete behavioral observations in context-rich environments. This creates a core ambiguity: observable behavior is often noisy or underspecified, while context provides strong prior information but can also induce brittle shortcut predictions if used naively. We present CatSignal, a Bayesian-inspired probabilistic framework for multimodal intent inference that models spatial context as a prior-like constraint and behavioral observations as evidence. Rather than treating context as an ordinary input feature, our method uses a context-gated Product-of-Experts formulation to compute posterior-like intent distributions from context, pose dynamics, and acoustic cues. We instantiate this formulation in a household cat setting as a focused proof-of-concept for intent inference in non-speaking agents. Under Leave-One-Video-Out evaluation on a multimodal domestic cat dataset, the proposed prior-guided fusion achieves the best overall accuracy of 77.72%, outperforming feature concatenation (71.83%) and stronger late-fusion baselines. More importantly, it substantially reduces context-driven shortcut failures in ambiguous cases. While simpler fusion strategies remain competitive in Macro-F1 and selective prediction, the proposed model provides the strongest overall accuracy and the best suppression of context-based shortcut collapse.

Comments:
Accepted to the CVPR 2026 Animal Workshop

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.27445 [cs.CV]

(or
arXiv:2604.27445v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.27445

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

75. 【2604.27437】Softmax-GS: Generalized Gaussians Learning When to Blend or Bound

链接：https://arxiv.org/abs/2604.27437

作者：Chen Ziwen,Peng Wang,Hao Tan,Zexiang Xu,Li Fuxin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：view synthesis due, Gaussian Splatting, widely adopted, synthesis due, high training

备注：

点击查看摘要

Abstract:3D Gaussian Splatting (3D GS) is widely adopted for novel view synthesis due to its high training and rendering efficiency. However, its efficiency relies on the key assumption that Gaussians do not overlap in the 3D space, which leads to noticeable artifacts and view inconsistencies. In addition, the inherently diffuse boundaries of Gaussians hinder accurate reconstruction of sharp object edges. We propose Softmax-GS, a unified solution that addresses both the view-inconsistency and the diffuse-boundary problem by enforcing a softmax-based competition in overlapping regions between two Gaussians. With learnable parameters controlling the strength of the competition, it enables a continuous spectrum from smooth color blending to crisp, well-defined boundaries. Our formulation explicitly preserves order invariance for any two overlapping Gaussians and ensures that the output transmittance remains unchanged irrespective of the extent of overlapping, preventing undesirable discontinuities in the rendered output. Ablation experiments on simple geometries demonstrate the effectiveness of each component of Softmax-GS, and evaluations on real-world benchmarks show that it achieves state-of-the-art performance, improving both reconstruction quality and parameter efficiency.

76. 【2604.27422】Sparse-View 3D Gaussian Splatting in the Wild

链接：https://arxiv.org/abs/2604.27422

作者：Wongi Park,Jordan A. James,Myeongseok Nam,Minjae Lee,Soomok Lee,Sang-Hyun Lee,William J. Beksi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：sparse-view synthesis framework, unconstrained real-world scenarios, real-world scenarios, unconstrained real-world, sparse-view synthesis

备注： 18 pages, 14 figures, and 14 tables

点击查看摘要

Abstract:We propose a 3D novel sparse-view synthesis framework for unconstrained real-world scenarios that contain distractors. Unlike existing methods that primarily perform novel-view synthesis from a sparse set of constrained images without transient elements or leverage unconstrained dense image collections to enhance 3D representation in real-world scenarios, our method not only effectively tackles sparse unconstrained image collections, but also shows high-quality 3D rendering results. To do this, we introduce reference-guided view refinement with a diffusion model using a transient mask and a reference image to enhance the 3D representation and mitigate artifacts in rendered views. Furthermore, we address sparse regions in the Gaussian field via pseudo-view generation along with a sparsity-aware Gaussian replication strategy to amplify Gaussians in the sparse regions. Extensive experiments on publicly available datasets demonstrate that our methodology consistently outperforms existing methods (e.g., PSNR - 17.2%, SSIM - 10.8%, LPIPS - 4.0%) and provides high-fidelity 3D rendering results. This advancement paves the way for realizing unconstrained real-world scenarios without labor-intensive data acquisition. Our project page is available at $\href{this https URL}{here}$

77. 【2604.27414】Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis

链接：https://arxiv.org/abs/2604.27414

作者：David Fernandez,Pedram MohajerAnsari,Amir Salarpour,Mert D. Pese

类目：Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词：combine visual perception, physical adversarial attacks, VLM architectures, language-based reasoning, supporting more interpretable

备注： 9 pages, 2 figures. Accepted at SAE WCX 2026

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used in autonomous driving because they combine visual perception with language-based reasoning, supporting more interpretable decision-making, yet their robustness to physical adversarial attacks, especially whether such attacks transfer across different VLM architectures, is not well understood and poses a practical risk when attackers do not know which model a vehicle uses. We address this gap with a systematic cross-architecture study of adversarial transferability in VLM-based driving, evaluating three representative architectures (Dolphins, OmniDrive, and LeapVAD) using physically realizable patches placed on roadside infrastructure in both crosswalk and highway scenarios. Our transfer-matrix evaluation shows high cross-architecture effectiveness, with transfer rates of 73-91% (mean TR = 0.815 for crosswalk and 0.833 for highway) and sustained frame-level manipulation over 64.7-79.4% of the critical decision window even when patches are not optimized for the target model.

78. 【2604.27389】COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

链接：https://arxiv.org/abs/2604.27389

作者：Bingli Wang,Huanze Tang,Haijun Lv,Zhishan Lin,Lixin Gu,Lei Feng,Qipeng Guo,Kai Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Multimodal Large Language, Language Models, Large Language, achieved remarkable progress

备注：

点击查看摘要

Abstract:In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as document reading, information is often presented as interleaved multimodel contexts. This requires MLLMs not only to recognize the content of individual images, but also to identify relevant textual and visual evidence, establish fine-grained alignments between them, and reason over these aligned signals in interleaved contexts based on contextual this http URL, there is still a lack of systematic benchmarks for quantifying the fine-grained understanding ability of MLLMs in interleaved image-text contexts. To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal contexts. COHERENCE covers interleaved image-text content from four representative domains and contains 6,161 high-quality questions. Moreover, we perform a six-type error analysis, enabling fine-grained attribution of failures in interleaved image-text understanding to the specific capabilities missing in current MLLMs.

79. 【2604.27375】VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

链接：https://arxiv.org/abs/2604.27375

作者：Yihong Guo,Youwei Lyu,Jiajun Tang,Yizhuo Zhou,Hongliang Wang,Jinwei Chen,Changqing Zou,Qingnan Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：give reasoning processes, gained significant traction, analyze image defects, precise retouching enhancements, give reasoning

备注：

点击查看摘要

Abstract:Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at this https URL.

80. 【2604.27367】DOT-Sim: Differentiable Optical Tactile Simulation with Precise Real-to-Sim Physical Calibration

链接：https://arxiv.org/abs/2604.27367

作者：Yang You,Won Kyung Do,Aiden Swann,Rika Antonova,Monroe Kennedy,Leonidas Guibas

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：presents significant challenges, significant challenges due, intricate optical properties, sensors presents significant, Differentiable Optical Tactile

备注： Accepted at ICRA 2026

点击查看摘要

Abstract:Simulating optical tactile sensors presents significant challenges due to their high deformability and intricate optical properties. To address these issues and enable a physically accurate simulation, we propose DOT-Sim: Differentiable Optical Tactile Simulation. Unlike prior simulators that rely on simplified models of deformable sensors, DOT-Sim accurately captures the physical behavior of soft sensors by modeling them as elastic materials using the Material Point Method (MPM). DOT-Sim enables rapid calibration of optical tactile sensor simulation using a small number of demonstrations within minutes, which is substantially faster than existing methods. Compared to current baselines, our approach supports much larger and non-linear deformations. To handle the optical aspect, we propose a novel approach to simulating optical responses by learning a residual image relative to the real-world idle state. We validate the physical and visual realism of our method through a series of zero-shot sim-to-real tasks. Our experiments show that DOT-Sim (1) accurately replicates the physical dynamics of a DenseTact optical tactile sensor in reality, (2) generates realistic optical outputs in contact-rich scenarios, (3) enables direct deployment of simulation-trained classifiers in the real world, achieving 85% classification accuracy on challenging objects and 90% accuracy in embedded tumor-type detection, and (4) allows precise trajectory following with a policy trained from demonstrations in simulation, with an average error of less than 0.9 mm.

81. 【2604.27366】Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving

链接：https://arxiv.org/abs/2604.27366

作者：Lijin Yang,Jianing Huang,Zhongzhan Huang,Shu Liu,Hao Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vision language action, shown remarkable potential, Recent advances, mapping multimodal inputs, directly mapping multimodal

备注： preprint

点击查看摘要

Abstract:Recent advances in vision language action (VLA) models have shown remarkable potential for autonomous driving by directly mapping multimodal inputs to control signals. However, previous VLA-based methods have not explicitly exploited the critic capability of VLAs to refine driving decisions, even though such capability has been well demonstrated in other LLM-based domains, thereby limiting their performance in complex closed-loop scenarios. In this work, we present a theoretically inspired two-stage framework, CriticVLA, which extends the role of VLAs from acting to judging. CriticVLA first generates a rough trajectory and then refines it through multimodal evaluation and single-step optimization guided by a VLA-based critic, yielding higher-quality driving behaviors. To support this process, we construct a large-scale synthetic dataset of 12.9 million annotated trajectories covering diverse driving scenarios, which enhances the critic's reasoning and refinement abilities. Extensive closed-loop experiments on the Bench2Drive benchmark show that CriticVLA significantly surpasses state-of-the-art baselines, achieving a 73.33% total success rate and delivering about 30% improvement in challenging scenarios.

82. 【2604.27364】Hyperspectral Image Classification via Efficient Global Spectral Supertoken Clustering

链接：https://arxiv.org/abs/2604.27364

作者：Peifu Liu,Tingfa Xu,Jie Wang,Huan Chen,Huiyan Bai,Jianan Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：precise boundary delineation, demands spatially coherent, boundary delineation, precise boundary, spatially coherent predictions

备注： Accepted by ISPRS JPRS 2026. This manuscript version is made available under the CC-BY-NC-ND 4.0 license

点击查看摘要

Abstract:Hyperspectral image classification demands spatially coherent predictions and precise boundary delineation. Yet prevailing superpixel-based methods face an inherent contradiction: clustering aggregates similar pixels into regions, but the subsequent classifier operates pixel-wise, undermining regional consistency. Consequently, existing approaches do not guarantee region-level, boundary-aligned classification. To address this limitation, we propose the Dual-stage Spectrum-Constrained Clustering-based Classifier (DSCC), an end-to-end framework that explicitly decouples clustering from classification by first grouping spectral similar and spatially proximate pixels into spectral supertokens and then performing token-level prediction. At its core, DSCC computes an image-level multi-criteria feature distance between pixels and centers, followed by a locality-aware assignment regularization, enabling the generation of boundary-preserving spectral supertokens. A density-isolation based center selection further yields representative, well-separated centers, reducing redundancy and improving robustness to scale variation. To accommodate mixed land-cover compositions within each token, we introduce a soft-label scheme that encodes class proportions and improves robustness for mixed-class tokens. DSCC attains a CF1 of 0.728 at 197.75 FPS on the WHU-OHS dataset, offering a superior accuracy-efficiency trade-off compared with state-of-the-art methods. Extensive experiments further validate the effectiveness and generality of the proposed dual-stage paradigm for hyperspectral image classification. The source code is available at this https URL.

83. 【2604.27361】CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling

链接：https://arxiv.org/abs/2604.27361

作者：Yingrui Wu,Youkang Kong,Mingyang Zhao,Weize Quan,Dong-Ming Yan,Yang Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：remains challenging due, simultaneously enforcing global, enforcing global architectural, indoor scenes remains, local semantic consistency

备注： SIGGARPH 2026 (Journal Track), Code: [this https URL](https://github.com/YingruiWoo/CasLayout)

点击查看摘要

Abstract:Synthesizing realistic 3D indoor scenes remains challenging due to data scarcity and the difficulty of simultaneously enforcing global architectural constraints and local semantic consistency. Existing approaches often overlook structural boundaries or rely on fully connected relation graphs that introduce redundant generation errors. Inspired by human design cognition, we present CasLayout, a cascaded diffusion framework that decomposes the joint scene generation task into four conditional sub-stages with explicit physical and semantic roles: (1) predicting furniture quantity and categories, (2) refining object sizes and feature embeddings, (3) modeling spatial relationships in a latent space, and (4) generating Oriented Bounding Boxes (OBBs). This decoupled architecture reduces data requirements and enables flexible integration of Large Language Models (LLMs) and Vision Language Models (VLMs) for zero-shot tasks such as image-to-scene generation. To maintain physical validity within complex floor plans, we explicitly model building elements (e.g., walls, doors, and windows) as conditional constraints. Furthermore, to address the high entropy of dense relation graphs, we introduce a sparse relation graph formulation aligned with human spatial descriptions. By encoding these sparse graphs into a compact latent space using a bidirectional Variational Autoencoder (VAE), the proposed framework provides enhanced relational controllability, allowing generated layouts to better respect functional organization. Experiments demonstrate that CasLayout achieves state-of-the-art performance in fidelity and diversity while enabling improved controllability in practical applications.

84. 【2604.27357】AG-TAL: Anatomically-Guided Topology-Aware Loss for Multiclass Segmentation of the Circle of Willis Using Large-Scale Multi-Center Datasets

链接：https://arxiv.org/abs/2604.27357

作者：Jialu Liu,Yue Cui,Shan Yu

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate multiclass segmentation, Circle of Willis, remains challenging due, complex vascular topology, Accurate multiclass

备注： 11 pages, 5 figures, submitted to IEEE JBHI

点击查看摘要

Abstract:Accurate multiclass segmentation of the Circle of Willis (CoW) is essential for neurovascular disease management but remains challenging due to complex vascular topology and variable morphology. Existing deep learning methods often suffer from vascular discontinuities and inter-class misclassification, while current topological loss functions incur prohibitive computational costs in 3D multiclass settings. To address these limitations, we propose an Anatomically-Guided Topology-Aware Loss (AG-TAL) and introduce a large-scale, multi-center CoW dataset with unified annotations to facilitate robust model training. AG-TAL specifically integrates a radius-aware Dice loss to address class imbalance in small vessels, a breakage-aware clDice loss that utilizes group convolutions to efficiently preserve local connectivity, and an adjacency-aware co-occurrence loss that leverages anatomical priors to enforce distinct boundaries between neighboring arteries. Evaluated using 5-fold cross-validation, AG-TAL achieved an average Dice score of 80.85% for all CoW arteries, with small arteries notably higher by 1.05-3.09% compared to state-of-the-art methods. Across six independent datasets, the performance of AG-TAL achieved Dice scores ranging from 74.46% to 81.17% for all CoW arteries, with improvements of 2.20% to 9.98% for small arteries compared to other methods. This study demonstrates the superiority of AG-TAL in identifying multiclass CoW arteries and its ability to generalize well to multiple independent datasets. Furthermore, reliability analyses and clinical applications in an Alzheimer's disease cohort validate the AG-TAL's robustness and its potential for discovering imaging-based morphological biomarkers.

85. 【2604.27353】Gait Recognition via Deep Residual Networks and Multi-Branch Feature Fusion

链接：https://arxiv.org/abs/2604.27353

作者：Yabo Luo,Xiaoyun Wang,Cunrong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offering inherent advantages, long-range identification capability, compelling biometric modality, resistance to disguise, security applications

备注：

点击查看摘要

Abstract:Gait recognition has emerged as a compelling biometric modality for surveillance and security applications, offering inherent advantages such as non-intrusiveness, resistance to disguise, and long-range identification capability. However, prevailing approaches struggle to comprehensively capture and exploit the rich biometric cues embedded in human locomotion, particularly under covariate interference including viewpoint variation, clothing change, and carrying conditions. In this paper, we present a high-precision gait recognition framework that deeply extracts and synergistically fuses gait dynamics with body shape characteristics through a multi-branch architecture grounded in deep residual learning. Specifically, we first employ the High-Resolution Network (HRNet) to perform robust skeletal keypoint estimation, preserving fine-grained spatial information even under low-resolution inputs. We then construct three complementary feature branches -- body proportion, gait velocity, and skeletal motion -- from the extracted pose sequences. A 50-layer Residual Network (ResNet-50) backbone is leveraged within a deep feature extraction module to capture hierarchically rich and discriminative representations. To effectively integrate heterogeneous feature streams, we design a Multi-Branch Feature Fusion (MFF) module inspired by channel-wise attention mechanisms, which dynamically allocates contribution weights across branches through learned activation parameters. Extensive experiments on the cross-view multi-condition CASIA-B benchmark demonstrate that our method achieves a Rank-1 accuracy of 94.52\% under normal walking, with the best recognition performance among skeleton-based methods for the coat-wearing condition.

86. 【2604.27343】JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification

链接：https://arxiv.org/abs/2604.27343

作者：Phan Nguyen,Dat Cao,Quang Hien Kha,Hien Chu,Minh H. N. Le,Trang Quoc Thao Pham,Nguyen Quoc Khanh Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：early dermatological diagnosis, existing computer-aided systems, computer-aided systems rely, systems rely primarily, Skin lesion classification

备注：

点击查看摘要

Abstract:Skin lesion classification is essential for early dermatological diagnosis, yet many existing computer-aided systems rely primarily on dermoscopic images and underutilize the multimodal evidence routinely available in clinical practice. To address this gap, we propose \textbf{JI-ADF}, a trimodal deep learning framework that integrates dermoscopic images, clinical photographs, and structured patient metadata for clinically grounded skin lesion classification. The proposed architecture combines joint multimodal representation learning with modality-specific auxiliary supervision and an adaptive decision fusion mechanism that dynamically calibrates modality contributions on a per-sample basis. To enhance cross-modal reasoning while preserving modality-specific evidence, we further introduce a multimodal fusion attention (MMFA) module. We evaluate JI-ADF on the large-scale MILK10k benchmark, which reflects real-world clinical acquisition conditions and severe class imbalance. The proposed method demonstrates strong and well-balanced performance across lesion categories, improving sensitivity and Dice score while maintaining high specificity and good calibration. Extensive analyses, including modality ablation, calibration evaluation, and Grad-CAM visualization, further confirm the robustness and clinically meaningful behavior of the model. These results indicate that JI-ADF provides a reliable and practical foundation for multimodal skin lesion classification in real-world clinical settings.

87. 【2604.27335】Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization

链接：https://arxiv.org/abs/2604.27335

作者：Naeem Rehmat,Muhammad Saad Saeed,Ijaz Ul Haq,Khalid Malik

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：block cyber threats, prevent data exfiltration, accurate web content, filtering systems rely, cyber threats

备注： Accepted at CVPR NeXD Workshop (2026)

点击查看摘要

Abstract:Web filtering systems rely on accurate web content classification to block cyber threats, prevent data exfiltration, and ensure compliance. However, classification is increasingly difficult due to the dynamic and rapidly evolving nature of the modern web. Embedding-based zero-shot approaches map content and category descriptions into a shared semantic space, enabling label assignment without labeled training data, but remain highly sensitive to definition quality. Poorly specified or ambiguous definitions create semantic overlap in the embedding space, leading to systematic misclassification. In this paper, we propose a training-free, adaptive iterative definition refinement framework that improves zero-shot web content classification by progressively optimizing category definitions rather than updating model parameters. Using LLMs as feedback-driven definition optimizers, we investigate three refinement strategies namely example-guided, confusion-aware, and history-aware, each refining class descriptions using structured signals from misclassified instances. Furthermore, we introduce a human-labeled benchmark of 10 URL categories with 1,000 samples per class and evaluate across 13 state-of-the-art embedding foundation models. Results demonstrate that iterative definition refinement consistently improves classification performance across diverse architectures, establishing definition quality as a critical and underexplored factor in embedding-based systems. The dataset is available at this https URL.

Comments:
Accepted at CVPR NeXD Workshop (2026)

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.27335 [cs.CV]

(or
arXiv:2604.27335v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.27335

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

88. 【2604.27329】SQuadGen: Generating Simple Quad Layouts via Chart Distance Fields

链接：https://arxiv.org/abs/2604.27329

作者：Youkang Kong,Yang Liu,Yue Dong,Xin Tong,Heung-Yeung Shum

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Chart Distance Fields, critical for efficient, editing and modeling, AI-generated content, content often lack

备注： SIGGRAPH 2026 (Journal Track), project page: [this https URL](https://youkang-kong.github.io/squadgen/)

点击查看摘要

Abstract:3D shapes from scanning, reconstruction, or AI-generated content often lack simple quad mesh layouts -- critical for efficient editing and modeling. Existing quad-remeshing techniques typically produce complex layouts with irregular loops, leading to tedious manual cleanup and extensive algorithm tuning. We introduce SQuadGen, a diffusion-based generative framework that leverages Chart Distance Fields (CDF) to synthesize simple quad layouts on 3D shapes. Our approach addresses two key challenges: (1) the discrete nature of mesh connectivity, which hinders learning, and (2) the scarcity of large-scale datasets with simple quad meshes. To overcome the first, we propose CDF, a continuous surface-based representation enabling effective learning and synthesis of quad layouts. To address the second, we define loop-aware simplicity metrics and construct a large-scale dataset of high-quality quad layouts recovered from public 3D repositories through a robust quad-recovery pipeline. Extensive evaluations across diverse 3D inputs show that SQuadGen consistently outperforms existing methods, producing robust, artist-friendly simple quad layouts.

89. 【2604.27322】YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

链接：https://arxiv.org/abs/2604.27322

作者：Chenyang Wu,Lina Lei,Fan Li,Chun-Le Guo,Dehong Kong,Xinran Qin,Zhixin Wang,Ming-Ming Cheng,Chongyi Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：video object removal, video generation technologies, shown impressive results, Recent advances, Diffusion Transformer

备注： accepted by CVPR2026

点击查看摘要

Abstract:Recent advances in Diffusion Transformer (DiT)-based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10FPS, primarily due to dense computations over the entire spatiotemporal token space, even when only a small masked region actually requires processing. In this paper, we present YOSE, You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a diffusion process approximation mechanism for unmasked tokens, which simulates the influence of unmasked regions within DiT self-attention to maintain semantic consistency for masked tokens. With these designs, YOSE achieves mask-aware acceleration, where the inference time scales approximately linearly with the masked regions, in contrast to full-token diffusion methods whose computation remains constant regardless of the mask size. Extensive experiments demonstrate that YOSE achieves up to 2.5X speedup in 70% of cases while maintaining visual quality comparable to the baseline. Code is available at: this https URL.

90. 【2604.27313】PINN-Cast: Exploring the Role of Continuous-Depth NODE in Transformers and Physics Informed Loss as Soft Physical Constraints in Short-term Weather Forecasting

链接：https://arxiv.org/abs/2604.27313

作者：Hira Saleem,Flora Salim,Cormac Purcell

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Operational weather prediction, complex simulation workflows, Operational weather, simulation workflows, weather prediction

备注： 14 pages, 4 Figures, Accepted in 26th International Conference on Computational Science (ICCS 2026)

点击查看摘要

Abstract:Operational weather prediction has long relied on physics-based numerical weather prediction (NWP), whose accuracy comes at the cost of substantial compute and complex simulation workflows. Recent transformer-based forecasters offer efficient data-driven alternatives, however transformers are physics-agnostic models. Additionally, standard transformer encoders evolve representations through discrete layer updates that may be less suited to modeling smooth latent dynamics. In this work, we propose a continuous-depth transformer encoder for weather forecasting that integrates Neural Ordinary Differential Equation (Neural ODE) dynamics within each encoder block. Specifically, we replace discrete residual updates with ODE-based updates solved using adaptive numerical integration. We also introduce a two-branch attention module that combines conventional patch-wise self-attention with an auxiliary branch that applies a derivative operator to attention logits, providing an additional change-sensitive interaction signal. To further align forecasts with governing principles, we propose a customized physics-informed training objective that enforces physical consistency as a soft constraint. We evaluate the proposed method against a standard discrete transformer baseline and an existing continuous-time Neural ODE forecasting variant, demonstrating the importance of PINN-Cast in short term weather forecasting.

91. 【2604.27293】Student Classroom Behavior Recognition Based on Improved YOLOv8s

链接：https://arxiv.org/abs/2604.27293

作者：Xiang Gao,Shuai Hang

类目：Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词：teaching quality analysis, quality analysis, teaching quality, great significance, student classroom behavior

备注：

点击查看摘要

Abstract:In classroom teaching, student behavior can reflect their learning state and classroom participation, which is of great significance for teaching quality analysis. To address the problems of dense student targets, numerous small objects, frequent occlusions, and imbalanced class distribution in real classroom scenes, this paper proposes an improved student classroom behavior recognition model named ALC-YOLOv8s based on YOLOv8s. The model introduces SPPF-LSKA to enhance contextual feature extraction, employs CFC-CRB and SFC-G2 to optimize multi-scale feature fusion, and incorporates ATFLoss to improve the learning ability for minority classes and hard samples. Experimental results show that compared with the baseline model, the improved model achieves increases of 1.8% in mAP50 and 2.1% in mAP50-95. Compared with several mainstream detection methods, the proposed model can well meet the requirements of automatic student behavior recognition in complex classroom scenarios.

92. 【2604.27277】BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

链接：https://arxiv.org/abs/2604.27277

作者：Yizhou Wu,Shansong Wang,Yuheng Li,Mojtaba Safari,Mingzhe Hu,Chih-Wei Chang,Harini Veeraraghavan,Xiaofeng Yang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：substantial labeled data, learning-based methods remain, require substantial labeled, Brain MRI underpins, methods remain task-specific

备注： 22 pages, 5 figures

点击查看摘要

Abstract:Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and require substantial labeled data. Here we show that a single self-supervised representation can generalize across heterogeneous brain MRI endpoints. We trained BrainDINO, a self-distilled foundation model, on approximately 6.6 million unlabeled axial slices from 20 datasets encompassing broad variation in population, disease, and acquisition setting. Using a frozen encoder with lightweight task heads, BrainDINO supported transfer across tumor segmentation, neurodegenerative and neurodevelopmental conditions classification, brain age estimation, post-stroke temporal prediction, molecular status prediction, MRI sequence classification, and survival modeling. Across tasks and supervision regimes, BrainDINO consistently equaled or exceeded natural-image and MRI-specific self-supervised baselines, with particularly strong advantages under label scarcity. Representation analyses further showed anatomically organized and pathology-sensitive feature structure in the absence of task-specific supervision. Our findings indicate that large-scale slice-wise self-supervised learning can yield a unified brain MRI representation that supports diverse neuroimaging tasks without volumetric pretraining or full-network fine-tuning, establishing a scalable foundation for robust and data-efficient brain imaging analysis.

93. 【2604.27259】VTBench: A Multimodal Framework for Time-Series Classification with Chart-Based Representations

链接：https://arxiv.org/abs/2604.27259

作者：Madhumitha Venkatesan,Xuyang Chen,Dongyu Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Gramian Angular Fields, overlooking alternative representations, models rely solely, deep learning, advanced significantly

备注： 8 pages main text

点击查看摘要

Abstract:Time-series classification (TSC) has advanced significantly with deep learning, yet most models rely solely on raw numerical inputs, overlooking alternative representations. While texture-based encodings such as Gramian Angular Fields (GAF) and Recurrence Plots (RP) convert time series into 2D images, they often require heavy preprocessing and yield less intuitive representations. In contrast, chart-based visualizations offer more interpretable alternatives and show promise in specific domains; however, their effectiveness remains underexplored, with limited systematic evaluation across chart types, visual encoding choices, and datasets. In this work, we introduce VTBench, a systematic and extensible framework that re-examines TSC through multimodal fusion of raw sequences and chart-based visualizations. VTBench generates lightweight, human-interpretable plots -- line, area, bar, and scatter, providing complementary views of the same signal. We develop a modular architecture supporting multiple fusion strategies, including single-chart visual-numerical fusion, multi-chart visual fusion, and full multimodal fusion with raw inputs. Through experiments across 31 UCR datasets, we show that: (1) chart-only models are competitive in selected settings, particularly on smaller datasets; (2) combining multiple chart types can improve accuracy by capturing complementary visual cues; and (3) multimodal models improve or maintain performance when visual features provide non-redundant information, but may degrade accuracy when they introduce redundancy. We further distill practical guidelines for selecting chart types, fusion strategies, and configurations. VTBench establishes a unified foundation for interpretable and effective multimodal time-series classification.

94. 【2604.27247】owards Generalizable Mapping of Hedges and Linear Woody Features from Earth Observation Data: a national Product for Germany

链接：https://arxiv.org/abs/2604.27247

作者：Thorsten Hoeser,Verena Huber-Garcia,Sarah Asam,Ursula Gessner,Claudia Kuenzer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：intensively managed agricultural, valuable ecosystem services, linear woody feature, managed agricultural landscapes, linear woody

备注： 31 pages, 14 figures

点击查看摘要

Abstract:Hedges and other linear woody features provide valuable ecosystem services, particularly within intensively managed agricultural landscapes. They are key elements for climate adaptation and biodiversity amongst others not only due to a largely varying flora, but also as a feeding-, resting-, and nesting place for many animals and insects including valuable pollinators. Therefore, they require dedicated management, preservation, and attention. Thus, systematic and large-scale mapping of these features from Earth observation data is of high importance. However, transferable and reusable workflows for linear woody feature mapping remain a key methodological challenge, given the diversity of sensor types, spatial resolutions, data acquisition conditions, and complex landscape variability encountered across study areas. We introduce a modular workflow built around two independently optimizable components. Firstly, a flexible input data interface that consolidates heterogeneous Earth observation data into a binary woody vegetation mask, and secondly, a deep neural network trained to separate linear from non-linear shapes within these masks. We demonstrate the workflow by deriving three national-scale linear woody feature maps for all of Germany from three input sources by using a single trained model without retraining. Evaluation against refined reference data from four federal state biotope mapping campaigns and comparison with two existing linear woody feature maps demonstrate that the workflow produces competitive results across all evaluation sites on a national level. The modular design and its demonstrated applicability at national scale provide a foundation for scalable and generalizable linear woody feature mapping beyond Germany.

95. 【2604.27218】AttriBE: Quantifying Attribute Expressivity in Body Embeddings for Recognition and Identification

链接：https://arxiv.org/abs/2604.27218

作者：Basudha Pal,Siyuan Huang,Anirudh Nanduri,Zhaoyang Wang,Rama Chellappa

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：systems that match, real-world applications, match individuals, individuals across images, images or video

备注：

点击查看摘要

Abstract:Person re-identification (ReID) systems that match individuals across images or video frames are essential in many real-world applications. However, existing methods are often influenced by attributes such as gender, pose, and body mass index (BMI), which vary in unconstrained settings and raise concerns related to fairness and generalization. To address this, we extend the notion of expressivity, defined as the mutual information between learned features and specific attributes, using a secondary neural network to quantify how strongly attributes are encoded. Applying this framework to three transformer-based ReID models on a large-scale visible-spectrum dataset, we find that BMI consistently shows the highest expressivity in deeper layers. Attributes in the final representation are ranked as BMI Pitch Gender Yaw, and expressivity evolves across layers and training epochs, with pose peaking in intermediate layers and BMI strengthening with depth. We further extend the analysis to cross-spectral person identification across infrared modalities including short-wave, medium-wave, and long-wave infrared. In this setting, pitch becomes comparable to BMI and attribute trends increase monotonically across depth, suggesting increased reliance on structural cues when bridging modality gaps. Overall, the results show that transformer-based ReID embeddings encode a hierarchy of implicit attributes, with morphometric information persistently embedded and pose contributing more strongly under cross-spectral conditions.

96. 【2604.27206】HQ-UNet: A Hybrid Quantum-Classical U-Net with a Quantum Bottleneck for Remote Sensing Image Segmentation

链接：https://arxiv.org/abs/2604.27206

作者：Md Aminur Hossain,Ayush V. Patel,Ikshwaku Vanani,Biplab Banerjee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：complex spatial relationships, model complex spatial, classical deep learning, deep learning architectures, Semantic segmentation

备注： 6 pages

点击查看摘要

Abstract:Semantic segmentation in remote sensing is commonly addressed using classical deep learning architectures such as U-Net, which require a large number of parameters to model complex spatial relationships. Quantum machine learning (QML) provides an alternative representation paradigm by mapping classical features into quantum states, but its direct application to high-dimensional images remains challenging under near-term quantum hardware constraints. In this work, we propose HQ-UNet, a hybrid quantum-classical U-Net architecture that integrates a compact parameterized quantum circuit at the bottleneck of a classical U-Net. The proposed design uses a non-pooling quantum convolutional module to enrich highly compressed encoder features before decoding, while keeping the quantum component shallow and parameter-efficient. Experiments on the this http URL dataset show that HQ-UNet achieves a mean IoU of 0.8050 and an overall accuracy of 94.76%, outperforming the classical U-Net baseline. These results suggest that compact quantum bottlenecks can enhance feature representation for remote sensing image segmentation under near-term quantum constraints. This highlights the potential of hybrid quantum-classical designs as a promising direction for parameter-efficient dense prediction in Earth observation.

97. 【2604.27178】Energy-Efficient Plant Monitoring via Knowledge Distillation

链接：https://arxiv.org/abs/2604.27178

作者：Ilyass Moummad,Reda Bensaid,Kawtar Zaher,Hervé Goëau,Jean-Christophe Lombardo,Joseph Salmon,Pierre Bonnet,Alexis Joly

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large-scale visual representation, visual representation learning, Recent advances, advances in large-scale, large-scale visual

备注：

点击查看摘要

Abstract:Recent advances in large-scale visual representation learning have significantly improved performance in plant species and plant disease recognition tasks. However, state-of-the-art models, often based on high-capacity vision transformers or multimodal foundation models, remain computationally expensive and difficult to deploy in resource-constrained environments such as mobile or edge devices. This limitation hinders the scalability of automated biodiversity monitoring and precision agriculture systems, where efficiency is as critical as accuracy. In this work, we investigate knowledge distillation as an effective approach to transfer the representational capacity of large pretrained models into smaller, more efficient architectures. We focus on plant species and disease recognition, and conduct an extensive empirical study on two challenging benchmarks: Pl@ntNet300K-v2 and Deep-Plant-Disease. We evaluate four representative architectures, including two ConvNeXt models and two vision transformers, under multiple training regimes: from-scratch training and pretrained initialization, each with and without distillation. In total, we train and evaluate 70 models. Our results show that knowledge distillation consistently improves performance across tasks and architectures. Distilled models are able to match the performance of significantly larger models while maintaining substantially lower computational cost. These findings demonstrate the potential of knowledge distillation techniques to enable efficient and scalable deployment of plant recognition systems in real-world environmental applications.

98. 【2604.27128】Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics

链接：https://arxiv.org/abs/2604.27128

作者：Haiyu Yang,Miel Hostens

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：individual-level livestock monitoring, precision livestock farming, combining open-vocabulary detection, promptable video segmentation, commodity edge accelerators

备注：

点击查看摘要

Abstract:Foundation-model pipelines for individual-level livestock monitoring -- combining open-vocabulary detection, promptable video segmentation, and self-supervised visual embeddings -- have raised the accuracy ceiling of precision livestock farming (PLF), but their GPU memory budgets exceed the envelope of commodity edge accelerators. To close this gap, the 446M-parameter Perception Encoder (PE-ViT-L+) backbone of SAM 3 is distilled into a 40.66M-parameter multi-scale student through three mechanisms: a Feature Pyramid Network student encoder built on TinyViT-21M-512, a four-term direction-then-scale distillation loss, and backbone-substitution inference with sliding-window session pruning that bounds streaming GPU memory growth. The DINOv3 family includes a pre-distilled ViT-S/16 variant (21.6M parameters) released alongside a 6716M-parameter ViT-7B teacher; the ViT-S (21M) variant is adopted as the per-individual embedder. On the Edinburgh Pig dataset, the compressed pipeline reaches 92.29% MOTA and 96.15% IDF1 against the SAM 3 teacher (1.68- and 0.84-percentage-point losses), achieves a 7.77-fold reduction in system-level parameters and a 3.01-fold reduction in peak VRAM (19.52GB - 6.49GB), and reaches 97.34% top-1 accuracy with 91.67% macro-F1 on nine-class pig behaviour classification. The pipeline fits inside an NVIDIA Jetson Orin NX 16GB envelope with 4.9GB of headroom, supporting a proposed -- but not yet empirically validated -- on-device embedding-pool re-identification mechanism whose per-individual footprint of approximately 94MB per animal per year produces a longitudinal visual record amenable to retrospective association with disease, lameness, reproductive, and growth outcome labels.

99. 【2604.27122】InterPartAbility: Text-Guided Part Matching for Interpretable Person Re-Identification

链接：https://arxiv.org/abs/2604.27122

作者：Shakeeb Murtaza,Aryan Shukla,Rajarshi Bhattacharya,Maguelonne Heritier,Eric Granger

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：natural-language text description, retrieve top matching, top matching individuals, person re-identification, relies on natural-language

备注：

点击查看摘要

Abstract:Text-to-image person re-identification (TI-ReID) relies on natural-language text description to retrieve top matching individuals from a large gallery of images. While recent large vision-language models (VLMs) achieve strong retrieval performance, their decisions remain largely uninterpretable. Existing interpretability approaches in TI-ReID rely solely on slot-attention to highlight attended regions, but fail to reliably bind visual regions to semantically meaningful concepts, limiting explanations to qualitative visualizations over a restricted vocabulary. This paper introduces InterPartAbility, an interpretable TI-ReID method that performs explicit part-wise matching and enables phrase-region grounding. A new open-vocabulary, lightweight supervision, patch-phrase interaction module (PPIM) is proposed to train a standard TI-ReID model with concept-level guidance. Concept-based part phrases provide evidence that encourages the model to attend to corresponding image regions. InterPartAbility further constrains CLIP ViT self-attention to produce spatially concentrated patch activations aligned with each part-level phrase, yielding grounded explanation maps. A quantitative interpretability protocol for TI-ReID is introduced by adapting perturbation-based evaluation metrics, including counterfactual region masking that measures retrieval degradation when top-ranked explanatory regions are removed. Empirical results\footnote{Our code is included in the supplementary materials and will be made public.} on challenging benchmarks like CUHK-PEDES and ICFG-PEDES show that InterPartAbility achieves state-of-the-art (SOTA) interpretability performance under these metrics, while sustaining competitive retrieval accuracy.

100. 【2604.27106】Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations

链接：https://arxiv.org/abs/2604.27106

作者：Andrii Zadaianchuk,Leonardo Barcellona,Lennard Schuenemann,Christian Gumbsch,Zehao Wang,Muhammad Zubair Irshad,Fabien Despinoy,Rahaf Aljundi,Stratis Gavves,Sergey Zakharov

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Accurately reconstructing complex, sparse observations remains, full multi-object scenes, Accurately reconstructing, reconstructing complex full

备注： Website: [this https URL](https://reconstruction-by-generation.github.io)

点击查看摘要

Abstract:Accurately reconstructing complex full multi-object scenes from sparse observations remains a core challenge in computer vision and a key step toward scalable and reliable simulation for robotics. In this work, we introduce RecGen, a generative framework for probabilistic joint estimation of object and part shapes, as well as their pose under occlusion and partial visibility from one or multiple RGB-D images. By leveraging compositional synthetic scene generation and strong 3D shape priors, RecGen generalizes across diverse object types and real-world environments. RecGen achieves state-of-the-art performance on complex, heavily occluded datasets, robustly handling severe occlusions, symmetric objects, object parts, and intricate geometry and texture. Despite using nearly 80% fewer training meshes than the previous state of the art SAM3D, RecGen outperforms it by 30.1% in geometric shape quality, 9.1% in texture reconstruction, and 33.9% in pose estimation.

101. 【2604.27105】Automated Detection of Mutual Gaze and Joint Attention in Dual-Camera Settings via Dual-Stream Transformers

链接：https://arxiv.org/abs/2604.27105

作者：Jakub Kosmydel,Paweł Gajewski,Arkadiusz Białek

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Analyzing mutual gaze, labor-intensive manual coding, Analyzing mutual, mutual gaze, joint attention

备注：

点击查看摘要

Abstract:Analyzing mutual gaze (MG) and joint attention (JA) is critical in developmental psychology but traditionally relies on labor-intensive manual coding. Automating this process in multi-camera laboratory settings is computationally challenging due to complex cross-camera relational dynamics. In this paper, we propose a highly efficient dual-stream Transformer architecture for detecting MG and JA from synchronized dual-camera recordings. Our approach leverages frozen gaze-aware backbones (GazeLLE) to extract rich visual priors, combined with a custom token fusion mechanism to map the spatial and semantic relationships between interacting dyads. Evaluated on an ecologically valid dataset of caregiver-infant interactions, our model exhibits good performance, significantly outperforming both a convolutional baseline and a state-of-the-art multimodal Large Language Model (LLM). By open-sourcing our model and pre-trained weights, we provide behavioral scientists with a scalable tool that can be fine-tuned to diverse laboratory environments, effectively bridging the gap between computational modeling and applied interaction research.

102. 【2409.16808】Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices

链接：https://arxiv.org/abs/2409.16808

作者：Daghash K. Alqahtani,Aamir Cheema,Adel N. Toosi

类目：Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词：Jetson Orin Nano, Modern applications, resource-constrained edge devices, object detection models, autonomous vehicles

备注：

点击查看摘要

Abstract:Modern applications, such as autonomous vehicles, require deploying deep learning algorithms on resource-constrained edge devices for real-time image and video processing. However, there is limited understanding of the efficiency and performance of various object detection models on these devices. In this paper, we evaluate state-of-the-art object detection models, including YOLOv8 (Nano, Small, Medium), EfficientDet Lite (Lite0, Lite1, Lite2), and SSD (SSD MobileNet V1, SSDLite MobileDet). We deployed these models on popular edge devices like the Raspberry Pi 3, 4, and 5 with/without TPU accelerators, and Jetson Orin Nano, collecting key performance metrics such as energy consumption, inference time, and Mean Average Precision (mAP). Our findings highlight that lower mAP models such as SSD MobileNet V1 are more energy-efficient and faster in inference, whereas higher mAP models like YOLOv8 Medium generally consume more energy and have slower inference, though with exceptions when accelerators like TPUs are used. Among the edge devices, Jetson Orin Nano stands out as the fastest and most energy-efficient option for request handling, despite having the highest idle energy consumption. These results emphasize the need to balance accuracy, speed, and energy efficiency when deploying deep learning models on edge devices, offering valuable guidance for practitioners and researchers selecting models and devices for their applications.

103. 【2102.05231】Culture-inspired Multi-modal Color Palette Generation and Colorization: A Chinese Youth Subculture Case

链接：https://arxiv.org/abs/2102.05231

作者：Yufan Li,Jinggang Zhuo,Ling Fan,Harry Jiannan Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：carrying cultural implications, Chinese Youth Subculture, graphic design, essential component, component of graphic

备注： accepted by the 3rd IEEE Workshop on Artificial Intelligence for Art Creation

点击查看摘要

Abstract:Color is an essential component of graphic design, acting not only as a visual factor but also carrying cultural implications. However, existing research on algorithmic color palette generation and colorization largely ignores the cultural aspect. In this paper, we contribute to this line of research by first constructing a unique color dataset inspired by a specific culture, i.e., Chinese Youth Subculture (CYS), which is an vibrant and trending cultural group especially for the Gen Z population. We show that the colors used in CYS have special aesthetic and semantic characteristics that are different from generic color theory. We then develop an interactive multi-modal generative framework to create CYS-styled color palettes, which can be used to put a CYS twist on images using our automatic colorization model. Our framework is illustrated via a demo system designed with the human-in-the-loop principle that constantly provides feedback to our algorithms. User studies are also conducted to evaluate our generation results.

104. 【2604.27721】Physically-Informed Fuzzy Clustering of Vertical Sounding Ionograms

链接：https://arxiv.org/abs/2604.27721

作者：Oleg I.Berngardt,Sergey N.Ponomarchuk

类目：Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an); Space Physics (physics.space-ph)

关键词：physically-informed fuzzy clustering, vertical sounding ionograms, number, number of tracks, optimal number

备注： 31 pages, 8 figures

点击查看摘要

Abstract:This paper presents a physically-informed fuzzy clustering of vertical sounding ionograms for automatically separating the ionogram into tracks suitable for further interpretation and determining their optimal number. The model is designed for use not only in conditions where the number of tracks is known, but also in disturbed ionospheric conditions where the number of tracks is preliminary unknown. The method is based on an expectation-maximization algorithm, used for clustering, and on parametrically specified distributions of distances from points to parametrically specified curves. The curves used as track models are close to model tracks in the parabolic ionospheric layer model. The resulting model of each track has six parameters: three standard ones (the critical frequency, the lower boundary of the layer, and its half-width), and three additional ones to take into account possible underlying layer effects. By sequentially increasing the number of tracks and optimizing their parameters, the model finds the optimal number of tracks on the ionogram by minimizing the modified Bayesian information criterion. The Sequential Least Squares Quadratic Programming algorithm is used to find the parameters of a single track. The width of each single track is assumed to be unknown constant found during fitting process. To improve the quality of ionogram clustering, automatic adaptive noise filtering is performed before clustering. This filtering is based on a combination of the DBSCAN and Gaussian Mixture algorithms. Also, to improve clustering quality on an ionosonde without hardware separation of the ordinary and extraordinary components, a preliminary approximate removal of points belonging to the extraordinary mode is performed.

Comments:
31 pages, 8 figures

Subjects:

Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an); Space Physics (physics.space-ph)

Cite as:
arXiv:2604.27721 [physics.ao-ph]

(or
arXiv:2604.27721v1 [physics.ao-ph] for this version)

https://doi.org/10.48550/arXiv.2604.27721

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

105. 【2604.27593】An Extended Evaluation Split for DeepSpaceYoloDataset

链接：https://arxiv.org/abs/2604.27593

作者：Olivier Parisot

类目：Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent technological advances, major scientific observatories, develop highly effective, Electronically Assisted Astronomy, Deep Sky Objects

备注： 9 pages, 5 figures

点击查看摘要

Abstract:Recent technological advances in astronomy, particularly the growing popularity of smart telescopes for the general public, make it possible to develop highly effective detection solutions that are accessible to a wide audience, rather than being reserved for major scientific observatories. Published in 2023, DeepSpaceYoloDataset is a collection of annotated images created to train YOLO-based models for detecting Deep Sky Objects, particularly suited for Electronically Assisted Astronomy. In this paper, we present an update to DeepSpaceYoloDataset with the addition of a new split, test2026, designed to evaluate detection models with a greater diversity of images.

106. 【2604.27383】A Real-time Scale-robust Network for Glottis Segmentation in Nasal Transnasal Intubation

链接：https://arxiv.org/abs/2604.27383

作者：Yang Zhou,Chaoyong Zhang,Ruoyi Hao,Huilin Pan,Yang Zhang,Hongliang Ren

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：patient airway patency, maintaining patient airway, Nasotracheal intubation, critical clinical procedure, airway patency

备注： 14 pages, 9 figures

点击查看摘要

Abstract:Nasotracheal intubation (NTI) is a critical clinical procedure for establishing and maintaining patient airway patency. Machine-assisted NTI has emerged as a pivotal approach for optimizing procedural efficiency and minimizing manual intervention. However, visual detection algorithms employed for NTI navigation encounter significant challenges, including complex anatomical environments and suboptimal illumination conditions surrounding the glottis. Additionally, the glottis presents considerable scale variability throughout the procedure, initially appearing as a small, difficult-to-capture structure before expanding to occupy nearly the entire field of view. Moreover, traditional visual detection methods often have high computational costs, making real-time, high-precision detection on portable devices challenging. To enhance NTI efficacy and address these challenges, this paper proposes a novel glottis segmentation framework optimized for vision-assisted NTI applications. First, we designed a lightweight, multi-receptive field feature extraction module to reduce intra-class differences, achieving robustness to scale variations of the glottis. This module was then stacked to form the backbone and neck of our network. Subsequently, we developed an advanced label assignment method and redefined the number of samples to further reduce intra-class differences and enhance accuracy in the complex NTI environment. Experiments on three distinct datasets demonstrate that our network surpasses state-of-the-art algorithms, achieving a segmentation mDice of 92.9\% with a compact model size of 19 MB and an inference speed exceeding 170 frames per second. % Our code and datasets will be open-sourced on GitHub after the manuscript is accepted. Our code and datasets are available at this https URL.

107. 【2604.27326】Spectral Dynamic Attention Network for Hyperspectral Image Super-Resolution

链接：https://arxiv.org/abs/2604.27326

作者：Tengya Zhang,Feng Gao,Lin Qi,Junyu Dong,Qian Du

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Hyperspectral image super-resolution, existing deep learning, deep learning methods, HSI data, Hyperspectral image

备注： Accepted for publication in IEEE GRSL 2026

点击查看摘要

Abstract:Hyperspectral image super-resolution is essential for enhancing the spatial fidelity of HSI data, yet existing deep learning methods often struggle with substantial spectral redundancy and the limited non-linear modeling capacity of standard feed-forward networks (FFNs). To address these challenges, we propose Spectral Dynamic Attention Network (SDANet), a framework designed to adaptively suppress redundant spectral interactions. SDANet integrates two key components: 1) Dynamic Channel Sparse Attention (DCSA) module that computes channel-wise correlations and selectively preserves the most informative attention responses through dynamic and data-dependent sparsification. 2) Frequency-Enhanced Feed-Forward Network (FE-FFN) that jointly models spatial and frequency-domain representations to enhance non-linear expressiveness. Extensive experiments on two benchmark datasets demonstrate that SDANet achieves state-of-the-art HISR performance while maintaining competitive efficiency. The code will be made publicly available at this https URL.

108. 【2604.27323】Representative Spectral Correlation Network for Multi-source Remote Sensing Image Classification

链接：https://arxiv.org/abs/2604.27323

作者：Chuanzheng Gong,Feng Gao,Junyan Lin,Junyu Dong,Qian Du

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：LiDAR data offer, data offer complementary, offer complementary spectral, Hyperspectral image, offer complementary

备注： Accepted for publication in IEEE TGRS 2026

点击查看摘要

Abstract:Hyperspectral image (HSI) and SAR/LiDAR data offer complementary spectral and structural information for land-cover classification. However, their effective fusion remains challenging due to two major limitations: The spectral redundancy in high-dimensional HSI and the heterogeneous characteristics between multi-source data. To this end, we propose Representative Spectral Correlation Network (RSCNet), a novel multi-source image classification framework specifically designed to address the above challenges through spectral selection and adaptive interaction. The network incorporates two key components: (1) Key Band Selection Module (KBSM) that adaptively selects task-relevant spectral bands from the original HSI under cross-source guidance, thereby alleviating redundancy and mitigating information loss from conventional PCA-based spectral reduction. Moreover, the learned band subset exhibits highly discriminative spectral structures that align with discriminative semantic cues, promoting compact yet expressive representations. (2) Cross-source Adaptive Fusion Module (CAFM) that performs cross-source attention weighting and local-global contextual refinement to enhance cross-source feature interaction. Experiments on three public benchmark datasets demonstrate that our RSCNet achieves superior performance compared with state-of-the-art methods, while maintaining substantially lower computational complexity. Our codes are publicly available at this https URL.