本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新580篇论文，其中：

自然语言处理90篇
信息检索23篇
计算机视觉147篇

自然语言处理

1. 【2604.09544】Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

作者：Hadas Orgad,Boyi Wei,Kaden Zheng,Martin Wattenberg,Peter Henderson,Seraphina Goldfarb-Tarrant,Yonatan Belinkov

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language models, jailbreaks routinely bypass, Large language, safeguards remain brittle, resulting safeguards remain

备注：

点击查看摘要

Abstract:Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

2. 【2604.09537】Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

链接：https://arxiv.org/abs/2604.09537

作者：Soroosh Tayebi Arasteh,Mehdi Joodaki,Mahshad Lotfinia,Sven Nebelung,Daniel Truhn

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Evidence-grounded reasoning requires, attaching retrieved text, Evidence-grounded reasoning, evidence, provided evidence supports

备注：

点击查看摘要

Abstract:Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.

3. 【2604.09531】VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

链接：https://arxiv.org/abs/2604.09531

作者：Guanyu Zhou,Yida Yin,Wenhao Chai,Shengbang Tong,Xingyu Fu,Zhuang Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：viewpoint recognition, Vision-language models, spatial understanding, understanding and viewpoint, Vision-language

备注： Project Page: [this https URL](https://zlab-princeton.github.io/VisionFoundry/)

点击查看摘要

Abstract:Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

4. 【2604.09529】VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

链接：https://arxiv.org/abs/2604.09529

作者：Wenyi Xiao,Xinchi Xu,Leilei Gan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Vision Language, Large Vision, achieve strong multimodal, Vision Language Models, Vision Language

备注： 24 pages, ACL 2026 Main. Repository: [this https URL](https://github.com/Mr-Loevan/VL-Calibration)

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.

5. 【2604.09514】Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation

链接：https://arxiv.org/abs/2604.09514

作者：Xinyu Wang,Sai Koneru,Wenbo Zhang,Wenliang Zheng,Saksham Ranjan,Sarah Rajtmajer

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：deceptive news-like content, Recent advances, large language models, news-like content, advances in large

备注：

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior work has often treated fake news detection as a binary classification problem, modern fake news increasingly arises through human-AI collaboration, where strategic inaccuracies are embedded within otherwise accurate and credible narratives. These mixed-truth cases represent a realistic and consequential threat, yet they remain underrepresented in existing benchmarks. To address this gap, we introduce MANYFAKE, a synthetic benchmark containing 6,798 fake news articles generated through multiple strategy-driven prompting pipelines that capture many ways fake news can be constructed and refined. Using this benchmark, we evaluate a range of state-of-the-art fake news detectors. Our results show that even advanced reasoning-enabled models approach saturation on fully fabricated stories, but remain brittle when falsehoods are subtle, optimized, and interwoven with accurate information.

6. 【2604.09501】You Can't Fight in Here! This is BBS!

链接：https://arxiv.org/abs/2604.09501

作者：Richard Futrell,Kyle Mahowald

类目：Computation and Language (cs.CL)

关键词：

备注： Accepted at Behavioral and Brain Sciences as a response to the commentaries to the accepted target article "How Linguistics Learned to Stop Worrying and Love the Language Models", whose preprint appears here: [arXiv:2501.17047](https://arxiv.org/abs/2501.17047)

点击查看摘要

None

7. 【2604.09497】BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

链接：https://arxiv.org/abs/2604.09497

作者：Hippolyte Gisserot-Boukhlef,Nicolas Boizard,Emmanuel Malherbe,Céline Hudelot,Pierre Colombo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

8. 【2604.09494】RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

链接：https://arxiv.org/abs/2604.09494

作者：Kyle Whitecross,Negin Rahimi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：language models post-trained, reasoning language models, language models, models post-trained, In-context retrieval

备注： Code, data, and models available at [this https URL](https://github.com/kswhitecross/RecaLLM)

点击查看摘要

Abstract:We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open-source LLMs, we observe that in-context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test-time scaling that we refer to as lost-in-thought: reasoning steps that improve performance also make subsequent in-context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible-overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long-context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long-context approaches, highlighting a promising path toward improving long-context performance without expensive long-context training data.

9. 【2604.09470】Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL

链接：https://arxiv.org/abs/2604.09470

作者：Vishnu Murali,Anmol Gulati,Elias Lumer,Kevin Frank,Sindy Campagna,Vamse Kumar Subbiah

类目：Computation and Language (cs.CL)

关键词：complex Boolean predicates, Translating natural language, requires resolving ambiguous, ambiguous field references, Boolean predicates

备注：

点击查看摘要

Abstract:Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.

10. 【2604.09466】Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities

链接：https://arxiv.org/abs/2604.09466

作者：Sathvik Nair,Colin Phillips

类目：Computation and Language (cs.CL)

关键词：

备注： 9 pages, Behavioral Brain Sciences Commentary on Futrell Mahowald (forthcoming)

点击查看摘要

None

11. 【2604.09459】From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

链接：https://arxiv.org/abs/2604.09459

作者：Chenchen Zhang

类目：Computation and Language (cs.CL)

关键词：outcome remains difficult, long trajectory caused, Reinforcement learning, large language models, relies on sparse

备注：

点击查看摘要

Abstract:Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500--30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K--1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches -- hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations -- that have no direct precedent in reasoning RL.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.09459 [cs.CL]

(or
arXiv:2604.09459v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.09459

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

12. 【2604.09443】Many-Tier Instruction Hierarchy in LLM Agents

链接：https://arxiv.org/abs/2604.09443

作者：Jingyu Zhang,Tianjian Li,William Jurayj,Hongyuan Zhan,Benjamin Van Durme,Daniel Khashabi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language model, Large language, tool outputs, sources-system messages, trust and authority

备注：

点击查看摘要

Abstract:Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.

13. 【2604.09442】UIPress: Bringing Optical Token Compression to UI-to-Code Generation

链接：https://arxiv.org/abs/2604.09442

作者：Dasen Dai,Shuoqi Li,Ronghao Chen,Huacan Wang,Biao Wu,Qizhen Lan

类目：Computation and Language (cs.CL)

关键词：generation requires vision-language, token efficiency critical, structured HTML, requires vision-language models, making visual token

备注： 10 pages, 3 figures

点击查看摘要

Abstract:UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence -- neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress ${\sim}$6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only ${\sim}$21.7M trainable parameters (0.26\% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5\% and the strongest inference-time method by +4.6\%, while delivering 9.1$\times$ time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.

14. 【2604.09418】Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

链接：https://arxiv.org/abs/2604.09418

作者：Solomiia Bilyk,Volodymyr Getmanskyi,Taras Firman

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Automated Instruction Revision, studies Automated Instruction, large language models, paper studies Automated, adapting large language

备注：

点击查看摘要

Abstract:This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks using limited task-specific examples. We position AIR within the broader landscape of adaptation strategies, including prompt optimization, retrieval-based methods, and fine-tuning. We then compare these approaches across a diverse benchmark suite designed to stress different task requirements, such as knowledge injection, structured extraction, label remapping, and logical reasoning. The paper argues that adaptation performance is strongly task-dependent: no single method dominates across all settings. Across five benchmarks, AIR was strongest or near-best on label-remapping classification, while KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger in tasks dominated by source-specific knowledge or dataset-specific annotation regularities.

15. 【2604.09389】Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

链接：https://arxiv.org/abs/2604.09389

作者：Götz-Henrik Wiegand,Lorena Raichle,Rico Städeli,Tomas Hrycej,Bernhard Bermeitinger,Siegfried Handschuh

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Training Transformer language, Transformer language models, Transformer language, performance typically improves, Training Transformer

备注： Presented as a paper at 3rd DATA-FM workshop @ ICLR 2026, Brazil. Published at 13th IEEE Swiss Conference on Data Science and AI (SDS 2026)

点击查看摘要

Abstract:Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled, smaller-scale settings remain less explored. In this work, we isolate dataset-size effects using a strongly reduced attention-only decoder architecture. By training on progressively larger power-of-two subsets, we observe smooth performance improvements accompanied by clear diminishing returns, consistent with scaling-law behavior. Using only about 30% of the training data is sufficient to reach approximately 90% of the full-data validation token-level accuracy. These results provide actionable insights into dataset scaling in a controlled, component-isolated setting and offer practical guidance for balancing dataset size and computational cost in compute- and data-restricted environments, such as small research labs and exploratory model development.

16. 【2604.09377】ask-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios

链接：https://arxiv.org/abs/2604.09377

作者：Hui Liu,Bin Zou,Kecheng Chen,Jie Liu,Wenya Wang,Haoliang Li

类目：Computation and Language (cs.CL)

关键词：exhibit substantial variability, user-specific cost-performance trade-offs, Large language models, meet user-specific cost-performance, Large language

备注： 30 pages, Accepted by ACL 2026 Main

点击查看摘要

Abstract:Large language models (LLMs) exhibit substantial variability in performance and computational cost across tasks and queries, motivating routing systems that select models to meet user-specific cost-performance trade-offs. However, existing routers generalize poorly in cold-start scenarios where in-domain training data is unavailable. We address this limitation with a multi-level task-profile-guided data synthesis framework that constructs a hierarchical task taxonomy and produces diverse question-answer pairs to approximate the test-time query distribution. Building on this, we introduce TRouter, a task-type-aware router approach that models query-conditioned cost and performance via latent task-type variables, with prior regularization derived from the synthesized task taxonomy. This design enhances TRouter's routing utility under both cold-start and in-domain settings. Across multiple benchmarks, we show that our synthesis framework alleviates cold-start issues and that TRouter delivers effective LLM routing.

17. 【2604.09364】Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts

链接：https://arxiv.org/abs/2604.09364

作者：Farhad Nooralahzadeh,Omid Rohanian,Yi Zhang,Jonathan Fürst,Kurt Stockinger

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal Arbitration Crossover, blue banana, problem of perception, Vision-Language Model, Arbitration Crossover

备注：

点击查看摘要

Abstract:When a Vision-Language Model (VLM) sees a blue banana and answers "yellow", is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding--Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit -- not the strength of encoding -- better predicts grounding outcomes with a correlation of . After having studied when VLMs base their answers on image clues rather than prior knowledge, we want to understand the causal relationships. We establish causality through full-sequence activation patching. The standard last-token interventions in LLM interpretability do not affect VLMs. In contrast, replacing the full token sequence at layers identified by MAC alters 60 to 84% of outputs. Partial-token decomposition shows that image tokens carry almost all of the causal impact, while text tokens have none. Scaling addresses the remaining architectural differences to achieve perfect retention. Moving from diagnosis to intervention, we show that training-free activation steering -- both linear and sparse autoencoder-guided -- in early layers can improve visual grounding by up to +3.8% with degrading performance in some setups. Overall, these findings lead to a clear conclusion: VLMs already see well, but the challenge is acting on what they see. Targeted interventions can help to bridge this gap.

18. 【2604.09349】Visually-Guided Policy Optimization for Multimodal Reasoning

链接：https://arxiv.org/abs/2604.09349

作者：Zengbin Wang,Feng Xiong,Liang Lin,Xuecai Hu,Yong Wang,Yanlin Wang,Man Zhang,Xiangxiang Chu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Reinforcement learning, verifiable rewards, vision-language models, visual, learning with verifiable

备注： ACL 2026

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.

19. 【2604.09338】Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

链接：https://arxiv.org/abs/2604.09338

作者：Lars Benedikt Kaesberg,Tianyu Yang,Niklas Bauer,Terry Ruas,Jan Philip Wahle,Bela Gipp

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：tasks remains difficult, measuring model capabilities, navigation and robotics, remains difficult, central to navigation

备注：

点击查看摘要

Abstract:Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.

20. 【2604.09265】EthicMind: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue

链接：https://arxiv.org/abs/2604.09265

作者：Jiawen Deng,Wei Li,Wentao Zhang,Ziyun Jiao,Fuji Ren

类目：Computation and Language (cs.CL)

关键词：Intelligent dialogue systems, ethically sensitive settings, Intelligent dialogue, sensitive settings, significant harm

备注： 18 pages, Accepted to the ACL 2026 Main Conference

点击查看摘要

Abstract:Intelligent dialogue systems are increasingly deployed in emotionally and ethically sensitive settings, where failures in either emotional attunement or ethical judgment can cause significant harm. Existing dialogue models typically address empathy and ethical safety in isolation, and often fail to adapt their behavior as ethical risk and user emotion evolve across multi-turn interactions. We formulate ethical-emotional alignment in dialogue as an explicit turn-level decision problem, and propose \textsc{EthicMind}, a risk-aware framework that implements this formulation in multi-turn dialogue at inference time. At each turn, \textsc{EthicMind} jointly analyzes ethical risk signals and user emotion, plans a high-level response strategy, and generates context-sensitive replies that balance ethical guidance with emotional engagement, without requiring additional model training. To evaluate alignment behavior under ethically complex interactions, we introduce a risk-stratified, multi-turn evaluation protocol with a context-aware user simulation procedure. Experimental results show that \textsc{EthicMind} achieves more consistent ethical guidance and emotional engagement than competitive baselines, particularly in high-risk and morally ambiguous scenarios.

21. 【2604.09237】ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

链接：https://arxiv.org/abs/2604.09237

作者：Shahar Levy,Eliya Habba,Reshef Mintz,Barak Raveh,Renana Keydar,Gabriel Stanovsky

类目：Computation and Language (cs.CL)

关键词：require structured evidence, pose natural-language research, large document collections, answers typically require, typically require structured

备注：

点击查看摘要

Abstract:Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: this http URL

22. 【2604.09212】SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

链接：https://arxiv.org/abs/2604.09212

作者：Han Luo,Guy Laban

类目：Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：Large language models, preserving consistent roles, Large language, Stable Persona-driven Agent, long horizons

备注： Accepted to Findings of the Association for Computational Linguistics (ACL 2026). Our code and data are available at [this https URL](https://github.com/lhannnn/SPASM)

点击查看摘要

Abstract:Large language models are increasingly deployed in multi-turn settings such as tutoring, support, and counseling, where reliability depends on preserving consistent roles, personas, and goals across long horizons. This requirement becomes critical when LLMs are used to generate synthetic dialogues for training and evaluation, since LLM--LLM conversations can accumulate identity-related failures such as persona drift, role confusion, and "echoing", where one agent gradually mirrors its partner. We introduce SPASM (Stable Persona-driven Agent Simulation for Multi-turn dialogue generation), a modular, stability-first framework that decomposes simulation into (i) persona creation via schema sampling, plausibility validation, and natural-language persona crafting, (ii) Client--Responder dialogue generation, and (iii) termination detection for coherent stopping. To improve long-horizon stability without changing model weights, we propose Egocentric Context Projection (ECP): dialogue history is stored in a perspective-agnostic representation and deterministically projected into each agent's egocentric view before generation. Across three LLM backbones (GPT-4o-mini, DeepSeek-V3.2, Qwen-Plus) and nine Client--Responder pairings, we construct a dataset of 4,500 personas and 45,000 conversations (500 personas X 10 conversations per pairing). Ablations show ECP substantially reduces persona drift and, under human validation, eliminates echoing; embedding analyses recover persona structure and reveal strong responder-driven interaction geometry. Our code is available at this https URL.

23. 【2604.09189】Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

链接：https://arxiv.org/abs/2604.09189

作者：Avni Mittal

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：difficult to inspect, internalize safety policies, remain difficult, LLMs internalize safety, RLHF

备注：

点击查看摘要

Abstract:LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.

24. 【2604.09174】Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

链接：https://arxiv.org/abs/2604.09174

作者：Passant Elchafei,Monorama Swain,Shahed Masoudian,Markus Schedl

类目：Computation and Language (cs.CL)

关键词：hallucinated answers remain, answers remain common, hallucinated answers, aims to reduce, evidence

备注：

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

25. 【2604.09162】Persona-E$^2$: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events

链接：https://arxiv.org/abs/2604.09162

作者：Yuqin Yang,Haowu Zhou,Haoran Tu,Zhiwen Hui,Shiqi Yan,HaoYang Li,Dong She,Xianrong Yao,Yang Gao,Zhanpeng Jin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：affective computing research, computing research treats, research treats emotion, Large Language Models, property of text

备注： Accepted by ACL 2026 Main

点击查看摘要

Abstract:Most affective computing research treats emotion as a static property of text, focusing on the writer's sentiment while overlooking the reader's perspective. This approach ignores how individual personalities lead to diverse emotional appraisals of the same event. Although role-playing Large Language Models (LLMs) attempt to simulate such nuanced reactions, they often suffer from "personality illusion'' -- relying on surface-level stereotypes rather than authentic cognitive logic. A critical bottleneck is the absence of ground-truth human data to link personality traits to emotional shifts. To bridge the gap, we introduce Persona-E$^2$ (Persona-Event2Emotion), a large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. Extensive experiments reveal that state-of-the-art LLMs struggle to capture precise appraisal shifts, particularly in social media domains. Crucially, we find that personality information significantly improves comprehension, with the Big Five traits alleviating "personality illusion.'

26. 【2604.09150】hink Less, Know More: State-Aware Reasoning Compression with Knowledge Guidance for Efficient Reasoning

链接：https://arxiv.org/abs/2604.09150

作者：Yi Sui,Chaozhuo Li,Dawei Song

类目：Computation and Language (cs.CL)

关键词：high inference latency, Large Reasoning Models, excessive reasoning steps, achieve strong performance, Large Reasoning

备注：

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong performance on complex tasks by leveraging long Chain-of-Thought (CoT), but often suffer from overthinking, leading to excessive reasoning steps and high inference latency. Existing CoT compression methods struggle to balance accuracy and efficiency, and lack fine-grained, step-level adaptation to redundancy and reasoning bias. Therefore, we propose State-Aware Reasoning Compression with Knowledge Guidance (STACK), a framework that performs step-wise CoT compression by explicitly modeling stage-specific redundancy sources and integrating with a retrieval-augmented guidance. STACK constructs online long-short contrastive samples and dynamically switches between knowledge-guided compression for uncertain or biased reasoning state and self-prompted compression for overly long but confident state, complemented by an answer-convergence-based early stopping mechanism to suppress redundant verification. We further propose a reward-difference-driven training strategy by combining Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), enabling models to learn state-conditioned compression strategies. Experiments on three mathematical reasoning benchmarks show that STACK achieves a superior accuracy-efficiency balance, reducing average response length by 59.9% while improving accuracy by 4.8 points over existing methods.

27. 【2604.09123】Prototype-Regularized Federated Learning for Cross-Domain Aspect Sentiment Triplet Extraction

链接：https://arxiv.org/abs/2604.09123

作者：Zongming Cai,Jianhang Tang,Zhenyong Zhang,Jinghui Qin,Kebing Jin,Hankz Hankui Zhuo

类目：Computation and Language (cs.CL)

关键词：Aspect Sentiment Triplet, Sentiment Triplet Extraction, Sentiment Triplet, sentiment triplets, Aspect Sentiment

备注：

点击查看摘要

Abstract:Aspect Sentiment Triplet Extraction (ASTE) aims to extract all sentiment triplets of aspect terms, opinion terms, and sentiment polarities from a sentence. Existing methods are typically trained on individual datasets in isolation, failing to jointly capture the common feature representations shared across domains. Moreover, data privacy constraints prevent centralized data aggregation. To address these challenges, we propose Prototype-based Cross-Domain Span Prototype extraction (PCD-SpanProto), a prototype-regularized federated learning framework to enable distributed clients to exchange class-level prototypes instead of full model parameters. Specifically, we design a weighted performance-aware aggregation strategy and a contrastive regularization module to improve the global prototype under domain heterogeneity and the promotion between intra-class compactness and inter-class separability across clients. Extensive experiments on four ASTE datasets demonstrate that our method outperforms baselines and reduces communication costs, validating the effectiveness of prototype-based cross-domain knowledge transfer.

28. 【2604.09121】Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

链接：https://arxiv.org/abs/2604.09121

作者：Peng Wang(1),Yanqiao Zhu(1),Zixuan Jiang(1),Qinyuan Chen(2),Xingjian Zhao(2),Xipeng Qiu(2),Wupeng Wang(3),Zhifu Gao(3),Xiangang Li(3),Kai Yu(1),Xie Chen(1) ((1) X-LANCE Lab, Shanghai Jiao Tong University, (2) School of Computer Science, Fudan University, (3) Tongyi Fun Team, Alibaba Group)

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词：large-scale training data, witnessed remarkable progress, Recent years, automatic speech recognition, driven by advances

备注：

点击查看摘要

Abstract:Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.

29. 【2604.09094】Few-Shot Contrastive Adaptation for Audio Abuse Detection in Low-Resource Indic Languages

链接：https://arxiv.org/abs/2604.09094

作者：Aditya Narayan Sankaran,Reza Farahbakhsh,Noel Crespi

类目：ound (cs.SD); Computation and Language (cs.CL)

关键词：social media shifts, Abusive speech detection, voice-based interaction, social media, media shifts

备注： 14 pages, preprint under review

点击查看摘要

Abstract:Abusive speech detection is becoming increasingly important as social media shifts towards voice-based interaction, particularly in multilingual and low-resource settings. Most current systems rely on automatic speech recognition (ASR) followed by text-based hate speech classification, but this pipeline is vulnerable to transcription errors and discards prosodic information carried in speech. We investigate whether Contrastive Language-Audio Pre-training (CLAP) can support abusive speech detection directly from audio. Using the ADIMA dataset, we evaluate CLAP-based representations under few-shot supervised contrastive adaptation in cross-lingual and leave-one-language-out settings, with zero-shot prompting included as an auxiliary analysis. Our results show that CLAP yields strong cross-lingual audio representations across ten Indic languages, and that lightweight projection-only adaptation achieves competitive performance with respect to fully supervised systems trained on complete training data. However, the benefits of few-shot adaptation are language-dependent and not monotonic with shot size. These findings suggest that contrastive audio-text models provide a promising basis for cross-lingual audio abuse detection in low-resource settings, while also indicating that transfer remains incomplete and language-specific in important ways.

30. 【2604.09075】Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency

链接：https://arxiv.org/abs/2604.09075

作者：Shu Yang,Zihao Zhou,Di Wang,Wenda Li

类目：Computation and Language (cs.CL)

关键词：including system policies, Large language models, Large language, language models increasingly, models increasingly operate

备注：

点击查看摘要

Abstract:Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real-world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro-Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction-following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver-guided reasoning that formulates instruction resolution as a constraint satisfaction problem, enabling the model to derive a maximally consistent set of applicable instructions under hierarchical constraints. At training time, NSHA distills solver-based decisions into model parameters using automatically constructed supervision. We evaluate our approach on rule following, task execution, tool use, and safety, covering both single-turn and multi-turn interactions, and show that NSHA significantly improves performance under such conflicts while maintaining competitive utility in reference settings.

31. 【2604.09069】NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System

链接：https://arxiv.org/abs/2604.09069

作者：Parjanya Aditya Shukla,Shubham Kumar Nigam,Debtanu Datta,Balaramamahanthi Deepak Patnaik,Noel Shallum,Pradeep Reddy Vanga,Saptarshi Ghosh,Arnab Bhattacharya

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Court Judgment Prediction, Judgment Prediction, Court Judgment, aims to predict, Prediction Module

备注：

点击查看摘要

Abstract:Court Judgment Prediction and Explanation (CJPE) aims to predict a judicial decision and provide a legally grounded explanation for a given case based on the facts, legal issues, arguments, cited statutes, and relevant precedents. For such systems to be practically useful in judicial or legal research settings, they must not only achieve high predictive performance but also generate transparent and structured legal reasoning that aligns with established judicial practices. In this work, we present NyayaMind, an open-source framework designed to enable transparent and scalable legal reasoning for the Indian judiciary. The proposed framework integrates retrieval, reasoning, and verification mechanisms to emulate the structured decision-making process typically followed in courts. Specifically, NyayaMind consists of two main components: a Retrieval Module and a Prediction Module. The Retrieval Module employs a RAG pipeline to identify legally relevant statutes and precedent cases from large-scale legal corpora, while the Prediction Module utilizes reasoning-oriented LLMs fine-tuned for the Indian legal domain to generate structured outputs including issues, arguments, rationale, and the final decision. Our extensive results and expert evaluation demonstrate that NyayaMind significantly improves the quality of explanation and evidence alignment compared to existing CJPE approaches, providing a promising step toward trustworthy AI-assisted legal decision support systems.

32. 【2604.09066】Anchored Sliding Window: Toward Robust and Imperceptible Linguistic Steganography

链接：https://arxiv.org/abs/2604.09066

作者：Ruiyi Yan,Shiao Meng,Yugo Murawaki

类目：Computation and Language (cs.CL)

关键词：Linguistic steganography based, Linguistic steganography, language models typically, models typically assumes, transmitted without alteration

备注： ACL2026 Main

点击查看摘要

Abstract:Linguistic steganography based on language models typically assumes that steganographic texts are transmitted without alteration, making them fragile to even minor modifications. While previous work mitigates this fragility by limiting the context window, it significantly compromises text quality. In this paper, we propose the anchored sliding window (ASW) framework to improve imperceptibility and robustness. In addition to the latest tokens, the prompt and a bridge context are anchored within the context window, encouraging the model to compensate for the excluded tokens. We formulate the optimization of the bridge context as a variant of prompt distillation, which we further extend using self-distillation strategies. Experiments show that our ASW significantly and consistently outperforms the baseline method in text quality, imperceptibility, and robustness across diverse settings. The code is available at this http URL.

33. 【2604.09037】SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

链接：https://arxiv.org/abs/2604.09037

作者：Xiyang Huang,Jiawei Lin,Keying Wu,Jiaxin Huang,Kailai Yang,Renxiong Wei,Cheng zeng,Jiayi Xiang,Ziyan Kuang,Min Peng,Qianqian Xie,Sophia Ananiadou

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：large language models, multimodal large language, harder capability required, language models, focus on event

备注：

点击查看摘要

Abstract:Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models' procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.

34. 【2604.09029】CONDESION-BENCH: Conditional Decision-Making of Large Language Models in Compositional Action Space

链接：https://arxiv.org/abs/2604.09029

作者：Yeonjun Hwang,Sungyong Park,Minju Kim,Dongha Lee,Jinyoung Yeo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, high-stakes domains due, Large language, reasoning capabilities, language models

备注： preprint

点击查看摘要

Abstract:Large language models have been widely explored as decision-support tools in high-stakes domains due to their contextual understanding and reasoning capabilities. However, existing decision-making benchmarks rely on two simplifying assumptions: actions are selected from a finite set of pre-defined candidates, and explicit conditions restricting action feasibility are not incorporated into the decision-making process. These assumptions fail to capture the compositional structure of real-world actions and the explicit conditions that constrain their validity. To address these limitations, we introduce CONDESION-BENCH, a benchmark designed to evaluate conditional decision-making in compositional action space. In CONDESION-BENCH, actions are defined as allocations to decision variables and are restricted by explicit conditions at the variable, contextual, and allocation levels. By employing oracle-based evaluation of both decision quality and condition adherence, we provide a more rigorous assessment of LLMs as decision-support tools.

35. 【2604.09019】Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA

链接：https://arxiv.org/abs/2604.09019

作者：Andre Bacellar

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：retrieval splits queries, explicitly named, bridge passage, splits queries, regimes determined

备注： 8 pages, 5 figures. Theory and empirical validation of regime-conditional multi-hop retrieval routing

点击查看摘要

Abstract:Two-hop QA retrieval splits queries into two regimes determined by whether the hop-2 entity is explicitly named in the question (Q-dominant) or only in the bridge passage (B-dominant). We formalize this split with three theorems: (T1) per-query AUC is a monotone function of the cosine separation margin, with R^2 = 0.90 for six of eight type-encoder pairs; (T2) regime is characterized by two surface-text predicates, with P1 decisive for routing and P2 qualifying the B-dominant case, holding across three encoders and three datasets; and (T3) bridge advantage requires the relation-bearing sentence, not entity name alone, with removal causing an 8.6-14.1 pp performance drop (p 0.001). Building on this theory, we propose RegimeRouter, a lightweight binary router that selects between question-only and question-plus-relation-sentence retrieval using five text features derived directly from the predicate definitions. Trained on 2WikiMultiHopQA (n = 881, 5-fold cross-fitted) and applied zero-shot to MuSiQue and HotpotQA, RegimeRouter achieves +5.6 pp (p 0.001), +5.3 pp (p = 0.002), and +1.1 pp (non-significant, no-regret) R@5 improvement, respectively, with artifact-driven.

36. 【2604.09008】owards Linguistically-informed Representations for English as a Second or Foreign Language: Review, Construction and Application

链接：https://arxiv.org/abs/2604.09008

作者：Wenxi Li,Xihao Wang,Weiwei Sun

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：distinct linguistic system, Foreign Language, ESFL, sparked a paradigm, standard English

备注：

点击查看摘要

Abstract:The widespread use of English as a Second or Foreign Language (ESFL) has sparked a paradigm shift: ESFL is not seen merely as a deviation from standard English but as a distinct linguistic system in its own right. This shift highlights the need for dedicated, knowledge-intensive representations of ESFL. In response, this paper surveys existing ESFL resources, identifies their limitations, and proposes a novel solution. Grounded in constructivist theories, the paper treats constructions as the fundamental units of analysis, allowing it to model the syntax--semantics interface of both ESFL and standard English. This design captures a wide range of ESFL phenomena by referring to syntactico-semantic mappings of English while preserving ESFL's unique characteristics, resulting a gold-standard syntactico-semantic resource comprising 1643 annotated ESFL sentences. To demonstrate the sembank's practical utility, we conduct a pilot study testing the Linguistic Niche Hypothesis, highlighting its potential as a valuable tool in Second Language Acquisition research.

37. 【2604.08999】ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

链接：https://arxiv.org/abs/2604.08999

作者：Xiaoke Guo,Songze Li,Zhiqiang Liu,Zhaoyan Gong,Yuanxiang Liu,Huajun Chen,Wen Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, bottleneck for Large, table question answering, representation gaps

备注：

点击查看摘要

Abstract:Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.

38. 【2604.08986】PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

链接：https://arxiv.org/abs/2604.08986

作者：Jihwan Oh,Soowon Oh,Murad Aghazada,Minchan Jeong,Sungnyun Kim,Se-Young Yun

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：assigning specific characters, steer large language, large language models, specific characters, widely adopted

备注： Preprint

点击查看摘要

Abstract:Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.

39. 【2604.08977】sting the Assumptions of Active Learning for Translation Tasks with Few Samples

链接：https://arxiv.org/abs/2604.08977

作者：Lorenzo Jaime Yu Flores,Cesare Spinoso di-Piano,Ori Ernst,David Ifeoluwa Adelani,Jackie Chi Kit Cheung

类目：Computation and Language (cs.CL)

关键词：Active learning, selecting unlabeled samples, improve model performance, paradigm for selecting, selecting unlabeled

备注：

点击查看摘要

Abstract:Active learning (AL) is a training paradigm for selecting unlabeled samples for annotation to improve model performance on a test set, which is useful when only a limited number of samples can be annotated. These algorithms often work by optimizing for the informativeness and diversity of the training data to be annotated. Recent work found that AL strategies fail to outperform random sampling on various language generation tasks when using 100-500 samples. To understand AL's poor performance when only using few samples, we investigate whether the core assumptions underlying AL strategies hold. We find that neither the informativeness nor diversity of the training data, which AL strategies optimize for, are correlated with test set performance. Instead, factors like the ordering of the training samples and interactions with pre-training data have a larger impact on performance. This suggests that future AL methods must take these factors into account in order to work with very few samples.

40. 【2604.08976】Quantisation Reshapes the Metacognitive Geometry of Language Models

链接：https://arxiv.org/abs/2604.08976

作者：Jon-Paul Cacioli

类目：Computation and Language (cs.CL)

关键词：model quantisation restructures, quantisation restructures domain-level, restructures domain-level metacognitive, domain-level metacognitive efficiency, degrading it uniformly

备注： 10 pages, 2 figures, 5 tables. Pre-registered study. Code and data: [this https URL](https://github.com/synthiumjp/sdt-calibration)

点击查看摘要

Abstract:We report that model quantisation restructures domain-level metacognitive efficiency in LLMs rather than degrading it uniformly. Evaluating Llama-3-8B-Instruct on the same 3,000 questions at Q5_K_M and f16 precision, we find that M-ratio profiles across four knowledge domains are uncorrelated between formats (Spearman rho = 0.00). Arts Literature moves from worst-monitored (M-ratio = 0.606 at Q5_K_M) to best-monitored (1.542 at f16). Geography moves from well-monitored (1.210) to under-monitored (0.798). However, Type-2 AUROC profiles are perfectly stable across formats (rho = 1.00), localising the restructuring to the M-ratio normalisation rather than the underlying discrimination signal. This finding emerged from a pre-registered attempt to improve metacognition through domain-conditional training. We prescribed confidence-amplification SFT for the diagnosed weak domain, with matched-budget agnostic and wrong-prescription controls. All four confirmatory hypotheses were null (10,000 bootstrap resamples, seed = 42). The training successfully reshaped confidence distributions, doubling the NLP gap in Science from 0.076 to 0.152, but did not improve meta-d' because the diagnostic profile did not transfer across formats. Any system relying on domain-level M-ratio profiles has an unexamined dependency on inference format. Systems using AUROC_2 are safer. We release all code, pre-registrations, and trial-level data.

41. 【2604.08974】Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning

链接：https://arxiv.org/abs/2604.08974

作者：Lorenzo Jaime Yu Flores,Cesare Spinoso di-Piano,Jackie Chi Kit Cheung

类目：Computation and Language (cs.CL)

关键词：Uncertainty quantification, confidence scores, language models, set of techniques, techniques that measure

备注：

点击查看摘要

Abstract:Uncertainty quantification is a set of techniques that measure confidence in language models. They can be used, for example, to detect hallucinations or alert users to review uncertain predictions. To be useful, these confidence scores must be correlated with the quality of the output. However, recent work found that fine-tuning can affect the correlation between confidence scores and quality. Hence, we investigate the underlying behavior of confidence scores to understand its sensitivity to supervised fine-tuning (SFT). We find that post-SFT, the correlation of various confidence scores degrades, which can stem from changes in confidence scores due to factors other than the output quality, such as the output's similarity to the training distribution. We demonstrate via a case study how failing to address this miscorrelation reduces the usefulness of the confidence scores on a downstream task. Our findings show how confidence metrics cannot be used off-the-shelf without testing, and motivate the need for developing metrics which are more robust to fine-tuning.

42. 【2604.08970】Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models

链接：https://arxiv.org/abs/2604.08970

作者：Avni Mittal,Shanu Kumar,Sandipan Dandapat,Monojit Choudhury

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)

关键词：study predictive multilingual, predictive multilingual evaluation, study predictive, target language, evidence

备注：

点击查看摘要

Abstract:We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and published evidence is uneven across languages, tasks, and model families. We introduce a controlled benchmark of 1,500 questions spanning six tasks and five evidence scenarios. The benchmark separates accessible evidence from ground truth, enabling evaluation of systems that must infer missing results from incomplete literature evidence. We also present Litmus (Re)Agent, a DAG-orchestrated agentic system that decomposes queries into hypotheses, retrieves evidence, and synthesises predictions through feature-aware aggregation. Across six systems, Litmus (Re)Agent achieves the best overall performance, with the largest gains in transfer-heavy scenarios where direct evidence is weak or absent. These results show that structured agentic reasoning is a promising approach to multilingual performance estimation under incomplete evidence.

43. 【2604.08964】Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models

链接：https://arxiv.org/abs/2604.08964

作者：Shun Zou,Yong Wang,Zehui Chen,Lin Chen,Chongyang Tao,Feng Zhao,Xiangxiang Chu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Diffusion Large Language, Diffusion Large, autoregressive large language, Language Models

备注： Accepted for ACL 2026

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have recently become a promising alternative to autoregressive large language models (ARMs). Semi-autoregressive (Semi-AR) decoding is widely employed in base dLLMs and advanced decoding strategies due to its superior performance. However, our observations reveal that Semi-AR decoding suffers from inherent block constraints, which cause the decoding of many cross-block stable tokens to be unnecessarily delayed. To address this challenge, we systematically investigate the identification of stable tokens and present three key findings: (1) naive lookahead decoding is unreliable, (2) token stability closely correlates with convergence trend, and (3) historical information is isolated. Building on these insights, we propose Anchor-based History-stable Decoding (AHD), a training-free, plug-and-play dynamic decoding strategy. Specifically, AHD monitors the stability trend of tokens in real time through dynamic anchors. Once a token reaches stability, it initiates early cross-block decoding to enhance efficiency and performance. Extensive experiments across language, vision-language, and audio-language domains demonstrate that AHD simultaneously improves both performance and inference efficiency. Notably, AHD effectively reverses the performance degradation typically observed in existing advanced decoding acceleration strategies. For instance, on the BBH benchmark, our approach reduces decoding steps by 80% while improving performance by 3.67%.

44. 【2604.08952】MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits

链接：https://arxiv.org/abs/2604.08952

作者：Yixin Xiang,Yunshan Ma,Xiaoyu Du,Yibing Chen,Yanxin Zhang,Jinhui Tang

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Document Question Answering, Question Answering, involves generating answers, Document Question, involves generating

备注： Accepted by ACL 2026. 19 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Document Question Answering (DQA) involves generating answers from a document based on a user's query, representing a key task in document understanding. This task requires interpreting visual layouts, which has prompted recent studies to adopt multimodal Retrieval-Augmented Generation (RAG) that processes page images for answer generation. However, in multimodal RAG, visual DQA struggles to utilize a large number of images effectively, as the retrieval stage often retains only a few candidate pages (e.g., Top-4), causing informative but less visually salient content to be overlooked in favor of common yet low-information pages. To address this issue, we propose a Multi-Armed Bandit-based DQA framework (MAB-DQA) to explicitly model the varying importance of multiple implicit aspects in a query. Specifically, MAB-DQA decomposes a query into aspect-aware subqueries and retrieves an aspect-specific candidate set for each. It treats each subquery as an arm and uses preliminary reasoning results from a small number of representative pages as reward signals to estimate aspect utility. Guided by an exploration-exploitation policy, MAB-DQA dynamically reallocates retrieval budgets toward high-value aspects. With the most informative pages and their correlations, MAB-DQA generates the expected results. On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding. Code at this https URL.

45. 【2604.08948】axPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

链接：https://arxiv.org/abs/2604.08948

作者：Gang Hu,Yating Chen,Haiyan Ding,Wang Gao,Jiajia Huang,Min Peng,Qianqian Xie,Kun Yu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, Large Language, exhibit notable gaps, legally regulated Chinese

备注：

点击查看摘要

Abstract:While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of "structured parsing-field alignment extraction-numerical and textual matching", enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom's taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.

46. 【2604.08947】MuTSE: A Human-in-the-Loop Multi-use Text Simplification Evaluator

链接：https://arxiv.org/abs/2604.08947

作者：Rares-Alexandru Roscan,Gabriel Petre1,Adrian-Marius Dumitran,Angela-Liliana Dumitran

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Intelligent Tutoring Systems, Language Models, diverse prompting strategies, critical methodological challenge

备注： Accepted for ITS 2026

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly prevalent in text simplification, systematically evaluating their outputs across diverse prompting strategies and architectures remains a critical methodological challenge in both NLP research and Intelligent Tutoring Systems (ITS). Developing robust prompts is often hindered by the absence of structured, visual frameworks for comparative text analysis. While researchers typically rely on static computational scripts, educators are constrained to standard conversational interfaces -- neither paradigm supports systematic multi-dimensional evaluation of prompt-model permutations. To address these limitations, we introduce \textbf{MuTSE}\footnote{The project code and the demo have been made available for peer review at the following anonymized URL. this https URL, an interactive human-in-the-loop web application designed to streamline the evaluation of LLM-generated text simplifications across arbitrary CEFR proficiency targets. The system supports concurrent execution of $P \times M$ prompt-model permutations, generating a comprehensive comparison matrix in real-time. By integrating a novel tiered semantic alignment engine augmented with a linearity bias heuristic ($\lambda$), MuTSE visually maps source sentences to their simplified counterparts, reducing the cognitive load associated with qualitative analysis and enabling reproducible, structured annotation for downstream NLP dataset construction.

47. 【2604.08923】NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression

链接：https://arxiv.org/abs/2604.08923

作者：Tong Wu,Nicolay Rusnachenko,Huizhi Liang

类目：Computation and Language (cs.CL)

关键词：Aspect-Based Sentiment Analysis, extends traditional ABSA, Dimensional Aspect-Based Sentiment, Dimensional Aspect Sentiment, categorical polarity labels

备注：

点击查看摘要

Abstract:Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence-arousal (VA) regression. This paper describes a system developed for Track A - Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, constructing the input as [CLS] T [SEP] a_i [SEP] and training dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language-domain combination (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models including GPT-5.2, LLaMA-3-70B, LLaMA-3.3-70B, and LLaMA-4-Maverick under a few-shot prompting setting, demonstrating that task-specific fine-tuning substantially and consistently outperforms these LLM-based methods across all evaluation datasets. The code is publicly available at this https URL.

48. 【2604.08920】Beyond Relevance: Utility-Centric Retrieval in the LLM Era

链接：https://arxiv.org/abs/2604.08920

作者：Hengran Zhang,Minghao Tang,Keping Bi,Jiafeng Guo

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：topical relevance-the degree, match a query, traditionally optimized, optimized for topical, topical relevance-the

备注： Accepted by SIGIR2026

点击查看摘要

Abstract:Information retrieval systems have traditionally optimized for topical relevance-the degree to which retrieved documents match a query. However, relevance only approximates a deeper goal: utility, namely, whether retrieved information helps accomplish a user's underlying task. The emergence of retrieval-augmented generation (RAG) fundamentally changes this paradigm. Retrieved documents are no longer consumed directly by users but instead serve as evidence for large language models (LLMs) that produce answers. As a result, retrieval effectiveness must be evaluated by its contribution to generation quality rather than by relevance-based ranking metrics alone. This tutorial argues that retrieval objectives are evolving from relevance-centric optimization toward LLM-centric utility. We present a unified framework covering LLM-agnostic versus LLM-specific utility, context-independent versus context-dependent utility, and the connection with LLM information needs and agentic RAG. By synthesizing recent advances, the tutorial provides conceptual foundations and practical guidance for designing retrieval systems aligned with the requirements of LLM-based information access.

49. 【2604.08880】Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective

链接：https://arxiv.org/abs/2604.08880

作者：Tokio Kajitsuka,Ukyo Honda,Sho Takase

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：transfers reasoning behaviors, prior work reports, distillation transfers reasoning, teacher-student capability mismatch, mismatch is large

备注： 19 pages, 6 figures

点击查看摘要

Abstract:Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student's pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.

50. 【2604.08879】GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

链接：https://arxiv.org/abs/2604.08879

作者：Faxian Wan,Xiaocui Yang,Yifan Cao,Shi Feng,Daling Wang,Yifei Zhang

类目：Computation and Language (cs.CL)

关键词：Multimodal Sarcasm Detection, requiring precise localization, Sarcasm Target Identification, Multimodal Sarcasm, Sarcasm Detection

备注：

点击查看摘要

Abstract:Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.

51. 【2604.08851】Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition

链接：https://arxiv.org/abs/2604.08851

作者：Jing Jie Tan,Ban-Hoe Kwan,Danny Wee-Kiat Ng,Yan-Chai Hum,Noriyuki Kawarazaki,Kosuke Takano

类目：Computation and Language (cs.CL)

关键词：multilingual personality recognition, personality recognition, unresolved challenge, multilingual datasets remains, significant work

备注： IEEE Transactions on Cognitive and Developmental Systems (2026)

点击查看摘要

Abstract:While significant work has been done on personality recognition, the lack of multilingual datasets remains an unresolved challenge. To address this, we propose ADAM (Cross-Lingual (A)ttention (D)istillation with Personality-Guided Generative (A)ugmentation for (M)ultilingual Personality Recognition), a state-of-the-art approach designed to advance multilingual personality recognition. Our approach leverages an existing English-language personality dataset as the primary source and employs a large language model (LLM) for translationbased augmentation, enhanced by Personality-Informed Generative Augmentation (PIGA), to generate high-quality training data in multiple languages, including Japanese, Chinese, Malay, and French. We provide a thorough analysis to justify the effectiveness of these augmentation techniques. Building on these advancements, ADAM integrates Cross-Lingual Attention Distillation (CLAD) to train a model capable of understanding and recognizing personality traits across languages, bridging linguistic and cultural gaps in personality analysis. This research presents a thorough evaluation of the proposed augmentation method, incorporating an ablation study on recognition performance to ensure fair comparisons and robust validation. Overall, with PIGA augmentation, the findings demonstrate that CLAD significantly outperforms the standard BCE across all languages and personality traits, achieving notable improvements in average BA scores - 0.6332 (+0.0573) on the Essays dataset and 0.7448 (+0.0968) on the Kaggle dataset. The CLAD-trained model also demonstrated strong generalizability and achieved benchmark performance comparable to current leading encoder models. The model weight, dataset, and algorithm repository are available at this https URL.

52. 【2604.08849】Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching

链接：https://arxiv.org/abs/2604.08849

作者：Cyrus Zhou,Yufei Jin,Yilin Xu,Yu-Chiang Wang,Chieh-Ju Chao,Monica S. Lam

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Multiagent Systems (cs.MA); Symbolic Computation (cs.SC)

关键词：million users monthly, meet enrollment targets, http URL, million trials listed, million users

备注： Under review

点击查看摘要

Abstract:Clinical trials are central to evidence-based medicine, yet many struggle to meet enrollment targets, despite the availability of over half a million trials listed on this http URL, which attracts approximately two million users monthly. Existing retrieval techniques, largely based on keyword and embedding-similarity matching between patient profiles and eligibility criteria, often struggle with low recall, low precision, and limited interpretability due to complex constraints. We propose SatIR, a scalable clinical trial retrieval method based on constraint satisfaction, enabling high-precision and interpretable matching of patients to relevant trials. Our approach uses formal methods -- Satisfiability Modulo Theories (SMT) and relational algebra -- to efficiently represent and match key constraints from clinical trials and patient records. Beyond leveraging established medical ontologies and conceptual models, we use Large Language Models (LLMs) to convert informal reasoning regarding ambiguity, implicit clinical assumptions, and incomplete patient records into explicit, precise, controllable, and interpretable formal constraints. Evaluated on 59 patients and 3,621 trials, SatIR outperforms TrialGPT on all three evaluated retrieval objectives. It retrieves 32%-72% more relevant-and-eligible trials per patient, improves recall over the union of useful trials by 22-38 points, and serves more patients with at least one useful trial. Retrieval is fast, requiring 2.95 seconds per patient over 3,621 trials. These results show that SatIR is scalable, effective, and interpretable.

53. 【2604.08846】Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

链接：https://arxiv.org/abs/2604.08846

作者：Jinqi Luo,Jinyu Yang,Tal Neiman,Lei Fan,Bing Yin,Son Tran,Mubarak Shah,René Vidal

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Large Language, elicit unsafe responses, Language Models

备注： Accepted in CVPR 2026. Project page: [this https URL](https://peterljq.github.io/project/daco)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directions. We name the dataset DACO-400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general-purpose capabilities.

54. 【2604.08826】HiFloat4 Format for Language Model Pre-training on Ascend NPUs

链接：https://arxiv.org/abs/2604.08826

作者：Mehran Taghian,Yunke Peng,Xing Huang,Yao Wang,Yaoyuan Wang,Wei Guo,Yuanyong Luo,Tianchi Hu,Junsong Wang,Xin Wang,Hu Liu,Yu Cheng,Ziwei Yu,Hongliang Li,Mehdi Rahimifar,Lei Yan,Xuefei Wang,Zhuang Ma,Lei Liu,Hui Yu,Anandharaju Durai Raju,Hoang Le,Hei Yi Mak,Tanzila Rahman,Shadan Golestan

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：modern machine learning, performance scaling predictably, Large foundation models, machine learning, size and data

备注：

点击查看摘要

Abstract:Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats--such as MXFP4 and NVFP4--can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines. In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision. We evaluate both dense architectures (e.g., Pangu and LLaMA-style models) and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. Furthermore, we explore stabilization techniques tailored to FP4 training that significantly reduce numerical degradation, maintaining relative error within 1% of full-precision baselines while preserving the efficiency benefits of 4-bit computation. Our results provide a comprehensive empirical study of FP4 training on NPUs and highlight the practical trade-offs between FP4 formats in large-scale dense and MoE models.

55. 【2604.08801】$p1$: Better Prompt Optimization with Fewer Prompts

链接：https://arxiv.org/abs/2604.08801

作者：Zhaolin Gao, Yu (Sid)Wang,Bo Liu,Thorsten Joachims,Kianté Brantley,Wen Sun

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：system prompts, effectiveness varies widely, prompts, system prompt, system

备注：

点击查看摘要

Abstract:Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose $p1$, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that $p1$ substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.

56. 【2604.08797】Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation

链接：https://arxiv.org/abs/2604.08797

作者：Sophie Wu,Andrew Piper

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Stories are key, key to transmitting, varies across linguistic, Stories, human

备注：

点击查看摘要

Abstract:Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. Thus, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Using a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization. We show that frontier models such as GPT-4o and Gemini generate story morals that are semantically similar to human responses and preferred by human evaluators. However, their outputs exhibit markedly less cross-linguistic variation and concentrate on a narrower set of widely shared values. These findings suggest that while contemporary models can approximate central tendencies of human moral interpretation, they struggle to reproduce the diversity that characterizes human narrative understanding. By framing narrative interpretation as an evaluative task, this work introduces a new approach to studying cultural alignment in language models beyond static benchmarks or knowledge-based tests.

57. 【2604.08788】MedConceal: A Benchmark for Clinical Hidden-Concern Reasoning Under Partial Observability

链接：https://arxiv.org/abs/2604.08788

作者：Yikun Han,Joey Chan,Jingyuan Chen,Mengting Ai,Simo Du,Yue Guo

类目：Computation and Language (cs.CL)

关键词：Patient-clinician communication, asymmetric-information problem, disclose fears, practical barriers, medical dialogue

备注：

点击查看摘要

Abstract:Patient-clinician communication is an asymmetric-information problem: patients often do not disclose fears, misconceptions, or practical barriers unless clinicians elicit them skillfully. Effective medical dialogue therefore requires reasoning under partial observability: clinicians must elicit latent concerns, confirm them through interaction, and respond in ways that guide patients toward appropriate care. However, existing medical dialogue benchmarks largely sidestep this challenge by exposing hidden patient state, collapsing elicitation into extraction, or evaluating responses without modeling what remains hidden. We present MedConceal, a benchmark with an interactive patient simulator for evaluating hidden-concern reasoning in medical dialogue, comprising 300 curated cases and 600 clinician-LLM interactions. Built from clinician-answered online health discussions, each case pairing clinician-visible context with simulator-internal hidden concerns derived from prior literature and structured using an expert-developed taxonomy. The simulator withholds these concerns from the dialogue agent, tracks whether they have been revealed and addressed via theory-grounded turn-level communication signals, and is clinician-reviewed for clinical plausibility. This enables process-aware evaluation of both task success and the interaction process that leads to it. We study two abilities: confirmation, surfacing hidden concerns through multi-turn dialogue, and intervention, addressing the primary concern and guiding the patient toward a target plan. Results show that no single system dominates: frontier models lead on different confirmation metrics, while human clinicians (N=159) remain strongest on intervention success. Together, these results identify hidden-concern reasoning under partial observability as a key unresolved challenge for medical dialogue systems.

58. 【2604.08782】MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation

链接：https://arxiv.org/abs/2604.08782

作者：Jyotika Singh,Fang Tu,Miguel Ballesteros,Weiyi Sun,Sandip Ghoshal,Michelle Yuan,Yassine Benajiba,Sujith Ravi,Dan Roth

类目：Computation and Language (cs.CL)

关键词：Large language models, interactions dominate chat, dominate chat interfaces, Large language, multiple conversational turns

备注：

点击查看摘要

Abstract:Large language models (LLMs) suffer significant performance degradation when user instructions and context are distributed over multiple conversational turns, yet multi-turn (MT) interactions dominate chat interfaces. The routine approach of appending full chat history to prompts rapidly exhausts context windows, leading to increased latency, higher computational costs, and diminishing returns as conversations extend. We introduce MT-OSC, a One-off Sequential Condensation framework that efficiently and automatically condenses chat history in the background without disrupting the user experience. MT-OSC employs a Condenser Agent that uses a few-shot inference-based Condenser and a lightweight Decider to selectively retain essential information, reducing token counts by up to 72% in 10-turn dialogues. Evaluated across 13 state-of-the-art LLMs and diverse multi-turn benchmarks, MT-OSC consistently narrows the multi-turn performance gap - yielding improved or preserved accuracy across datasets while remaining robust to distractors and irrelevant turns. Our results establish MT-OSC as a scalable solution for multi-turn chats, enabling richer context within constrained input spaces, reducing latency and operational cost, while balancing performance.

59. 【2604.08764】Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics

链接：https://arxiv.org/abs/2604.08764

作者：Raphael Bernas,Fanny Jourdan,Antonin Poché,Céline Hudelot

类目：Computation and Language (cs.CL); Differential Geometry (math.DG)

关键词：Natural Language Processing, dominated Natural Language, Transformer architectures, dominated Natural, Language Processing

备注：

点击查看摘要

Abstract:Since their introduction, Transformer architectures have dominated Natural Language Processing (NLP). However, recent research has highlighted an inherent anisotropy phenomenon in these models, presenting a significant challenge to their geometric interpretation. Previous theoretical studies on this phenomenon are rarely grounded in the underlying representation geometry. In this paper, we extend them by deriving geometric arguments for how frequency-biased sampling attenuates curvature visibility and why training preferentially amplify tangent directions. Empirically, we then use concept-based mechanistic interpretability during training, rather than only post hoc, to fit activation-derived low-rank tangent proxies and test them against ordinary backpropagated true gradients. Across encoder-style and decoder-style language models, we find that these activation-derived directions capture both unusually large gradient energy and a substantially larger share of gradient anisotropy than matched-rank normal controls, providing strong empirical support for a tangent-aligned account of anisotropy.

60. 【2604.08759】Optimal Multi-bit Generative Watermarking Schemes Under Worst-Case False-Alarm Constraints

链接：https://arxiv.org/abs/2604.08759

作者：Yu-Shin Huang,Chao Tian,Krishna Narayanan

类目：Information Theory (cs.IT); Computation and Language (cs.CL)

关键词：worst-case false-alarm constraint, large language models, multi-bit generative watermarking, false-alarm constraint, large language

备注： 41 pages, 8 tables

点击查看摘要

Abstract:This paper considers the problem of multi-bit generative watermarking for large language models under a worst-case false-alarm constraint. Prior work established a lower bound on the achievable miss-detection probability in the finite-token regime and proposed a scheme claimed to achieve this bound. We show, however, that the proposed scheme is in fact suboptimal. We then develop two new encoding-decoding constructions that attain the previously established lower bound, thereby completely characterizing the optimal multi-bit watermarking performance. Our approach formulates the watermark design problem as a linear program and derives the structural conditions under which optimality can be achieved. In addition, we identify the failure mechanism of the previous construction and compare the tradeoffs between the two proposed schemes.

61. 【2604.08757】Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models

链接：https://arxiv.org/abs/2604.08757

作者：Yousra Fettach,Guillaume Bied,Hannu Toivonen,Tijl De Bie

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Model, socially significant dimensions, dimension of Large, Large Language, remains largely unexplored

备注：

点击查看摘要

Abstract:Humor is one of the most culturally embedded and socially significant dimensions of human communication, yet it remains largely unexplored as a dimension of Large Language Model (LLM) alignment. In this study, five frontier language models play the same Cards Against Humanity games (CAH) as human players. The models select the funniest response from a slate of ten candidate cards across 9,894 rounds. While all models exceed the random baseline, alignment with human preference remains modest. More striking is that models agree with each other substantially more often than they agree with humans. We show that this preference is partly explained by systematic position biases and content preferences, raising the question whether LLM humor judgment reflects genuine preference or structural artifacts of inference and alignment.

62. 【2604.08752】LLMs Underperform Graph-Based Parsers on Supervised Relation Extraction for Complex Graphs

链接：https://arxiv.org/abs/2604.08752

作者：Paolo Gajo,Domenic Rosati,Hassan Sajjad,Alberto Barrón-Cedeño

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：creating knowledge graphs, Relation extraction represents, Relation extraction, represents a fundamental, fundamental component

备注： Accepted at ACL 2026 (Main Conference)

点击查看摘要

Abstract:Relation extraction represents a fundamental component in the process of creating knowledge graphs, among other applications. Large language models (LLMs) have been adopted as a promising tool for relation extraction, both in supervised and in-context learning settings. However, in this work we show that their performance still lags behind much smaller architectures when the linguistic graph underlying a text has great complexity. To demonstrate this, we evaluate four LLMs against a graph-based parser on six relation extraction datasets with sentence graphs of varying sizes and complexities. Our results show that the graph-based parser increasingly outperforms the LLMs, as the number of relations in the input documents increases. This makes the much lighter graph-based parser a superior choice in the presence of complex linguistic graphs.

63. 【2604.08723】Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?

链接：https://arxiv.org/abs/2604.08723

作者：Chia-Hsuan Lee,Mingyang Zhou,Renkun Ni,Zelei Cheng,Sihui Dai,Supriyo Chakraborty,Shixiong Zhang,Sambit Sahu,William Campbell

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：DPO and KTO, KTO are widely, downstream reasoning gains, aligning language models, data drive downstream

备注：

点击查看摘要

Abstract:Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair improve a reasoning model's performance on general reasoning tasks? We investigate two distinct notions of quality delta in preference data: generator-level delta, arising from the differences in capability between models that generate chosen and rejected reasoning traces, and sample-level delta, arising from differences in judged quality differences within an individual preference pair. To study generator-level delta, we vary the generator's scale and model family, and to study sample-level delta, we employ an LLM-as-a-judge to rate the quality of generated traces along multiple reasoning-quality dimensions. We find that increasing generator-level delta steadily improves performance on out-of-domain reasoning tasks and filtering data by sample-level delta can enable more data-efficient training. Our results suggest a twofold recipe for improving reasoning performance through preference optimization: maximize generator-level delta when constructing preference pairs and exploit sample-level delta to select the most informative training examples.

64. 【2604.08708】Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition

链接：https://arxiv.org/abs/2604.08708

作者：Tiejin Chen,Huaiyuan Yao,Jia Chen,Evangelos E. Papalexakis,Hua Wei

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Model-based, Language Model-based Multi-Agent, Model-based Multi-Agent Systems, outperform single-agent systems, Large Language

备注： Accept to ACL 26

点击查看摘要

Abstract:While Large Language Model-based Multi-Agent Systems (MAS) consistently outperform single-agent systems on complex tasks, their intricate interactions introduce critical reliability challenges arising from communication dynamics and role dependencies. Existing Uncertainty Quantification methods, typically designed for single-turn outputs, fail to address the unique complexities of the MAS. Specifically, these methods struggle with three distinct challenges: the cascading uncertainty in multi-step reasoning, the variability of inter-agent communication paths, and the diversity of communication topologies. To bridge this gap, we introduce MATU, a novel framework that quantifies uncertainty through tensor decomposition. MATU moves beyond analyzing final text outputs by representing entire reasoning trajectories as embedding matrices and organizing multiple execution runs into a higher-order tensor. By applying tensor decomposition, we disentangle and quantify distinct sources of uncertainty, offering a comprehensive reliability measure that is generalizable across different agent structures. We provide comprehensive experiments to show that MATU effectively estimates holistic and robust uncertainty across diverse tasks and communication topologies.

65. 【2604.08690】Skip-Connected Policy Optimization for Implicit Advantage

链接：https://arxiv.org/abs/2604.08690

作者：Fengwei Teng,Jinyi Bai,Xinhao Yao,Demi Ruohan Wang,Jiahao Zhao,Zhijiang Guo

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Group Relative Policy, Relative Policy Optimization, effective in RLVR, Policy Optimization, Group Relative

备注：

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has proven effective in RLVR by using outcome-based rewards. While fine-grained dense rewards can theoretically improve performance, we reveal that under practical sampling budgets, Monte Carlo estimation yields high-variance and sign-inconsistent advantages for early reasoning tokens, paradoxically underperforming outcome-only GRPO. We propose Skip-Connected Optimization (SKPO), which decomposes reasoning into upstream and downstream phases: upstream receives dense rewards from downstream Monte Carlo sampling with single-stream optimization; downstream maintains group-relative optimization, where a skip connection concatenates the upstream segment with the original problem, enabling the model to leverage helpful upstream reasoning while preserving the freedom to bypass flawed reasoning through direct problem access. Experiments demonstrate improvements of 3.91% and 6.17% relative gains over the strongest baselines on Qwen2.5-Math-7B and Llama-3.2-3B respectively across mathematical benchmarks and out-of-domain tasks including general reasoning and code generation. Further analysis reveals an implicit advantage: SKPO generates trajectories with higher intermediate-step quality even when matched for final correctness.

66. 【2604.08649】PRAGMA: Revolut Foundation Model

链接：https://arxiv.org/abs/2604.08649

作者：Maxim Ostroukhov,Ruslan Mikhailov,Vladimir Iashin,Artem Sokolov,Andrei Akshonov,Vitaly Protasov,Dmitrii Beloborodov,Vince Mullin,Roman Yokunda Enzmann,Georgios Kolovos,Jason Renders,Pavel Nesterov,Anton Repushko

类目：Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Information Retrieval (cs.IR); Computational Finance (q-fin.CP)

关键词：rich economic signals, Modern financial systems, systems generate vast, generate vast quantities, encode rich economic

备注：

点击查看摘要

Abstract:Modern financial systems generate vast quantities of transactional and event-level data that encode rich economic signals. This paper presents PRAGMA, a family of foundation models for multi-source banking event sequences. Our approach pre-trains a Transformer-based architecture with masked modelling on a large-scale, heterogeneous banking event corpus using a self-supervised objective tailored to the discrete, variable-length nature of financial records. The resulting model supports a wide range of downstream tasks such as credit scoring, fraud detection, and lifetime value prediction: strong performance can be achieved by training a simple linear model on top of the extracted embeddings and can be further improved with lightweight fine-tuning. Through extensive evaluation on downstream tasks, we demonstrate that PRAGMA achieves superior performance across multiple domains directly from raw event sequences, providing a general-purpose representation layer for financial applications.

67. 【2604.08644】EXAONE 4.5 Technical Report

链接：https://arxiv.org/abs/2604.08644

作者：Eunbi Choi,Kibong Choi,Sehyun Chun,Seokhee Hong,Junwon Hwang,Hyojin Jeon,Ahra Jo,Hyunjik Jo,Yeonsik Jo,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Changhun Lee,Haeju Lee,Jinsik Lee,Kyungmin Lee,Sangha Park,Kwangrok Ryoo,Minju Seo,Sejong Yang,Heuiyeen Yeen,Hwan Chang,Stanley Jungkyu Choi,Yejin Choi,Kyubeen Han,Joonwon Jang,Kijeong Jeon,Geunyeong Jeong,Gerrard Jeongwon Jo,Jiyeon Jung,Daeseong Kim,Dohoon Kim,Dohyun Kim,Hyunseo Kim,Minu Kim,Myoungshin Kim,Youchul Kim,Byungoh Ko,Christopher Lee,Edward Hwayoung Lee,Honglak Lee,Jiyoung Lee,Sangeun Lee,Seungwon Lim,Woohyung Lim,Jueun Mun,Jaewoo Park,Jimin Park,Jinho Park,Yongmin Park,Wooseok Seo,Yongwoo Song,Sihyuk Yi,Kyungjae Yoo,Sangyeon Yoon

类目：Computation and Language (cs.CL)

关键词：technical report introduces, report introduces EXAONE, open-weight vision language, technical report, report introduces

备注：

点击查看摘要

Abstract:This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG's strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG's ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.

68. 【2604.08603】From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

链接：https://arxiv.org/abs/2604.08603

作者：Hongyin Zhu,Jinming Liang,Mengjun Hou,Ruifan Tang,Xianbin Zhu,Jingyuan Yang,Yuanman Mao,Feng Wu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Existing LLM-based agent, unrestricted knowledge space, LLM-based agent systems, agent systems share, Existing LLM-based

备注：

点击查看摘要

Abstract:Existing LLM-based agent systems share a common architectural failure: they answer from the unrestricted knowledge space without first simulating how active business scenarios reshape that space for the event at hand -- producing decisions that are fluent but ungrounded and carrying no audit trail. We present LOM-action, which equips enterprise AI with \emph{event-driven ontology simulation}: business events trigger scenario conditions encoded in the enterprise ontology~(EO), which drive deterministic graph mutations in an isolated sandbox, evolving a working copy of the subgraph into the scenario-valid simulation graph $G_{\text{sim}}$; all decisions are derived exclusively from this evolved graph. The core pipeline is \emph{event $\to$ simulation $\to$ decision}, realized through a dual-mode architecture -- \emph{skill mode} and \emph{reasoning mode}. Every decision produces a fully traceable audit log. LOM-action achieves 93.82% accuracy and 98.74% tool-chain F1 against frontier baselines Doubao-1.8 and DeepSeek-V3.2, which reach only 24--36% F1 despite 80% accuracy -- exposing the \emph{illusive accuracy} phenomenon. The four-fold F1 advantage confirms that ontology-governed, event-driven simulation, not model scale, is the architectural prerequisite for trustworthy enterprise decision intelligence.

69. 【2604.08595】Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean

链接：https://arxiv.org/abs/2604.08595

作者：Aleksandr Meshkov

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Existing evaluation methods, Existing evaluation, adapt their strictness, Temperature-Controlled Verdict Aggregation, Existing

备注：

点击查看摘要

Abstract:Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and NLI, do not always align well with human assessment because they cannot adapt their strictness to the application domain. This paper presents Temperature-Controlled Verdict Aggregation (TCVA), a method that combines a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures yield pessimistic scores suited for safety-critical domains; high temperatures produce lenient scores appropriate for conversational AI. Experimental evaluation on three benchmark datasets with human Likert-scale annotations (SummEval and USR) shows that TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval. The method requires no additional LLM calls when adjusting the temperature parameter.

70. 【2604.08571】Robust Reasoning Benchmark

链接：https://arxiv.org/abs/2604.08571

作者：Pavel Golikov,Evgenii Opryshko,Gennady Pekhimenko,Mark C. Jeffrey

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, processes remain highly, remain highly overfit, standard textual formatting

备注：

点击查看摘要

Abstract:While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collapses (up to 55% average accuracy drops across perturbations and up to 100% on some), exposing structural fragility. To further disentangle mechanical parsing failures from downstream reasoning failures, we strictly isolate the models' working memory capacity by forcing models to solve multiple unperturbed mathematical problems sequentially within a single context window. Our results indicate that open weight models ranging from 7B to 120B parameters and Claude Opus 4.6 exhibit accuracy decay on subsequent problems. This degradation demonstrates that intermediate reasoning steps permanently pollute standard dense attention mechanisms. We argue that to achieve reliable reasoning, future reasoning architectures must integrate explicit contextual resets within a model's own Chain-of-Thought, leading to fundamental open questions regarding the optimal granularity of atomic reasoning tasks.

71. 【2604.08568】Can We Still Hear the Accent? Investigating the Resilience of Native Language Signals in the LLM Era

链接：https://arxiv.org/abs/2604.08568

作者：Nabelanita Utami,Sasano Ryohei

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：writing assistance tools, large language models, researchers write, ACL Anthology papers, evolution of writing

备注：

点击查看摘要

Abstract:The evolution of writing assistance tools from machine translation to large language models (LLMs) has changed how researchers write. This study investigates whether this shift is homogenizing research papers by analyzing native language identification (NLI) trends in ACL Anthology papers across three eras: pre-neural network (NN), pre-LLM, and post-LLM. We construct a labeled dataset using a semi-automated framework and fine-tune a classifier to detect linguistic fingerprints of author backgrounds. Our analysis shows a consistent decline in NLI performance over time. Interestingly, the post-LLM era reveals anomalies: while Chinese and French show unexpected resistance or divergent trends, Japanese and Korean exhibit sharper-than-expected declines.

72. 【2604.08567】Multi-User Large Language Model Agents

链接：https://arxiv.org/abs/2604.08567

作者：Shu Yang,Shenzhe Zhu,Hao Zhu,José Ramón Enríquez,Di Wang,Alex Pentland,Michiel A. Bakker,Jiaxin Pei

类目：Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：Large language models, Large language, single-principal interaction paradigm, language models, deployed as assistants

备注：

点击查看摘要

Abstract:Large language models (LLMs) and LLM-based agents are increasingly deployed as assistants in planning and decision making, yet most existing systems are implicitly optimized for a single-principal interaction paradigm, in which the model is designed to satisfy the objectives of one dominant user whose instructions are treated as the sole source of authority and utility. However, as they are integrated into team workflows and organizational tools, they are increasingly required to serve multiple users simultaneously, each with distinct roles, preferences, and authority levels, leading to multi-user, multi-principal settings with unavoidable conflicts, information asymmetry, and privacy constraints. In this work, we present the first systematic study of multi-user LLM agents. We begin by formalizing multi-user interaction with LLM agents as a multi-principal decision problem, where a single agent must account for multiple users with potentially conflicting interests and associated challenges. We then introduce a unified multi-user interaction protocol and design three targeted stress-testing scenarios to evaluate current LLMs' capabilities in instruction following, privacy preservation, and coordination. Our results reveal systematic gaps: frontier LLMs frequently fail to maintain stable prioritization under conflicting user objectives, exhibit increasing privacy violations over multi-turn interactions, and suffer from efficiency bottlenecks when coordination requires iterative information gathering.

73. 【2604.08566】Sentiment Classification of Gaza War Headlines: A Comparative Analysis of Large Language Models and Arabic Fine-Tuned BERT Models

链接：https://arxiv.org/abs/2604.08566

作者：Amr Eleraqi,Hager H. Mustafa,Abdul Hadi N. Ahmed

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：artificial intelligence architectures, intelligence architectures interpret, conflict-related media discourse, Gaza War, Arabic BERT models

备注： 45 pages, 6 figures (including diagrams), 8 tables. Dataset available at this https URL . Previously posted at [this https URL](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FFENX3)

点击查看摘要

Abstract:This study examines how different artificial intelligence architectures interpret sentiment in conflict-related media discourse, using the 2023 Gaza War as a case study. Drawing on a corpus of 10,990 Arabic news headlines (Eleraqi 2026), the research conducts a comparative analysis between three large language models and six fine-tuned Arabic BERT models. Rather than evaluating accuracy against a single human-annotated gold standard, the study adopts an epistemological approach that treats sentiment classification as an interpretive act produced by model architectures. To quantify systematic differences across models, the analysis employs information-theoretic and distributional metrics, including Shannon Entropy, Jensen-Shannon Distance, and a Variance Score measuring deviation from aggregate model behavior. The results reveal pronounced and non-random divergence in sentiment distributions. Fine-tuned BERT models, particularly MARBERT, exhibit a strong bias toward neutral classifications, while LLMs consistently amplify negative sentiment, with LLaMA-3.1-8B showing near-total collapse into negativity. Frame-conditioned analysis further demonstrates that GPT-4.1 adjusts sentiment judgments in line with narrative frames (e.g., humanitarian, legal, security), whereas other LLMs display limited contextual modulation. These findings suggest that the choice of model constitutes a choice of interpretive lens, shaping how conflict narratives are algorithmically framed and emotionally evaluated. The study contributes to media studies and computational social science by foregrounding algorithmic discrepancy as an object of analysis and by highlighting the risks of treating automated sentiment outputs as neutral or interchangeable measures of media tone in contexts of war and crisis.

74. 【2604.08565】Dynamic sparsity in tree-structured feed-forward layers at scale

链接：https://arxiv.org/abs/2604.08565

作者：Reza Sedghi,Robin Schiewer,Anand Subramoney,David Kappel

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：typical context lengths, MLP block accounts, motivating sparse alternatives, transformer compute budget, feed-forward MLP block

备注：

点击查看摘要

Abstract:At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that simple architectural choices can modulate this behavior and recover balanced trees without auxiliary losses. Overall, our work demonstrates that tree-structured feed-forward layers provide a scalable and controllable mechanism for sparsifying large transformer models.

75. 【2604.08564】Attention-Based Sampler for Diffusion Language Models

链接：https://arxiv.org/abs/2604.08564

作者：Yuyan Zhou,Kai Syun Hou,Weiyu Chen,James Kwok

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Auto-regressive models, established a dominant, language modeling, dominant paradigm, Auto-regressive

备注：

点击查看摘要

Abstract:Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential decoding paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel decoding and flexible language modeling. Despite these advantages, current dLLMs decoding strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the decoding order selection problem from the perspective of log-likelihood maximization. We theoretically demonstrate that optimal sequence likelihood can be approximately achieved by decoding tokens in descending order of their attention matrix column sums. This finding provides a principled justification for attention-guided decoding and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free decoding algorithm, termed Attn-Sampler, and further propose a block attention approximation and dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the decoding parallelism.

76. 【2604.08563】mperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

链接：https://arxiv.org/abs/2604.08563

作者：Mousa Salah,Amgad Muneer

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Model, Large Language, Language Model, enabling explicit test-time, explicit test-time computation

备注： 3 Figures, 2 Tables

点击查看摘要

Abstract:Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.

77. 【2604.08562】Neural networks for Text-to-Speech evaluation

链接：https://arxiv.org/abs/2604.08562

作者：Ilya Trofimenko,David Kocharyan,Aleksandr Zaitsev,Pavel Repnikov,Mark Levin,Nikita Shevtsov

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：systems deliver human-perceived, modern speech technologies, deliver human-perceived quality, systems deliver, speech technologies

备注：

点击查看摘要

Abstract:Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak learners. Our best MOS models achieve a Root Mean Square Error (RMSE) of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62. Furthermore, our ablation studies reveal that naively fusing text via cross-attention can degrade performance, highlighting the effectiveness of ensemble-based stacking over direct latent fusion. We additionally report negative results with SpeechLM-based architectures and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 flash preview), reinforcing the necessity of dedicated metric learning frameworks.

78. 【2604.08561】A Representation-Level Assessment of Bias Mitigation in Foundation Models

链接：https://arxiv.org/abs/2604.08561

作者：Svetoslav Nizhnichenkov,Rahul Nair,Elizabeth Daly,Brian Mac Namee

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：successful bias mitigation, bias mitigation reshapes, investigate how successful, bias mitigation, successful bias

备注： Accepted at ECML-PKDD 2025 (5th Workshop on Bias and Fairness in AI)

点击查看摘要

Abstract:We investigate how successful bias mitigation reshapes the embedding space of encoder-only and decoder-only foundation models, offering an internal audit of model behaviour through representational analysis. Using BERT and Llama2 as representative architectures, we assess the shifts in associations between gender and occupation terms by comparing baseline and bias-mitigated variants of the models. Our findings show that bias mitigation reduces gender-occupation disparities in the embedding space, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, suggesting that fairness improvements can manifest as interpretable and geometric transformations. These results position embedding analysis as a valuable tool for understanding and validating the effectiveness of debiasing methods in foundation models. To further promote the assessment of decoder-only models, we introduce WinoDec, a dataset consisting of 4,000 sequences with gender and occupation terms, and release it to the general public. (this https URL)

79. 【2604.08560】Uncertainty Estimation for the Open-Set Text Classification systems

链接：https://arxiv.org/abs/2604.08560

作者：Leonid Erlygin,Alexey Zaytsev

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Accurate uncertainty estimation, Accurate uncertainty, uncertainty estimation, Holistic Uncertainty Estimation, essential for building

备注：

点击查看摘要

Abstract:Accurate uncertainty estimation is essential for building robust and trustworthy recognition systems. In this paper, we consider the open-set text classification (OSTC) task - and uncertainty estimation for it. For OSTC a text sample should be classified as one of the existing classes or rejected as unknown. To account for the different uncertainty types encountered in OSTC, we adapt the Holistic Uncertainty Estimation (HolUE) method for the text domain. Our approach addresses two major causes of prediction errors in text recognition systems: text uncertainty that stems from ill formulated queries and gallery uncertainty that is related the ambiguity of data distribution. By capturing these sources, it becomes possible to predict when the system will make a recognition error. We propose a new OSTC benchmark and conduct extensive experiments on a wide range of data, utilizing the authorship attribution, intent and topic classification datasets. HolUE achieves 40-365% improvement in Prediction Rejection Ratio (PRR) over the quality-based SCF baseline across datasets: 365% on Yahoo Answers (0.79 vs 0.17 at FPIR 0.1), 347% on DBPedia (0.85 vs 0.19), 240% on PAN authorship attribution (0.51 vs 0.15 at FPIR 0.5), and 40% on CLINC150 intent classification (0.73 vs~0.52). We make public our code and protocols this https URL

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.08560 [cs.CL]

(or
arXiv:2604.08560v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.08560

Focus to learn more

              arXiv-issued DOI via DataCite</p>

80. 【2604.08559】Medical Reasoning with Large Language Models: A Survey and MR-Bench

链接：https://arxiv.org/abs/2604.08559

作者：Xiaohan Ren,Chenxiao Fan,Wenyin Ma,Hongliang He,Chongming Gao,Xiaoyan Zhao,Fuli Feng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, motivating growing interest, Large language, achieved strong performance, motivating growing

备注：

点击查看摘要

Abstract:Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.

81. 【2604.08558】WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

链接：https://arxiv.org/abs/2604.08558

作者：Hanna Lee,Tan Dat Nguyen,Jaehoon Kang,Kyuhong Shim

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Recent decoder-only autoregressive, compute costs scale, costs scale quadratically, sequence length due, Recent decoder-only

备注： Submitted to Interspeech 2026

点击查看摘要

Abstract:Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. To stabilize fine-tuning, we employ a curriculum learning strategy that progressively tightens the attention window. We further utilize knowledge distillation from a full-attention teacher to recover high-fidelity synthesis quality with high data efficiency. Evaluated on three modern AR-TTS models, WAND preserves the original quality while achieving up to 66.2% KV cache memory reduction and length-invariant, near-constant per-step latency.

82. 【2604.08557】Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

链接：https://arxiv.org/abs/2604.08557

作者：Arth Singh

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Diffusion-based language models, Diffusion-based language, iteratively denoising masked, masked token sequences, language models

备注： 11 pages, 1 figure, 6 tables

点击查看摘要

Abstract:Diffusion-based language models (dLLMs) generate text by iteratively denoising masked token sequences. We show that their safety alignment rests on a single fragile assumption: that the denoising schedule is monotonic and committed tokens are never re-evaluated. Safety-aligned dLLMs commit refusal tokens within the first 8-16 of 64 denoising steps, and the schedule treats these commitments as permanent. A trivial two-step intervention - re-masking these tokens and injecting a 12-token affirmative prefix - achieves 76.1% ASR on HarmBench (n=159, Lg=128) against LLaDA-8B-Instruct and 81.8% ASR (n=159) against Dream-7B-Instruct, without any gradient computation or adversarial search. The simplicity of this exploit is itself the central finding: augmenting with gradient-optimized perturbation via a differentiable Gumbel-softmax chain consistently degrades ASR (e.g., 41.5% vs. 76.1% at Lg=128), confirming that the vulnerability is structural rather than requiring sophisticated exploitation. These findings reveal that dLLM safety is not adversarially robust but architecturally shallow - it holds only because the denoising schedule is never violated. We discuss defenses including safety-aware unmasking schedules, step-conditional prefix detection, and post-commitment re-verification.

83. 【2604.08556】EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

链接：https://arxiv.org/abs/2604.08556

作者：Arth Singh

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：simple temporal averaging, efficient sequence models, sequence models gain, efficient sequence, gain over simple

备注： 10 pages, 1 figure, 7 tables

点击查看摘要

Abstract:What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.

84. 【2604.08555】SynDocDis: A Metadata-Driven Framework for Generating Synthetic Physician Discussions Using Large Language Models

链接：https://arxiv.org/abs/2604.08555

作者：Beny Rubinstein,Sergio Matos

类目：Computation and Language (cs.CL)

关键词：Physician-physician discussions, patient cases represent, Large Language Models, discussions of patient, represent a rich

备注：

点击查看摘要

Abstract:Physician-physician discussions of patient cases represent a rich source of clinical knowledge and reasoning that could feed AI agents to enrich and even participate in subsequent interactions. However, privacy regulations and ethical considerations severely restrict access to such data. While synthetic data generation using Large Language Models offers a promising alternative, existing approaches primarily focus on patient-physician interactions or structured medical records, leaving a significant gap in physician-to-physician communication synthesis. We present SynDocDis, a novel framework that combines structured prompting techniques with privacy-preserving de-identified case metadata to generate clinically accurate physician-to-physician dialogues. Evaluation by five practicing physicians in nine oncology and hepatology scenarios demonstrated exceptional communication effectiveness (mean 4.4/5) and strong medical content quality (mean 4.1/5), with substantial interrater reliability (kappa = 0.70, 95% CI: 0.67-0.73). The framework achieved 91% clinical relevance ratings while maintaining doctors' and patients' privacy. These results place SynDocDis as a promising framework for advancing medical AI research ethically and responsibly through privacy-compliant synthetic physician dialogue generation with direct applications in medical education and clinical decision support.

85. 【2604.08554】Drift and selection in LLM text ecosystems

链接：https://arxiv.org/abs/2604.08554

作者：Søren Riis

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：public text record, increasingly shaped, public text, Generated text enters, compresses public text

备注：

点击查看摘要

Abstract:The public text record -- the material from which both people and AI systems now learn -- is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order $n$-gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative -- rewarding quality, correctness or novelty -- deeper structure persists, and we establish an optimal upper bound on the resulting divergence from shallow equilibria. The framework therefore identifies when recursive publication compresses public text and when selective filtering sustains richer structure, with implications for the design of AI training corpora.

86. 【2604.08553】GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback

链接：https://arxiv.org/abs/2604.08553

作者：Ruiyao Xu,Kaize Ding

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, shown strong performance, superior semantic understanding

备注： ICLR 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong performance on text-attributed graphs (TAGs) due to their superior semantic understanding ability on textual node features. However, their effectiveness as predictors in the low-resource setting, where labeled nodes are severely limited and scarce, remains constrained since fine-tuning LLMs usually requires sufficient labeled data, especially when the TAG shows complex structural patterns. In essence, this paper targets two key challenges: (i) the difficulty of generating and selecting reliable pseudo labels on TAGs for LLMs, and (ii) the need to mitigate potential label noise when fine-tuning LLMs with pseudo labels. To counter the challenges, we propose a new framework, GNN-as-Judge, which can unleash the power of LLMs for few-shot semi-supervised learning on TAGs by incorporating the structural inductive bias of Graph Neural Networks (GNNs). Specifically, GNN-as-Judge introduces a collaborative pseudo-labeling strategy that first identifies the most influenced unlabeled nodes from labeled nodes, then exploits both the agreement and disagreement patterns between LLMs and GNNs to generate reliable labels. Furthermore, we develop a weakly-supervised LLM fine-tuning algorithm that can distill the knowledge from informative pseudo labels while mitigating the potential label noise. Experiments on multiple TAG datasets demonstrate that GNN-as-Judge significantly outperforms existing methods, particularly in low-resource regimes where labeled data are scarce.

87. 【2604.08549】VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

链接：https://arxiv.org/abs/2604.08549

作者：Miloš Košprdić,Adela Ljajić,Bojana Bašaragin,Darija Medvecki,Lorenzo Cassano,Nikola Milošević

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：integrates retrieval-augmented generation, biomedical question answering, open-source expert system, claim verification mechanism, post-hoc claim verification

备注：

点击查看摘要

Abstract:We introduce VerifAI, an open-source expert system for biomedical question answering that integrates retrieval-augmented generation (RAG) with a novel post-hoc claim verification mechanism. Unlike standard RAG systems, VerifAI ensures factual consistency by decomposing generated answers into atomic claims and validating them against retrieved evidence using a fine-tuned natural language inference (NLI) engine. The system comprises three modular components: (1) a hybrid Information Retrieval (IR) module optimized for biomedical queries (MAP@10 of 42.7%), (2) a citation-aware Generative Component fine-tuned on a custom dataset to produce referenced answers, and (3) a Verification Component that detects hallucinations with state-of-the-art accuracy, outperforming GPT-4 on the HealthVer benchmark. Evaluations demonstrate that VerifAI significantly reduces hallucinated citations compared to zero-shot baselines and provides a transparent, verifiable lineage for every claim. The full pipeline, including code, models, and datasets, is open-sourced to facilitate reliable AI deployment in high-stakes domains.

88. 【2604.08477】SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

链接：https://arxiv.org/abs/2604.08477

作者：Ashima Suvarna,Kendrick Phan,Mehrab Beikzadeh,Hritik Bansal,Saadia Gabriel

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Reinforcement Learning, significantly improved large, improved large language, Verifiable Rewards, RLVR

备注： 23 Pages, 4 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at this https URL.

89. 【2604.08362】owards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

链接：https://arxiv.org/abs/2604.08362

作者：Jiawei Chen,Ruoxi Xu,Boxi Cao,Ruotong Pan,Yunfei Zhang,Yifei Hu,Yong Du,Tingting Gao,Yaojie Lu,Yingfei Sun,Xianpei Han,Le Sun,Xiangyu Wu,Hongyu Lin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, general-purpose user simulator, emergence of Large, Language Models

备注：

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

90. 【2505.21472】Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

链接：https://arxiv.org/abs/2505.21472

作者：Mehrdad Fazli,Bowen Wei,Ahmet Sari,Ziwei Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词：achieve impressive performance, Large vision-language models, confidently describe objects, Large vision-language, achieve impressive

备注：

点击查看摘要

Abstract:Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current training-free interventions struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding guided by the model's confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.

信息检索

1. 【2604.09541】rans-RAG: Query-Centric Vector Transformation for Secure Cross-Organizational Retrieval

链接：https://arxiv.org/abs/2604.09541

作者：Yu Liu,Kun Peng,Wenxiao Zhang,Fangfang Yuan,Cong Cao,Wenxuan Lu,Yanbing Liu

类目：Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词：Retrieval Augmented Generation, Augmented Generation, organizational boundaries face, boundaries face fundamental, face fundamental tensions

备注： Accepted by DASFAA 2026

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems deployed across organizational boundaries face fundamental tensions between security, accuracy, and efficiency. Current encryption methods expose plaintext during decryption, while federated architectures prevent resource integration and incur substantial overhead. We introduce Trans-RAG, implementing a novel vector space language paradigm where each organization's knowledge exists in a mathematically isolated semantic space. At the core lies vector2Trans, a multi-stage transformation technique that enables queries to dynamically "speak" each organization's vector space "language" through query-centric transformations, eliminating decryption overhead while maintaining native retrieval efficiency. Security evaluations demonstrate near-orthogonal vector spaces with 89.90° angular separation and 99.81% isolation rates. Experiments across 8 retrievers, 3 datasets, and 3 LLMs show minimal accuracy degradation (3.5% decrease in nDCG@10) and significant efficiency improvements over homomorphic encryption.

2. 【2604.09537】Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

链接：https://arxiv.org/abs/2604.09537

作者：Soroosh Tayebi Arasteh,Mehdi Joodaki,Mahshad Lotfinia,Sven Nebelung,Daniel Truhn

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Evidence-grounded reasoning requires, attaching retrieved text, Evidence-grounded reasoning, evidence, provided evidence supports

备注：

点击查看摘要

3. 【2604.09494】RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

链接：https://arxiv.org/abs/2604.09494

作者：Kyle Whitecross,Negin Rahimi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：language models post-trained, reasoning language models, language models, models post-trained, In-context retrieval

备注： Code, data, and models available at [this https URL](https://github.com/kswhitecross/RecaLLM)

点击查看摘要

4. 【2604.09492】Dynamic Ranked List Truncation for Reranking Pipelines via LLM-generated Reference-Documents

链接：https://arxiv.org/abs/2604.09492

作者：Nilanjan Sinhababu,Soumedhik Bharati,Debasis Ganguly,Pabitra Mitra

类目：Information Retrieval (cs.IR)

关键词：Large Language Models, Language Models, Large Language, ranked list, Models

备注：

点击查看摘要

Abstract:Large Language Models (LLM) have been widely used in reranking. Computational overhead and large context lengths remain a challenging issue for LLM rerankers. Efficient reranking usually involves selecting a subset of the ranked list from the first stage, known as ranked list truncation (RLT). The truncated list is processed further by a reranker. For LLM rerankers, the ranked list is often partitioned and processed sequentially in batches to reduce the context length. Both these steps involve hyperparameters and topic-agnostic heuristics. Recently, LLMs have been shown to be effective for relevance judgment. Equivalently, we propose that LLMs can be used to generate reference documents that can act as a pivot between relevant and non-relevant documents in a ranked list. We propose methods to use these generated reference documents for RLT as well as for efficient listwise reranking. While reranking, we process the ranked list in either parallel batches of non-overlapping windows or overlapping windows with adaptive strides, improving the existing fixed stride setup. The generated reference documents are also shown to improve existing efficient listwise reranking frameworks. Experiments on TREC Deep Learning benchmarks show that our approach outperforms existing RLT-based approaches. In-domain and out-of-domain benchmarks demonstrate that our proposed methods accelerate LLM-based listwise reranking by up to 66\% compared to existing approaches. This work not only establishes a practical paradigm for efficient LLM-based reranking but also provides insight into the capability of LLMs to generate semantically controlled documents using relevance signals.

5. 【2604.09439】ME-PSR: Time-aware, Multi-interest, and Explanation Personalization for Sequential Recommendation

链接：https://arxiv.org/abs/2604.09439

作者：Qingzhuo Wang,Leilei Wen,Juntao Chen,Kunyu Peng,Ruiyang Qin,Zhihua Wei,Wen Shen

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：integrates Time-aware personalization, Personalized Sequential Recommendation, Multi-interest personalization, sequential recommendation model, Time-aware personalization

备注：

点击查看摘要

Abstract:In this paper, we propose a sequential recommendation model that integrates Time-aware personalization, Multi-interest personalization, and Explanation personalization for Personalized Sequential Recommendation (TME-PSR). That is, we consider the differences across different users in temporal rhythm preference, multiple fine-grained latent interests, and the personalized semantic alignment between recommendations and explanations. Specifically, the proposed TME-PSR model employs a dual-view gated time encoder to capture personalized temporal rhythms, a lightweight multihead Linear Recurrent Unit architecture that enables fine-grained sub-interest modeling with improved efficiency, and a dynamic dual-branch mutual information weighting mechanism to achieve personalized alignment between recommendations and explanations. Extensive experiments on real-world datasets demonstrate that our method consistently improves recommendation accuracy and explanation quality, at a lower computational cost.

6. 【2604.09430】On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework

链接：https://arxiv.org/abs/2604.09430

作者：Dario Maio

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Retrieval-Augmented Generation, Toggle, Text embeddings, Toggle Hugging Face

备注： 44 pages, 6 figures

点击查看摘要

Abstract:Text embeddings are central to modern information retrieval and Retrieval-Augmented Generation (RAG). While dense models derived from Large Language Models (LLMs) dominate current practice, recent work has explored quantum-inspired alternatives motivated by the geometric properties of Hilbert-like spaces and their potential to encode richer semantic structure. This paper presents an experimental framework for constructing quantum-inspired 1024-dimensional document embeddings based on overlapping windows and multi-scale aggregation. The pipeline combines semantic projections (e.g., EigAngle), circuit-inspired feature mappings, and optional teacher-student distillation, together with a fingerprinting mechanism for reproducibility and controlled evaluation. We introduce a set of diagnostic tools for hybrid retrieval, including static and dynamic interpolation between BM25 and embedding-based scores, candidate union strategies, and a conceptual alpha-oracle that provides an upper bound for score-level fusion. Experiments on controlled corpora of Italian and English documents across technical, narrative, and legal domains, using synthetic queries, show that BM25 remains a strong baseline, teacher embeddings provide stable semantic structure, and standalone quantum-inspired embeddings exhibit weak and unstable ranking signals. Distillation yields mixed effects, improving alignment in some cases but not consistently enhancing retrieval performance, while hybrid retrieval can recover competitive results when lexical and embedding-based signals are combined. Overall, the results highlight structural limitations in the geometry of quantum-inspired embeddings, including distance compression and ranking instability, and clarify their role as auxiliary components rather than standalone retrieval representations.

Comments:
44 pages, 6 figures

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.09430 [cs.IR]

(or
arXiv:2604.09430v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.09430

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Dario Maio [view email] [v1]
Fri, 10 Apr 2026 15:48:37 UTC (580 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework, by Dario MaioView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.IR

|
next

new
|
recent
| 2026-04

Change to browse by:

cs
cs.AI

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

计算机视觉

1. 【2604.09547】ango: Taming Visual Signals for Efficient Video Large Language Models

链接：https://arxiv.org/abs/2604.09547

作者：Shukang Yin,Sirui Zhao,Hanchao Wang,Baozhi Jia,Xianquan Wang,Chaoyou Fu,Enhong Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Video Large Language, Language Models, Large Language, efficient Video Large

备注： Code is available at [this https URL](https://github.com/xjtupanda/Tango)

点击查看摘要

Abstract:Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88x inference speedup.

2. 【2604.09535】EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

链接：https://arxiv.org/abs/2604.09535

作者：Lulin Liu,Dayou Li,Yiqing Liang,Sicong Jiang,Hitesh Vijay,Hezhen Hu,Xuhai Xu,Zirui Liu,Srinivas Shakkottai,Manling Li,Zhiwen Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made significant advances, Large foundation models, Large foundation, embodied intelligence, made significant

备注： [this https URL](https://ego-tl.github.io/)

点击查看摘要

Abstract:Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.

3. 【2604.09532】Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

链接：https://arxiv.org/abs/2604.09532

作者：Zibin Geng,Xuefeng Jiang,Jia Li,Zheng Li,Tian Wen,Lvhua Wu,Sheng Sun,Yuwei Wang,Min Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：label noise, vision-language models, parameter-efficient approach, approach for vision-language, Prompt

备注：

点击查看摘要

Abstract:Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at this https URL.

4. 【2604.09531】VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

链接：https://arxiv.org/abs/2604.09531

作者：Guanyu Zhou,Yida Yin,Wenhao Chai,Shengbang Tong,Xingyu Fu,Zhuang Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：viewpoint recognition, Vision-language models, spatial understanding, understanding and viewpoint, Vision-language

备注： Project Page: [this https URL](https://zlab-princeton.github.io/VisionFoundry/)

点击查看摘要

5. 【2604.09529】VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

链接：https://arxiv.org/abs/2604.09529

作者：Wenyi Xiao,Xinchi Xu,Leilei Gan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Vision Language, Large Vision, achieve strong multimodal, Vision Language Models, Vision Language

备注： 24 pages, ACL 2026 Main. Repository: [this https URL](https://github.com/Mr-Loevan/VL-Calibration)

点击查看摘要

6. 【2604.09527】Envisioning the Future, One Step at a Time

链接：https://arxiv.org/abs/2604.09527

作者：Stefan Andreas Baumann,Jannik Wiese,Tommaso Martorella,Mahdi M. Kalayeh,Björn Ommer

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：extended interaction chains, Accurately anticipating, evolve requires models, anticipating how complex, simulate along extended

备注： CVPR 2026. For code and models, see [this http URL](http://compvis.github.io/myriad)

点击查看摘要

Abstract:Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: this http URL.

7. 【2604.09511】RIRF: Reasoning Image Restoration Framework

链接：https://arxiv.org/abs/2604.09511

作者：Wending Yan,Rongkai Zhang,Kaihua Tang,Yu Cheng,Qiankun Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Universal image restoration, recover clean images, Universal image, aims to recover, recover clean

备注：

点击查看摘要

Abstract:Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primarily focus on pixel reconstruction and often lack explicit diagnostic reasoning over degradation composition, severity, and scene semantics prior to restoration. We propose Reason and Restore (R\R), a novel framework that integrates structured Chain-of-Thought (CoT) reasoning into the image restoration pipeline. R\R introduces an explicit reasoner, implemented by fine-tuning Qwen3-VL, to diagnose degradation types, quantify degradation severity, infer key degradation-related factors, and describe relevant scene and object semantics. The resulting structured reasoning provides interpretable and fine-grained diagnostic priors for the restorer. To further improve restoration quality, the quantified degradation severity produced by the reasoner is leveraged as reinforcement learning (RL) signals to guide and strengthen the restorer. Unlike existing multimodal LLM-based agentic systems that decouple reasoning from low-level vision tasks, R\R tightly couples semantic diagnostic reasoning with pixel-level restoration in a unified framework. Extensive experiments across diverse UIR benchmarks demonstrate that R\R achieves state-of-the-art performance while offering unique interpretability into the restoration process.

8. 【2604.09508】VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

链接：https://arxiv.org/abs/2604.09508

作者：Yucheng Shen,Jiulong Wu,Jizhou Huang,Dawei Yin,Lingyong Yan,Min Cao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：empowers Vision-Language Models, visually rich documents, Vision-Language Models, Models to retrieve, Visual Retrieval-Augmented Generation

备注：

点击查看摘要

Abstract:Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.

9. 【2604.09480】Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

链接：https://arxiv.org/abs/2604.09480

作者：Shunkai Zhou,Zike Yan,Fei Xue,Dong Wu,Yuchen Deng,Hongbin Zha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：effectively resolving inconsistency, resolving inconsistency issues, sequential reconstruction framework, effectively resolving, inconsistency issues

备注：

点击查看摘要

Abstract:We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: this https URL

10. 【2604.09478】Incremental Semantics-Aided Meshing from LiDAR-Inertial Odometry and RGB Direct Label Transfer

链接：https://arxiv.org/abs/2604.09478

作者：Muhammad Affan,Ville Lehtola,George Vosselman

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：complex indoor environments, scans remains challenging, parameters produce holes, point cloud sparsity, fixed fusion parameters

备注： 8 pages, 5 figures, 2 tables. Accepted in ISPRS Archives 2026

点击查看摘要

Abstract:Geometric high-fidelity mesh reconstruction from LiDAR-inertial scans remains challenging in large, complex indoor environments -- such as cultural buildings -- where point cloud sparsity, geometric drift, and fixed fusion parameters produce holes, over-smoothing, and spurious surfaces at structural boundaries. We propose a modular, incremental RGB+LiDAR pipeline that generates incremental semantics-aided high-quality meshes from indoor scans through scan frame-based direct label transfer. A vision foundation model labels each incoming RGB frame; labels are incrementally projected and fused onto a LiDAR-inertial odometry map; and an incremental semantics-aware Truncated Signed Distance Function (TSDF) fusion step produces the final mesh via marching cubes. This frame-level fusion strategy preserves the geometric fidelity of LiDAR while leveraging rich visual semantics to resolve geometric ambiguities at reconstruction boundaries caused by LiDAR point-cloud sparsity and geometric drift. We demonstrate that semantic guidance improves geometric reconstruction quality; quantitative evaluation is therefore performed using geometric metrics on the Oxford Spires dataset, while results from the NTU VIRAL dataset are analyzed qualitatively. The proposed method outperforms state-of-the-art geometric baselines ImMesh and Voxblox, demonstrating the benefit of semantics-aided fusion for geometric mesh quality. The resulting semantically labelled meshes are of value when reconstructing Universal Scene Description (USD) assets, offering a path from indoor LiDAR scanning to XR and digital modeling.

11. 【2604.09473】Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

链接：https://arxiv.org/abs/2604.09473

作者：Zhengxian Yang,Shengqi Wang,Shi Pan,Hongshuai Li,Haoxiang Wang,Lin Li,Guanjun Li,Zhengqi Wen,Borong Lin,Jianhua Tao,Tao Yu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Fully immersive experiences, Fully immersive, tightly integrate, visual and auditory, augmented reality

备注： Journal extension of CVPR 2025. See also [arXiv:2503.14359](https://arxiv.org/abs/2503.14359) . Project page and code: [this https URL](https://github.com/Metaverse-AI-Lab-THU/ImViD)

点击查看摘要

Abstract:Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground--background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.

12. 【2604.09445】AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization

链接：https://arxiv.org/abs/2604.09445

作者：Mohammad Omama,Gabriele Berton,Eric Foxlin,Yelin Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：resource-constrained edge devices, Precise and real-time, real-time visual localization, smart glasses, primary concerns

备注：

点击查看摘要

Abstract:Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be a primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers. We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching. Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher's localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.09445 [cs.CV]

(or
arXiv:2604.09445v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.09445

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

13. 【2604.09436】SCoRe: Clean Image Generation from Diffusion Models Trained on Noisy Images

链接：https://arxiv.org/abs/2604.09436

作者：Yuta Matsuzaki,Seiichi Uchida,Shumpei Takezaki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion models trained, significantly degrading generation, degrading generation quality, high-frequency training artifacts, Diffusion models

备注： Accepted at IJCNN2026

点击查看摘要

Abstract:Diffusion models trained on noisy datasets often reproduce high-frequency training artifacts, significantly degrading generation quality. To address this, we propose SCoRe (Spectral Cutoff Regeneration), a training-free, generation-time spectral regeneration method for clean image generation from diffusion models trained on noisy images. Leveraging the spectral bias of diffusion models, which infer high-frequency details from low-frequency cues, SCoRe suppresses corrupted high-frequency components of a generated image via a frequency cutoff and regenerates them via SDEdit. Crucially, we derive a theoretical mapping between the cutoff frequency and the SDEdit initialization timestep based on Radially Averaged Power Spectral Density (RAPSD), which prevents excessive noise injection during regeneration. Experiments on synthetic (CIFAR-10) and real-world (SIDD) noisy datasets demonstrate that SCoRe substantially outperforms post-processing and noise-robust baselines, restoring samples closer to clean image distributions without any retraining or fine-tuning.

14. 【2604.09429】Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

链接：https://arxiv.org/abs/2604.09429

作者：Wonbong Jang,Shikun Liu,Soubhik Sanyal,Juan Camilo Perez,Kam Woh Ng,Sanskar Agrawal,Juan-Manuel Perez-Rua,Yiannis Douratsos,Tao Xiang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Recovering camera parameters, Recovering camera, vision and graphics, rendering scenes, viewpoints have long

备注： 9 pages, 6 figures, 4 tables. Project page: [this https URL](https://wbjang.github.io/raysaspixels/)

点击查看摘要

Abstract:Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.

15. 【2604.09425】Do Vision Language Models Need to Process Image Tokens?

链接：https://arxiv.org/abs/2604.09425

作者：Sambit Ghosh,R. Venkatesh Babu,Chirag Agarwal

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Language Models, large language models, Language Models, Vision Language, achieved remarkable success

备注： Accepted (Oral) at TRUE-V Workshop CVPR 2026

点击查看摘要

Abstract:Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers. In this work, we systematically investigate the functional role of image tokens in VLMs and show that visual representations rapidly converge to a bounded-complexity regime, \ie their entropy stabilizes, intrinsic dimensionality compresses, and trajectory curvature approaches a near-constant profile. In contrast, textual representations continue to undergo substantial restructuring across depth. Once stabilized, visual representations become largely interchangeable between layers, indicating limited additional transformation in deeper stages. Further, depth-wise visual truncation reveals that the necessity of visual processing is task-dependent, where single-token predictions remain comparatively robust to truncated visual depth, but multi-token generation require sustained access to visual representations. Under deterministic decoding, reducing visual depth perturbs intermediate reasoning trajectories more strongly than final outputs, suggesting that image tokens influence the structure of reasoning more than the ultimate conclusions. Collectively, these findings \textbf{question the assumption} that deeper visual processing is uniformly essential in VLMs, challenging the current paradigm of multimodal LLM architectures.

16. 【2604.09415】PhysInOne: Visual Physics Learning and Reasoning in One Suite

链接：https://arxiv.org/abs/2604.09415

作者：Siyuan Zhou,Hejun Wang,Hu Cheng,Jinxi Li,Dongsheng Wang,Junwei Jiang,Yixiao Jin,Jiayue Huang,Shiwei Mao,Shangjia Liu,Yafei Yang,Hongkang Song,Shenxing Wei,Zihui Zhang,Peng Huang,Shijie Liu,Zhengli Hao,Hao Li,Yitian Li,Wenqi Zhou,Zhihan Zhao,Zongqi He,Hongtao Wen,Shouwang Huang,Peng Yun,Bowen Cheng,Pok Kazaf Fu,Wai Kit Lai,Jiahao Chen,Kaiyuan Wang,Zhixuan Sun,Ziqi Li,Haochen Hu,Di Zhang,Chun Ho Yuen,Bing Wang,Zhihua Wang,Chuhang Zou,Bo Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：physically-grounded training data, large-scale synthetic dataset, synthetic dataset addressing, large-scale synthetic, scarcity of physically-grounded

备注： CVPR 2026. Siyuan, Hejun, Hu, Jinxi, Dongsheng, Junwei, Yixiao, Jiayue, and Shiwei are co-first authors. Project page: [this https URL](https://vlar-group.github.io/PhysInOne.html)

点击查看摘要

Abstract:We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne's efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.

17. 【2604.09411】SynFlow: Scaling Up LiDAR Scene Flow Estimation with Synthetic Data

链接：https://arxiv.org/abs/2604.09411

作者：Qingwen Zhang,Xiaomeng Zhu,Chenhan Jiang,Patric Jensfelt

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dynamic perception requires, high-quality motion annotations, perception requires models, dynamic perception, predefined categories

备注：

点击查看摘要

Abstract:Reliable 3D dynamic perception requires models that can anticipate motion beyond predefined categories, yet progress is hindered by the scarcity of dense, high-quality motion annotations. While self-supervision on unlabeled real data offers a path forward, empirical evidence suggests that scaling unlabeled data fails to close the performance gap due to noisy proxy signals. In this paper, we propose a shift in paradigm: learning robust real-world motion priors entirely from scalable simulation. We introduce SynFlow, a data generation pipeline that generates large-scale synthetic dataset specifically designed for LiDAR scene flow. Unlike prior works that prioritize sensor-specific realism, SynFlow employs a motion-oriented strategy to synthesize diverse kinematic patterns across 4,000 sequences ($\sim$940k frames), termed SynFlow-4k. This represents a 34x scale-up in annotated volume over existing real-world benchmarks. Our experiments demonstrate that SynFlow-4k provides a highly domain-invariant motion prior. In a zero-shot regime, models trained exclusively on our synthetic data generalize across multiple real-world benchmarks, rivaling in-domain supervised baselines on nuScenes and outperforming state-of-the-art methods on TruckScenes by 31.8%. Furthermore, SynFlow-4k serves as a label-efficient foundation: fine-tuning with only 5% of real-world labels surpasses models trained from scratch on the full available budget. We open-source the pipeline and dataset to facilitate research in generalizable 3D motion estimation. More detail can be found at this https URL.

18. 【2604.09405】EGLOCE: Training-Free Energy-Guided Latent Optimization for Concept Erasure

链接：https://arxiv.org/abs/2604.09405

作者：Junyeong Ahn,Seojin Yoon,Sungyong Baik

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：grow increasingly prevalent, specific concepts-mostly explicit, concepts-mostly explicit content, remove specific concepts-mostly, models grow increasingly

备注：

点击查看摘要

Abstract:As text-to-image diffusion models grow increasingly prevalent, the ability to remove specific concepts-mostly explicit content and many copyrighted characters or styles-has become essential for safety and compliance. Existing unlearning approaches often require costly re-training, modify parameters at the cost of degradation of unrelated concept fidelity, or depend on indirect inference-time adjustment that compromise the effectiveness of concept erasure. Inspired by the success of energy-guided sampling for preservation of the condition of diffusion models, we introduce Energy-Guided Latent Optimization for Concept Erasure (EGLOCE), a training-free approach that removes unwanted concepts by re-directing noisy latent during inference. Our method employs a dual-objective framework: a repulsion energy that steers generation away from target concepts via gradient descent in latent space, and a retention energy that preserves semantic alignment to the original prompt. Combined with previous approaches that either require erroneous modified model weights or provide weak inference-time guidance, EGLOCE operates entirely at inference and enhances erasure performance, enabling plug-and-play integration. Extensive experiments demonstrate that EGLOCE improves concept removal while maintaining image quality and prompt alignment across baselines, even with adversarial attacks. To the best of our knowledge, our work is the first to establish a new paradigm for safe and controllable image generation through dual energy-based guidance during sampling.

19. 【2604.09391】Efficient Unlearning through Maximizing Relearning Convergence Delay

链接：https://arxiv.org/abs/2604.09391

作者：Khoa Tran,Simon S. Woo

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Machine unlearning poses, unlearning poses challenges, Machine unlearning, removing mislabeled, poses challenges

备注：

点击查看摘要

Abstract:Machine unlearning poses challenges in removing mislabeled, contaminated, or problematic data from a pretrained model. Current unlearning approaches and evaluation metrics are solely focused on model predictions, which limits insight into the model's true underlying data characteristics. To address this issue, we introduce a new metric called relearning convergence delay, which captures both changes in weight space and prediction space, providing a more comprehensive assessment of the model's understanding of the forgotten dataset. This metric can be used to assess the risk of forgotten data being recovered from the unlearned model. Based on this, we propose the Influence Eliminating Unlearning framework, which removes the influence of the forgetting set by degrading its performance and incorporates weight decay and injecting noise into the model's weights, while maintaining accuracy on the retaining set. Extensive experiments show that our method outperforms existing metrics and our proposed relearning convergence delay metric, approaching ideal unlearning performance. We provide theoretical guarantees, including exponential convergence and upper bounds, as well as empirical evidence of strong retention and resistance to relearning in both classification and generative unlearning tasks.

20. 【2604.09386】Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing

链接：https://arxiv.org/abs/2604.09386

作者：Zhuohan Ouyang,Zhe Qian,Wenhuo Cui,Chaoqun Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires balancing target, balancing target modification, Instruction-guided image editing, Instruction-guided image, editing requires balancing

备注：

点击查看摘要

Abstract:Instruction-guided image editing requires balancing target modification with non-target preservation. Recently, flow-based models have emerged as a strong and increasingly adopted backbone for instruction-guided image editing, thanks to their high fidelity and efficient deterministic ODE sampling. Building on this foundation, GRPO-based reward-driven post-training has been explored to directly optimize editing-specific rewards, improving instruction following and editing consistency. However, existing methods often suffer from noisy credit assignment: global exploration also perturbs non-target regions, inflating within-group reward variance and yielding noisy GRPO advantages. To address this, we propose RC-GRPO-Editing, a region-constrained GRPO post-training framework for flow-based image editing under deterministic ODE sampling. It suppresses background-induced nuisance variance to enable cleaner localized credit assignment, improving editing region instruction adherence while preserving non-target content. Concretely, we localize exploration via region-decoupled initial noise perturbations to reduce background-induced reward variance and stabilize GRPO advantages, and introduce an attention concentration reward that aligns cross-attention with the intended editing region throughout the rollout, reducing unintended changes in non-target regions. Experiments on CompBench show consistent improvements in editing region instruction adherence and non-target preservation.

21. 【2604.09368】hrough Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

链接：https://arxiv.org/abs/2604.09368

作者：Lingfeng Huang,Huizhong Guo,Tianjun Wei,Yingpeng Du,Zhu Sun

类目：Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large language model, recommender system evaluation, Large language, agents are increasingly, system evaluation

备注：

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model's (VLM's) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM's internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model's attention toward each user's characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model "see like the user" is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.

22. 【2604.09367】EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

链接：https://arxiv.org/abs/2604.09367

作者：Shipeng Zhu,Ang Chen,Na Nie,Pengfei Fang,Min-Ling Zhang,Hui Xue

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Ancient inscriptions, suffered from centuries, centuries of environmental, environmental and human-induced, human-induced degradation

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at this https URL.

23. 【2604.09366】Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors

链接：https://arxiv.org/abs/2604.09366

作者：Ying Zang,Yidong Han,Chaotao Ding,Yuanqi Hu,Deyi Ji,Qi Zhu,Xuanfu Li,Jin Ma,Lingyun Sun,Tianrun Chen,Lanyun Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：challenging task, Reconstructing dynamic, important yet challenging, Driven Geometry Purification, Reconstructing

备注：

点击查看摘要

Abstract:Reconstructing dynamic 4D scenes is an important yet challenging task. While 3D foundation models like VGGT excel in static settings, they often struggle with dynamic sequences where motion causes significant geometric ambiguity. To address this, we present a framework designed to disentangle dynamic and static components by modeling uncertainty across different stages of the reconstruction process. Our approach introduces three synergistic mechanisms: (1) Entropy-Guided Subspace Projection, which leverages information-theoretic weighting to adaptively aggregate multi-head attention distributions, effectively isolating dynamic motion cues from semantic noise; (2) Local-Consistency Driven Geometry Purification, which enforces spatial continuity via radius-based neighborhood constraints to eliminate structural outliers; and (3) Uncertainty-Aware Cross-View Consistency, which formulates multi-view projection refinement as a heteroscedastic maximum likelihood estimation problem, utilizing depth confidence as a probabilistic weight. Experiments on dynamic benchmarks show that our approach outperforms current state-of-the-art methods, reducing Mean Accuracy error by 13.43\% and improving segmentation F-measure by 10.49\%. Our framework maintains the efficiency of feed-forward inference and requires no task-specific fine-tuning or per-scene optimization.

24. 【2604.09364】Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts

链接：https://arxiv.org/abs/2604.09364

作者：Farhad Nooralahzadeh,Omid Rohanian,Yi Zhang,Jonathan Fürst,Kurt Stockinger

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal Arbitration Crossover, blue banana, problem of perception, Vision-Language Model, Arbitration Crossover

备注：

点击查看摘要

25. 【2604.09352】LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation

链接：https://arxiv.org/abs/2604.09352

作者：Aytaç Sekmen,Fatih Emre Gunes,Furkan Horoz,Hüseyin Umut Işık,Mehmet Alp Ozaydin,Onur Altay Topaloglu,Şahin Umutcan Üstündaş,Yurdasen Alp Yeni,Halil Ersin Soken,Erol Sahin,Ramazan Gokberk Cinbis,Sinan Kalkan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Monocular Depth Estimation, autonomous lunar rover, lunar rover navigation, Monocular Depth, Depth Estimation

备注： This paper will be published in CVPRW2026

点击查看摘要

Abstract:Monocular Depth Estimation (MDE) is crucial for autonomous lunar rover navigation using electro-optical cameras. However, deploying terrestrial MDE networks to the Moon brings a severe domain gap due to harsh shadows, textureless regolith, and zero atmospheric scattering. Existing evaluations rely on analogs that fail to replicate these conditions and lack actual metric ground truth. To address this, we present LuMon, a comprehensive benchmarking framework to evaluate MDE methods for lunar exploration. We introduce novel datasets featuring high-quality stereo ground truth depth from the real Chang'e-3 mission and the CHERI dark analog dataset. Utilizing this framework, we conduct a systematic zero-shot evaluation of state-of-the-art architectures across synthetic, analog, and real datasets. We rigorously assess performance against mission critical challenges like craters, rocks, extreme shading, and varying depth ranges. Furthermore, we establish a sim-to-real domain adaptation baseline by fine tuning a foundation model on synthetic data. While this adaptation yields drastic in-domain performance gains, it exhibits minimal generalization to authentic lunar imagery, highlighting a persistent cross-domain transfer gap. Our extensive analysis reveals the inherent limitations of current networks and sets a standard foundation to guide future advancements in extraterrestrial perception and domain adaptation.

26. 【2604.09349】Visually-Guided Policy Optimization for Multimodal Reasoning

链接：https://arxiv.org/abs/2604.09349

作者：Zengbin Wang,Feng Xiong,Liang Lin,Xuecai Hu,Yong Wang,Yanlin Wang,Man Zhang,Xiangxiang Chu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Reinforcement learning, verifiable rewards, vision-language models, visual, learning with verifiable

备注： ACL 2026

点击查看摘要

27. 【2604.09330】VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

链接：https://arxiv.org/abs/2604.09330

作者：Xiaolei Lang,Yang Wang,Yukun Zhou,Chaojun Ni,Kerui Li,Jiagang Zhu,Tianze Liu,Jiajun Lv,Xingxing Zuo,Yun Ye,Guan Huang,Xiaofeng Wang,Zheng Zhu

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：large-scale human teleoperation, perform increasingly complex, Recent advances, robot foundation models, foundation models trained

备注：

点击查看摘要

Abstract:Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.

28. 【2604.09327】From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection

链接：https://arxiv.org/abs/2604.09327

作者：Narges Rashvand,Shanle Yao,Armin Danesh Pazho,Babak Rahimi Ardabili,Hamed Tabkhi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Pose-based Video Anomaly, Video Anomaly Detection, gained significant attention, Pose-based Video, Video Anomaly

备注：

点击查看摘要

Abstract:Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at this https URL.

29. 【2604.09326】Multimodal Anomaly Detection for Human-Robot Interaction

链接：https://arxiv.org/abs/2604.09326

作者：Guilherme Ribeiro,Iordanis Antypas,Leonardo Bizzaro,João Bimbo,Nuno Cruz Garcia

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Ensuring safety, requires the timely, unsafe behaviours, safety and reliability, unexpected events

备注：

点击查看摘要

Abstract:Ensuring safety and reliability in human-robot interaction (HRI) requires the timely detection of unexpected events that could lead to system failures or unsafe behaviours. Anomaly detection thus plays a critical role in enabling robots to recognize and respond to deviations from normal operation during collaborative tasks. While reconstruction models have been actively explored in HRI, approaches that operate directly on feature vectors remain largely unexplored. In this work, we propose MADRI, a framework that first transforms video streams into semantically meaningful feature vectors before performing reconstruction-based anomaly detection. Additionally, we augment these visual feature vectors with the robot's internal sensors' readings and a Scene Graph, enabling the model to capture both external anomalies in the visual environment and internal failures within the robot itself. To evaluate our approach, we collected a custom dataset consisting of a simple pick-and-place robotic task under normal and anomalous conditions. Experimental results demonstrate that reconstruction on vision-based feature vectors alone is effective for detecting anomalies, while incorporating other modalities further improves detection performance, highlighting the benefits of multimodal feature reconstruction for robust anomaly detection in human-robot collaboration.

30. 【2604.09324】Structure-Aware Fine-Grained Gaussian Splatting for Expressive Avatar Reconstruction

链接：https://arxiv.org/abs/2604.09324

作者：Yuze Su,Hongsong Wang,Jie Gui,Liang Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：topology-aware human avatars, monocular videos remains, vision and graphics, photorealistic and topology-aware, remains a significant

备注： The code is on Github: [this https URL](https://github.com/Su245811YZ/SFGS)

点击查看摘要

Abstract:Reconstructing photorealistic and topology-aware human avatars from monocular videos remains a significant challenge in the fields of computer vision and graphics. While existing 3D human avatar modeling approaches can effectively capture body motion, they often fail to accurately model fine details such as hand movements and facial expressions. To address this, we propose Structure-aware Fine-grained Gaussian Splatting (SFGS), a novel method for reconstructing expressive and coherent full-body 3D human avatars from a monocular video sequence. The SFGS use both spatial-only triplane and time-aware hexplane to capture dynamic features across consecutive frames. A structure-aware gaussian module is designed to capture pose-dependent details in a spatially coherent manner and improve pose and texture expression. To better model hand deformations, we also propose a residual refinement module based on fine-grained hand reconstruction. Our method requires only a single-stage training and outperforms state-of-the-art baselines in both quantitative and qualitative evaluations, generating high-fidelity avatars with natural motion and fine details. The code is on Github: this https URL

31. 【2604.09305】VAGNet: Vision-based accident anticipation with global features

链接：https://arxiv.org/abs/2604.09305

作者：Vipooshan Vipulananthan,Charith D. Chitraranjan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fatalities and injuries, Abstract, globe, driving, accidents

备注：

点击查看摘要

Abstract:Traffic accidents are a leading cause of fatalities and injuries across the globe. Therefore, the ability to anticipate hazardous situations in advance is essential. Automated accident anticipation enables timely intervention through driver alerts and collision avoidance maneuvers, forming a key component of advanced driver assistance systems. In autonomous driving, such predictive capabilities support proactive safety behaviors, such as initiating defensive driving and human takeover when required. Using dashcam video as input offers a cost-effective solution, but it is challenging due to the complexity of real-world driving scenes. Accident anticipation systems need to operate in real-time. However, current methods involve extracting features from each detected object, which is computationally intensive. We propose VAGNet, a deep neural network that learns to predict accidents from dash-cam video using global features of traffic scenes without requiring explicit object-level features. The network consists of transformer and graph modules, and we use the vision foundation model VideoMAE-V2 for global feature extraction. Experiments on four benchmark datasets (DAD, DoTA, DADA, and Nexar) show that our method anticipates accidents with higher average precision and mean time-to-accident while being computationally more efficient compared to existing methods.

32. 【2604.09304】GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic

链接：https://arxiv.org/abs/2604.09304

作者：Jiayuan Lu,Rengan Xie,Xuancheng Jin,Zhizhen Wu,Qi Ye,Tian Xie,Hujun Bao,Rui Wang. Yuchi Huo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：synthesizing photorealisitic images, DTV Field, PRR, PBR, fundation of synthesizing

备注：

点击查看摘要

Abstract:For decades, Physically-Based Rendering (PBR) is the fundation of synthesizing photorealisitic images, and therefore sometimes roughly referred as Photorealistic Rendering (PRR). While PBR is indeed a mathematical simulation of light transport that guarantees physical reality, photorealism has additional reliance on the realistic digital model of geometry and appearance of the real world, leaving a barely explored gap from PBR to PRR (P2P). Consequently, the path toward photorealism faces a critical dilemma: the explicit simulation of PRR encumbered by unreachable realistic digital models for real-world existence, while implicit generation models sacrifice controllability and geometric consistency. Based on this insight, this paper presents the problem, data, and approach of mitigating P2P gap, followed by the first multi-modal generative rendering model, dubbed GeRM, to unify PBR and PRR. GeRM integrates physical attributes like G-buffers with text prompts, and progressive incremental injection to generate controllable photorealistic images, allowing users to fluidly navigate the continuum between strict physical fidelity and perceptual photorealism. Technically, we model the transition between PBR and PRR images as a distribution transfer and aim to learn a distribution transfer vector field (DTV Field) to guide this process. To define the learning objective, we first leverage a multi-agent VLM framework to construct an expert-guided pairwise P2P transfer dataset, named P2P-50K, where each paired sample in the dataset corresponds to a transfer vector in the DTV Field. Subsequently, we propose a multi-condition ControlNet to learn the DTV Field, which synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions.

33. 【2604.09282】Characterizing Lidar Range-Measurement Ambiguity due to Multiple Returns

链接：https://arxiv.org/abs/2604.09282

作者：Jason H. Rife,Yifan Li

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：highly automated vehicles, Reliable position, position and attitude, attitude sensing, sensing is critical

备注： Proceedings of the 38th International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS+ 2025), Baltimore, Maryland, September 2025, pp. 1949-1963

点击查看摘要

Abstract:Reliable position and attitude sensing is critical for highly automated vehicles that operate on conventional roadways. Lidar sensors are increasingly incorporated into pose-estimation systems. Despite its great utility, lidar is a complex sensor, and its performance in roadway environments is not yet well understood. For instance, it is often assumed in lidar-localization algorithms that a lidar will always identify a unique surface along a given raypath. However, this assumption is not always true, as ample prior evidence exists to suggest that lidar units may generate measurements probabilistically when more than one scattering surface appears within the lidar's conical beam. In this paper, we analyze lidar datasets to characterize cases with probabilistic returns along particular raypaths. Our contribution is to present representative cumulative distribution functions (CDFs) for raypaths observed by two different mechanically rotating lidar units with stationary bases. In subsequent discussion, we outline a qualitative methodology to assess the effect of probabilistic multi-return cases on lidar-based localization.

34. 【2604.09260】Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images

链接：https://arxiv.org/abs/2604.09260

作者：Maciej Janicki,Aleksander Plocharski,Przemyslaw Musialski

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：architectural elements independently, downstream procedural reconstruction, object detectors typically, detectors typically treat, typically treat architectural

备注： 4 pages, 4 figures, EUROGRAPHICS 2026 Short Paper

点击查看摘要

Abstract:Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction. We address this limitation by augmenting the YOLOv8 training objective with a custom lightweight alignment loss. This regularization encourages grid-consistent arrangements of bounding boxes during training, effectively injecting geometric priors without altering the standard inference pipeline. Experiments on the CMP dataset demonstrate that our method successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion while maintaining a controllable trade-off with standard detection accuracy.

35. 【2604.09253】Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

链接：https://arxiv.org/abs/2604.09253

作者：Yuqin Lan,Gen Li,Yuanze Hu,Weihao Shen,Zhaoxin Fan,Faguo Wu,Xiao Zhang,Laurence T. Yang,Zhiming Zheng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：powerful but remain, remain vulnerable, Vision-Language Models, Attack Success Rate, surrogate-target settings

备注： 14pages, 9 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.

36. 【2604.09249】FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding

链接：https://arxiv.org/abs/2604.09249

作者：Kaidong Feng,Zhuoxuan Huang,Huizhong Guo,Yuting Jin,Xinyu Chen,Yue Liang,Yifei Gai,Li Zhou,Yunshan Ma,Zhu Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：Fashion understanding requires, expert-level fashion understanding, requires both visual, visual perception, Fashion understanding

备注：

点击查看摘要

37. 【2604.09244】2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

链接：https://arxiv.org/abs/2604.09244

作者：Zihao Zheng,Sicheng Tian,Zhihao Mao,Lingyue Zhang,Chenyue Li,Ziyun Zhang,Hong Gao,Yuchen Huang,Yutong Xu,Guojie Luo,Xiang Chen

类目：Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：VLA models, VLA, embodied intelligence, MVLA models, Recent VLA models

备注：

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite achieving improved spatial perception, MVLA faces a greater acceleration demand due to the increased number of input tokens caused by modal expansion. Token pruning is an effective optimization methods tailored to MVLA models. However, existing token pruning schemes are designed for 2D-only VLA models, ignoring 2D/3D modality salience differences. In this paper, we follow the application process of multi-modal data in MVLA models and develop a tri-stage analysis to capture the discrepancy and dynamics of 2D/3D modality salience. Based on these, we propose a corresponding tri-stage token pruning framework for MVLA models to achieve optimal 2D/3D token selection and efficient pruning. Experiments show that our framework achieves up to a 2.55x inference speedup with minimal accuracy loss, while only costing 5.8% overhead. Our Code is coming soon.

38. 【2604.09232】Neural Distribution Prior for LiDAR Out-of-Distribution Detection

链接：https://arxiv.org/abs/2604.09232

作者：Zizhao Li,Zhengkang Xiang,Jiayang Ao,Feng Liu,Joseph West,Kourosh Khoshelham

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：autonomous driving due, OOD, visibility conditions, critical for autonomous, autonomous driving

备注： CVPR 2026

点击查看摘要

Abstract:LiDAR-based perception is critical for autonomous driving due to its robustness to poor lighting and visibility conditions. Yet, current models operate under the closed-set assumption and often fail to recognize unexpected out-of-distribution (OOD) objects in the open world. Existing OOD scoring functions exhibit limited performance because they ignore the pronounced class imbalance inherent in LiDAR OOD detection and assume a uniform class distribution. To address this limitation, we propose the Neural Distribution Prior (NDP), a framework that models the distributional structure of network predictions and adaptively reweights OOD scores based on alignment with a learned distribution prior. NDP dynamically captures the logit distribution patterns of training data and corrects class-dependent confidence bias through an attention-based module. We further introduce a Perlin noise-based OOD synthesis strategy that generates diverse auxiliary OOD samples from input scans, enabling robust OOD training without external datasets. Extensive experiments on the SemanticKITTI and STU benchmarks demonstrate that NDP substantially improves OOD detection performance, achieving a point-level AP of 61.31\% on the STU test set, which is more than 10$\times$ higher than the previous best result. Our framework is compatible with various existing OOD scoring formulations, providing an effective solution for open-world LiDAR perception.

39. 【2604.09231】Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

链接：https://arxiv.org/abs/2604.09231

作者：Huiang He,Shengchu Zhao,Jianwen Huang,Jie Li,Jiaqi Wu,Hu Zhang,Pei Tang,Heliang Zheng,Yukun Li,Rongfei Jia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：incomplete texture coverage, texture generation, texture, generation, recent advances

备注： 13 pages

点击查看摘要

Abstract:Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.

40. 【2604.09220】nyNeRV: Compact Neural Video Representations via Capacity Scaling, Distillation, and Low-Precision Inference

链接：https://arxiv.org/abs/2604.09220

作者：Muhammad Hannan Akhtar,Ihab Amer,Tamer Shanableh

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Implicit neural video, enable constant time, constant time frame, representations encode entire, neural video representations

备注： Submitted to "Computers and Electrical Engineering", Elsevier

点击查看摘要

Abstract:Implicit neural video representations encode entire video sequences within the parameters of a neural network and enable constant time frame reconstruction. Recent work on Neural Representations for Videos (NeRV) has demonstrated competitive reconstruction performance while avoiding the sequential decoding process of conventional video codecs. However, most existing studies focus on moderate or high capacity models, leaving the behavior of extremely compact configurations required for constrained environments insufficiently explored. This paper presents a systematic study of tiny NeRV architectures designed for efficient deployment. Two lightweight configurations, NeRV-T and NeRV-T+, are introduced and evaluated across multiple video datasets in order to analyze how aggressive capacity reduction affects reconstruction quality, computational complexity, and decoding throughput. Beyond architectural scaling, the work investigates strategies for improving the performance of compact models without increasing inference cost. Knowledge distillation with frequency-aware focal supervision is explored to enhance reconstruction fidelity in low-capacity networks. In addition, the impact of lowprecision inference is examined through both post training quantization and quantization aware training to study the robustness of tiny models under reduced numerical precision. Experimental results demonstrate that carefully designed tiny NeRV variants can achieve favorable quality efficiency trade offs while substantially reducing parameter count, computational cost, and memory requirements. These findings provide insight into the practical limits of compact neural video representations and offer guidance for deploying NeRV style models in resource constrained and real-time environments. The official implementation is available at https: //github.com/HannanAkhtar/TinyNeRV-Implementation.

41. 【2604.09213】SHIFT: Steering Hidden Intermediates in Flow Transformers

链接：https://arxiv.org/abs/2604.09213

作者：Nina Konovalova,Andrey Kuznetsov,Aibek Alanov

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion models, leading approaches, approaches for high-fidelity, DiT-based diffusion models, Recent DiT-based diffusion

备注：

点击查看摘要

Abstract:Diffusion models have become leading approaches for high-fidelity image generation. Recent DiT-based diffusion models, in particular, achieve strong prompt adherence while producing high-quality samples. We propose SHIFT, a simple but effective and lightweight framework for concept removal in DiT diffusion models via targeted manipulation of intermediate activations at inference time, inspired by activation steering in large language models. SHIFT learns steering vectors that are dynamically applied to selected layers and timesteps to suppress unwanted visual concepts while preserving the prompt's remaining content and overall image quality. Beyond suppression, the same mechanism can shift generations into a desired \emph{style domain} or bias samples toward adding or changing target objects. We demonstrate that SHIFT provides effective and flexible control over DiT generation across diverse prompts and targets without time-consuming retraining.

42. 【2604.09210】Adding Another Dimension to Image-based Animal Detection

链接：https://arxiv.org/abs/2604.09210

作者：Vandita Shukla,Fabio Remondino,Benjamin Risse

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：animals inherently reduces, inherently reduces, bounding boxes, Skinned Multi Animal, Multi Animal Linear

备注： CV4Animals Workshop 2025

点击查看摘要

Abstract:Monocular imaging of animals inherently reduces 3D structures to 2D projections. Detection algorithms lead to 2D bounding boxes that lack information about animal's orientation relative to the camera. To build 3D detection methods for RGB animal images, there is a lack of labeled datasets; such labeling processes require 3D input streams along with RGB data. We present a pipeline that utilises Skinned Multi Animal Linear models to estimate 3D bounding boxes and to project them as robust labels into 2D image space using a dedicated camera pose refinement algorithm. To assess which sides of the animal are captured, cuboid face visibility metrics are computed. These 3D bounding boxes and metrics form a crucial step toward developing and benchmarking future monocular 3D animal detection algorithms. We evaluate our method on the Animal3D dataset, demonstrating accurate performance across species and settings.

43. 【2604.09206】Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception

链接：https://arxiv.org/abs/2604.09206

作者：Jiahao Wang,Zikun Xu,Yuner Zhang,Zhongwei Jiang,Chenyang Lu,Shuocheng Yang,Yuxuan Wang,Jiaru Zhong,Chuang Zhang,Shaobing Xu,Jianqiang Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offering extended sensing, enhancing autonomous driving, extended sensing horizons, autonomous driving, offering extended

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Cooperative 3D perception via Vehicle-to-Everything communication is a promising paradigm for enhancing autonomous driving, offering extended sensing horizons and occlusion resolution. However, the practical deployment of existing methods is hindered at long distances by two critical bottlenecks: the quadratic computational scaling of dense BEV representations and the fragility of feature association mechanisms under significant observation and alignment errors. To overcome these limitations, we introduce Long-SCOPE, a fully sparse framework designed for robust long-distance cooperative 3D perception. Our method features two novel components: a Geometry-guided Query Generation module to accurately detect small, distant objects, and a learnable Context-Aware Association module that robustly matches cooperative queries despite severe positional noise. Experiments on the V2X-Seq and Griffin datasets validate that Long-SCOPE achieves state-of-the-art performance, particularly in challenging 100-150 m long-range settings, while maintaining highly competitive computation and communication costs.

44. 【2604.09201】CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

链接：https://arxiv.org/abs/2604.09201

作者：Haoyu Zhao,Zihao Zhang,Jiaxi Gu,Haoran Chen,Qingping Zheng,Pin Tang,Yeyin Jin,Yuang Zhang,Junqi Cheng,Zenghui Lu,Peng Shu,Zuxuan Wu,Yu-Gang Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：plausible camera movements, physically plausible camera, aims to synthesize, flexible and physically, physically plausible

备注：

点击查看摘要

Abstract:Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

45. 【2604.09199】Globally Optimal Pose from Orthographic Silhouettes

链接：https://arxiv.org/abs/2604.09199

作者：Agniva Sengupta,Dilara Kuş,Jianning Li,Stefan Zachow

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：solve the problem, problem of determining, Pattern Recognition, Vision and Pattern, Computer Vision

备注：

点击查看摘要

Abstract:We solve the problem of determining the pose of known shapes in $\mathbb{R}^3$ from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by correspondences, for any shape, irrespective of its convexity and genus. We validate our method on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches. Code, data, and supplementary in: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.09199 [cs.CV]

(or
arXiv:2604.09199v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.09199

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. Denver, Colorado

46. 【2604.09197】Vision Transformers for Preoperative CT-Based Prediction of Histopathologic Chemotherapy Response Score in High-Grade Serous Ovarian Carcinoma

链接：https://arxiv.org/abs/2604.09197

作者：Francesca Fati,Felipe Coutinho,Marika Reinius,Marina Rosanu,Gabriel Funingana,Luigi De Vitis,Gabriella Schivardi,Hannah Clayton,Alice Traversa,Zeyu Gao,Guilherme Penteado,Shangqi Gao,Francesco Pastori,Ramona Woitek,Maria Cristina Ghioni,Giovanni Damiano Aletti,Mercedes Jimenez-Linan,Sarah Burge,Nicoletta Colombo,Evis Sala,Maria Francesca Spadea,Timothy L. Kline,James D. Brenton,Jaime Cardoso,Francesco Multinu,Elena De Momi,Mireia Crispin-Ortuzar,Ines P. Machado

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Purpose, Chemotherapy Response Score, CRS, predict CRS, Response

备注：

点击查看摘要

Abstract:Purpose. High-grade serous ovarian carcinoma (HGSOC) is characterized by pronounced biological and spatial heterogeneity and is frequently diagnosed at an advanced stage. Neoadjuvant chemotherapy (NACT) followed by delayed primary surgery is commonly employed in patients unsuitable for primary cytoreduction. The Chemotherapy Response Score (CRS) is a validated histopathological biomarker of response to NACT, but it is only available postoperatively. In this study, we investigate whether pre-treatment computed tomography (CT) imaging and clinical data can be used to predict CRS as an investigational decision-support adjunct to inform multidisciplinary team (MDT) discussions regarding expected treatment response. Methods. We proposed a 2.5D multimodal deep learning framework that processes lesion-dense omental slices using a pre-trained Vision Transformer encoder and integrates the resulting visual representations with clinical variables through an intermediate fusion module to predict CRS. Results. Our multimodal model, integrating imaging and clinical data, achieved a ROC-AUC of 0.95 alongside 95% accuracy and 80% precision on the internal test cohort (IEO, n=41 patients). On the external test set (OV04, n=70 patients), it achieved a ROC-AUC of 0.68, alongside 67% accuracy and 75% precision. Conclusion. These preliminary results demonstrate the feasibility of transformer-based deep learning for preoperative prediction of CRS in HGSOC using routine clinical data and CT imaging. As an investigational, pre-treatment decision-support tool, this approach may assist MDT discussions by providing early, non-invasive estimates of treatment response.

47. 【2604.09181】MixFlow: Mixed Source Distributions Improve Rectified Flows

链接：https://arxiv.org/abs/2604.09181

作者：Nazir Nayal,Christopher Wewer,Jan Eric Lenssen

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：highly curved generative, slow iterative sampling, iterative sampling caused, Diffusion models, generate diverse

备注：

点击查看摘要

Abstract:Diffusion models and their variations, such as rectified flows, generate diverse and high-quality images, but they are still hindered by slow iterative sampling caused by the highly curved generative paths they learn. An important cause of high curvature, as shown by previous work, is independence between the source distribution (standard Gaussian) and the data distribution. In this work, we tackle this limitation by two complementary contributions. First, we attempt to break away from the standard Gaussian assumption by introducing $\kappa\texttt{-FC}$, a general formulation that conditions the source distribution on an arbitrary signal $\kappa$ that aligns it better with the data distribution. Then, we present MixFlow, a simple but effective training strategy that reduces the generative path curvatures and considerably improves sampling efficiency. MixFlow trains a flow model on linear mixtures of a fixed unconditional distribution and a $\kappa\texttt{-FC}$-based distribution. This simple mixture improves the alignment between the source and data, provides better generation quality with less required sampling steps, and accelerates the training convergence considerably. On average, our training procedure improves the generation quality by 12\% in FID compared to standard rectified flow and 7\% compared to previous baselines under a fixed sampling budget. Code available at: $\href{this https URL}{this https URL}$

48. 【2604.09169】UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation

链接：https://arxiv.org/abs/2604.09169

作者：Le-Van Thai,Tien Dat Nguyen,Hoai Nhan Pham,Lan Anh Dinh Thi,Duy-Dong Nguyen,Ngoc Lam Quang Bui

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computational pathology remains, pathology remains challenging, remains challenging due, scarce pixel-level annotations, computational pathology

备注： Accepted at CVPR 2026 Workshop. 11 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Semi-supervised semantic segmentation in computational pathology remains challenging due to scarce pixel-level annotations and unreliable pseudo-label supervision. We propose UniSemAlign, a dual-modal semantic alignment framework that enhances visual segmentation by injecting explicit class-level structure into pixel-wise learning. Built upon a pathology-pretrained Transformer encoder, UniSemAlign introduces complementary prototype-level and text-level alignment branches in a shared embedding space, providing structured guidance that reduces class ambiguity and stabilizes pseudo-label refinement. The aligned representations are fused with visual predictions to generate more reliable supervision for unlabeled histopathology images. The framework is trained end-to-end with supervised segmentation, cross-view consistency, and cross-modal alignment objectives. Extensive experiments on the GlaS and CRAG datasets demonstrate that UniSemAlign substantially outperforms recent semi-supervised baselines under limited supervision, achieving Dice improvements of up to 2.6% on GlaS and 8.6% on CRAG with only 10% labeled data, and strong improvements at 20% supervision. Code is available at: this https URL

49. 【2604.09168】ELT: Elastic Looped Transformers for Visual Generation

链接：https://arxiv.org/abs/2604.09168

作者：Sahil Goyal,Swayam Agrawal,Gautham Govind Anil,Prateek Jain,Sujoy Paul,Aditya Kusupati

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：introduce Elastic Looped, Elastic Looped Transformers, highly parameter-efficient class, recurrent transformer architecture, Looped Transformers

备注：

点击查看摘要

Abstract:We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-conditional UCF-101.

50. 【2604.09167】MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

链接：https://arxiv.org/abs/2604.09167

作者：Henry Zheng,Chenyue Fang,Rui Huang,Siyuan Wei,Xiao Liu,Gao Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词：Vision-language models, scenes remains underexplored, remains underexplored, multimodal understanding, reasoning

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.

51. 【2604.09164】Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

链接：https://arxiv.org/abs/2604.09164

作者：Yicheng Qiu,Keiji Yanai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：localize action segments, aims to identify, identify and localize, segments within untrimmed, pivotal task

备注： ICME2026

点击查看摘要

Abstract:Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed Temporal Boundary-aware SSM(TB-SSM) for temporal feature modeling with efficient processing of spatial features. We perform comprehensive and quantitative analyses across multiple benchmarks, comparing our proposed method against previous SSM-based and other structural methods. Extensive experiments demonstrate that our improved strategy significantly enhances both localization performance and robustness, validating the effectiveness of our proposed method.

52. 【2604.09151】Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery

链接：https://arxiv.org/abs/2604.09151

作者：Sara Ameli

类目：Computer Vision and Pattern Recognition (cs.CV); Pattern Formation and Solitons (nlin.PS)

关键词：context-aware computer-assisted interventions, enabling context-aware computer-assisted, Accurate segmentation, workflow analysis, computer-assisted interventions

备注：

点击查看摘要

Abstract:Accurate segmentation of surgical instruments in robotic-assisted surgery is critical for enabling context-aware computer-assisted interventions, such as tool tracking, workflow analysis, and autonomous decision-making. In this study, we benchmark five deep learning architectures-UNet, UNet, DeepLabV3, Attention UNet, and SegFormer on the SAR-RARP50 dataset for multi-class semantic segmentation of surgical instruments in real-world radical prostatectomy videos. The models are trained with a compound loss function combining Cross Entropy and Dice loss to address class imbalance and capture fine object boundaries. Our experiments reveal that while convolutional models such as UNet and Attention UNet provide strong baseline performance, DeepLabV3 achieves results comparable to SegFormer, demonstrating the effectiveness of atrous convolution and multi-scale context aggregation in capturing complex surgical scenes. Transformer-based architectures like SegFormer further enhance global contextual understanding, leading to improved generalization across varying instrument appearances and surgical conditions. This work provides a comprehensive comparison and practical insights for selecting segmentation models in surgical AI applications, highlighting the trade-offs between convolutional and transformer-based approaches.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Pattern Formation and Solitons (nlin.PS)

Cite as:
arXiv:2604.09151 [cs.CV]

(or
arXiv:2604.09151v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.09151

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

53. 【2604.09145】Deep Light Pollution Removal in Night Cityscape Photographs

链接：https://arxiv.org/abs/2604.09145

作者：Hao Wang,Xiaolin Wu,Xi Zhang,Baoqing Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：pervasive artificial lighting, light pollution induced, urban environments, photography is severely, severely degraded

备注： 17 pages, supplementary material included

点击查看摘要

Abstract:Nighttime photography is severely degraded by light pollution induced by pervasive artificial lighting in urban environments. After long-range scattering and spatial diffusion, unwanted artificial light overwhelms natural night luminance, generates skyglow that washes out the view of stars and celestial objects and produces halos and glow artifacts around light sources. Unlike nighttime dehazing, which aims to improve detail legibility through thick air, the objective of light pollution removal is to restore the pristine night appearance by neutralizing the radiative footprint of ground lighting. In this paper we introduce a physically-based degradation model that adds to the previous ones for nighttime dehazing two critical aspects; (i) anisotropic spread of directional light sources, and (ii) skyglow caused by invisible surface lights behind skylines. In addition, we construct a training strategy that leverages large generative model and synthetic-real coupling to compensate for the scarcity of paired real data and enhance generalization. Extensive experiments demonstrate that the proposed formulation and learning framework substantially reduce light pollution artifacts and better recover authentic night imagery than prior nighttime restoration methods.

54. 【2604.09142】Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

链接：https://arxiv.org/abs/2604.09142

作者：Jiahao Li,Xinhong Chen,Zhengmin Jiang,Cheng Huang,Yung-Hui Li,Jianping Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image-driven stereo matching, past decade, open challenge, remarkable advances, advances in image-driven

备注：

点击查看摘要

Abstract:Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.

55. 【2604.09132】Strips as Tokens: Artist Mesh Generation with Native UV Segmentation

链接：https://arxiv.org/abs/2604.09132

作者：Rui Xu,Dafei Qin,Kaichun Qiao,Qiujie Dong,Huaijin Pi,Qixuan Zhang,Longwen Zhang,Lan Xu,Jingyi Yu,Wenping Wang,Taku Komura

类目：Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Graphics (cs.GR)

关键词：demonstrated remarkable potential, Recent advancements, generating artist-quality meshes, advancements in autoregressive, autoregressive transformers

备注：

点击查看摘要

Abstract:Recent advancements in autoregressive transformers have demonstrated remarkable potential for generating artist-quality meshes. However, the token ordering strategies employed by existing methods typically fail to meet professional artist standards, where coordinate-based sorting yields inefficiently long sequences, and patch-based heuristics disrupt the continuous edge flow and structural regularity essential for high-quality modeling. To address these limitations, we propose Strips as Tokens (SATO), a novel framework with a token ordering strategy inspired by triangle strips. By constructing the sequence as a connected chain of faces that explicitly encodes UV boundaries, our method naturally preserves the organized edge flow and semantic layout characteristic of artist-created meshes. A key advantage of this formulation is its unified representation, enabling the same token sequence to be decoded into either a triangle or quadrilateral mesh. This flexibility facilitates joint training on both data types: large-scale triangle data provides fundamental structural priors, while high-quality quad data enhances the geometric regularity of the outputs. Extensive experiments demonstrate that SATO consistently outperforms prior methods in terms of geometric quality, structural coherence, and UV segmentation.

56. 【2604.09127】FaceLiVTv2: An Improved Hybrid Architecture for Efficient Mobile Face Recognition

链接：https://arxiv.org/abs/2604.09127

作者：Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：met alongside reliable, alongside reliable accuracy, increasingly important, important for deployment, deployment on edge

备注：

点击查看摘要

Abstract:Lightweight face recognition is increasingly important for deployment on edge and mobile devices, where strict constraints on latency, memory, and energy consumption must be met alongside reliable accuracy. Although recent hybrid CNN-Transformer architectures have advanced global context modeling, striking an effective balance between recognition performance and computational efficiency remains an open challenge. In this work, we present FaceLiVTv2, an improved version of our FaceLiVT hybrid architecture designed for efficient global--local feature interaction in mobile face recognition. At its core is Lite MHLA, a lightweight global token interaction module that replaces the original multi-layer attention design with multi-head linear token projections and affine rescale transformations, reducing redundancy while preserving representational diversity across heads. We further integrate Lite MHLA into a unified RepMix block that coordinates local and global feature interactions and adopts global depthwise convolution for adaptive spatial aggregation in the embedding stage. Under our experimental setup, results on LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB show that FaceLiVTv2 consistently improves the accuracy-efficiency trade-off over existing lightweight methods. Notably, FaceLiVTv2 reduces mobile inference latency by 22% relative to FaceLiVTv1, achieves speedups of up to 30.8% over GhostFaceNets on mobile devices, and delivers 20-41% latency improvements over EdgeFace and KANFace across platforms while maintaining higher recognition accuracy. These results demonstrate that FaceLiVTv2 offers a practical and deployable solution for real-time face recognition. Code is available at this https URL.

57. 【2604.09125】Few-Shot Personalized Age Estimation

链接：https://arxiv.org/abs/2604.09125

作者：Jakub Paplhám,Vojtěch Franc,Artem Moroz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：independent sample, learning a global, global mapping, mapping from appearance, estimation methods treat

备注：

点击查看摘要

Abstract:Existing age estimation methods treat each face as an independent sample, learning a global mapping from appearance to age. This ignores a well-documented phenomenon: individuals age at different rates due to genetics, lifestyle, and health, making the mapping from face to age identity-dependent. When reference images of the same person with known ages are available, we can exploit this context to personalize the estimate. The only existing benchmark for this task (NIST FRVT) is closed-source and limited to a single reference image. In this work, we introduce OpenPAE, the first open benchmark for $N$-shot personalized age estimation with strict evaluation protocols. We establish a hierarchy of increasingly sophisticated baselines: from arithmetic offset, through closed-form Bayesian linear regression, to a conditional attentive neural process. Our experiments show that personalization consistently improves performance, that the gains are not merely domain adaptation, and that nonlinear methods significantly outperform simpler alternatives. We release all models, code, protocols, and evaluation splits.

58. 【2604.09114】FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval

链接：https://arxiv.org/abs/2604.09114

作者：François Gardères,Camille-Sovanneary Gauthier,Jean Ponce,Shizhe Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Composed image retrieval, Composed image, aims to retrieve, textual description, retrieve a target

备注：

点击查看摘要

Abstract:Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion. In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images. To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications. Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.

59. 【2604.09106】Detecting Diffusion-generated Images via Dynamic Assembly ForestsDetecting Diffusion-generated Images via Dynamic Assembly Forests

链接：https://arxiv.org/abs/2604.09106

作者：Mengxin Fu,Yuezun Li

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：generating high-quality images, Diffusion models, causing serious security, security concerns, generating high-quality

备注：

点击查看摘要

Abstract:Diffusion models are known for generating high-quality images, causing serious security concerns. To combat this, most efforts rely on deep neural networks (e.g., CNNs and Transformers), while largely overlooking the potential of traditional machine learning models. In this paper, we freshly investigate such alternatives and proposes a novel Dynamic Assembly Forest model (DAF) to detect diffusion-generated images. Built upon the deep forest paradigm, DAF addresses the inherent limitations in feature learning and scalable training, making it an effective diffusion-generated image detector. Compared to existing DNN-based methods, DAF has significantly fewer parameters, much lower computational cost, and can be deployed without GPUs, while achieving competitive performance under standard evaluation protocols. These results highlight the strong potential of the proposed method as a practical substitute for heavyweight DNN models in resource-constrained scenarios. Our code and models are available at this https URL.

60. 【2604.09101】CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion

链接：https://arxiv.org/abs/2604.09101

作者：Akshit Jindal,Saket Anand,Chetan Arora,Vikram Goyal

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Machine Learning, computational resources increasingly, resources increasingly outsource, Organisations with limited, increasingly outsource model

备注： 17 pages (8 main + 2 references + 7 supplementary), Accepted to CVPR Findings 2026

点击查看摘要

Abstract:Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, "Is the delivered model backdoored or not?" To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI's reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.

61. 【2604.09100】Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

链接：https://arxiv.org/abs/2604.09100

作者：Gabriele Mario Caddeo,Pasquale Marra,Lorenzo Natale

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：metric-scale amodal object, physically grounded approach, propose a multimodal, grounded approach, approach for metric-scale

备注： 27 pages, 10 figures, under review

点击查看摘要

Abstract:We propose a multimodal, physically grounded approach for metric-scale amodal object reconstruction and pose estimation under severe hand occlusion. Unlike prior occlusion-aware 3D generation methods that rely only on vision, we leverage physical interaction signals: proprioception provides the posed hand geometry, and multi-contact touch constrains where the object surface must lie, reducing ambiguity in occluded regions. We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand--object interpenetration and to align the reconstructed surface with contact observations. Because our method produces a metric, physically consistent structure estimate, it integrates naturally into existing two-stage reconstruction pipelines, where a downstream module refines geometry and predicts appearance. Experiments in simulation show that adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines; we further validate transfer by deploying the model on a real humanoid robot with an end-effector different from those used during training.

62. 【2604.09096】Off-the-shelf Vision Models Benefit Image Manipulation Localization

链接：https://arxiv.org/abs/2604.09096

作者：Zhengxuan Zhang,Keji Song,Junmin Hu,Ao Luo,Yuezun Li

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)

关键词：Image manipulation localization, research directions due, separate research directions, general vision tasks, manipulation localization

备注：

点击查看摘要

Abstract:Image manipulation localization (IML) and general vision tasks are typically treated as two separate research directions due to the fundamental differences between manipulation-specific and semantic features. In this paper, however, we bridge this gap by introducing a fresh perspective: these two directions are intrinsically connected, and general semantic priors can benefit IML. Building on this insight, we propose a novel trainable adapter (named ReVi) that repurposes existing off-the-shelf general-purpose vision models (e.g., image generation and segmentation networks) for IML. Inspired by robust principal component analysis, the adapter disentangles semantic redundancy from manipulation-specific information embedded in these models and selectively enhances the latter. Unlike existing IML methods that require extensive model redesign and full retraining, our method relies on the off-the-shelf vision models with frozen parameters and only fine-tunes the proposed adapter. The experimental results demonstrate the superiority of our method, showing the potential for scalable IML frameworks.

63. 【2604.09088】Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

链接：https://arxiv.org/abs/2604.09088

作者：Yutong Zhang,Jiaxin Chen,Honglin Chen,Kaiqi Zheng,Shengcai Liao,Hanwen Zhong,Weixin Li,Yunhong Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Memory-efficient transfer learning, recently achieved promising, adapting pre-trained models, achieved promising performance, Memory-efficient transfer

备注： CVPR2026 Accepted

点击查看摘要

Abstract:Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2\% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at this https URL.

64. 【2604.09076】Cross-Modal Knowledge Distillation from Spatial Transcriptomics to Histology

链接：https://arxiv.org/abs/2604.09076

作者：Arbel Hizmi,Artemii Bakulin,Shai Bagon,Nir Yosef

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：molecularly rich description, enabling unsupervised discovery, spatially coherent regions, Spatial transcriptomics, paired spatial transcriptomics

备注： Accepted to the CVMI Workshop at CVPR 2026. Project page: [this https URL](https://cross-modal-distillation.github.io/)

点击查看摘要

Abstract:Spatial transcriptomics provides a molecularly rich description of tissue organization, enabling unsupervised discovery of tissue niches -- spatially coherent regions of distinct cell-type composition and function that are relevant to both biological research and clinical interpretation. However, spatial transcriptomics remains costly and scarce, while HE histology is abundant but carries a less granular signal. We propose to leverage paired spatial transcriptomics and HE data to transfer transcriptomics-derived niche structure to a histology-only model via cross-modal distillation. Across multiple tissue types and disease contexts, the distilled model achieves substantially higher agreement with transcriptomics-derived niche structure than unsupervised morphology-based baselines trained on identical image features, and recovers biologically meaningful neighborhood composition as confirmed by cell-type analysis. The resulting framework leverages paired spatial transcriptomic and HE data during training, and can then be applied to held-out tissue regions using histology alone, without any transcriptomic input at inference time.

65. 【2604.09063】Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

链接：https://arxiv.org/abs/2604.09063

作者：Yuxi Zhou,Zhengbo Zhang,Jingyu Pan,Zhiyu Lin,Zhigang Tu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Human action recognition, Human action, Skeleton Action Recognition, computer vision, human-robot interaction

备注：

点击查看摘要

Abstract:Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at this https URL. Project homepage: this https URL

66. 【2604.09062】Nested Radially Monotone Polar Occupancy Estimation: Clinically-Grounded Optic Disc and Cup Segmentation for Glaucoma Screening

链接：https://arxiv.org/abs/2604.09062

作者：Rimsa Goperma,Rojan Basnet,Liang Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Valid segmentation, glaucoma screening, Polar Shape Network, fundus photographs, photographs is essential

备注：

点击查看摘要

Abstract:Valid segmentation of the optic disc (OD) and optic cup (OC) from fundus photographs is essential for glaucoma screening. Unfortunately, existing deep learning methods do not guarantee clinical validness including star-convexity and nested structure of OD and OC, resulting corruption in diagnostic metric, especially under cross-dataset domain shift. To adress this issue, this paper proposed NPS-Net (Nested Polar Shape Network), the first framework that formulates the OD/OC segmentation as nested radially monotone polar occupancy this http URL output representation can guarantee the aforementioned clinical validness and achieve high accuracy. Evaluated across seven public datasets, NPS-Net shows strong zero-shot generalization. On RIM-ONE, it maintains 100% anatomical validity and improves Cup Dice by 12.8% absolute over the best baseline, reducing vCDR MAE by over 56%. On PAPILA, it achieves Disc Dice of 0.9438 and Disc HD95 of 2.78 px, an 83% reduction over the best competing method.

67. 【2604.09059】Learning Vision-Language-Action World Models for Autonomous Driving

链接：https://arxiv.org/abs/2604.09059

作者：Guoqing Wang,Pin Tang,Xiangxuan Ren,Guodongfang Zhao,Bailan Feng,Chao Ma

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：unified multimodal framework, recently achieved notable, achieved notable progress, integrating perception, multimodal framework

备注： Accepted by CVPR2026 findings

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: this https URL

68. 【2604.09057】ora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

链接：https://arxiv.org/abs/2604.09057

作者：Junchao Liao,Zhenghao Zhang,Xiangyu Meng,Litao Li,Ziying Zhang,Siyu Zhu,Long Qin,Weizhi Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

关键词：relations remains challenging, recently made strong, made strong progress, plausible motion-sound relations, motion-sound relations remains

备注：

点击查看摘要

Abstract:Audio-video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion-sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion-aware structure shared by video and audio generation. We present Tora3, a trajectory-guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video-only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory-aligned motion representation for video, a kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and a hybrid flow matching scheme that preserves trajectory fidelity in trajectory-conditioned regions while maintaining local coherence elsewhere. We further curate PAV, a large-scale AV dataset emphasizing motion-relevant patterns with automatically extracted motion annotations. Extensive experiments show that Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.

69. 【2604.09051】Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy

链接：https://arxiv.org/abs/2604.09051

作者：Jiaheng Dai,Huanrong Liu,Tailai Zhou,Tongyu Jia,Qin Liu,Yutong Ban,Zeju Li,Yu Gao,Xin Ma,Qingbiao Li

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Fine-grained action segmentation, substantial class imbalance, robot-assisted partial nephrectomy, partial nephrectomy requires, visually similar suturing

备注：

点击查看摘要

Abstract:Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.

70. 【2604.09047】xt-Conditioned Multi-Expert Regression Framework for Fully Automated Multi-Abutment Design

链接：https://arxiv.org/abs/2604.09047

作者：Mianjie Zheng,Xinquan Yang,Xuefen Liu,Xuguang Li,Kun Tang,He Meng,Linlin Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：design relies heavily, prosthetic crown, geometric and biomechanical, biomechanical interface, relies heavily

备注：

点击查看摘要

Abstract:Dental implant abutments serve as the geometric and biomechanical interface between the implant fixture and the prosthetic crown, yet their design relies heavily on manual effort and is time-consuming. Although deep neural networks have been proposed to assist dentists in designing abutments, most existing approaches remain largely manual or semi-automated, requiring substantial clinician intervention and lacking scalability in multi-abutment scenarios. To address these limitations, we propose TEMAD, a fully automated, text-conditioned multi-expert architecture for multi-abutment design. This framework integrates implant site localization and implant system, compatible abutment parameter regression into a unified pipeline. Specifically, we introduce an Implant Site Identification Network (ISIN) to automatically localize implant sites and provide this information to the subsequent multi-abutment regression network. We further design a Tooth-Conditioned Feature-wise Linear Modulation (TC-FiLM) module, which adaptively calibrates mesh representations using tooth embeddings to enable position-specific feature modulation. Additionally, a System-Prompted Mixture-of-Experts (SPMoE) mechanism leverages implant system prompts to guide expert selection, ensuring system-aware regression. Extensive experiments on a large-scale abutment design dataset show that TEMAD achieves state-of-the-art performance compared to existing methods, particularly in multi-abutment settings, validating its effectiveness for fully automated dental implant planning.

71. 【2604.09045】Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting

链接：https://arxiv.org/abs/2604.09045

作者：Tsuheng Hsu,Guiyu Liu,Juho Kannala,Janne Heikkilä

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent works, visual foundation models, supervise radiance fields, enabling instance-level, radiance fields

备注：

点击查看摘要

Abstract:Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module's unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.

72. 【2604.09038】owards Lifelong Aerial Autonomy: Geometric Memory Management for Continual Visual Place Recognition in Dynamic Environments

链接：https://arxiv.org/abs/2604.09038

作者：Xingyu Shao,Zhiqiang Yan,Liangzheng Sun,Mengfan He,Chao Chen,Jinhui Zhang,Chunyu Li,Ziyang Meng

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：changing environmental conditions, Robust geo-localization, geo-localization in changing, changing environmental, environmental conditions

备注：

点击查看摘要

Abstract:Robust geo-localization in changing environmental conditions is critical for long-term aerial autonomy. While visual place recognition (VPR) models perform well when airborne views match the training domain, adapting them to shifting distributions during sequential missions triggers catastrophic forgetting. Existing continual learning (CL) methods often fail here because geographic features exhibit severe intra-class variations. In this work, we formulate aerial VPR as a mission-based domain-incremental learning (DIL) problem and propose a novel heterogeneous memory framework. To respect strict onboard storage constraints, our "Learn-and-Dispose" pipeline decouples geographic knowledge into static satellite anchors (preserving global geometric priors) and a dynamic experience replay buffer (retaining domain-specific features). We introduce a spatially-constrained allocation strategy that optimizes buffer selection based on sample difficulty or feature space diversity. To facilitate systematic assessment, we provide three evaluation criteria and a comprehensive benchmark derived from 21 diverse mission sequences. Extensive experiments demonstrate that our architecture significantly boosts spatial generalization; our diversity-driven buffer selection outperforms the random baseline by 7.8% in knowledge retention. Unlike class-mean preservation methods that fail in unstructured environments, maximizing structural diversity achieves a superior plasticity-stability balance and ensures order-agnostic robustness across randomized sequences. These results prove that maintaining structural feature coverage is more critical than sample difficulty for resolving catastrophic forgetting in lifelong aerial autonomy.

73. 【2604.09037】SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

链接：https://arxiv.org/abs/2604.09037

作者：Xiyang Huang,Jiawei Lin,Keying Wu,Jiaxin Huang,Kailai Yang,Renxiong Wei,Cheng zeng,Jiayi Xiang,Ziyan Kuang,Min Peng,Qianqian Xie,Sophia Ananiadou

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：large language models, multimodal large language, harder capability required, language models, focus on event

备注：

点击查看摘要

74. 【2604.09030】NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Multi-Exposure Image Fusion in Dynamic Scenes (Track 2)

链接：https://arxiv.org/abs/2604.09030

作者：Lishen Qu,Yao Liu,Jie Liang,Hui Zeng,Wen Dai,Guanyi Qin,Ya-nan Guan,Shihao Zhou,Jufeng Yang,Lei Zhang,Radu Timofte,Xiyuan Yuan,Wanjie Sun,Shihang Li,Bo Zhang,Bin Chen,Jiannan Lin,Yuxu Chen,Qinquan Gao,Tong Tong,Song Gao,Jiacong Tang,Tao Hu,Xiaowen Ma,Qingsen Yan,Sunhan Xu,Juan Wang,Xinyu Sun,Lei Qi,He Xu,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：paper presents NTIRE, Image Model, presents NTIRE, Restore Any Image, multi-exposure image fusion

备注： Accepted by CVPRW 2026

点击查看摘要

Abstract:This paper presents NTIRE 2026, the 3rd Restore Any Image Model (RAIM) challenge on multi-exposure image fusion in dynamic scenes. We introduce a benchmark that targets a practical yet difficult HDR imaging setting, where exposure bracketing must be fused under scene motion, illumination variation, and handheld camera jitter. The challenge data contains 100 training sequences with 7 exposure levels and 100 test sequences with 5 exposure levels, reflecting real-world scenarios that frequently cause misalignment and ghosting artefacts. We evaluate submissions with a leaderboard score derived from PSNR, SSIM, and LPIPS, while also considering perceptual quality, efficiency, and reproducibility during the final review. This track attracted 114 participating teams and received 987 submissions. The winning methods significantly improved the ability to remove artifacts from multi-exposure fusion and recover fine details. The dataset and the code of each team can be found at the repository: this https URL.

75. 【2604.09025】Skill-Conditioned Visual Geolocation for Vision-Language

链接：https://arxiv.org/abs/2604.09025

作者：Chenjie Yang,Yutian Jiang,Chenyu Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：lack structured geographic, Vision-language models, ability in image, lack structured, Vision-language

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system's cognition of real-world geographic knowledge beyond isolated case studies.

76. 【2604.09024】Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection

链接：https://arxiv.org/abs/2604.09024

作者：Zedian Shao,Hongbin Liu,Yuepeng Hu,Neil Zhenqiang Gong

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词：Multi-modal large language, Internet-scale image data, offering significant benefits, analyzing Internet-scale image, raising critical safety

备注： Appeared in ACL 2026 main conference

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as "I'm sorry, I can't help with that request." We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.

77. 【2604.09023】CAD 100K: A Comprehensive Multi-Task Dataset for Car Related Visual Anomaly Detection

链接：https://arxiv.org/abs/2604.09023

作者：Jiahua Pang,Ying Li,Dongpu Cao,Jingcai Luo,Yanuo Zheng,Bao Yunfan,Yujie Lei,Rui Yuan,Yuxi Tian,Guojin Yuan,Hongchang Chen,Zhi Zheng,Yongchun Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：manufacturing quality assessment, Multi-task visual anomaly, visual anomaly detection, car-related manufacturing quality, car-related multi-task visual

备注：

点击查看摘要

Abstract:Multi-task visual anomaly detection is critical for car-related manufacturing quality assessment. However, existing methods remain task-specific, hindered by the absence of a unified benchmark for multi-task evaluation. To fill in this gap, We present the CAD Dataset, a large-scale and comprehensive benchmark designed for car-related multi-task visual anomaly detection. The dataset contains over 100 images crossing 7 vehicle domains and 3 tasks, providing models a comprehensive view for car-related anomaly detection. It is the first car-related anomaly dataset specialized for multi-task learning(MTL), while combining synthesis data augmentation for few-shot anomaly images. We implement a multi-task baseline and conduct extensive empirical studies. Results show MTL promotes task interaction and knowledge transfer, while also exposing challenging conflicts between tasks. The CAD dataset serves as a standardized platform to drive future advances in car-related multi-task visual anomaly detection.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.09023 [cs.CV]

(or
arXiv:2604.09023v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.09023

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

78. 【2604.09022】BlendFusion -- Scalable Synthetic Data Generation for Diffusion Model Training

链接：https://arxiv.org/abs/2604.09022

作者：Thejas Venkatesh,Suguna Varshini Velury

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Model Autophagy Disorder, synthetic data generation, diffusion models, rapid adoption, promising approach

备注：

点击查看摘要

Abstract:With the rapid adoption of diffusion models, synthetic data generation has emerged as a promising approach for addressing the growing demand for large-scale image datasets. However, images generated purely by diffusion models often exhibit visual inconsistencies, and training models on such data can create an autophagous feedback loop that leads to model collapse, commonly referred to as Model Autophagy Disorder (MAD). To address these challenges, we propose BlendFusion, a scalable framework for synthetic data generation from 3D scenes using path tracing. Our pipeline incorporates an object-centric camera placement strategy, robust filtering mechanisms, and automatic captioning to produce high-quality image-caption pairs. Using this pipeline, we curate FineBLEND, an image-caption dataset constructed from a diverse set of 3D scenes. We empirically analyze the quality of FineBLEND and compare it to several widely used image-caption datasets. We also demonstrate the effectiveness of our object-centric camera placement strategy relative to object-agnostic sampling approaches. Our open-source framework is designed for high configurability, enabling the community to create their own datasets from 3D scenes.

79. 【2604.09018】Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion

链接：https://arxiv.org/abs/2604.09018

作者：Seungjin Jung,Yonghyun Jeong,Minha Kim,Jimin Min,Youngjoon Yoo,Jongwon Choi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：limited dataset diversity, handle unseen visual, secure face recognition, face recognition systems, Generative Adversarial Network

备注： The published version is available at DOI: [this https URL](https://doi.org/10.1016/j.patcog.2026.113640)

点击查看摘要

Abstract:Face Anti-Spoofing (FAS) algorithms, designed to secure face recognition systems against spoofing, struggle with limited dataset diversity, impairing their ability to handle unseen visual domains and spoofing methods. We introduce the Pattern Conversion Generative Adversarial Network (PCGAN) to enhance domain generalization in FAS. PCGAN effectively disentangles latent vectors for spoof artifacts and facial features, allowing to generate images with diverse artifacts. We further incorporate patch-based and multi-task learning to tackle partial attacks and overfitting issues to facial features. Our extensive experiments validate PCGAN's effectiveness in domain generalization and detecting partial attacks, giving a substantial improvement in facial recognition security.

80. 【2604.09009】Robust by Design: A Continuous Monitoring and Data Integration Framework for Medical AI

链接：https://arxiv.org/abs/2604.09009

作者：Mohammad Daouk,Jan Ulrich Becker,Neeraja Kambham,Anthony Chang,Chandra Mohan,Hien Van Nguyen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dynamic clinical environments, clinical environments due, Adaptive medical, face performance drops, drops in dynamic

备注： Accepted at IEEE ISBI 2026. Chandra Mohan and Hien Van Nguyen jointly supervised this work

点击查看摘要

Abstract:Adaptive medical AI models often face performance drops in dynamic clinical environments due to data drift. We propose an autonomous continuous monitoring and data integration framework that maintains robust performance over time. Focusing on glomerular pathology image classification (proliferative vs. non-proliferative lupus nephritis), our three-stage method uses multi-metric feature analysis and Monte Carlo dropout-based uncertainty gating to decide when to retrain on new data. Only images statistically similar to the training distribution (via Euclidean, cosine, Mahalanobis metrics) and with low predictive entropy are integrated. The model is then incrementally retrained with these images under strict performance safeguards (no metric degradation 5%). In experiments with a ResNet18 ensemble on a multi-center dataset, the framework prevents performance degradation: new images were added without significant change in AUC (~0.92) or accuracy (~89%). This approach addresses data shift and avoids catastrophic forgetting, enabling sustained learning in medical imaging AI.

81. 【2604.09000】StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding

链接：https://arxiv.org/abs/2604.09000

作者：Junxi Wang,Te Sun,Jiayi Zhu,Junxian Li,Haowen Xu,Zichen Wen,Xuming Hu,Zhiyu Li,Linfeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision agent memory, shown remarkable effectiveness, streaming video understanding, Vision agent, Stream Agent Memory

备注： 2026ACL Findings

点击查看摘要

Abstract:Vision agent memory has shown remarkable effectiveness in streaming video understanding. However, storing such memory for videos incurs substantial memory overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introduces edge-free minmax sampling for the isolated nodes and an edge-aware weight pruning for connected nodes, evicting the redundant memory nodes while maintaining the accuracy. In addition, we introduce a time-decay memory retrieval mechanism to further eliminate the performance degradation caused by memory compression. Extensive experiments on three challenging benchmark datasets (M3-Bench-robot, M3-Bench-web and Video-MME-Long) demonstrate that under 70% memory graph compression, StreamMeCo achieves a 1.87* speedup in memory retrieval while delivering an average accuracy improvement of 1.0%. Our code is available at this https URL.

82. 【2604.08995】Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

链接：https://arxiv.org/abs/2604.08995

作者：Zile Wang,Zexiang Liu,Jaixing Li,Kaichen Huang,Baixin Xu,Fei Kang,Mengyin An,Peiyu Wang,Biao Jiang,Yichen Wei,Yidan Xietian,Jiangbo Pei,Liang Hu,Boyi Jiang,Hua Xue,Zidong Wang,Haofeng Sun,Wei Li,Wanli Ouyang,Xianglong He,Yang Liu,Yangguang Li,Yahui Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：increasingly demonstrated, demonstrated their potential, model, generation, interactive video generation

备注： Project page: [this https URL](https://matrix-game-v3.github.io/)

点击查看摘要

Abstract:With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.

83. 【2604.08991】PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

链接：https://arxiv.org/abs/2604.08991

作者：Zhiyu Zhou,Peilin Liu,Ruoxuan Zhang,Luyang Zhang,Cheng Zhang,Hongxia Xie,Wen-Huang Cheng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：multimodal large language, Small object-centric spatial, large language models, object-centric spatial understanding, indoor videos remains

备注：

点击查看摘要

Abstract:Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at this https URL.

84. 【2604.08990】ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

链接：https://arxiv.org/abs/2604.08990

作者：Shifeng Liu,Zhengye Zhang,Sirui Zhao,Xinglong Mao,Zhehan Kan,Zhixiang Wei,Shiwei Wu,Chaoyou Fu,Tong Xu,Enhong Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, Large Language, reasoning-based affect understanding

备注： 10 pages, 7 figures

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have created new opportunities for facial expression recognition (FER), moving it beyond pure label prediction toward reasoning-based affect understanding. However, existing MLLM-based FER methods still follow a passive paradigm: they rely on externally prepared facial inputs and perform single-pass reasoning over fixed visual evidence, without the capability for active facial perception. To address this limitation, we propose ActFER, an agentic framework that reformulates FER as active visual evidence acquisition followed by multimodal reasoning. Specifically, ActFER dynamically invokes tools for face detection and alignment, selectively zooms into informative local regions, and reasons over facial Action Units (AUs) and emotions through a visual Chain-of-Thought. To realize such behavior, we further develop Utility-Calibrated GRPO (UC-GRPO), a reinforcement learning algorithm tailored to agentic FER. UC-GRPO uses AU-grounded multi-level verifiable rewards to densify supervision, query-conditional contrastive utility estimation to enable sample-aware dynamic credit assignment for local inspection, and emotion-aware EMA calibration to reduce noisy utility estimates while capturing emotion-wise inspection tendencies. This algorithm enables ActFER to learn both when local inspection is beneficial and how to reason over the acquired evidence. Comprehensive experiments show that ActFER trained with UC-GRPO consistently outperforms passive MLLM-based FER baselines and substantially improves AU prediction accuracy.

85. 【2604.08966】How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms

链接：https://arxiv.org/abs/2604.08966

作者：Shengji Jin,Yuanhao Zou,Victor Zhu,Zhengping Ji,Chen Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, advanced Video Temporal, Multimodal Large, Large Language

备注： CVPR 2026 Workshop Paper

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation. In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization accuracy and system efficiency, including inference latency, training throughput, and parameter overhead. Our results demonstrate that the choice of output formulation significantly affects both grounding accuracy and computational cost, independent of model scale. Specifically, the continuous distribution paradigm consistently achieves the most favorable efficiency-accuracy trade-off on the Pareto frontier, delivering robust localization with minimal latency overhead. These findings provide objective empirical guidelines for designing efficient, deployment-ready VTG systems.

86. 【2604.08965】Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation

链接：https://arxiv.org/abs/2604.08965

作者：Gadi Hemanth Kumar,Athira Nambiar,Pankaj Bodani

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：satellite imagery plays, environmental monitoring, Semantic segmentation, imagery plays, plays a vital

备注：

点击查看摘要

Abstract:Semantic segmentation of satellite imagery plays a vital role in land cover mapping and environmental monitoring. However, annotating large-scale, high-resolution satellite datasets is costly and time consuming, especially when covering vast geographic regions. Instead of randomly labeling data or exhaustively annotating entire datasets, Active Learning (AL) offers an efficient alternative by intelligently selecting the most informative samples for annotation with the help of Human-in-the-loop (HITL), thereby reducing labeling costs while maintaining high model performance. AL is particularly beneficial for large-scale or resource-constrained satellite applications, as it enables high segmentation accuracy with significantly fewer labeled samples. Despite these advantages, standard AL strategies typically rely on global uncertainty or diversity measures and lack the adaptability to target underperforming or rare classes as training progresses, leading to bias in the system. To overcome these limitations, we propose a novel adaptive acquisition function, Dynamic Class-Aware Uncertainty based Active learning (DCAU-AL) that prioritizes sample selection based on real-time class-wise performance gaps, thereby overcoming class-imbalance issue. The proposed DCAU-AL mechanism continuously tracks the performance of the segmentation per class and dynamically adjusts the sampling weights to focus on poorly performing or underrepresented classes throughout the active learning process. Extensive experiments on the OpenEarth land cover dataset show that DCAU-AL significantly outperforms existing AL methods, especially under severe class imbalance, delivering superior per-class IoU and improved annotation efficiency.

87. 【2604.08956】Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift

链接：https://arxiv.org/abs/2604.08956

作者：Harshith Kethavath,Weiming Hu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：image pretraining corpora, remote sensing imagery, sensing imagery presents, Adapting vision-language models, fundamental challenge

备注： 10 pages, 6 figures, to be published in EarthVision @ CVPR 2026

点击查看摘要

Abstract:Adapting vision-language models to remote sensing imagery presents a fundamental challenge: both the visual and linguistic distributions of satellite data lie far outside natural image pretraining corpora. Despite this, prompting remains the dominant deployment paradigm, driven by the assumption that domain-specific language can guide frozen model representations toward specialized tasks. We test this assumption directly on a domain where the mismatch is prominent: cloud segmentation for satellite imagery. Using CLIPSeg on the CloudSEN12+ cloud segmentation benchmark, we evaluate 60 prompt variants spanning simple labels, domain terminology, appearance descriptors, and contextual cues, finding that every variant underperforms the zero-shot baseline (0.255 mIoU), with engineered prompts scoring as low as 0.07 mIoU. No amount of linguistic refinement bridges the gap between CLIP's natural image representations and satellite spectral imagery. In contrast, supervised fine-tuning with just 0.1% labeled data (~8 images) surpasses zero-shot performance overall, and 5-10% data recovers ~85% of maximum achievable mIoU. Full fine-tuning consistently outperforms low-rank adaptation by 0.03-0.09 mIoU, with the largest gaps for spectrally ambiguous classes, and at 0.5 to 1% labeled data, fine-tuning temporarily degrades performance on these classes before recovering, a supervision dip that aggregate mIoU can mask. For practitioners adapting vision-language models to specialized imagery, our results deliver a clear message: labeled data is not the expensive alternative to prompting; it is the worthwhile path.

88. 【2604.08945】ouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches

链接：https://arxiv.org/abs/2604.08945

作者：Langzhe Gu,Hung-Jui Huang,Mohamad Qadri,Michael Kaess,Wenzhen Yuan

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：including robotic manipulation, downstream tasks, including robotic, estimation is essential, robotic manipulation

备注： Project Page: [this https URL](https://grange007.github.io/touchanything)

点击查看摘要

Abstract:Accurate object geometry estimation is essential for many downstream tasks, including robotic manipulation and physical interaction. Although vision is the dominant modality for shape perception, it becomes unreliable under occlusions or challenging lighting conditions. In such scenarios, tactile sensing provides direct geometric information through physical contact. However, reconstructing global 3D geometry from sparse local touches alone is fundamentally underconstrained. We present TouchAnything, a framework that leverages a pretrained large-scale 2D vision diffusion model as a semantic and geometric prior for 3D reconstruction from sparse tactile measurements. Unlike prior work that trains category-specific reconstruction networks or learns diffusion models directly from tactile data, we transfer the geometric knowledge encoded in pretrained visual diffusion models to the tactile domain. Given sparse contact constraints and a coarse class-level description of the object, we formulate reconstruction as an optimization problem that enforces tactile consistency while guiding solutions toward shapes consistent with the diffusion prior. Our method reconstructs accurate geometries from only a few touches, outperforms existing baselines, and enables open-world 3D reconstruction of previously unseen object instances. Our project page is this https URL .

89. 【2604.08943】MASS: Mesh-inellipse Aligned Deformable Surfel Splatting for Hand Reconstruction and Rendering from Egocentric Monocular Video

链接：https://arxiv.org/abs/2604.08943

作者：Haoyu Zhu,Yi Zhang,Lei Yao,Lap-pui Chau,Yi Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：egocentric monocular videos, monocular videos remains, Reconstructing high-fidelity, hand-object interactions, egocentric monocular

备注： This paper has been accepted to CVM 2026 Journal Track and is under consideration for publication in IEEE TVCG

点击查看摘要

Abstract:Reconstructing high-fidelity 3D hands from egocentric monocular videos remains a challenge due to the limitations in capturing high-resolution geometry, hand-object interactions, and complex objects on hands. Additionally, existing methods often incur high computational costs, making them impractical for real-time applications. In this work, we propose Mesh-inellipse Aligned deformable Surfel Splatting (MASS) to address these challenges by leveraging a deformable 2D Gaussian Surfel representation. We introduce the mesh-aligned Steiner Inellipse and fractal densification for mesh-to-surfel conversion that initiates high-resolution 2D Gaussian surfels from coarse parametric hand meshes, providing surface representation with photorealistic rendering potential. Second, we propose Gaussian Surfel Deformation, which enables efficient modeling of hand deformations and personalized features by predicting residual updates to surfel attributes and introducing an opacity mask to refine geometry and texture without adaptive density control. In addition, we propose a two-stage training strategy and a novel binding loss to improve the optimization robustness and reconstruction quality. Extensive experiments on the ARCTIC dataset, the Hand Appearance dataset, and the Interhand2.6M dataset demonstrate that our model achieves superior reconstruction performance compared to state-of-the-art methods.

90. 【2604.08936】M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model

链接：https://arxiv.org/abs/2604.08936

作者：Yihang Liu,Ying Wen,Jiaxiong Yang,Longzhen Yang,Lianghua He,Heng Tao Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：downstream clinical tasks, generalize effectively, learn universal representations, multimodal medical images, clinical tasks

备注：

点击查看摘要

Abstract:Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blend multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised \underline{\textit{M}}FM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximize inter-modality entropy by dispersing multimodal representation into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature cluster across modalities and finer-grained feature discrimination within each modality.

91. 【2604.08924】Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion

链接：https://arxiv.org/abs/2604.08924

作者：Zengyi Yang,Yu Liu,Juan Cheng,Zhiqin Zhu,Yafei Zhang,Huafeng Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Infrared-visible image fusion, robust visual understanding, integrate complementary information, existing fusion methods, fusion methods struggle

备注： This paper has been accepted by CVPR 2026

点击查看摘要

Abstract:Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability. The code is available at this https URL.

92. 【2604.08922】Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios

链接：https://arxiv.org/abs/2604.08922

作者：Yu Shi,Yu Liu,Zhong-Cheng Wu,Juan Cheng,Huafeng Li,Xun Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real world image, limiting the performance, low resolution, resolution are typical, real world

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Complex degradations like noise, blur, and low resolution are typical challenges in real world image fusion tasks, limiting the performance and practicality of existing methods. End to end neural network based approaches are generally simple to design and highly efficient in inference, but their black-box nature leads to limited interpretability. Diffusion based methods alleviate this to some extent by providing powerful generative priors and a more structured inference process. However, they are trained to learn a single domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation aware diffusion framework for image fusion under arbitrary degradation scenarios. Specifically, instead of explicitly predicting noise as in conventional diffusion models, our method performs implicit denoising by directly regressing the fused image, enabling flexible adaptation to diverse fusion tasks under complex degradations with limited steps. Moreover, we design a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure high reconstruction accuracy. Experiments on diverse fusion tasks and degradation configurations demonstrate the superiority of the proposed method under complex degradation scenarios.

93. 【2604.08921】AIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction

链接：https://arxiv.org/abs/2604.08921

作者：Ao Li,Yonggen Ling,Yiyang Lin,Yuji Wang,Yong Deng,Yansong Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：technology enabling robots, critical technology enabling, safe physical interaction, technology enabling, safe physical

备注：

点击查看摘要

Abstract:Accurate 3D human keypoints localization is a critical technology enabling robots to achieve natural and safe physical interaction with users. Conventional 3D human keypoints estimation methods primarily focus on the whole-body reconstruction quality relative to the root joint. However, in practical human-robot interaction (HRI) scenarios, robots are more concerned with the precise metric-scale spatial localization of task-relevant body parts under the egocentric camera 3D coordinate. We propose TAIHRI, the first Vision-Language Model (VLM) tailored for close-range HRI perception, capable of understanding users' motion commands and directing the robot's attention to the most task-relevant keypoints. By quantizing 3D keypoints into a finite interaction space, TAIHRI precisely localize the 3D spatial coordinates of critical body parts by 2D keypoint reasoning via next token prediction, and seamlessly adapt to downstream tasks such as natural language control or global space human mesh recovery. Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts. We believe TAIHRI opens new research avenues in the field of embodied human-robot interaction. Code is available at: this https URL.

94. 【2604.08916】MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation

链接：https://arxiv.org/abs/2604.08916

作者：Yibo Zhao,Yigong Zhang,Jin Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：instance segmentation, annotations for supervised, supervised training, instance segmentation methods, limits their scalability

备注：

点击查看摘要

Abstract:Conventional 3D instance segmentation methods rely on labor-intensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi-view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero-shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi-view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse-to-fine framework for zero-shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D-guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi-view mask consistency via 3D coverage distributions. Guided by these view-consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter-object occlusions, thereby improving the robustness of 3D-to-2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods

95. 【2604.08915】Large-Scale Universal Defect Generation: Foundation Models and Datasets

链接：https://arxiv.org/abs/2604.08915

作者：Yuanting Fan,Jun Liu,Bin-Bin Gao,Xiaochen Chen,Yuhuan Lin,Zhewei Dai,Jiawei Zhan,Chengjie Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：specific defect categories, defect categories due, defect editing data, paired defect editing, Existing defect

备注： 25 pages, 13 figures, preprint

点击查看摘要

Abstract:Existing defect/anomaly generation methods often rely on few-shot learning, which overfits to specific defect categories due to the lack of large-scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large-scale dataset of 300K normal-abnormal-mask-caption quadruplets spanning diverse domains, and by presenting UniDG, a universal defect generation foundation model that supports both reference-based defect generation and text instruction-based defect editing without per-category fine-tuning. UniDG performs Defect-Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM-DiT multimodal attention. A two-stage training strategy, Diversity-SFT followed by Consistency-RFT, further improves diversity while enhancing realism and reference consistency. Extensive experiments on MVTec-AD and VisA show that UniDG outperforms prior few-shot anomaly generation and image insertion/editing baselines in synthesis quality and downstream single- and multi-class anomaly detection/localization. Code will be available at this https URL.

96. 【2604.08903】Fast Model-guided Instance-wise Adaptation Framework for Real-world Pansharpening with Fidelity Constraints

链接：https://arxiv.org/abs/2604.08903

作者：Zhiqi Yang,Jin-Liang Xiao,Shan Yin,Liang-Jian Deng,Gemine Vivone

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generate high-resolution multispectral, fusing low-resolution multispectral, high-resolution multispectral, low-resolution multispectral, high-resolution panchromatic

备注：

点击查看摘要

Abstract:Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and high-resolution panchromatic (PAN) images while preserving both spectral and spatial information. Although deep learning (DL)-based pansharpening methods achieve impressive performance, they require high training cost and large datasets, and often degrade when the test distribution differs from training, limiting generalization. Recent zero-shot methods, trained on a single PAN/LRMS pair, offer strong generalization but suffer from limited fusion quality, high computational overhead, and slow convergence. To address these issues, we propose FMG-Pan, a fast and generalizable model-guided instance-wise adaptation framework for real-world pansharpening, achieving both cross-sensor generality and rapid training-inference. The framework leverages a pretrained model to guide a lightweight adaptive network through joint optimization with spectral and physical fidelity constraints. We further design a novel physical fidelity term to enhance spatial detail preservation. Extensive experiments on real-world datasets under both intra- and cross-sensor settings demonstrate state-of-the-art performance. On the WorldView-3 dataset, FMG-Pan completes training and inference for a 512x512x8 image within 3 seconds on an RTX 3090 GPU, significantly faster than existing zero-shot methods, making it suitable for practical deployment.

97. 【2604.08896】GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

链接：https://arxiv.org/abs/2604.08896

作者：Aoran Xiao,Shihao Cheng,Yonghao Xu,Yexian Ren,Hongruixuan Chen,Naoto Yokoya

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：heterogeneous sensor modalities, wide-ranging disciplinary knowledge, Recent advances, large language models, multimodal large language

备注： CVPR 2026 Highlight paper

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models, uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning--capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.

98. 【2604.08894】Ge$^\text{2}$mS-T: Multi-Dimensional Grouping for Ultra-High Energy Efficiency in Spiking Transformer

链接：https://arxiv.org/abs/2604.08894

作者：Zecheng Hao,Shenghao Xie,Kang Chen,Wenxuan Liu,Zhaofei Yu,Tiejun Huang

类目：Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Artificial Neural Networks, Spiking Neural Networks, Artificial Neural, Neural Networks, Spiking Vision Transformers

备注：

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer superior energy efficiency over Artificial Neural Networks (ANNs). However, they encounter significant deficiencies in training and inference metrics when applied to Spiking Vision Transformers (S-ViTs). Existing paradigms including ANN-SNN Conversion and Spatial-Temporal Backpropagation (STBP) suffer from inherent limitations, precluding concurrent optimization of memory, accuracy and energy consumption. To address these issues, we propose Ge$^\text{2}$mS-T, a novel architecture implementing grouped computation across temporal, spatial and network structure dimensions. Specifically, we introduce the Grouped-Exponential-Coding-based IF (ExpG-IF) model, enabling lossless conversion with constant training overhead and precise regulation for spike patterns. Additionally, we develop Group-wise Spiking Self-Attention (GW-SSA) to reduce computational complexity via multi-scale token grouping and multiplication-free operations within a hybrid attention-convolution framework. Experiments confirm that our method can achieve superior performance with ultra-high energy efficiency on challenging benchmarks. To our best knowledge, this is the first work to systematically establish multi-dimensional grouped computation for resolving the triad of memory overhead, learning capability and energy budget in S-ViTs.

99. 【2604.08893】Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS)

链接：https://arxiv.org/abs/2604.08893

作者：Mohsen Yaghoubi Suraki

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multiscale Spatial Attention, requires early detection, Multiscale Spatial, Attention, tumor

备注：

点击查看摘要

Abstract:Glioma is a harmful brain tumor that requires early detection to ensure better health results. Early detection of this tumor is key for effective treatment and requires an automated segmentation process. However, it is a challenging task to find tumors due to tumor characteristics like location and size. A reliable method to accurately separate tumor zones from healthy tissues is deep learning models, which have shown promising results over the last few years. In this research, an Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) is introduced. This model is an innovative combination of adaptive dual residual networks, attention mechanisms, and multiscale spatial attention. The dual adaptive residual network architecture captures high-level semantic and intricate low-level details from brain images, ensuring precise segmentation of different tumor parts, types, and hard regions. The attention gates use gating and input signals to compute attention coefficients for the input features, and multiscale spatial attention generates scaled attention maps and combines these features to hold the most significant information about the brain tumor. We trained the model for 200 epochs using the ReLU activation function on BraTS 2020 and BraTS 2019 datasets. These improvements resulted in high accuracy for tumor detection and segmentation on BraTS 2020, achieving dice scores of 0.9229 for the whole tumor, 0.8432 for the tumor core, and 0.8004 for the enhancing tumor.

100. 【2604.08884】HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

链接：https://arxiv.org/abs/2604.08884

作者：Xinyu Zhang,Zurong Mai,Qingmei Li,Zjin Liao,Yibin Wen,Yuhang Chen,Xiaoya Fan,Chan Tsz Ho,Bi Tianyuan,Haoyuan Liang,Ruifeng Su,Zihao Qian,Juepeng Zheng,Jianxi Huang,Yutong Lu,Haohuan Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：multimodal large language, large language models, remains underexplored, Hyperspectral Multimodal Benchmark, made significant strides

备注：

点击查看摘要

Abstract:While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB this http URL address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at this https URL.

101. 【2604.08881】Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

链接：https://arxiv.org/abs/2604.08881

作者：Enyi Shi,Fei Shen,Shuyi Miao,Linxia Zhu,Pengyang Shao,Jinhui Tang,Tat-Seng Chua

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision-Language Large Models, exposing structural blind, easily bypass defenses, bypass defenses designed, structural blind spots

备注：

点击查看摘要

Abstract:In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates strictly within this subspace via gradient masking with affecting fewer than 0.03% of parameters. This strategy substantially improves safety while preserving multilingual and multimodal generalization. Further analysis reveals a moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities, and offering a new direction for neuron-level, transfer-based safety enhancement.

102. 【2604.08877】Harnessing Weak Pair Uncertainty for Text-based Person Search

链接：https://arxiv.org/abs/2604.08877

作者：Jintao Sun,Zhedong Zheng,Gangyi Ding

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：text-based person search, natural language description, study the text-based, interest via natural, natural language

备注： 39 pages, 15 tables, 7 figures

点击查看摘要

Abstract:In this paper, we study the text-based person search, which is to retrieve the person of interest via natural language description. Prevailing methods usually focus on the strict one-to-one correspondence pair matching between the visual and textual modality, such as contrastive learning. However, such a paradigm unintentionally disregards the weak positive image-text pairs, which are of the same person but the text descriptions are annotated from different views (cameras). To take full use of weak positives, we introduce an uncertainty-aware method to explicitly estimate image-text pair uncertainty, and incorporate the uncertainty into the optimization procedure in a smooth manner. Specifically, our method contains two modules: uncertainty estimation and uncertainty regularization. (1) Uncertainty estimation is to obtain the relative confidence on the given positive pairs; (2) Based on the predicted uncertainty, we propose the uncertainty regularization to adaptively adjust loss weight. Additionally, we introduce a group-wise image-text matching loss to further facilitate the representation space among the weak pairs. Compared with existing methods, the proposed method explicitly prevents the model from pushing away potentially weak positive candidates. Extensive experiments on three widely-used datasets, .e.g, CUHK-PEDES, RSTPReid and ICFG-PEDES, verify the mAP improvement of our method against existing competitive methods +3.06%, +3.55% and +6.94%, respectively.

103. 【2604.08858】BIAS: A Biologically Inspired Algorithm for Video Saliency Detection

链接：https://arxiv.org/abs/2604.08858

作者：Zhao-ji Zhang,Ya-tang Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：biologically inspired model, continuous video streams, biologically inspired, visual saliency detection, dynamic visual saliency

备注：

点击查看摘要

Abstract:We present BIAS, a fast, biologically inspired model for dynamic visual saliency detection in continuous video streams. Building on the Itti--Koch framework, BIAS incorporates a retina-inspired motion detector to extract temporal features, enabling the generation of saliency maps that integrate both static and motion information. Foci of attention (FOAs) are identified using a greedy multi-Gaussian peak-fitting algorithm that balances winner-take-all competition with information maximization. BIAS detects salient regions with millisecond-scale latency and outperforms heuristic-based approaches and several deep-learning models on the DHF1K dataset, particularly in videos dominated by bottom-up attention. Applied to traffic accident analysis, BIAS demonstrates strong real-world utility, achieving state-of-the-art performance in cause-effect recognition and anticipating accidents up to 0.72 seconds before manual annotation with reliable accuracy. Overall, BIAS bridges biological plausibility and computational efficiency to achieve interpretable, high-speed dynamic saliency detection.

104. 【2604.08847】DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization

链接：https://arxiv.org/abs/2604.08847

作者：Xiangyu Li,Yujing Sun,Yuhang Zheng,Yuexin Ma,Kwok-Yan Lam

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modern media forensics, media forensics, fundamental component, component of modern, modern media

备注：

点击查看摘要

Abstract:Deepfake detection has become a fundamental component of modern media forensics. Despite significant progress in detection accuracy, most existing methods remain computationally intensive and parameter-heavy, limiting their deployment on resource-constrained edge devices that require real-time, on-site inference. This limitation is particularly critical in an era where mobile devices are extensively used for media-centric applications, including online payments, virtual meetings, and social networking. Meanwhile, due to the unique requirement of capturing extremely subtle forgery artifacts for deepfake detection, state-of-the-art quantization techniques usually underperform for such a challenging task. These fine-grained cues are highly sensitive to model compression and can be easily degraded during quantization, leading to noticeable performance drops. This challenge highlights the need for quantization strategies specifically designed to preserve the discriminative features essential for reliable deepfake detection. To address this gap, we propose DefakeQ, the first quantization framework tailored for deepfake detectors, enabling real-time deployment on edge devices. Our approach introduces a novel adaptive bidirectional compression strategy that simultaneously leverages feature correlations and eliminates redundancy, achieving an effective balance between model compactness and detection performance. Extensive experiments across five benchmark datasets and eleven state-of-the-art backbone detectors demonstrate that DeFakeQ consistently surpasses existing quantization and model compression baselines. Furthermore, we deploy DefakeQ on mobile devices in real-world scenarios, demonstrating its capability for real-time deepfake detection and its practical applicability in edge environments.

105. 【2604.08846】Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

链接：https://arxiv.org/abs/2604.08846

作者：Jinqi Luo,Jinyu Yang,Tal Neiman,Lei Fan,Bing Yin,Son Tran,Mubarak Shah,René Vidal

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Large Language, elicit unsafe responses, Language Models

备注： Accepted in CVPR 2026. Project page: [this https URL](https://peterljq.github.io/project/daco)

点击查看摘要

106. 【2604.08836】CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation

链接：https://arxiv.org/abs/2604.08836

作者：Sanyam Jain,Pragya Kandari,Manit Singhal,He Zhang,Soo Ye Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：seamlessly insert objects, shown remarkable ability, insert objects, shown remarkable, remarkable ability

备注： CVPR 2026 HiGen Workshop. Project page, [this https URL](https://catalogstitch.github.io)

点击查看摘要

Abstract:Generative object compositing methods have shown remarkable ability to seamlessly insert objects into scenes. However, when applied to real-world catalog image generation, these methods require tedious manual intervention: users must carefully adjust masks when product dimensions differ, and painstakingly restore occluded elements post-generation. We present CatalogStitch, a set of model-agnostic techniques that automate these corrections, enabling user-friendly content creation. Our dimension-aware mask computation algorithm automatically adapts the target region to accommodate products with different dimensions; users simply provide a product image and background, without manual mask adjustments. Our occlusion-aware hybrid restoration method guarantees pixel-perfect preservation of occluding elements, eliminating post-editing workflows. We additionally introduce CatalogStitch-Eval, a 58-example benchmark covering aspect-ratio mismatch and occlusion-heavy catalog scenarios, together with supplementary PDF and HTML viewers. We evaluate our techniques with three state-of-the-art compositing models (ObjectStitch, OmniPaint, and InsertAnything), demonstrating consistent improvements across diverse catalog scenarios. By reducing manual intervention and automating tedious corrections, our approach transforms generative compositing into a practical, human-friendly tool for production catalog workflows.

107. 【2604.08828】Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning

链接：https://arxiv.org/abs/2604.08828

作者：Chia-Hong Hsu,Randall Balestriero

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：practitioners trade-off fidelity, Diffusion Models, diversity in Diffusion, Classifier-free Guidance, practitioners trade-off

备注：

点击查看摘要

Abstract:Classifier-free Guidance (CFG) lets practitioners trade-off fidelity against diversity in Diffusion Models (DMs). The practicality of CFG is however hindered by DMs sampling cost. On the other hand, Consistency Models (CMs) generate images in one or a few steps, but existing guidance methods require knowledge distillation from a separate DM teacher, limiting CFG to Consistency Distillation (CD) methods. We propose Joint Flow Distribution Learning (JFDL), a lightweight alignment method enabling guidance in a pre-trained CM. With a pre-trained CM as an ordinary differential equation (ODE) solver, we verify with normality tests that the variance-exploding noise implied by the velocity fields from unconditional and conditional distributions is Gaussian. In practice, JFDL equips CMs with the familiar adjustable guidance knob, yielding guided images with similar characteristics to CFG. Applied to an original Consistency Trained (CT) CM that could only do conditional sampling, JFDL unlocks guided generation and reduces FID on both CIFAR-10 and ImageNet 64x64 datasets. This is the first time that CMs are able to receive effective guidance post-hoc without a DM teacher, thus, bridging a key gap in current methods for CMs.

108. 【2604.08819】SenBen: Sensitive Scene Graphs for Explainable Content Moderation

链接：https://arxiv.org/abs/2604.08819

作者：Fatih Cagatay Akyon,Alptekin Temizel

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：moderation systems classify, systems classify images, lack spatial grounding, Content moderation systems, Visual Genome-style scene

备注： Accepted at CVPRW 2026

点击查看摘要

Abstract:Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.

109. 【2604.08815】owards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

链接：https://arxiv.org/abs/2604.08815

作者：Sumra Khan,Sagar Chhabriya,Aizan Zafar,Sheeraz Arif,Amgad Muneer,Anas Zafar,Shaina Raza,Rizwan Qureshi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：weakly grounded conclusions, grounded conclusions due, show strong performance, show strong, radiology tasks

备注：

点击查看摘要

Abstract:Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.

110. 【2604.08810】R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII

链接：https://arxiv.org/abs/2604.08810

作者：Zewei Zhou,Jiajun Zou,Jiajia Zhang,Ao Yang,Ruichao He,Haozheng Zhou,Ao Liu,Jiawei Liu,Leilei Jin,Shan Shen,Daying Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Graph neural networks, controlled evaluation protocols, physical design tasks, Graph neural, inconsistent circuit representations

备注： Accepted as a poster by CVPR2026

点击查看摘要

Abstract:Graph neural networks (GNNs) are increasingly applied to physical design tasks such as congestion prediction and wirelength estimation, yet progress is hindered by inconsistent circuit representations and the absence of controlled evaluation protocols. We present R2G (RTL-to-GDSII), a multi-view circuit-graph benchmark suite that standardizes five stage-aware views with information parity (every view encodes the same attribute set, differing only in where features attach) over 30 open-source IP cores (up to $10^6$ nodes/edges). R2G provides an end-to-end DEF-to-graph pipeline spanning synthesis, placement, and routing stages, together with loaders, unified splits, domain metrics, and reproducible baselines. By decoupling representation choice from model choice, R2G isolates a confound that prior EDA and graph-ML benchmarks leave uncontrolled. In systematic studies with GINE, GAT, and ResGatedGCN, we find: (i) view choice dominates model choice, with Test R$^2$ varying by more than 0.3 across representations for a fixed GNN; (ii) node-centric views generalize best across both placement and routing; and (iii) decoder-head depth (3--4 layers) is the primary accuracy driver, turning divergent training into near-perfect predictions (R$^2$$$0.99). Code and datasets are available at this https URL.

111. 【2604.08799】MeshOn: Intersection-Free Mesh-to-Mesh Composition

链接：https://arxiv.org/abs/2604.08799

作者：Hyunwoo Kim,Itai Lang,Hadar Averbuch-Elor,Silvia Sellán,Rana Hanocka

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：semantically realistic compositions, finds physically, physically and semantically, semantically realistic, realistic compositions

备注： Project page: \hyperlink{ [this https URL](https://threedle.github.io/MeshOn/) }{this https URL}

点击查看摘要

Abstract:We propose MeshOn, a method that finds physically and semantically realistic compositions of two input meshes. Given an accessory, a base mesh with a user-defined target region, and optional text strings for both meshes, MeshOn uses a multi-step optimization framework to realistically fit the meshes onto each other while preventing intersections. We initialize the shapes' rigid configuration via a structured alignment scheme using Vision-to-Language Models, which we then optimize using a combination of attractive geometric losses, and a physics-inspired barrier loss that prevents surface intersections. We then obtain a final deformation of the object, assisted by a diffusion prior. Our method successfully fits accessories of various materials over a breadth of target regions, and is designed to fit directly into existing digital artist workflows. We demonstrate the robustness and accuracy of our pipeline by comparing it with generative approaches and traditional registration algorithms.

112. 【2604.08762】InstrAct: Towards Action-Centric Understanding in Instructional Videos

链接：https://arxiv.org/abs/2604.08762

作者：Zhuoyi Yang,Jiapeng Yu,Reuben Tan,Boyang Li,Huijuan Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：current Video Foundation, Video Foundation Models, Foundation Models, videos requires recognizing, requires recognizing fine-grained

备注：

点击查看摘要

Abstract:Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive "static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos' action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and Masked Action Modeling (MAM) for strengthening cross-modal grounding. Finally, we introduce the InstrAct Bench to evaluate action-centric understanding, where our method consistently outperforms state-of-the-art VFMs on semantic reasoning, procedural logic, and fine-grained retrieval tasks.

113. 【2604.08761】State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition

链接：https://arxiv.org/abs/2604.08761

作者：Bryan Cheng,Austin Jin,Jasper Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：language recognition suffers, achieving high accuracy, small vocabularies collapse, Sign language recognition, catastrophic scaling failure

备注： 8 pages, 3 figures. Accepted to workshop on Algorithmic Fairness Across Alignment Procedures and Agentic Systems at ICLR 2026

点击查看摘要

Abstract:Sign language recognition suffers from catastrophic scaling failure: models achieving high accuracy on small vocabularies collapse at realistic sizes. Existing architectures treat signs as atomic visual patterns, learning flat representations that cannot exploit the compositional structure of sign languages-systematically organized from discrete phonological parameters (handshape, location, movement, orientation) reused across the vocabulary. We introduce PHONSSM, enforcing phonological decomposition through anatomically-grounded graph attention, explicit factorization into orthogonal subspaces, and prototypical classification enabling few-shot transfer. Using skeleton data alone on the largest ASL dataset ever assembled (5,565 signs), PHONSSM achieves 72.1% on WLASL2000 (+18.4pp over skeleton SOTA), surpassing most RGB methods without video input. Gains are most dramatic in the few-shot regime (+225% relative), and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The vocabulary scaling bottleneck is fundamentally a representation learning problem, solvable through compositional inductive biases mirroring linguistic structure.

114. 【2604.08760】SIC3D: Style Image Conditioned Text-to-3D Gaussian Splatting Generation

链接：https://arxiv.org/abs/2604.08760

作者：Ming He,Zhixiang Chen,Steve Maddock

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent progress, input by leveraging, object generation enables, enables the synthesis, synthesis of detailed

备注：

点击查看摘要

Abstract:Recent progress in text-to-3D object generation enables the synthesis of detailed geometry from text input by leveraging 2D diffusion models and differentiable 3D representations. However, the approaches often suffer from limited controllability and texture ambiguity due to the limitation of the text modality. To address this, we present SIC3D, a controllable image-conditioned text-to-3D generation pipeline with 3D Gaussian Splatting (3DGS). There are two stages in SIC3D. The first stage generates the 3D object content from text with a text-to-3DGS generation model. The second stage transfers style from a reference image to the 3DGS. Within this stylization stage, we introduce a novel Variational Stylized Score Distillation (VSSD) loss to effectively capture both global and local texture patterns while mitigating conflicts between geometry and appearance. A scaling regularization is further applied to prevent the emergence of artifacts and preserve the pattern from the style image. Extensive experiments demonstrate that SIC3D enhances geometric fidelity and style adherence, outperforming prior approaches in both qualitative and quantitative evaluations.

115. 【2604.08746】AniGen: Unified $S^3$ Fields for Animatable 3D Asset Generation

链接：https://arxiv.org/abs/2604.08746

作者：Yi-Hua Huang,Zi-Xin Zou,Yuting He,Chirui Chang,Cheng-Feng Pu,Ziyi Yang,Yuan-Chen Guo,Yan-Pei Cao,Xiaojuan Qi

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：embodied agents, interactive graphics, fundamental to interactive, Animatable, animation production

备注： 16 pages, 12 figures

点击查看摘要

Abstract:Animatable 3D assets, defined as geometry equipped with an articulated skeleton and skinning weights, are fundamental to interactive graphics, embodied agents, and animation production. While recent 3D generative models can synthesize visually plausible shapes from images, the results are typically static. Obtaining usable rigs via post-hoc auto-rigging is brittle and often produces skeletons that are topologically inconsistent with the generated geometry. We present AniGen, a unified framework that directly generates animate-ready 3D assets conditioned on a single image. Our key insight is to represent shape, skeleton, and skinning as mutually consistent $S^3$ Fields (Shape, Skeleton, Skin) defined over a shared spatial domain. To enable the robust learning of these fields, we introduce two technical innovations: (i) a confidence-decaying skeleton field that explicitly handles the geometric ambiguity of bone prediction at Voronoi boundaries, and (ii) a dual skin feature field that decouples skinning weights from specific joint counts, allowing a fixed-architecture network to predict rigs of arbitrary complexity. Built upon a two-stage flow-matching pipeline, AniGen first synthesizes a sparse structural scaffold and then generates dense geometry and articulation in a structured latent space. Extensive experiments demonstrate that AniGen substantially outperforms state-of-the-art sequential baselines in rig validity and animation quality, generalizing effectively to in-the-wild images across diverse categories including animals, humanoids, and machinery. Homepage: this https URL

116. 【2604.08741】LPLCv2: An Expanded Dataset for Fine-Grained License Plate Legibility Classification

链接：https://arxiv.org/abs/2604.08741

作者：Lucas Wojcik,Eduardo A. F. Machoski,Eduil Nascimento Jr.,Rayson Laroca,David Menotti

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Modern Automatic License, License Plate Recognition, Automatic License Plate, Modern Automatic, Plate Recognition

备注：

点击查看摘要

Abstract:Modern Automatic License Plate Recognition (ALPR) systems achieve outstanding performance in controlled, well-defined scenarios. However, large-scale real-world usage remains challenging due to low-quality imaging devices, compression artifacts, and suboptimal camera installation. Identifying illegible license plates (LPs) has recently become feasible through a dedicated benchmark; however, its impact has been limited by its small size and annotation errors. In this work, we expand the original benchmark to over three times the size with two extra capture days, revise its annotations and introduce novel labels. LP-level annotations include bounding boxes, text, and legibility level, while vehicle-level annotations comprise make, model, type, and color. Image-level annotations feature camera identity, capture conditions (e.g., rain and faulty cameras), acquisition time, and day ID. We present a novel training procedure featuring an Exponential Moving Average-based loss function and a refined learning rate scheduler, addressing common mistakes in testing. These improvements enable a baseline model to achieve an 89.5% F1-score on the test set, considerably surpassing the previous state of the art. We further introduce a novel protocol to explicitly addresses camera contamination between training and evaluation splits, where results show a small impact. Dataset and code are publicly available at this https URL.

117. 【2604.08722】AI Driven Soccer Analysis Using Computer Vision

链接：https://arxiv.org/abs/2604.08722

作者：Adrian Manchado,Tanner Cellio,Jonathan Keane,Yiyang Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：inform coaching decisions, enhance team strategies, Sport analysis, coaching decisions, inform coaching

备注：

点击查看摘要

Abstract:Sport analysis is crucial for team performance since it provides actionable data that can inform coaching decisions, improve player performance, and enhance team strategies. To analyze more complex features from game footage, a computer vision model can be used to identify and track key entities from the field. We propose the use of an object detection and tracking system to predict player positioning throughout the game. To translate this to positioning in relation to the field dimensions, we use a point prediction model to identify key points on the field and combine these with known field dimensions to extract actual distances. For the player-identification model, object detection models like YOLO and Faster R-CNN are evaluated on the accuracy of our custom video footage using multiple different evaluation metrics. The goal is to identify the best model for object identification to obtain the most accurate results when paired with SAM2 (Segment Anything Model 2) for segmentation and tracking. For the key point detection model, we use a CNN model to find consistent locations in the soccer field. Through homography, the positions of points and objects in the camera perspective will be transformed to a real-ground perspective. The segmented player masks from SAM2 are transformed from camera perspective to real-world field coordinates through homography, regardless of camera angle or movement. The transformed real-world coordinates can be used to calculate valuable tactical insights including player speed, distance covered, positioning heatmaps, and more complex team statistics, providing coaches and players with actionable performance data previously unavailable from standard video analysis.

118. 【2604.08719】LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

链接：https://arxiv.org/abs/2604.08719

作者：Hao Shao,Letian Wang,Yang Zhou,Yuxuan Hu,Zhuofan Zong,Steven L. Waslander,Wei Zhan,Hongsheng Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：Recent years, open-world scenarios remains, remarkable progress, generalization to long-tail, long-tail and open-world

备注：

点击查看摘要

Abstract:Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio-temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large-scale pretraining. We further propose a progressive three-stage training strategy, from vision pretraining to multi-step long-horizon driving, to improve stability and performance. LMGenDrive supports both low-latency online planning and autoregressive offline video generation. Experiments show that it significantly outperforms prior methods on challenging closed-loop benchmarks, with clear gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios. These results suggest that unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision-making systems.

119. 【2604.08718】Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring

链接：https://arxiv.org/abs/2604.08718

作者：Xinmiao Xiong,Bangya Liu,Hao Wang,Dayou Li,Nuo Chen,Andrew Feng,Mingyu Ding,Suman Banerjee,Yang Zhou,Zhiwen Fan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：Geometric Foundation Models, recently advanced monocular, advanced monocular SLAM, Foundation Models, providing robust

备注：

点击查看摘要

Abstract:Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine whether a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate this inefficiency, we propose LeanGate, a lightweight feed-forward frame-gating network. LeanGate predicts a geometric utility score to assess a frame's mapping value prior to the heavy GFM feature extraction and matching stages. As a predictive plug-and-play module, our approach bypasses over 90% of redundant frames. Evaluations on standard SLAM benchmarks demonstrate that LeanGate reduces tracking FLOPs by more than 85% and achieves a 5x end-to-end throughput speedup. Furthermore, it maintains the tracking and mapping accuracy of dense baselines.

120. 【2604.08716】What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction

链接：https://arxiv.org/abs/2604.08716

作者：Loc-Phat Truong,Meysam Madadi,Sergio Escalera

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generative fashion tasks, Latent Diffusion Models, Virtual Try-On, rapid advancements, fashion tasks

备注：

点击查看摘要

Abstract:Virtual Try-On (VTON) has seen rapid advancements, providing a strong foundation for generative fashion tasks. However, the inverse problem, Virtual Try-Off (VTOFF)-aimed at reconstructing the canonical garment from a draped-on image-remains a less understood domain, distinct from the heavily researched field of VTON. In this work, we seek to establish a robust architectural foundation for VTOFF by studying and adapting various diffusion-based strategies from VTON and general Latent Diffusion Models (LDMs). We focus our investigation on the Dual-UNet Diffusion Model architecture and analyze three axes of design: (i) Generation Backbone: comparing Stable Diffusion variants; (ii) Conditioning: ablating different mask designs, masked/unmasked inputs for image conditioning, and the utility of high-level semantic features; and (iii) Losses and Training Strategies: evaluating the impact of the auxiliary attention-based loss, perceptual objectives and multi-stage curriculum schedules. Extensive experiments reveal trade-offs across various configuration options. Evaluated on VITON-HD and DressCode datasets, our framework achieves state-of-the-art performance with a drop of 9.5\% on the primary metric DISTS and competitive performance on LPIPS, FID, KID, and SSIM, providing both stronger baselines and insights to guide future Virtual Try-Off research.

121. 【2604.08711】Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup

链接：https://arxiv.org/abs/2604.08711

作者：Vrushank Ahire,Vivek Kurumanghat,Mudasir Ganaie,Lipika Kabiraj

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：involves highly transient, droplets involves highly, highly transient, multi-scale dynamics, involves highly

备注：

点击查看摘要

Abstract:The disintegration of liquid sheets into ligaments and droplets involves highly transient, multi-scale dynamics that are difficult to quantify from high-speed shadowgraphy images. Identifying droplets, ligaments, and blobs formed during breakup, along with tracking across frames, is essential for spray analysis. However, conventional multi-object tracking frameworks impose strict one-to-one temporal associations and cannot represent one-to-many fragmentation events. In this study, we present a two-stage deep learning framework for object detection and temporal relationship modeling across frames. The framework captures ligament deformation, fragmentation, and parent-child lineage during liquid sheet disintegration. In the first stage, a Faster R-CNN with a ResNet-50 backbone and Feature Pyramid Network detects and classifies ligaments and droplets in high-speed shadowgraphy recordings of an impinging Carbopol gel jet. A morphology-preserving synthetic data generation strategy augments the training set without introducing physically implausible configurations, achieving a held-out F1 score of up to 0.872 across fourteen original-to-synthetic configurations. In the second stage, a Transformer-augmented multilayer perceptron classifies inter-frame associations into continuation, fragmentation (one-to-many), and non-association using physics-informed geometric features. Despite severe class imbalance, the model achieves 86.1% accuracy, 93.2% precision, and perfect recall (1.00) for fragmentation events. Together, the framework enables automated reconstruction of fragmentation trees, preservation of parent-child lineage, and extraction of breakup statistics such as fragment multiplicity and droplet size distributions. By explicitly identifying children droplets formed from ligament fragmentation, the framework provides automated analysis of the primary atomization mode.

122. 【2604.08704】RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data

链接：https://arxiv.org/abs/2604.08704

作者：Tamir Shor,George Leifman,Genady Beryozkin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：attracting increasing research, increasing research interest, research interest due, attracting increasing, increasing research

备注：

点击查看摘要

Abstract:Object-Counting for remote-sensing (RS) imagery is attracting increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - the first Open Vocabulary Counting (OVC) model for Remote-Sensing and aerial imagery. We show that our model is capable of accurate counting of novel object classes, that were unseen during training, based solely on textual and/or visual conditioning.

123. 【2604.08701】Unified Multimodal Uncertain Inference

链接：https://arxiv.org/abs/2604.08701

作者：Dengjia Zhang,Alexander Martin,William Jurayj,Kenton Murray,Benjamin Van Durme,Reno Kriz

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Unified Multimodal Uncertain, introduce Unified Multimodal, Multimodal Uncertain Inference, multimodal inference task, Unified Multimodal

备注：

点击查看摘要

Abstract:We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.

124. 【2604.08694】EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition

链接：https://arxiv.org/abs/2604.08694

作者：Rishabh Gupta,Shravya R. Nalla

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：sign language recognizer, Indian Sign Language, sign language, language recognizer, Sign Language alphabets

备注： Submitted to IEEE Transactions on Human-Machine Systems

点击查看摘要

Abstract:How do you build a sign language recognizer that works on a phone? That question drove this work. We built EfficientSign, a lightweight model which takes EfficientNet-B0 and focuses on two attention modules (Squeeze-and-Excitation for channel focus, and a spatial attention layer that focuses on the hand gestures). We tested it against five other approaches on 12,637 images of Indian Sign Language alphabets, all 26 classes, using 5-fold cross-validation. EfficientSign achieves the accuracy of 99.94% (+/-0.05%), which matches the performance of ResNet18's 99.97% accuracy, but with 62% fewer parameters (4.2M vs 11.2M). We also experimented with feeding deep features (1,280-dimensional vectors pulled from EfficientNet-B0's pooling layer) into classical classifiers. SVM achieved the accuracy of 99.63%, Logistic Regression achieved the accuracy of 99.03% and KNN achieved accuracy of 96.33%. All of these blow past the 92% that SURF-based methods managed on a similar dataset back in 2015. Our results show that attention-enhanced learning model provides an efficient and deployable solution for ISL recognition without requiring a massive model or hand-tuned feature pipelines anymore.

125. 【2604.08646】InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

链接：https://arxiv.org/abs/2604.08646

作者：Zhefan Rao,Bin Zou,Haoxuan Che,Xuanhua He,Chong Hou Choi,Yanheng Li,Rui Liu,Qifeng Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：control video content, video editing data, video, editing, video editing

备注： 13 pages, 10 figures

点击查看摘要

Abstract:Instruction-based video editing is a natural way to control video content with text, but adapting a video generation model into an editor usually appears data-hungry. At the same time, high-quality video editing data remains scarce. In this paper, we show that a video generation backbone can become a strong video editor without large scale video editing data. We present InsEdit, an instruction-based editing model built on HunyuanVideo-1.5. InsEdit combines a visual editing architecture with a video data pipeline based on Mutual Context Attention (MCA), which creates aligned video pairs where edits can begin in the middle of a clip rather than only from the first frame. With only O(100)K video editing data, InsEdit achieves state-of-the-art results among open-source methods on our video instruction editing benchmarks. In addition, because our training recipe also includes image editing data, the final model supports image editing without any modification.

126. 【2604.08645】3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

链接：https://arxiv.org/abs/2604.08645

作者：Makanjuola Ogunleye,Eman Abdelrahman,Ismini Lourentzou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Large multimodal models, Large multimodal, embodied agents operating, ungrounded decisions, multimodal models

备注： 8 pages, 6 figures, Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.

127. 【2604.08641】On Semiotic-Grounded Interpretive Evaluation of Generative Art

链接：https://arxiv.org/abs/2604.08641

作者：Ruixiang Jiang,Changwen Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)

关键词：current Generative Art, audiences communicate, Generative Art, essential to deciphering, deciphering the language

备注：

点击查看摘要

Abstract:Interpretation is essential to deciphering the language of art: audiences communicate with artists by recovering meaning from visual artifacts. However, current Generative Art (GenArt) evaluators remain fixated on surface-level image quality or literal prompt adherence, failing to assess the deeper symbolic or abstract meaning intended by the creator. We address this gap by formalizing a Peircean computational semiotic theory that models Human-GenArt Interaction (HGI) as cascaded semiosis. This framework reveals that artistic meaning is conveyed through three modes - iconic, symbolic, and indexical - yet existing evaluators operate heavily within the iconic mode, remaining structurally blind to the latter two. To overcome this structural blindness, we propose SemJudge. This evaluator explicitly assesses symbolic and indexical meaning in HGI via a Hierarchical Semiosis Graph (HSG) that reconstructs the meaning-making process from prompt to generated artifact. Extensive quantitative experiments show that SemJudge aligns more closely with human judgments than prior evaluators on an interpretation-intensive fine-art benchmark. User studies further demonstrate that SemJudge produces deeper, more insightful artistic interpretations, thereby paving the way for GenArt to move beyond the generation of "pretty" images toward a medium capable of expressing complex human experience. Project page: this https URL.

128. 【2604.08639】VOLTA: The Surprising Ineffectiveness of Auxiliary Losses for Calibrated Deep Learning

链接：https://arxiv.org/abs/2604.08639

作者：Rahul D Ray,Utkarsh Srivastava

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：safety critical applications, Uncertainty quantification, deploying deep learning, deep learning models, critical applications

备注：

点击查看摘要

Abstract:Uncertainty quantification (UQ) is essential for deploying deep learning models in safety critical applications, yet no consensus exists on which UQ method performs best across different data modalities and distribution shifts. This paper presents a comprehensive benchmark of ten widely used UQ baselines including MC Dropout, SWAG, ensemble methods, temperature scaling, energy based OOD, Mahalanobis, hyperbolic classifiers, ENN, Taylor Sensus, and split conformal prediction against a simplified yet highly effective variant of VOLTA that retains only a deep encoder, learnable prototypes, cross entropy loss, and post hoc temperature scaling. We evaluate all methods on CIFAR 10 (in distribution), CIFAR 100, SVHN, uniform noise (out of distribution), CIFAR 10 C (corruptions), and Tiny ImageNet features (tabular). VOLTA achieves competitive or superior accuracy (up to 0.864 on CIFAR 10), significantly lower expected calibration error (0.010 vs. 0.044 to 0.102 for baselines), and strong OOD detection (AUROC 0.802). Statistical testing over three random seeds shows that VOLTA matches or outperforms most baselines, with ablation studies confirming the importance of adaptive temperature and deep encoders. Our results establish VOLTA as a lightweight, deterministic, and well calibrated alternative to more complex UQ approaches.

129. 【2604.08626】WildDet3D: Scaling Promptable 3D Detection in the Wild

链接：https://arxiv.org/abs/2604.08626

作者：Weikai Huang,Jieyu Zhang,Sijun Li,Taoyang Jia,Jiafei Duan,Yunqian Cheng,Jaemin Cho,Mattew Wallingford,Rustin Soraki,Chris Dongjoo Kim,Donovan Clay,Taira Anderson,Winson Han,Ali Farhadi,Bharath Hariharan,Zhongzheng Ren,Ranjay Krishna

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Understanding objects, spatial intelligence, cornerstone of spatial, input RGB image, Understanding

备注：

点击查看摘要

Abstract:Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).

130. 【2604.08617】From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity

链接：https://arxiv.org/abs/2604.08617

作者：Zhuang Qi,Ying-Peng Tang,Lei Meng,Guoqing Chao,Lei Wu,Han Yu,Xiangxu Meng

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：mitigating catastrophic forgetting, retaining representative samples, federated continual learning, effective strategy, strategy for mitigating

备注： CVPR 2026 accepted

点击查看摘要

Abstract:Exemplar replay has become an effective strategy for mitigating catastrophic forgetting in federated continual learning (FCL) by retaining representative samples from past tasks. Existing studies focus on designing sample-importance estimation mechanisms to identify information-rich samples. However, they typically overlook strategies for effectively utilizing the selected exemplars, which limits their performance under continual dynamic heterogeneity across clients and tasks. To address this issue, this paper proposes a Federated gEometry-Aware correcTion method, termed FEAT, which alleviates imbalance-induced representation collapse that drags rare-class features toward frequent classes across clients. Specifically, it consists of two key modules: 1) the Geometric Structure Alignment module performs structural knowledge distillation by aligning the pairwise angular similarities between feature representations and their corresponding Equiangular Tight Frame prototypes, which are fixed and shared across clients to serve as a class-discriminative reference structure. This encourages geometric consistency across tasks and helps mitigate representation drift; 2) the Energy-based Geometric Correction module removes task-irrelevant directional components from feature embeddings, which reduces prediction bias toward majority classes. This improves sensitivity to minority classes and enhances the model's robustness under class-imbalanced distributions.

131. 【2604.08615】MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments

链接：https://arxiv.org/abs/2604.08615

作者：Xingming Liao,Ning Chen,Muying Shu,Yunpeng Yin,Peijian Zeng,Zhuowei Wang,Nankai Lin,Lianglun Cheng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：remain under-explored due, environments remain under-explored, remain under-explored, under-explored due, real-world open-water environments

备注：

点击查看摘要

Abstract:Fine-grained visual understanding and high-level reasoning in real-world open-water environments remain under-explored due to the lack of dedicated benchmarks. We introduce MARINER, a comprehensive benchmark built under the novel Entity-Environment-Event (3E) paradigm. MARINER contains 16,629 multi-source maritime images with 63 fine-grained vessel categories, diverse adverse environments, and 5 typical dynamic maritime incidents, covering fine-grained classification, object detection, and visual question answering tasks. We conduct extensive evaluations on mainstream Multimodal Large language models (MLLMs) and establish baselines, revealing that even advanced models struggle with fine-grained discrimination and causal reasoning in complex marine scenes. As a dedicated maritime benchmark, MARINER fills the gap of realistic and cognitive-level evaluation for maritime multimodal understanding, and promotes future research on robust vision-language models for open-water applications. Appendix and supplementary materials are available at this https URL.

132. 【2604.08613】ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction

链接：https://arxiv.org/abs/2604.08613

作者：Kun Wang,Yupeng Hu,Zhiran Li,Hao Liu,Qianlong Xiang,Liqiang Nie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：conjunction with CVPR, Video Saliency, Saliency Prediction held, Video Saliency Prediction, Adaptive Gated Experts

备注：

点击查看摘要

Abstract:In this report, we present our champion solution for the NTIRE 2026 Challenge on Video Saliency Prediction held in conjunction with CVPR 2026. To exploit complementary inductive biases for video saliency, we propose Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework. Each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from different experts are then fused at inference. ViSAGE thereby aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the Private Test set, ViSAGE ranked first on two out of four evaluation metrics, and outperformed most competing solutions on the other two metrics, demonstrating its effectiveness and generalization ability. Our code has been released at this https URL.

133. 【2604.08610】A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures

链接：https://arxiv.org/abs/2604.08610

作者：Riccardo Pallotto,Pierluigi Feliciati,Tiberio Uricchio

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Depth Range Ratio, transforming two-dimensional miniatures, three-dimensional digital models, digital models suitable, extended reality

备注：

点击查看摘要

Abstract:This paper presents a semi-automated framework for transforming two-dimensional miniatures from medieval manuscripts into three-dimensional digital models suitable for extended reality (XR), tactile 3D~printing, and web-based visualization. We evaluate seven image-to-3D methods (TripoSR, SF3D, SPAR3D, TRELLIS, Wonder3D, SAM~3D, Hi3DGen) on 69~manuscript figures from two collections using rendering-based metrics (Silhouette IoU, LPIPS, CLIP~Score) and volumetric measures (Depth Range Ratio, watertight percentage), revealing a trade-off between volumetric expansion and geometric fidelity. Hi3DGen balances topological quality with rich surface detail through its normal bridging approach, making it a good starting point for expert refinement. Our pipeline combines SAM segmentation, Hi3DGen mesh generation, expert refinement in ZBrush, and AI-assisted texturing. Two case studies on Gothic illuminations from the Decretum Gratiani (Vatican Library) and Renaissance miniatures by Giulio Clovio demonstrate applicability across artistic traditions. The resulting models can support WebXR visualization, AR overlay on physical manuscripts, and tactile 3D~prints for visually impaired users.

134. 【2604.08609】Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach

链接：https://arxiv.org/abs/2604.08609

作者：Ponkoj Chandra Shill

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Digital forensic investigations, investigations increasingly rely, forensic investigations increasingly, scanned documents, Digital forensic

备注： 8 pages, 4 figures

点击查看摘要

Abstract:Digital forensic investigations increasingly rely on heterogeneous evidence such as images, scanned documents, and contextual reports. These artifacts may contain explicit or implicit expressions of harm, hate, threat, violence, or intimidation, yet existing automated approaches often assume clean text input or apply vision models without forensic justification. This paper presents a case-driven multimodal approach for hate and threat detection in forensic analysis. The proposed framework explicitly determines the presence and source of textual evidence, distinguishing between embedded text, associated contextual text, and image-only evidence. Based on the identified evidence configuration, the framework selectively applies text analysis, multimodal fusion, or image-only semantic reasoning using vision language models with vision transformer backbones (ViT). By conditioning inference on evidence availability, the approach mirrors forensic decision-making, improves evidentiary traceability, and avoids unjustified modality assumptions. Experimental evaluation on forensic-style image evidence demonstrates consistent and interpretable behavior across heterogeneous evidence scenarios.

135. 【2604.08598】Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search

链接：https://arxiv.org/abs/2604.08598

作者：Jiahao Zhang,Shaofei Huang,Yaxiong Wang,Zhedong Zheng

类目：Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词：faces inherent limitations, inherent limitations due, stringent privacy constraints, Text-based person search, Text-based person

备注： Accepted to ACM SIGIR 2026

点击查看摘要

136. 【2604.08573】Silhouette Loss: Differentiable Global Structure Learning for Deep Representations

链接：https://arxiv.org/abs/2604.08573

作者：Matheus Vinícius Todescato,Joel Luís Carbonera

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Soft Silhouette Loss, Silhouette Loss, Soft Silhouette, central goal, Learning

备注：

点击查看摘要

Abstract:Learning discriminative representations is a central goal of supervised deep learning. While cross-entropy (CE) remains the dominant objective for classification, it does not explicitly enforce desirable geometric properties in the embedding space, such as intra-class compactness and inter-class separation. Existing metric learning approaches, including supervised contrastive learning (SupCon) and proxy-based methods, address this limitation by operating on pairwise or proxy-based relationships, but often increase computational cost and complexity. In this work, we introduce Soft Silhouette Loss, a novel differentiable objective inspired by the classical silhouette coefficient from clustering analysis. Unlike pairwise objectives, our formulation evaluates each sample against all classes in the batch, providing a batch-level notion of global structure. The proposed loss directly encourages samples to be closer to their own class than to competing classes, while remaining lightweight. Soft Silhouette Loss can be seamlessly combined with cross-entropy, and is also complementary to supervised contrastive learning. We propose a hybrid objective that integrates them, jointly optimizing local pairwise consistency and global cluster structure. Extensive experiments on seven diverse datasets demonstrate that: (i) augmenting CE with Soft Silhouette Loss consistently improves over CE and other metric learning baselines; (ii) the hybrid formulation outperforms SupCon alone; and (iii) the combined method achieves the best performance, improving average top-1 accuracy from 36.71% (CE) and 37.85% (SupCon2) to 39.08%, while incurring substantially lower computational overhead. These results suggest that classical clustering principles can be reinterpreted as differentiable objectives for deep learning, enabling efficient optimization of both local and global structure in representation spaces.

137. 【2604.08572】Ranked Activation Shift for Post-Hoc Out-of-Distribution Detection

链接：https://arxiv.org/abs/2604.08572

作者：Gianluca Guglielmo,Marc Masana

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：detection methods rely, layer activation editing, rely on intermediate, intermediate layer activation, intermediate layer

备注： Code is available at [this https URL](https://github.com/gigug/RAS)

点击查看摘要

Abstract:State-of-the-art post-hoc out-of-distribution detection methods rely on intermediate layer activation editing. However, they exhibit inconsistent performance across datasets and models. We show that this instability is driven by differences in the activation distributions, and identify a failure mode of scaling-based methods that arises when penultimate layer activations are not rectified. Motivated by this analysis, we propose \ours, a hyperparameter-free post-hoc method that replaces sorted activation magnitudes with a fixed in-distribution reference profile. Our simple plug-and-play method shows strong and consistent performance across datasets and architectures without assumptions on the penultimate layer activation function, and without requiring any hyperparameter tuning, while preserving in-distribution classification accuracy by construction. We further analyze what drives the improvement, showing that both inhibiting and exciting activation shifts independently contribute to better out-of-distribution discrimination.

138. 【2505.21472】Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

链接：https://arxiv.org/abs/2505.21472

作者：Mehrdad Fazli,Bowen Wei,Ahmet Sari,Ziwei Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词：achieve impressive performance, Large vision-language models, confidently describe objects, Large vision-language, achieve impressive

备注：

点击查看摘要

139. 【2604.09468】DSVTLA: Deep Swin Vision Transformer-Based Transfer Learning Architecture for Multi-Type Cancer Histopathological Cancer Image Classification

链接：https://arxiv.org/abs/2604.09468

作者：Muazzem Hussain Khan,Tasdid Hasnain,Md. Jamil khan,Ruhul Amin,Md. Shamim Reza,Md. Al Mehedi Hasan,Md Ashad Alam

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：deep Swin-Vision Transformer-based, Swin-Vision Transformer-based transfer, Swin-Vision Transformer-based, Transformer-based transfer learning, Swin Transformer models

备注： 25 [ages. 9 Figures

点击查看摘要

Abstract:In this study, we proposed a deep Swin-Vision Transformer-based transfer learning architecture for robust multi-cancer histopathological image classification. The proposed framework integrates a hierarchical Swin Transformer with ResNet50-based convolution features extraction, enabling the model to capture both long-range contextual dependencies and fine-grained local morphological patterns within histopathological images. To validate the efficiency of the proposed architecture, an extensive experiment was executed on a comprehensive multi-cancer dataset including Breast Cancer, Oral Cancer, Lung and Colon Cancer, Kidney Cancer, and Acute Lymphocytic Leukemia (ALL), including both original and segmented images were analyzed to assess model robustness across heterogeneous clinical imaging conditions. Our approach is benchmarked alongside several state-of-the-art CNN and transfer models, including DenseNet121, DenseNet201, InceptionV3, ResNet50, EfficientNetB3, multiple ViT variants, and Swin Transformer models. However, all models were trained and validated using a unified pipeline, incorporating balanced data preprocessing, transfer learning, and fine-tuning strategies. The experimental results demonstrated that our proposed architecture consistently gained superior performance, reaching 100% test accuracy for lung-colon cancer, segmented leukemia datasets, and up to 99.23% accuracy for breast cancer classification. The model also achieved near-perfect precision, f1 score, and recall, indicating highly stable scores across divers cancer types. Overall, the proposed model establishes a highly accurate, interpretable, and also robust multi-cancer classification system, demonstrating strong benchmark for future research and provides a unified comparative assessment useful for designing reliable AI-assisted histopathological diagnosis and clinical decision-making.

140. 【2604.09421】Multi-task Just Recognizable Difference for Video Coding for Machines: Database, Model, and Coding Application

链接：https://arxiv.org/abs/2604.09421

作者：Junqi Liu,Yun Zhang,Xiaoxia Huang,Long Xu,Weisi Lin

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：Recognizable Difference, visibility threshold modeling, Feature Extraction Module, boosts coding efficiency, Feature Fusion Module

备注： Submitted to IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Just Recognizable Difference (JRD) boosts coding efficiency for machine vision through visibility threshold modeling, but is currently limited to a single-task scenario. To address this issue, we propose a Multi-Task JRD (MT-JRD) dataset and an Attribute-assisted MT-JRD (AMT-JRD) model for Video Coding for Machines (VCM), enhancing both prediction accuracy and coding efficiency. First, we construct a dataset comprising 27,264 JRD annotations from machines, supporting three representative tasks including object detection, instance segmentation, and keypoint detection. Secondly, we propose the AMT-JRD prediction model, which integrates Generalized Feature Extraction Module (GFEM) and Specialized Feature Extraction Module (SFEM) to facilitate joint learning across multiple tasks. Thirdly, we innovatively incorporate object attribute information into object-wise JRD prediction through the Attribute Feature Fusion Module (AFFM), which introduces prior knowledge about object size and location. This design effectively compensates for the limitations of relying solely on image features and enhances the model's capacity to represent the perceptual mechanisms of machine vision. Finally, we apply the AMT-JRD model to VCM, where the accurately predicted JRDs are applied to reduce the coding bit rate while preserving accuracy across multiple machine vision tasks. Extensive experimental results demonstrate that AMT-JRD achieves precise and robust multi-task prediction with a mean absolute error of 3.781 and error variance of 5.332 across three tasks, outperforming the state-of-the-art single-task prediction model by 6.7% and 6.3%, respectively. Coding experiments further reveal that compared to the baseline VVC and JPEG, the AMT-JRD-based VCM improves an average of 3.861% and 7.886% Bjontegaard Delta-mean Average Precision (BD-mAP), respectively.

141. 【2604.09370】Cluster-First Labelling: An Automated Pipeline for Segmentation and Morphological Clustering in Histology Whole Slide Images

链接：https://arxiv.org/abs/2604.09370

作者：Muhammad Haseeb Ahmad,Sharmila Rajendran,Damion Young,Jon Mason

类目：Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)

关键词：requiring manual boundary, manual boundary delineation, slide images, prohibitively labour-intensive, delineation and classification

备注： 7 pages, 4 figures

点击查看摘要

Abstract:Labelling tissue components in histology whole slide images (WSIs) is prohibitively labour-intensive: a single slide may contain tens of thousands of structures--cells, nuclei, and other morphologically distinct objects--each requiring manual boundary delineation and classification. We present a cloudnative, end-to-end pipeline that automates this process through a cluster-first paradigm. Our system tiles WSIs, filters out tiles deemed unlikely to contain valuable information, segments tissue components with Cellpose-SAM (including cells, nuclei, and other morphologically similar structures), extracts neural embeddings via a pretrained ResNet-50, reduces dimensionality with UMAP, and groups morphologically similar objects using DBSCAN clustering. Under this paradigm, a human annotator labels representative clusters rather than individual objects, reducing annotation effort by orders of magnitude. We evaluate the pipeline on 3,696 tissue components across 13 diverse tissue types from three species (human, rat, rabbit), measuring how well unsupervised clusters align with independent human labels via per-tile Hungarian-algorithm matching. Our system achieves a weighted cluster-label alignment accuracy of 96.8%, with 7 of 13 tissue types reaching perfect agreement. The pipeline, a companion labelling web application, and all evaluation code are released as open-source software.

142. 【2604.09321】UHD Low-Light Image Enhancement via Real-Time Enhancement Methods with Clifford Information Fusion

链接：https://arxiv.org/abs/2604.09321

作者：Xiaohan Wang,Chen Wu,Dawei Zhao,Guangwei Gao,Dianjie Lu,Guijuan Zhang,Linwei Fan,Xu Lu,Shuai Wu,Hang Wei,Zhuoran Zheng

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：extremely challenging, UHD low-light enhancement, real-time UHD low-light, Clifford algebra, UHD low-light

备注：

点击查看摘要

Abstract:Considering efficiency, ultra-high-definition (UHD) low-light image restoration is extremely challenging. Existing methods based on Transformer architectures or high-dimensional complex convolutional neural networks often suffer from the "memory wall" bottleneck, failing to achieve millisecond-level inference on edge devices. To address this issue, we propose a novel real-time UHD low-light enhancement network based on geometric feature fusion using Clifford algebra in 2D Euclidean space. First, we construct a four-layer feature pyramid with gradually increasing resolution, which decomposes input images into low-frequency and high-frequency structural components via a Gaussian blur kernel, and adopts a lightweight U-Net based on depthwise separable convolution for dual-branch feature extraction. Second, to resolve structural information loss and artifacts from traditional high-low frequency feature fusion, we introduce spatially aware Clifford algebra, which maps feature tensors to a multivector space (scalars, vectors, bivectors) and uses Clifford similarity to aggregate features while suppressing noise and preserving textures. In the reconstruction stage, the network outputs adaptive Gamma and Gain maps, which perform physically constrained non-linear brightness adjustment via Retinex theory. Integrated with FP16 mixed-precision computation and dynamic operator fusion, our method achieves millisecond-level inference for 4K/8K images on a single consumer-grade device, while outperforming state-of-the-art (SOTA) models on several restoration metrics.

143. 【2604.09313】Compositional-Degradation UAV Image Restoration: Conditional Decoupled MoE Network and A Benchmark

链接：https://arxiv.org/abs/2604.09313

作者：Jinquan Yan,Zhicheng Zhao,Zhengzheng Tu,Chenglong Li,Jin Tang,Bin Luo

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：UAV image restoration, infrastructure inspection, compositional UAV image, large-area mapping, emergency response

备注：

点击查看摘要

Abstract:UAV images are critical for applications such as large-area mapping, infrastructure inspection, and emergency response. However, in real-world flight environments, a single image is often affected by multiple degradation factors, including rain, haze, and noise, undermining downstream task performance. Current unified restoration approaches typically rely on implicit degradation representations that entangle multiple factors into a single condition, causing mutual interference among heterogeneous corrections. To this end, we propose DAME-Net, a Degradation-Aware Mixture-of-Experts Network that decouples explicit degradation perception from degradation-conditioned reconstruction for compositional UAV image restoration. Specifically, we design a Factor-wise Degradation Perception module(FDPM) to provide explicit per-factor degradation cues for the restoration stage through multi-label prediction with label-similarity-guided soft alignment, replacing implicit entangled conditions with interpretable and generalizable degradation descriptions. Moreover, we develop a Conditioned Decoupled MoE module(CDMM) that leverages these cues for stage-wise conditioning, spatial-frequency hybrid processing, and mask-constrained decoupled expert routing, enabling selective factor-specific correction while suppressing irrelevant interference. In addition, we construct the Multi-Degradation UAV Restoration benchmark (MDUR), the first large-scale UAV benchmark for compositional UAV image restoration, with 43 degradation configurations from single degradations to four-factor composites and standardized seen/unseen this http URL experiments on MDUR demonstrate consistent improvements over representative unified restoration methods, with greater gains on unseen and higher-order composite degradations. Downstream experiments further validate benefits for UAV object detection.

144. 【2604.09280】AMO-ENE: Attention-based Multi-Omics Fusion Model for Outcome Prediction in Extra Nodal Extension and HPV-associated Oropharyngeal Cancer

链接：https://arxiv.org/abs/2604.09280

作者：Gautier Hénique,William Le,Gabriel Dayan,Coralie Brodeur,Kristoff Nelson,Apostolos Christopoulos,Edith Filion,Phuc-Felix Nguyen-Tan,Laurent Letourneau-Guillon,Houda Bahig,Samuel Kadoury

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Extranodal extension, emerging prognostic factor, HPV-positive OPC staging, human papillomavirus, oropharyngeal cancer

备注：

点击查看摘要

Abstract:Extranodal extension (ENE) is an emerging prognostic factor in human papillomavirus (HPV)-associated oropharyngeal cancer (OPC), although it is currently omitted as a clinical staging criteria. Recent works have advocated for the inclusion of iENE as a prognostic marker in HPV-positive OPC staging. However, several practical limitations continue to hinder its clinical integration, including inconsistencies in segmentation, low contrast in the periphery of metastatic lymph nodes on CT imaging, and laborious manual annotations. To address these limitations, we propose a fully automated end-to-end pipeline that uses computed tomography (CT) images with clinical data to assess the status of nodal ENE and predict treatment outcomes. Our approach includes a hierarchical 3D semi-supervised segmentation model designed to detect and delineate relevant iENE from radiotherapy planning CT scans. From these segmentations, a set of radiomics and deep features are extracted to train an imaging-detected ENE grading classifier. The predicted ENE status is then evaluated for its prognostic value and compared with existing staging criteria. Furthermore, we integrate these nodal features with primary tumor characteristics in a multimodal, attention-based outcome prediction model, providing a dynamic framework for outcome prediction. Our method is validated in an internal cohort of 397 HPV-positive OPC patients treated with radiation therapy or chemoradiotherapy between 2009 and 2020. For outcome prediction at the 2-year mark, our pipeline surpassed baseline models with 88.2% (4.8) in AUC for metastatic recurrence, 79.2% (7.4) for overall survival, and 78.1% (8.6) for disease-free survival. We also obtain a concordance index of 83.3% (6.5) for metastatic recurrence, 71.3% (8.9) for overall survival, and 70.0% (8.1) for disease-free survival, making it feasible for clinical decision making.

145. 【2604.09227】raining-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models

链接：https://arxiv.org/abs/2604.09227

作者：Wongi Jeong,Hoigi Seo,Se Young Chun

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：yield exquisite high-resolution, exquisite high-resolution, ranging from general, professional designers, indispensable tools

备注：

点击查看摘要

Abstract:Image generative models have become indispensable tools to yield exquisite high-resolution (HR) images for everyone, ranging from general users to professional designers. However, a desired outcome often requires generating a large number of HR images with different prompts and seeds, resulting in high computational cost for both users and service providers. Generating low-resolution (LR) images first could alleviate computational burden, but it is not straightforward how to generate LR images that are perceptually consistent with their HR counterparts. Here, we consider the task of generating high-fidelity LR images, called Previews, that preserve perceptual similarity of their HR counterparts for an efficient workflow, allowing users to identify promising candidates before generating the final HR image. We propose the commutator-zero condition to ensure the LR-HR perceptual consistency for flow matching models, leading to the proposed training-free solution with downsampling matrix selection and commutator-zero guidance. Extensive experiments show that our method can generate LR images with up to 33\% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, our method achieves up to 3$\times$ speedup. Moreover, our formulation can be extended to image manipulations, such as warping and translation, demonstrating its generalizability.

146. 【2604.08868】MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification

链接：https://arxiv.org/abs/2604.08868

作者：Mohammed Maaz Sibhai,Abedalrhman Alkhateeb,Saad B. Ahmed

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：safe clinical integration, dependable uncertainty quantification, Medical Vision Transformers, require dependable uncertainty, ensure safe clinical

备注：

点击查看摘要

Abstract:To ensure safe clinical integration, deep learning models must provide more than just high accuracy; they require dependable uncertainty quantification. While current Medical Vision Transformers perform well, they frequently struggle with overconfident predictions and a lack of transparency, issues that are magnified by the noisy and imbalanced nature of clinical data. To address this, we enhanced the modified Medical Transformer (MedFormer) that incorporates prototype-based learning and uncertainty-guided routing, by utilizing a Dirichlet distribution for per-token evidential uncertainty, our framework can quantify and localize ambiguity in real-time. This uncertainty is not just an output but an active participant in the training process, filtering out unreliable feature updates. Furthermore, the use of class-specific prototypes ensures the embedding space remains structured, allowing for decisions based on visual similarity. Testing across four modalities (mammography, ultrasound, MRI, and histopathology) confirms that our approach significantly enhances model calibration, reducing expected calibration error (ECE) by up to 35%, and improves selective prediction, even when accuracy gains are modest.

147. 【2604.08781】PSIRNet: Deep Learning-based Free-breathing Rapid Acquisition Late Enhancement Imaging

链接：https://arxiv.org/abs/2604.08781

作者：Arda Atalik,Hui Xue,Rhodri H. Davies,Thomas A. Treibel,Daniel K. Sodickson,Michael S. Hansen,Peter Kellman

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Medical Physics (physics.med-ph)

关键词：late gadolinium enhancement, phase-sensitive inversion recovery, cardiac MRI, free-breathing phase-sensitive inversion, MOCO PSIR

备注： 25 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Purpose: To develop and evaluate a deep learning (DL) method for free-breathing phase-sensitive inversion recovery (PSIR) late gadolinium enhancement (LGE) cardiac MRI that produces diagnostic-quality images from a single acquisition over two heartbeats, eliminating the need for 8 to 24 motion-corrected (MOCO) signal averages. Materials and Methods: Raw data comprising 800,653 slices from 55,917 patients, acquired on 1.5T and 3T scanners across multiple sites from 2016 to 2024, were used in this retrospective study. Data were split by patient: 640,000 slices (42,822 patients) for training and the remainder for validation and testing, without overlap. The training and testing data were from different institutions. PSIRNet, a physics-guided DL network with 845 million parameters, was trained end-to-end to reconstruct PSIR images with surface coil correction from a single interleaved IR/PD acquisition over two heartbeats. Reconstruction quality was evaluated using SSIM, PSNR, and NRMSE against MOCO PSIR references. Two expert cardiologists performed an independent qualitative assessment, scoring image quality on a 5-point Likert scale across bright blood, dark blood, and wideband LGE variants. Paired superiority and equivalence (margin = 0.25 Likert points) were tested using exact Wilcoxon signed-rank tests at a significance level of 0.05 using R version 4.5.2. Results: Both readers rated single-average PSIRNet reconstructions superior to MOCO PSIR for dark blood LGE (conservative P = .002); for bright blood and wideband, one reader rated it superior and the other confirmed equivalence (all P .001). Inference required approximately 100 msec per slice versus more than 5 sec for MOCO PSIR. Conclusion: PSIRNet produces diagnostic-quality free-breathing PSIR LGE images from a single acquisition, enabling 8- to 24-fold reduction in acquisition time.

Comments:
25 pages, 5 figures, 4 tables

Subjects:

Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Medical Physics (physics.med-ph)

Cite as:
arXiv:2604.08781 [eess.IV]

(or
arXiv:2604.08781v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2604.08781

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Arda Atalik [view email] [v1]
Thu, 9 Apr 2026 21:31:48 UTC (989 KB)