本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新639篇论文，其中：

自然语言处理103篇
信息检索15篇
计算机视觉120篇

自然语言处理

1. 【2604.19716】Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

链接：https://arxiv.org/abs/2604.19716

作者：Feihao Fang,My T. Thai,Yuanyuan Lei

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, struggle with multi-step, multi-step logical reasoning

备注： Accepted to ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that are shared across views while remaining independent of surface forms. To verify this, we employ Canonical Correlation Analysis on the paired residual activations from natural-language and symbolic-language reasoning chains, learning a low-dimensional subspace with maximum cross-view correlation. Furthermore, we design a training-free approach that steers LLMs reasoning chain along this logical subspace, thereby leveraging the complementary reasoning signals from both views. Experiments on four logical reasoning benchmarks demonstrate the effectiveness of our approach, improving accuracy by up to 11 percentage points and generalizing well on out-of-domain problems.

2. 【2604.19699】Epistemic orientation in parliamentary discourse is associated with deliberative democracy

链接：https://arxiv.org/abs/2604.19699

作者：Segun Aroyehun,Stephan Lewandowsky,David Garcia

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：evidence-based reasoning grounded, intuition-based reasoning rooted, reflects varying epistemic, discourse reflects varying, varying epistemic orientations

备注：

点击查看摘要

Abstract:The pursuit of truth is central to democratic deliberation and governance, yet political discourse reflects varying epistemic orientations, ranging from evidence-based reasoning grounded in verifiable information to intuition-based reasoning rooted in beliefs and subjective interpretation. We introduce a scalable approach to measure epistemic orientation using the Evidence--Minus--Intuition (EMI) score, derived from large language model (LLM) ratings and embedding-based semantic similarity. Applying this approach to 15 million parliamentary speech segments spanning 1946 to 2025 across seven countries, we examine temporal patterns in discourse and its association with deliberative democracy and governance. We find that EMI is positively associated with deliberative democracy within countries over time, with consistent relationships in both contemporaneous and lagged analyses. EMI is also positively associated with the transparency and predictable implementation of laws as a dimension of governance. These findings suggest that the epistemic nature of political discourse is crucial for both the quality of democracy and governance.

3. 【2604.19685】An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA

链接：https://arxiv.org/abs/2604.19685

作者：Saransh Sharma,Pritika Ramu,Aparna Garimella,Koyel Mukherjee

类目：Computation and Language (cs.CL)

关键词：questions remains challenging, requires synthesis, factual retrieval, single response, remains challenging

备注： Paper accepted at ACL Findings 2026

点击查看摘要

Abstract:Answering open-ended questions remains challenging for AI systems because it requires synthesis, judgment, and exploration beyond factual retrieval, and users often refine answers through multiple iterations rather than accepting a single response. Existing QA benchmarks do not explicitly support this refinement process. To address this gap, we introduce a new task, document-grounded related insight generation, where the goal is to generate additional insights from a document collection that help improve, extend, or rethink an initial answer to an open-ended question, ultimately supporting richer user interaction and a better overall question answering experience. We curate and release SCOpE-QA (Scientific Collections for Open-Ended QA), a dataset of 3,000 open-ended questions across 20 research collections. We present InsightGen, a two-stage approach that first constructs a thematic representation of the document collection using clustering, and then selects related context based on neighborhood selection from the thematic graph to generate diverse and relevant insights using LLMs. Extensive evaluation on 3,000 questions using two generation models and two evaluation settings shows that InsightGen consistently produces useful, relevant, and actionable insights, establishing a strong baseline for this new task.

4. 【2604.19678】Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

链接：https://arxiv.org/abs/2604.19678

作者：Nurkhan Laiyk,Gerard I. Gállego,Javier Ferrando,Fajri Koto

类目：Computation and Language (cs.CL)

关键词：Function vectors, in-context learning, activations during in-context, model activations, Function

备注：

点击查看摘要

Abstract:Function vectors (FVs) are vector representations of tasks extracted from model activations during in-context learning. While prior work has shown that multilingual model representations can be language-agnostic, it remains unclear whether the same holds for function vectors. We study whether FVs exhibit language-agnosticity, using machine translation as a case study. Across three decoder-only multilingual LLMs, we find that translation FVs extracted from a single English$\rightarrow$Target direction transfer to other target languages, consistently improving the rank of correct translation tokens across multiple unseen languages. Ablation results show that removing the FV degrades translation across languages with limited impact on unrelated tasks. We further show that base-model FVs transfer to instruction-tuned variants and partially generalize from word-level to sentence-level translation.

5. 【2604.19667】Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

链接：https://arxiv.org/abs/2604.19667

作者：Yi Zhong,Buqiang Xu,Yijun Wang,Zifei Shan,Shuofei Qiao,Guozhou Zheng,Ningyu Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：offering strong reliability, real-world industrial deployments, industrial deployments, offering strong, reliability and controllability

备注： Work in progress

点击查看摘要

Abstract:At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at this https URL.

6. 【2604.19656】Pause or Fabricate? Training Language Models for Grounded Reasoning

链接：https://arxiv.org/abs/2604.19656

作者：Yiwen Qiu,Linjuan Wu,Yizhou Liu,Yuchen Yan,Jin Ma,Xu Tan,Yao Hu,Daoxin Zhang,Wenqi Zhang,Weiming Lu,Jun Xiao,Yongliang Shen

类目：Computation and Language (cs.CL)

关键词：Large language models, achieved remarkable progress, Large language, achieved remarkable, remarkable progress

备注： Code: [this https URL](https://github.com/ZJU-REAL/GRIL)

点击查看摘要

Abstract:Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.

7. 【2604.19645】he signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

链接：https://arxiv.org/abs/2604.19645

作者：Andrew Hong,Jason Potteiger,Luis E. Zapata

类目：Computation and Language (cs.CL)

关键词：predicts fan-reported experience, fan-reported experience ratings, prompt predicts fan-reported, prompt, GPT

备注： 42 pages, 7 figures, 10 tables

点击查看摘要

Abstract:An earlier paper (Hong, Potteiger, and Zapata 2026) established that an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This paper tests the relative impact of prompt design and model selection on that performance. We compared four configurations on approximately 10,000 post-game surveys from five MLB teams: the original baseline prompt and a moderately customized version, crossed with three GPT models (4.1, 4.1-mini, 5.2). Prompt customization added roughly two percentage points of within +/-1 agreement on GPT 4.1 (from 67% to 69%). Both model swaps from that best configuration degraded performance: GPT 5.2 returned to the baseline, and GPT 4.1-mini fell six percentage points below it. Both levers combined were dwarfed by the input itself: across capable configurations, accuracy varied more than an order of magnitude more by the linguistic character of the text than by the choice of prompt or model. The ceiling has two parts. One is a bias in how the model reads text, which prompt design can correct. The other is a difference between what fans write about and what they actually decide, which no engineering can close because the missing information is not in the text. Prompt customization moved the first part; model selection moved neither reliably. The result is not that "prompt engineering helps a little" but that prompt engineering helps in a specific and predictable way, on the part of the ceiling it can reach.

8. 【2604.19642】Micro Language Models Enable Instant Responses

链接：https://arxiv.org/abs/2604.19642

作者：Wen Cheng,Tuochao Chen,Karim Helwani,Sriram Srinivasan,Luke Zettlemoyer,Shyamnath Gollakota

类目：Computation and Language (cs.CL)

关键词：inference introduces multi-second, introduces multi-second latencies, cloud inference introduces, Edge devices, language models due

备注：

点击查看摘要

Abstract:Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models ($\mu$LMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that $\mu$LMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at this https URL.

9. 【2604.19638】SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

链接：https://arxiv.org/abs/2604.19638

作者：Josue Torres-Fonseca,Naihao Deng,Yinpei Dai,Shane Storks,Yichi Zhang,Rada Mihalcea,Casey Kennington,Joyce Chai

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, proactively address safety

备注： Work accepted at ACL 2026 Findings

点击查看摘要

Abstract:Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under this https URL

10. 【2604.19620】he "Small World of Words" German Free-Association Norms

链接：https://arxiv.org/abs/2604.19620

作者：Samuel Aeschbach,Rui Mata,Kaidi Lõo,Simon De Deyne,Dirk U. Wulff

类目：Computation and Language (cs.CL)

关键词：

备注：

点击查看摘要

None

11. 【2604.19598】Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

链接：https://arxiv.org/abs/2604.19598

作者：Kihyuk Lee

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：

备注： 24 Pages, 2 Figures, 6 Tables and 2 Supplementary Materials

点击查看摘要

None

12. 【2604.19593】RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

链接：https://arxiv.org/abs/2604.19593

作者：Mircea Timpuriu,Dumitru-Clementin Cercel

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：realistic legal data, importance of clear, clear and correct, correct text, meant to assist

备注：

点击查看摘要

Abstract:The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.

13. 【2604.19584】A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry

链接：https://arxiv.org/abs/2604.19584

作者：Silvio Calderaro,Johanna Monti

类目：Computation and Language (cs.CL)

关键词：Natural Language Processing, interest of Natural, oral linguistic heritage, Language Processing, Natural Language

备注： Accepted at the DIALRES Workshop, LREC-COLING 2026

点击查看摘要

Abstract:The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map the characteristics of the poetic text. The results indicate that the production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord's theory of formulaicity. This evidence not only provides a new key to understanding oral creativity, but also offers a significant contribution to the development of NLP tools that are more inclusive and sensitive to the specificities of less widely spoken languages.

14. 【2604.19578】Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

链接：https://arxiv.org/abs/2604.19578

作者：Wenqing Wu,Chengzhi Zhang,Yi Zhao,Tong Bao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词：Large Language Models, Language Models, Large Language, faced unprecedented disruptions, advancement of Large

备注： Scientometrics

点击查看摘要

Abstract:With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

15. 【2604.19572】A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

链接：https://arxiv.org/abs/2604.19572

作者：Jincheng Ren,Siwei Wu,Yizhi Li,Kang Zhu,Shu Xu,Boyu Feng,Ruibin Yuan,Wei Zhang,Riza Batista-Navarro,Jian Yang,Chenghua Lin

类目：Computation and Language (cs.CL)

关键词：support future decisions, multi-turn terminal-centric agentic, terminal-centric agentic tasks, model capabilities advance, raw environment feedback

备注： 23 pages

点击查看摘要

Abstract:As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogeneity of terminal environments makes heuristic-based or fixed-prompt methods difficult to generalize. We propose TACO, a plug-and-play, self-evolving Terminal Agent Compression framework that automatically discovers and refines compression rules from interaction trajectories for existing terminal agents. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks (i.e., SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench) show that TACO consistently improves performance across mainstream agent frameworks and strong backbone models. With MiniMax-2.5, it improves performance on most benchmarks while reducing token overhead by around 10%. On TerminalBench, it brings consistent gains of 1%-4% across strong agentic models, and further improves accuracy by around 2%-3% under the same token budget. These results demonstrate the effectiveness and generalization of self-evolving, task-aware compression for terminal agents.

16. 【2604.19566】Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference

链接：https://arxiv.org/abs/2604.19566

作者：François Remy

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：strong ranking performance, find systematic model, systematic model failures, clinical retrieval requires, Reliable biomedical

备注：

点击查看摘要

Abstract:Reliable biomedical and clinical retrieval requires more than strong ranking performance: it requires a practical way to find systematic model failures and curate the training evidence needed to correct them. Late-interaction models such as ColBERT provide a first solution thanks to the interpretable token-level interaction scores they expose between document and query tokens. Yet this interpretability is shallow: it explains a particular document--query pairwise score, but does not reveal whether the model has learned a clinical concept in a stable, reusable, and context-sensitive way across diverse expressions. As a result, these scores provide limited support for diagnosing misunderstandings, identifying irreasonably distant biomedical concepts, or deciding what additional data or feedback is needed to address this. In this short position paper, we propose Diagnosable ColBERT, a framework that aligns ColBERT token embeddings to a reference latent space grounded in clinical knowledge and expert-provided conceptual similarity constraints. This alignment turns document encodings into inspectable evidence of what the model appears to understand, enabling more direct error diagnosis and more principled data curation without relying on large batteries of diagnostic queries.

17. 【2604.19565】Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

链接：https://arxiv.org/abs/2604.19565

作者：Jonas Waldendorf,Bashar Awwad Shiekh Hasan,Evgenii Tsymbalov

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Speech Large Language, pose significant risks, Language Models, Large Language

备注： Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Hallucinations in Speech Large Language Models (SpeechLLMs) pose significant risks, yet existing detection methods typically rely on gold-standard outputs that are costly or impractical to obtain. Moreover, hallucination detection methods developed for text-based LLMs do not directly capture audio-specific signals. We investigate four attention-derived metrics: AUDIORATIO, AUDIOCONSISTENCY, AUDIOENTROPY, and TEXTENTROPY, designed to capture pathological attention patterns associated with hallucination, and train lightweight logistic regression classifiers on these features for efficient inference-time detection. Across automatic speech recognition and speech-to-text translation tasks, evaluations on Qwen-2-Audio and Voxtral-3B show that our approach outperforms uncertainty-based and prior attention-based baselines on in-domain data, achieving improvements of up to +0.23 PR-AUC, and generalises to out-of-domain ASR settings. We further find that strong performance can be achieved with approximately 100 attention heads, improving out-of-domain generalisation compared to using all heads. While effectiveness is model-dependent and task-specific training is required, our results demonstrate that attention patterns provide a valuable tool for hallucination detection in SpeechLLMs.

18. 【2604.19559】Enhancing Construction Worker Safety in Extreme Heat: A Machine Learning Approach Utilizing Wearable Technology for Predictive Health Analytics

链接：https://arxiv.org/abs/2604.19559

作者：Syed Sajid Ullah,Amir Khan

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：intelligence remain scarce, translate real-time physiological, real-time physiological data, actionable safety intelligence, safety intelligence remain

备注：

点击查看摘要

Abstract:Construction workers are highly vulnerable to heat stress, yet tools that translate real-time physiological data into actionable safety intelligence remain scarce. This study addresses this gap by developing and evaluating deep learning models, specifically a baseline Long Short-Term Memory (LSTM) network and an attention-based LSTM, to predict heat stress among 19 workers in Saudi Arabia. Using Garmin Vivosmart 5 smartwatches to monitor metrics such as heart rate, HRV, and oxygen saturation, the attention-based model outperformed the baseline, achieving 95.40% testing accuracy and significantly reducing false positives and negatives. With precision, recall, and F1 scores of 0.982, this approach not only improves predictive performance but also offers interpretable results suitable for integration into IoT-enabled safety systems and BIM dashboards, advancing proactive, informatics-driven safety management in the construction industry.

19. 【2604.19548】aming Actor-Observer Asymmetry in Agents via Dialectical Alignment

链接：https://arxiv.org/abs/2604.19548

作者：Bobo Li,Rui Wu,Zibo Ji,Meishan Zhang,Hao Fei,Min Zhang,Mong-Li Lee,Wynne Hsu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：Large Language Model, complex autonomous workflows, Large Language, static text generators, dynamic systems capable

备注： ACL 2026 Main Conference. Project page: [this https URL](https://unikcc.github.io/ReTAS/)

点击查看摘要

Abstract:Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.

20. 【2604.19547】Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment

链接：https://arxiv.org/abs/2604.19547

作者：Tianxiang Ma,Weijie Feng,Xinyu Wang,Zhiyong Cheng

类目：Computation and Language (cs.CL)

关键词：Emotion-Cause Pair Extraction, Extraction in Conversations, Pair Extraction, aims to identify, Emotion-Cause Pair

备注：

点击查看摘要

Abstract:Emotion-Cause Pair Extraction in Conversations (ECPEC) aims to identify the set of causal relations between emotion utterances and their triggering causes within a dialogue. Most existing approaches formulate ECPEC as an independent pairwise classification task, overlooking the distinct semantics of emotion diffusion and cause explanation, and failing to capture globally consistent many-to-many conversational causality. To address these limitations, we revisit ECPEC from a semantic perspective and seek to disentangle emotion-oriented semantics from cause-oriented semantics, mapping them into two complementary representation spaces to better capture their distinct conversational roles. Building on this semantic decoupling, we naturally formulate ECPEC as a global alignment problem between the emotion-side and cause-side representations, and employ optimal transport to enable many-to-many and globally consistent emotion-cause matching. Based on this perspective, we propose a unified framework SCALE that instantiates the above semantic decoupling and alignment principle within a shared conversational structure. Extensive experiments on several benchmark datasets demonstrate that SCALE consistently achieves state-of-the-art performance. Our codes are released at this https URL.

21. 【2604.19508】Bangla Key2Text: Text Generation from Keywords for a Low Resource Language

链接：https://arxiv.org/abs/2604.19508

作者：Tonmoy Talukder,G M Shahariar

类目：Computation and Language (cs.CL)

关键词：text pairs designed, paper introduces, designed for keyword-driven, million Bangla keyword, Bangla

备注： 18 pages, uses [this http URL](http://lrec2026.sty)

点击查看摘要

Abstract:This paper introduces \textit{Bangla Key2Text}, a large-scale dataset of $2.6$ million Bangla keyword--text pairs designed for keyword-driven text generation in a low-resource language. The dataset is constructed using a BERT-based keyword extraction pipeline applied to millions of Bangla news texts, transforming raw articles into structured keyword--text pairs suitable for supervised learning. To establish baseline performance on this new benchmark, we fine-tune two sequence-to-sequence models, \texttt{mT5} and \texttt{BanglaT5}, and evaluate them using multiple automatic metrics and human judgments. Experimental results show that task-specific fine-tuning substantially improves keyword-conditioned text generation in Bangla compared to zero-shot large language models. The dataset, trained models, and code are publicly released to support future research in Bangla natural language generation and keyword-to-text generation tasks.

22. 【2604.19505】Enhancing Unsupervised Keyword Extraction in Academic Papers through Integrating Highlights with Abstract

链接：https://arxiv.org/abs/2604.19505

作者：Yi Xiang,Chengzhi Zhang

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词：natural language processing, Automatic keyword extraction, Automatic keyword, area of interest, interest in natural

备注： Scientometrics

点击查看摘要

Abstract:Automatic keyword extraction from academic papers is a key area of interest in natural language processing and information retrieval. Although previous research has mainly focused on utilizing abstract and references for keyword extraction, this paper focuses on the highlights section - a summary describing the key findings and contributions, offering readers a quick overview of the research. Our observations indicate that highlights contain valuable keyword information that can effectively complement the abstract. To investigate the impact of incorporating highlights into unsupervised keyword extraction, we evaluate three input scenarios: using only the abstract, the highlights, and a combination of both. Experiments conducted with four unsupervised models on Computer Science (CS), Library and Information Science (LIS) datasets reveal that integrating the abstract with highlights significantly improves extraction performance. Furthermore, we examine the differences in keyword coverage and content between abstract and highlights, exploring how these variations influence extraction outcomes. The data and code are available at this https URL.

23. 【2604.19502】Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

链接：https://arxiv.org/abs/2604.19502

作者：Bowen Li,Haochen Ma,Yuxin Wang,Jie Yang,Xinchi Chen,Xuanjing Huang,Yining Zheng,Xipeng Qiu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, treat reviewing primarily, Language Models, Large Language, rating prediction task

备注： 38 pages,8 figures,4 tables

点击查看摘要

Abstract:The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics--particularly the recall of weakness arguments--correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.

24. 【2604.19499】Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

链接：https://arxiv.org/abs/2604.19499

作者：Dmitry Pronin,Evgeny Kazartsev

类目：Computation and Language (cs.CL)

关键词：generalise Burrows classical, Burrows classical Delta, applying distance functions, distance functions designed, Delta

备注： Under review at Digital Scholarship in the Humanities. Code available at: [this https URL](https://github.com/DDPronin/Rank-Turbulence-Delta)

点击查看摘要

Abstract:This article introduces two new measures for authorship attribution - Rank-Turbulence Delta and Jensen-Shannon Delta - which generalise Burrows's classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z-scoring of word-frequency vectors and re-casting the uncentred vectors as probability distributions. Building on this representation, we develop a token-level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 755 works by 180 authors spanning the eighteenth to the twenty-first centuries. Rank-Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen-Shannon Delta consistently matches or exceeds the performance of canonical Burrows's Delta. Finally, several established attribution algorithms are re-evaluated on the extended SOCIOLIT corpus.

25. 【2604.19485】EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

链接：https://arxiv.org/abs/2604.19485

作者：Chengjun Pan,Shichun Liu,Jiahang Lin,Dingwei Zhu,Jiazheng Zhang,Shihan Dou,Songyang Gao,Zhenhua Han,Binghai Wang,Rui Zheng,Xuanjing Huang,Tao Gui,Yansong Feng

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：LLM post-training faces, fundamental design choice, Reinforcement learning, LLM post-training, design choice

备注：

点击查看摘要

Abstract:Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.

26. 【2604.19477】Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

链接：https://arxiv.org/abs/2604.19477

作者：Hyunjung Joo,GyeongTaek Lee

类目：ound (cs.SD); Computation and Language (cs.CL)

关键词：discrete tonal categories, Seoul Korean, defined with discrete, discrete tonal, Seoul

备注：

点击查看摘要

Abstract:The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizations in real-world speech. Our paper proposes Dual-Glob, a deep supervised contrastive learning framework to robustly classify fine-grained pitch accent patterns in Seoul Korean. Unlike conventional local predictive models, our approach captures holistic $F_0$ contour shapes by enforcing structural consistency between clean and augmented views in a shared latent space. To this aim, we introduce the first large-scale benchmark dataset, consisting of manually annotated 10,093 Accentual Phrases in Seoul Korean. Experimental results show that our Dual-Glob significantly outperforms strong baseline models with state-of-the-art accuracy (77.75%) and F1-score (51.54%). Therefore, our work supports AM-based intonational phonology using data-driven methodology, showing that deep contrastive learning effectively captures holistic structural features of continuous $F_0$ contours.

27. 【2604.19464】LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues

链接：https://arxiv.org/abs/2604.19464

作者：Fanyu Wang,Xiaoxi Kang,Paul Burgess,Aashish Srivastava,Chetan Arora,Adnan Trakic,Lay-Ki Soon,Md Khalid Hossain,Lizhen Qu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：global population struggles, limited legal resources, Large Language Models, Malaysian Contract Act, global population

备注： Accepted by ACL 2026 Main Conference

点击查看摘要

Abstract:More than half of the global population struggles to meet their civil justice needs due to limited legal resources. While Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, significant challenges remain even at the foundational step of legal issue identification. To investigate LLMs' capabilities in this task, we constructed a dataset from 769 real-world Malaysian Contract Act court cases, using GPT-4o to extract facts and generate candidate legal issues, annotated by senior legal experts, which reveals a critical limitation: while LLMs generate diverse issue candidates, their precision remains inadequate (GPT-4o achieves only 62%). To address this gap, we propose LePREC (Legal Professional-inspired Reasoning Elicitation and Classification), a neuro-symbolic framework combining neural generation with structured statistical reasoning. LePREC consists of: (1) a neuro component leverages LLMs to transform legal descriptions into question-answer pairs representing diverse analytical factors, and (2) a symbolic component applies sparse linear models over these discrete features, learning explicit algebraic weights that identify the most informative reasoning factors. Unlike end-to-end neural approaches, LePREC achieves interpretability through transparent feature weighting while maintaining data efficiency through correlation-based statistical classification. Experiments show a 30-40% improvement over advanced LLM baselines, including GPT-4o and Claude, confirming that correlation-based factor-issue analysis offers a more data-efficient solution for relevance decisions.

28. 【2604.19459】Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning

链接：https://arxiv.org/abs/2604.19459

作者：Kyuhee Kim,Auguste Poiroux,Antoine Bosselut

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词：Formal verification guarantees, Formal verification, verification guarantees proof, guarantees proof validity, verification guarantees

备注： 25 pages, 4 figures, 22 tables. Published at the VerifAI-2 Workshop, ICLR 2026 (non-archival). Code and data: [this https URL](https://github.com/koreankiwi99/formalization-gaming)

点击查看摘要

Abstract:Formal verification guarantees proof validity but not formalization faithfulness. For natural-language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming. We evaluate GPT-5 and DeepSeek-R1 on 303 first-order logic problems (203 from FOLIO, 100 from Multi-LogiEval), comparing unified generation against a two-stage pipeline that separates formalization from proving. Despite compilation rates of 87-99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The two-stage pipeline reveals two distinct modes of unfaithfulness: GPT-5 fabricates axioms during proof generation, a reactive fallback detectable via cross-stage comparison, while DeepSeek-R1 mistranslates premises during formalization, producing internally consistent outputs that evade detection entirely. These findings show that high compilation rates or accuracies should not be equated with faithful reasoning. Code and data are available at this https URL.

Comments:
25 pages, 4 figures, 22 tables. Published at the VerifAI-2 Workshop, ICLR 2026 (non-archival). Code and data: this https URL

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

ACMclasses:
I.2.7; I.2.6; F.4.1

Cite as:
arXiv:2604.19459 [cs.AI]

(or
arXiv:2604.19459v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.19459

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

29. 【2604.19447】'The Order in the Horse's Heart': A Case Study in LLM-Assisted Stylometry for the Discovery of Biblical Allusion in Modern Literary Fiction

链接：https://arxiv.org/abs/2604.19447

作者：Ewan Cameron

类目：Computation and Language (cs.CL)

关键词：King James Bible, Cormac McCarthy, present a dual-track, fiction and apply, James Bible

备注： 39 pages, 1 figure

点击查看摘要

Abstract:We present a dual-track pipeline for detecting biblical allusions in literary fiction and apply it to the novels of Cormac McCarthy. A bottom-up embedding track uses inverse document frequency to identify rare vocabulary shared with the King James Bible, embeds occurrences in their local context for sense disambiguation, and passes candidate passage pairs through cascaded LLM review. A top-down register track asks an LLM to read McCarthy's prose undirected to any specific biblical passage for comparison, catching allusions not distinguished by word or phrase rarity. Both tracks are cross-validated by a long-context model that holds entire novels alongside the KJV in a single pass, and every finding is checked against published scholarship. Restricting attention to allusions that carry a textual echo--shared phrasing, reworked vocabulary, or transplanted cadence--and distinguishing literary allusions proper from signposted biblical references (similes naming biblical figures, characters overtly citing scripture), the pipeline surfaces 349 allusions across the corpus. Among a target set of 115 previously documented allusions retrieved through human review of the academic literature, the pipeline independently recovers 62 (54% recall), with recall varying by connection type from 30% (transformed imagery) to 80% (register collisions). We contextualise these results with respect to the value-add from LLMs as assistants to mechanical stylometric analyses, and their potential to facilitate the statistical study of intertextuality in massive literary corpora.

30. 【2604.19440】What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

链接：https://arxiv.org/abs/2604.19440

作者：Xinhao Zhang,Xi Chen,François Portet,Maxime Peyrard

类目：Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

关键词：orchestrating large language, Recent work, demonstrated the promise, promise of orchestrating, agentic optimization systems

备注： 9 pages, 8 figures, Accepted at Findings of ACL 2026

点击查看摘要

Abstract:Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary search, collecting optimization trajectories for 15 LLMs across 8 tasks. Although zero-shot problem-solving ability correlates with final optimization outcomes, it explains only part of the variance: models with similar initial capability often induce dramatically different search trajectories and outcomes. By analyzing these trajectories, we find that strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory analysis for understanding and improving LLM-based optimization systems and provide actionable insights for their design and training.

31. 【2604.19412】VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing

链接：https://arxiv.org/abs/2604.19412

作者：Yanbin Huang,Yisen Li,Guiyao Tie,Xiaoye Qu,Pan Zhou,Hongfei Wang,Zhaofan Zou,Hao Sun,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Large vision-language models, Large vision-language, frequently suffer, input image, Large

备注： ICASSP 2026

点击查看摘要

Abstract:Large vision-language models (LVLMs) frequently suffer from Object Hallucination (OH), wherein they generate descriptions containing objects that are not actually present in the input image. This phenomenon is particularly problematic in real-world applications such as medical imaging and autonomous driving, where accuracy is critical. Recent studies suggest that the hallucination problem may stem from language priors: biases learned during pretraining that cause LVLMs to generate words based on their statistical co-occurrence. To mitigate this problem, we propose Visual Contrastive Editing (VCE), a novel post-hoc method that identifies and suppresses hallucinatory tendencies by analyzing the model's response to contrastive visual perturbations. Using Singular Value Decomposition (SVD), we decompose the model's activation patterns to isolate hallucination subspaces and apply targeted parameter edits to attenuate its influence. Unlike existing approaches that require fine-tuning or labeled data, VCE operates as a label-free intervention, making it both scalable and practical for deployment in resource-constrained settings. Experimental results demonstrate that VCE effectively reduces object hallucination across multiple benchmarks while maintaining the model's original computational efficiency.

32. 【2604.19405】Lost in Translation: Do LVLM Judges Generalize Across Languages?

链接：https://arxiv.org/abs/2604.19405

作者：Md Tahmid Rahman Laskar,Mohammed Saidul Islam,Mir Tafseer Nayeem,Amran Bhuiyan,Mizanur Rahman,Shafiq Joty,Enamul Hoque,Jimmy Huang

类目：Computation and Language (cs.CL)

关键词：Automatic evaluators, play a central, central role, Automatic, LVLM judges

备注： Accepted at ACL 2026 Findings

点击查看摘要

Abstract:Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.

33. 【2604.19395】Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?

链接：https://arxiv.org/abs/2604.19395

作者：Sho Hoshino,Ukyo Honda,Peinan Zhang

类目：Computation and Language (cs.CL)

关键词：targeted evaluation grounds, symbolic reasoning, unclear due, lack of targeted, targeted evaluation

备注： ACL 2026

点击查看摘要

Abstract:While self-consistency is known to improve performance on symbolic reasoning, its effect on the recall of encyclopedic knowledge is unclear due to a lack of targeted evaluation grounds. To address this, we establish such a knowledge recall split for the popular MMLU benchmark by applying a data-driven heuristic from prior work. We validate this split by showing that the performance patterns on the symbolic reasoning and knowledge recall subsets mirror those of GSM8K and MedMCQA, respectively. Using this solid ground, we find that self-consistency consistently improves performance across both symbolic reasoning and knowledge recall, even though its underlying CoT prompting is primarily effective for symbolic reasoning. As a result, we achieve an 89\% accuracy on MMLU, the best performance to date with the use of GPT-4o.

34. 【2604.19394】Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

链接：https://arxiv.org/abs/2604.19394

作者：Niclas Doll,Jasper Schulze Buschhoff,Shalaka Satheesh,Hammam Abdelwahab,Héctor Allende-Cid,Katrin Klug

类目：Computation and Language (cs.CL)

关键词：significantly larger general-purpose, gap between small, paper narrows, continual pre-training, larger general-purpose models

备注： Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026, San Diego, California, July 2 - 7, 2026) as a main conference paper

点击查看摘要

Abstract:This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from $7B$ to $24B$ parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances $7B$ model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately $3.5$-fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized $7B$ models as a competitive, resource-efficient solution for complex medical instruction-following tasks. While model merging successfully restores instruction-following abilities, a subsequent failure mode analysis reveals inherent trade-offs, including the introduction of language mixing and increased verbosity, highlighting the need for more targeted fine-tuning in future work. This research provides a robust, compliant methodology for developing specialized LLMs, serving as the foundation for practical use in German-speaking healthcare contexts.

35. 【2604.19351】DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

链接：https://arxiv.org/abs/2604.19351

作者：Jinyu Guo,Zhihan Zhang,Yutong Li,Jiehui Xie,Md. Tamim Iqbal,Dongshen Han,Lik-Hang Lee,Sung-Ho Bae,Jie Zou,Yang Yang,Chaoning Zhang

类目：Computation and Language (cs.CL)

关键词：large language models, quadratic computational complexity, quadratic computational, constitutes a fundamental, fundamental bottleneck

备注： Accepted by ACL 2026 (Findings)

点击查看摘要

Abstract:The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and fail to address the high overhead of floating-point arithmetic. This paper introduces DASH-KV, an innovative acceleration framework that reformulates attention as approximate nearest-neighbor search via asymmetric deep hashing. Under this paradigm, we design an asymmetric encoding architecture that differentially maps queries and keys to account for their distinctions in precision and reuse characteristics. To balance efficiency and accuracy, we further introduce a dynamic mixed-precision mechanism that adaptively retains full-precision computation for critical tokens. Extensive experiments on LongBench demonstrate that DASH-KV significantly outperforms state-of-the-art baseline methods while matching the performance of full attention, all while reducing inference complexity from O(N^2) to linear O(N). The code is available at this https URL

36. 【2604.19342】Are Large Language Models Economically Viable for Industry Deployment?

链接：https://arxiv.org/abs/2604.19342

作者：Abdullah Mohammad,Sushant Kumar Ray,Pushkar Arora,Rafiq Ali,Ebad Shabbir,Gautam Siddharth Kashyap,Jiechao Gao,Usman Naseem

类目：Computation and Language (cs.CL)

关键词：Large Language Models, healthcare decision support, Large Language, Generative AI-powered, AI-powered by Large

备注： Accepted at ACL 2026 (Industry Track)

点击查看摘要

Abstract:Generative AI-powered by Large Language Models (LLMs)-is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization-not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment-Evaluation Gap-the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL-a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics-Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW ), System Density (\r{ho}sys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)-capturing profitability, energy efficiency, hardware scaling, serverless feasibility, and compression safety. Our results reveal a clear efficiency frontier-models in the 2B parameter class dominate larger baselines across economic and ecological dimensions. LLaMA-3.2-1B (INT4) achieves ROI break-even in 14 requests (median), delivers 3x higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens/s/GB under 4-bit quantization. We further uncover an efficiency anomaly-while QLoRA reduces memory footprint, it increases adaptation energy by up to 7x for small models-challenging prevailing assumptions about quantization-aware training in edge deployment.

37. 【2604.19331】Evaluating LLM-Driven Summarisation of Parliamentary Debates with Computational Argumentation

链接：https://arxiv.org/abs/2604.19331

作者：Eoghan Cunningham,Derek Greene,James Cross,Antonio Rago

类目：Computation and Language (cs.CL)

关键词：Large Language Models, democratic process, debated and justified, fundamental aspect, Understanding

备注： Accepted at KR'26 In The Wild Track. Camera ready to follow

点击查看摘要

Abstract:Understanding how policy is debated and justified in parliament is a fundamental aspect of the democratic process. However, the volume and complexity of such debates mean that outside audiences struggle to engage. Meanwhile, Large Language Models (LLMs) have been shown to enable automated summarisation at scale. While summaries of debates can make parliamentary procedures more accessible, evaluating whether these summaries faithfully communicate argumentative content remains challenging. Existing automated summarisation metrics have been shown to correlate poorly with human judgements of consistency (i.e., faithfulness or alignment between summary and source). In this work, we propose a formal framework for evaluating parliamentary debate summaries that grounds argument structures in the contested proposals up for debate. Our novel approach, driven by computational argumentation, focuses the evaluation on formal properties concerning the faithful preservation of the reasoning presented to justify or oppose policy outcomes. We demonstrate our methods using a case-study of debates from the European Parliament and associated LLM-driven summaries.

38. 【2604.19321】RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

链接：https://arxiv.org/abs/2604.19321

作者：Yusuf Çelebi,Yağız Asker,Özay Ezerceli,Mahmoud ElHussieni,Selva Taş,Reyhan Bayraktar,Fatma Betül Terzioğlu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Fine-tuning Large Language, Large Language Models, Large Language, remains structurally uncertain, Fine-tuning Large

备注：

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the layer-specific roles of internal representations are poorly understood, leading to heuristic decisions about where adaptation should be applied. We model the evolution of hidden states as a high-dimensional geometric trajectory and propose using the Ramer-Douglas-Peucker (RDP) algorithm, a parameter-free and training-free polygon simplification method that preserves global structural transitions while eliminating locally redundant changes, to identify critical breakpoints along the representation path. Crucially, we use these geometric pivots not merely for analysis, but as a direct decision signal for determining which layers should be adapted during parameter-efficient fine-tuning. By integrating this geometry-aware layer selection strategy into LoRA fine-tuning of Qwen3-8B-Base, we achieve superior performance on MMLU-Math using only 13 RDP-selected layers (81.67%), significantly outperforming both full 36-layer adaptation (79.32%) and random 13-layer selection (75.56%), as well as the baseline Qwen3-8B-Base model (74.25%). These results demonstrate that leveraging the intrinsic geometry of representation trajectories provides a robust, interpretable, and training-free signal for optimizing layer selection during model adaptation.

39. 【2604.19299】Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

链接：https://arxiv.org/abs/2604.19299

作者：Xinlin Wang,Mats Brorsson

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：privacy risks hinder, large language models, substantial computational costs, real-world applications, Small Language Models

备注：

点击查看摘要

Abstract:Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of 10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi-agent system with collaborative capabilities. Our results show that single-agent systems achieve the best balance between performance and cost, while multi-agent setups add overhead with limited gains. Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.

40. 【2604.19298】IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

链接：https://arxiv.org/abs/2604.19298

作者：Rajveer Singh Pall

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：assessing large language, Indian financial regulatory, Indian financial, financial regulatory text, financial NLP benchmarks

备注： 24 pages, 4 figures, 11 tables. Dataset and evaluation code at [this https URL](https://github.com/rajveerpall/IndiaFinBench)

点击查看摘要

Abstract:We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English-language financial news), leaving a significant gap in coverage of non-Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert-annotated question-answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero-shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non-specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage-point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at this https URL

41. 【2604.19292】Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

链接：https://arxiv.org/abs/2604.19292

作者：Guy Mor-Lan,Omer Goldman,Matan Eyal,Adi Mayrav Gilady,Sivan Eiger,Idan Szpektor,Avinatan Hassidim,Yossi Matias,Reut Tsarfaty

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Multilingual large language, Multilingual large, minimized the fluency, fluency gap, Multilingual

备注： ACL 2026 main conference

点击查看摘要

Abstract:Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models' inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates, and measurements. The questions do not contain indications of the locales they relate to, other than the querying language itself. LLMs' responses to LocQA locale-ambiguous questions thus reveal models' implicit priors. We used LocQA to evaluate 32 models, and detected two types of structural biases. Inter-lingually, we show a global bias towards answers relevant to the US-locale, even when models are asked in languages other than English. Moreover, we discovered that this global bias is exacerbated in models that underwent instruction tuning, compared to their base counterparts. Intra-lingually, we show that when multiple locales are relevant for the same language, models act as demographic probability engines, prioritizing locales with larger populations. Taken together, insights from LocQA may help in shaping LLMs' desired local behavior, and in quantifying the impact of various training phases on different kinds of biases.

42. 【2604.19281】Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

链接：https://arxiv.org/abs/2604.19281

作者：Abu Noman Md Sakib,Md. Main Oddin Chisty,Zijie Zhang

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, addressing medical questions, Language Models, increasingly prevalent

备注： Accepted in the Ninth Annual ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) 2026

点击查看摘要

Abstract:The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.

43. 【2604.19274】HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

链接：https://arxiv.org/abs/2604.19274

作者：Euntae Kim,Soomin Han,Buru Chang

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, refine their content, begin with rough, users begin

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at this https URL

44. 【2604.19262】CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

链接：https://arxiv.org/abs/2604.19262

作者：Peiqin Lin,Chenyang Lyu,Wenjiang Luo,Haotian Ye,Md Mehrab Hossain,Chunlan Ma,Shaoxiong Ji,Younes Samih,Bo Zeng,Fan Jiang,Yuanbin Cao,Dilda Duisenbek,Adrian Neo Sau Xun,Daria Pozdniakova,Liubou Misevich,Nevena Marinković,Ngoc Gia Linh Nguyen,Thi Khanh Linh Do,Sarakmatak Sophy,Baotian Hu,Guanhua Chen,Gongbo Tang,Alham Fikri Aji,Longyue Wang,Weihua Luo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, grounded tasks, deployed worldwide, inspiring a surge

备注：

点击查看摘要

Abstract:Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.

45. 【2604.19261】owards a Linguistic Evaluation of Narratives: A Quantitative Stylistic Framework

链接：https://arxiv.org/abs/2604.19261

作者：Alessandro Maisto

类目：Computation and Language (cs.CL)

关键词：involves subjective factors, narrative quality remains, character development, complex challenge, emotional impact

备注： 9TH International Workshop on Computational Models of Narrative (CMN '26) - 8-11 June 2026 - Madrid. 15 Pages

点击查看摘要

Abstract:The evaluation of narrative quality remains a complex challenge, as it involves subjective factors such as plot, character development, and emotional impact. This work proposes a quantitative approach to narrative assessment by focusing on the linguistic dimension as a primary indicator of quality. The paper presents a methodology for the automatic evaluation of narrative based on the extraction of a comprehensive set of 33 quantitative linguistic features categorized into lexical, syntactic, and semantic groups. To test the model, an experiment was conducted on a specialized corpus of 23 books, including canonical masterpieces and self-published works. Through a similarity matrix, the system successfully clustered the narratives, distinguishing almost perfectly between professionally edited and self-published texts. Furthermore, the methodology was validated against a human-annotated dataset; it significantly outperforms traditional story-level evaluation metrics, demonstrating the effectiveness of quantitative linguistic features in assessing narrative quality.

46. 【2604.19254】ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

链接：https://arxiv.org/abs/2604.19254

作者：Xianming Li,Zongxi Li,Tsz-fung Andrew Lee,Jing Li,Haoran Xie,Qing Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, Parameter-efficient fine-tuning, training cost, full-parameter fine-tuning, language models

备注：

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small set of task-specific parameters while freezing the pretrained backbone. However, existing approaches, such as Low-Rank Adaptation (LoRA), achieve adaptation by inserting independent low-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation. We propose ShadowPEFT, a centralized PEFT framework that instead performs layer-level refinement through a depth-shared shadow module. At each transformer layer, ShadowPEFT maintains a parallel shadow state and evolves it repeatedly for progressively richer hidden states. This design shifts adaptation from distributed weight-space perturbations to a shared layer-space refinement process. Since the shadow module is decoupled from the backbone, it can be reused across depth, independently pretrained, and optionally deployed in a detached mode, benefiting edge computing scenarios. Experiments on generation and understanding benchmarks show that ShadowPEFT matches or outperforms LoRA and DoRA under comparable trainable-parameter budgets. Additional analyses on shadow pretraining, cross-dataset transfer, parameter scaling, inference latency, and system-level evaluation suggest that centralized layer-space adaptation is a competitive and flexible alternative to conventional low-rank PEFT.

47. 【2604.19245】alking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

链接：https://arxiv.org/abs/2604.19245

作者：Clara Lachenmaier,Hannah Bultmann,Sina Zarrieß

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：remains underexplored, human-LLM interaction, important resource, resource for resolving, resolving trouble

备注：

点击查看摘要

Abstract:Repair, an important resource for resolving trouble in human-human conversation, remains underexplored in human-LLM interaction. In this study, we investigate how LLMs engage in the interactive process of repair in multi-turn dialogues around solvable and unsolvable math questions. We examine whether models initiate repair themselves and how they respond to user-initiated repair. Our results show strong differences across models: reactions range from being almost completely resistant to (appropriate) repair attempts to being highly susceptible and easily manipulated. We further demonstrate that once conversations extend beyond a single turn, model behavior becomes more distinctive and less predictable across systems. Overall, our findings indicate that each tested LLM exhibits its own characteristic form of unreliability in the context of repair.

48. 【2604.19189】Headlines You Won't Forget: Can Pronoun Insertion Increase Memorability?

链接：https://arxiv.org/abs/2604.19189

作者：Selina Meyer(1),Magdalena Abel(2),Michael Roth(1) ((1) Natural Language Understanding Lab, University of Technology Nuremberg, (2) Cognitive Psychology Lab, University of Technology Nuremberg)

类目：Computation and Language (cs.CL)

关键词：drive action, relevant information, influence beliefs, beliefs and drive, retained and retrievable

备注： To be published at the 15th edition of the Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2026)

点击查看摘要

Abstract:For news headlines to influence beliefs and drive action, relevant information needs to be retained and retrievable from memory. In this probing study we draw on experiment designs from cognitive psychology to examine how a specific linguistic feature, namely direct address through first- and second-person pronouns, affects memorability and to what extent it is feasible to use large language models for the targeted insertion of such a feature into existing text without changing its core meaning. Across three controlled memorization experiments with a total of 240 participants, yielding 7,680 unique memory judgments, we show that pronoun insertion has mixed effects on memorability. Exploratory analyses indicate that effects differ based on headline topic, how pronouns are inserted and their immediate contexts. Additional data and fine-grained analysis is needed to draw definitive conclusions on these mediating factors. We further show that automatic revisions by LLMs are not always appropriate: Crowdsourced evaluations find many of them to be lacking in content accuracy and emotion retention or resulting in unnatural writing style. We make our collected data available for future work.

49. 【2604.19185】SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

链接：https://arxiv.org/abs/2604.19185

作者：Bo-Jyun Wang,Ying-Jia Lin,Hung-Yu Kao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Small language models, Small language, large language models, large language, comparable to large

备注： Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM-based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high-quality summaries. To address these issues, we introduce \textbf{SCURank}, a framework that enhances summarization by leveraging \textbf{Summary Content Units (SCUs)}. Instead of relying on unstable comparisons or surface-level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM-based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information-centric ranking in multi-LLM distillation. The code for SCURank is available at this https URL.

50. 【2604.19162】Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

链接：https://arxiv.org/abs/2604.19162

作者：Hongxing Pan,Yingying Guo,Wenqing Kuang,Jiashi Lu

类目：Computation and Language (cs.CL); Applications (stat.AP)

关键词：large language models, paper studies uncertainty, language models, paper studies, large language

备注： 7 pages, 1 figure, 3 tables

点击查看摘要

Abstract:This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size--that is, the number of distinct meanings expressed in the sampled responses--provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.

51. 【2604.19151】Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

链接：https://arxiv.org/abs/2604.19151

作者：Kaushal Bhogale,Manas Dhir,Amritansh Walecha,Manmeet Kaur,Vanshika Chhabra,Aaditya Pareek,Hanuman Sidh,Sagar Jain,Bhaskar Singh,Utkarsh Singh,Tahir Javed,Shobhit Banga,Mitesh M. Khapra

类目：Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：Existing Indic ASR, leaderboard driven evaluation, Existing Indic, dataset specific overfitting, specific overfitting

备注： 6 pages, 4 figures

点击查看摘要

Abstract:Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

52. 【2604.19149】How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

链接：https://arxiv.org/abs/2604.19149

作者：Haoyang Chen,Yi Liu,Jianzhi Shao,Tao Zhang,Chengfu Huo,Wei Hu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Thinking LLMs produce, Thinking LLMs, LLMs produce reasoning, Thinking, produce reasoning traces

备注： Accepted in the Findings of ACL 2026

点击查看摘要

Abstract:Thinking LLMs produce reasoning traces before answering. Prior activation steering work mainly targets on shaping these traces. It remains less understood how answer tokens actually read and integrate the reasoning to produce reliable outcomes. Focusing on quantitative reasoning, we analyze the answer-to-reasoning attention and observe a benign self-reading pattern aligned with correctness, characterized by a forward drift of the reading focus along the reasoning trace and a persistent concentration on key semantic anchors, whereas incorrect solutions exhibit diffuse and irregular attention pattern. We interpret this as internal certainty during answer decoding, where the model commits to a viable solution branch and integrates key evidence. Following this, we propose a training-free steering method driven by Self-Reading Quality (SRQ) scores combining geometric metrics for process control with semantic metrics for content monitoring. SRQ selects data to build steering vectors that guide inference toward benign self-reading and away from uncertain and disorganized reading. Experiments show that our method yields consistent accuracy gains.

53. 【2604.19144】ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

链接：https://arxiv.org/abs/2604.19144

作者：Kunquan Li,Yingxue Zhang,Fandong Meng,Jinsong Su

类目：Computation and Language (cs.CL)

关键词：applying Large Reasoning, witnessed growing interest, Recent years, applying Large, Large Reasoning Models

备注：

点击查看摘要

Abstract:Recent years have witnessed growing interest in applying Large Reasoning Models (LRMs) to Machine Translation (MT). Existing approaches predominantly adopt a "think-first-then-translate" paradigm. Although explicit reasoning trajectories significantly enhance translation quality, they incur prohibitive inference costs and latency. To address these limitations, we propose ReflectMT, a two-stage reflection internalization algorithm for machine translation that employs a "translate-first-think-later" paradigm. Our approach develops the model's "translate-reflect-refine" capability through reinforcement learning. In the first stage, we cultivate the model's capacity for high-quality reflection and refinement, thereby enhancing its semantic comprehension and task-specific knowledge. In the second stage, we train the model to internalize the knowledge acquired during reflection. As a result, during inference, ReflectMT operates in a direct translation mode, producing high-quality translations on the first attempt without any explicit reasoning steps. Experimental results on datasets such as WMT24 demonstrate that our model's first-pass translations during inference outperform multi-step reasoning LRMs such as DeepSeek-R1 in both automatic metrics and GPT-based evaluation, achieving a 2.16-point improvement in GPT-based translation quality evaluation while reducing token consumption by 94.33%.

54. 【2604.19139】he Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

链接：https://arxiv.org/abs/2604.19139

作者：Shuai Wu,Xue Li,Yanna Feng,Yufang Li,Zhijun Wang,Ran Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Reinforcement Learning, pervade model outputs, increasingly conspicuous phenomenon

备注： 20 pages, 17 figures, 8 tables. Technical report

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics -- repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers ("That's a great question!", "Awesome!") to pseudo-empathetic affirmations ("I completely understand your concern", "I'm right here to catch you") and overused vocabulary ("delve", "tapestry", "nuanced"). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p 0.001). These results underscore the "alignment tax" of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.

55. 【2604.19137】Construction of Knowledge Graph based on Language Model

链接：https://arxiv.org/abs/2604.19137

作者：Qiubai Zhu,Qingwang Wang,Haibin Yuan,Wei Chen,Tao Shen

类目：Computation and Language (cs.CL)

关键词：effectively integrate valuable, integrate valuable information, effectively integrate, integrate valuable, rapidly developed

备注： 10 pages,3 figures To be published in the proceedings of 2025 13th The International Conference on Information Systems and Computing Technology (ISCTech 2025)

点击查看摘要

Abstract:Knowledge Graph (KG) can effectively integrate valuable information from massive data, and thus has been rapidly developed and widely used in many fields. Traditional KG construction methods rely on manual annotation, which often consumes a lot of time and manpower. And KG construction schemes based on deep learning tend to have weak generalization capabilities. With the rapid development of Pre-trained Language Models (PLM), PLM has shown great potential in the field of KG construction. This paper provides a comprehensive review of recent research advances in the field of construction of KGs using PLM. In this paper, we explain how PLM can utilize its language understanding and generation capabilities to automatically extract key information for KGs, such as entities and relations, from textual data. In addition, We also propose a new Hyper-Relarional Knowledge Graph construction framework based on lightweight Large Language Model (LLM) named LLHKG and compares it with previous methods. Under our framework, the KG construction capability of lightweight LLM is comparable to GPT3.5.

56. 【2604.19125】Do Emotions Influence Moral Judgment in Large Language Models?

链接：https://arxiv.org/abs/2604.19125

作者：Mohammad Saim,Tianyu Jiang

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, judgment remains underexplored, distinct capabilities, remains underexplored

备注： 18 pages, 14 figures, 6 tables

点击查看摘要

Abstract:Large language models have been extensively studied for emotion recognition and moral reasoning as distinct capabilities, yet the extent to which emotions influence moral judgment remains underexplored. In this work, we develop an emotion-induction pipeline that infuses emotion into moral situations and evaluate shifts in moral acceptability across multiple datasets and LLMs. We observe a directional pattern: positive emotions increase moral acceptability and negative emotions decrease it, with effects strong enough to reverse binary moral judgments in up to 20% of cases, and with susceptibility scaling inversely with model capability. Our analysis further reveals that specific emotions can sometimes behave contrary to what their valence would predict (e.g., remorse paradoxically increases acceptability). A complementary human annotation study shows humans do not exhibit these systematic shifts, indicating an alignment gap in current LLMs.

57. 【2604.19124】Detoxification for LLM: From Dataset Itself

链接：https://arxiv.org/abs/2604.19124

作者：Wei Shao,Yihang Wang,Gaoyu Zhu,Ziqiang Cheng,Lei Yu,Jiafeng Guo,Xueqi Cheng

类目：Computation and Language (cs.CL)

关键词：Existing detoxification methods, large language models, Existing detoxification, Soft Contrastive Decoding, inference time

备注： Accepted to Main Conference of ACL 2026

点击查看摘要

Abstract:Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches cannot completely suppress the model's inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, HSPD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B. These findings show that semantics-preserving, corpus-level rewriting with HSPD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source-level mitigation, thereby reducing the cost of later model behavior adjustment. (Code is available at: this https URL)

58. 【2604.19098】SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

链接：https://arxiv.org/abs/2604.19098

作者：Rania Elbadry,Sarfraz Ahmad,Ahmed Heakl,Dani Bouch,Momina Ahsan,Muhra AlMahri,Marwa Elsaid khalil,Yuxia Wang,Salem Lahlou,Sophia Ananiadou,Veselin Stoyanov,Jimin Huang,Xueqing Peng,Preslav Nakov,Zhuohan Xie

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：English financial NLP, Arabic financial NLP, NLP remains comparatively, remains comparatively under-explored, financial NLP remains

备注： 29 page

点击查看摘要

Abstract:English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari'ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event-cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.

59. 【2604.19071】HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

链接：https://arxiv.org/abs/2604.19071

作者：Andrew Zhuoer Feng,Cunxiang Wang,Yu Luo,Lin Fan,Yilin Zhou,Zikang Wang,Xiaotao Gu,Jie Tang,Hongning Wang,Minlie Huang

类目：Computation and Language (cs.CL)

关键词：large language models, significant challenge due, language models, remains a significant, capabilities of large

备注： 49 pages, 6 figures, 19 tables, ACL 2026 main

点击查看摘要

Abstract:Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.

60. 【2604.19070】RN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

链接：https://arxiv.org/abs/2604.19070

作者：Yilun Liu,Ruihong Qiu,Zi Huang

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：integrate textual semantics, remains a challenging, challenging frontier, task-specific supervision, integrate textual

备注：

点击查看摘要

Abstract:Zero-shot reasoning on text-rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task-specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN-R1-Zero, a post-training framework for TRN reasoning trained solely via reinforcement learning. TRN-R1-Zero directly optimises base LLMs using a Neighbour-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN-R1-Zero requires no supervised fine-tuning or chain-of-thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co-purchase TRN benchmarks demonstrate the superiority and robustness of TRN-R1-Zero. Moreover, relying strictly on node-level training, TRN-R1-Zero achieves zero-shot inference on edge- and graph-level tasks, extending beyond cross-domain transfer. The codebase is publicly available at this https URL.

61. 【2604.19069】Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference

链接：https://arxiv.org/abs/2604.19069

作者：Aby Mammen Mathew

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Neural NLI models, Neural NLI, NLI models overfit, overfit dataset artifacts, models overfit dataset

备注： 10 pages, 3 figures, 4 tables. Single-author paper

点击查看摘要

Abstract:Neural NLI models overfit dataset artifacts instead of truly reasoning. A hypothesis-only model gets 57.7% in SNLI, showing strong spurious correlations, and 38.6% of the baseline errors are the result of these artifacts. We propose Product-of-Experts (PoE) training, which downweights examples where biased models are overconfident. PoE nearly preserves accuracy (89.10% vs. 89.30%) while cutting bias reliance by 4.71% (bias agreement 49.85% to 45%). An ablation finds lambda = 1.5 that best balances debiasing and accuracy. Behavioral tests still reveal issues with negation and numerical reasoning.

62. 【2604.19052】Cell-Based Representation of Relational Binding in Language Models

链接：https://arxiv.org/abs/2604.19052

作者：Qin Dai,Benjamin Heinzerling,Kentaro Inui

类目：Computation and Language (cs.CL)

关键词：discourse requires tracking, requires tracking entities, Understanding a discourse, Large Language Models, discourse requires

备注：

点击查看摘要

Abstract:Understanding a discourse requires tracking entities and the relations that hold between them. While Large Language Models (LLMs) perform well on relational reasoning, the mechanism by which they bind entities, relations, and attributes remains unclear. We study discourse-level relational binding and show that LLMs encode it via a Cell-based Binding Representation (CBR): a low-dimensional linear subspace in which each ``cell'' corresponds to an entity--relation index pair, and bound attributes are retrieved from the corresponding cell during inference. Using controlled multi-sentence data annotated with entity and relation indices, we identify the CBR subspace by decoding these indices from attribute-token activations with Partial Least Squares regression. Across domains and two model families, the indices are linearly decodable and form a grid-like geometry in the projected space. We further find that context-specific CBR representations are related by translation vectors in activation space, enabling cross-context transfer. Finally, activation patching shows that manipulating this subspace systematically changes relational predictions and that perturbing it disrupts performance, providing causal evidence that LLMs rely on CBR for relational binding.

63. 【2604.19048】SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning

链接：https://arxiv.org/abs/2604.19048

作者：Boyan Shi,Wei Chen,Shuyuan Zhao,Junfeng Shen,Shengnan Guo,Shaojiang Wang,Huaiyu Wan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Language Models, Large Language, shown significant potential, Low-Rank Adaptation

备注： ACL 2026 Findings

点击查看摘要

Abstract:The combination of Mixture-of-Experts (MoE) and Low-Rank Adaptation (LoRA) has shown significant potential for enhancing the multi-task learning capabilities of Large Language Models. However, existing methods face two primary challenges: (1)Imprecise Routing in the current MoE-LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization. (2)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks. To address these limitations, we propose SAMoRA (Semantic-Aware Mixture of LoRA Experts), a novel parameter-efficient fine-tuning framework tailored for task-adaptive learning. Specifically, A Semantic-Aware Router is proposed to explicitly align textual semantics with the most suitable experts for precise routing. A Task-Adaptive Scaling mechanism is designed to regulate expert contributions based on specific task requirements dynamically. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling. Extensive experiments on multiple multi-task benchmarks demonstrate that SAMoRA significantly outperforms the state-of-the-art methods and holds excellent task generalization capabilities. Code is available at this https URL

64. 【2604.19047】RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

链接：https://arxiv.org/abs/2604.19047

作者：Hanjun Cho,Jay-Yoon Lee

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：typically assume distinct, strong inter-document similarity, assume distinct documents, exhibit strong inter-document, benchmarks typically assume

备注： Accepted to ACL 2026 (Main Conference)

点击查看摘要

Abstract:Existing QA benchmarks typically assume distinct documents with minimal overlap, yet real-world retrieval-augmented generation (RAG) systems operate on corpora such as financial reports, legal codes, and patents, where information is highly redundant and documents exhibit strong inter-document similarity. This mismatch undermines evaluation validity: retrievers can be unfairly undervalued even when they retrieve documents that provide sufficient evidence, because redundancy across documents is not accounted for in evaluation. On the other hand, retrievers that perform well on standard benchmarks often generalize poorly to real-world corpora with highly similar and redundant documents. We present RARE (Redundancy-Aware Retrieval Evaluation), a framework for constructing realistic benchmarks by (i) decomposing documents into atomic facts to enable precise redundancy tracking and (ii) enhancing LLM-based data generation with CRRF. RAG benchmark data usually requires multiple quality criteria, but LLMs often yield trivial outputs. CRRF scores criteria separately and fuses decisions by rank, improving the reliability of generated data. Applying RARE to Finance, Legal, and Patent corpora, we introduce RedQA, where a strong retriever baseline drops from 66.4% PerfRecall@10 on 4-hop General-Wiki to 5.0-27.9% PerfRecall@10 at 4-hop depth, revealing robustness gaps that current benchmarks fail to capture. RARE enables practitioners to build domain-specific RAG evaluations that faithfully reflect real-world deployment conditions.

65. 【2604.19016】AlignCultura: Towards Culturally Aligned Large Language Models?

链接：https://arxiv.org/abs/2604.19016

作者：Gautam Siddharth Kashyap,Mark Dras,Usman Naseem

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, producing contextually aware, Cultural alignment, Language Models

备注： Accepted at ACL Mains 2026

点击查看摘要

Abstract:Cultural alignment in Large Language Models (LLMs) is essential for producing contextually aware, respectful, and trustworthy outputs. Without it, models risk generating stereotyped, insensitive, or misleading responses that fail to reflect cultural diversity w.r.t Helpful, Harmless, and Honest (HHH) paradigm. Existing benchmarks represent early steps toward cultural alignment; yet, no benchmarks currently enables systematic evaluation of cultural alignment in line with UNESCO's principles of cultural diversity w.r.t HHH paradigm. Therefore, to address this gap, we built Align-Cultura, two-stage pipeline for cultural alignment. Stage I constructs CULTURAX, the HHH-English dataset grounded in the UNESCO cultural taxonomy, through Query Construction, which reclassifies prompts, expands underrepresented domains (or labels), and prevents data leakage with SimHash. Then, Response Generation pairs prompts with culturally grounded responses via two-stage rejection sampling. The final dataset contains 1,500 samples spanning 30 subdomains of tangible and intangible cultural forms. Stage II benchmarks CULTURAX on general-purpose models, culturally fine-tuned models, and open-weight LLMs (Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B). Empirically, culturally fine-tuned models improve joint HHH by 4%-6%, reduce cultural failures by 18%, achieve 10%-12% efficiency gains, and limit leakage to 0.3%.

66. 【2604.19005】Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

链接：https://arxiv.org/abs/2604.19005

作者：Yixuan Tang,Yirui Zhang,Hang Feng,Anthony K.H. Tung

类目：Computation and Language (cs.CL)

关键词：verification systems focused, remain a blind, explicit falsehoods, fact verification systems, factually correct

备注： Accepted to ACL 2026

点击查看摘要

Abstract:Half-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission-based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role-anchored multi-agent debate framework for omission-aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single- and multi-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role-anchored, retrieval-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification. The code is available at this https URL.

67. 【2604.19001】When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

链接：https://arxiv.org/abs/2604.19001

作者：Ishita Kakkar,Enze Zhang,Rheeya Uppaal,Junjie Hu

类目：Computation and Language (cs.CL)

关键词：Large reasoning models, evaluation remains focused, Large reasoning, reasoning traces, produce complex

备注：

点击查看摘要

Abstract:Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence-level granularity within reasoning traces -- a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step-wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset consists of 56,931 sentences from 1,018 reasoning traces generated by four model families, each annotated with fine-grained sentence-level behavioral labels. Using HarmThoughts, we analyze harm propagation patterns across reasoning traces, identifying common behavioral trajectories and drift points where reasoning transitions from safe to unsafe. Finally, we systematically compare white-box and black-box detectors on the task of identifying harmful reasoning behaviours on HarmThoughts. Our results show that existing detectors struggle with fine-grained behavior detection in reasoning traces, particularly for nuanced categories within harm emergence and execution, highlighting a critical gap in process-level safety monitoring. HarmThoughts is available publicly at: this https URL

68. 【2604.18995】$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

链接：https://arxiv.org/abs/2604.18995

作者：Zhenbang Du,Kejing Xia,Xinrui Zhong,Yonggan Fu,Nicolai Oswald,Binfei Ji,Brucek Khailany,Pavlo Molchanov,Yingyan Lin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Diffusion Large Language, Large Language Models, Diffusion Large, Large Language, enabling parallel token

备注：

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^2$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^2$-dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains.

69. 【2604.18976】STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

链接：https://arxiv.org/abs/2604.18976

作者：MinJae Jung,YongTaek Lim,Chaeyun Kim,Junghwan Kim,Kihyun Kim,Minwoo Kim

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, Large Language, inappropriate responses, remain susceptible

备注： Accepted at ACL 2026 Findings

点击查看摘要

Abstract:While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses. This paper introduces STAR-Teaming, a novel black-box framework for automated red teaming that effectively generates such prompts. STAR-Teaming integrates a Multi-Agent System (MAS) with a Strategy-Response Multiplex Network and employs network-driven optimization to sample effective attack strategies. This network-based approach recasts the intractable high-dimensional embedding space into a tractable structure, yielding two key advantages: it enhances the interpretability of the LLM's strategic vulnerabilities, and it streamlines the search for effective strategies by organizing the search space into semantic communities, thereby preventing redundant exploration. Empirical results demonstrate that STAR-Teaming significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost. Extensive experiments validate the effectiveness and explainability of the Multiplex Network. The code is available at this https URL.

70. 【2604.18955】Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

链接：https://arxiv.org/abs/2604.18955

作者：Ramtin Davoudi,Kartik Thakkar,Nazanin Donyapour,Tyler Derr,Hamid Karimi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)

关键词：Media Authorship Verification, Social Media Authorship, Authorship Verification, core social media, Media Post Generation

备注：

点击查看摘要

Abstract:In this study, we present the first comprehensive evaluation of modern LLMs - including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT - across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) User Attribute Inference. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate "seen-data" bias. For post generation, we assess the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics. Bridging Tasks I and II, we conduct a user study to measure real users' perceptions of LLM-generated posts conditioned on their own writing. For attribute inference, we annotate occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) and benchmark LLMs against existing baselines. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM-driven social media analytics. The code and data are provided in the supplementary material and will also be made publicly available upon publication.

71. 【2604.18951】Superficial Success vs. Internal Breakdown: An Empirical Study of Generalization in Adaptive Multi-Agent Systems

链接：https://arxiv.org/abs/2604.18951

作者：Namyoung So,Seokgyu Jang,Taeuk Kim(Department of Computer Science, Hanyang University, Seoul, Republic of Korea)

类目：Multiagent Systems (cs.MA); Computation and Language (cs.CL)

关键词：http URL address, http URL findings, ideal MAS behavior, Adaptive multi-agent systems, URL findings highlight

备注： 27 pages, 4 figures. Equal contribution for the first two authors

点击查看摘要

Abstract:Adaptive multi-agent systems (MAS) are increasingly adopted to tackle complex this http URL, the narrow task coverage of their optimization raises the question of whether they can function as general-purpose this http URL address this gap, we conduct an extensive empirical study of adaptive MAS, revealing two key findings: (1) topological overfitting -- they fail to generalize across different domains; and (2) illusory coordination -- they achieve reasonable surface-level accuracy while the underlying agent interactions diverge from ideal MAS behavior, raising concerns about their practical this http URL findings highlight the pressing need to prioritize generalization in MAS development and motivate evaluation protocols that extend beyond simple final-answer correctness.

72. 【2604.18944】A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

链接：https://arxiv.org/abs/2604.18944

作者：Jiang Xiaobo,Dinghong Lai,Song Qiu,Yadong Deng,Xinkai Zhan

类目：Computation and Language (cs.CL)

关键词：Named Entity Recognition, sparse User-Generated Content, high-resource corpora exhibit, corpora exhibit catastrophic, User-Generated Content

备注：

点击查看摘要

Abstract:Named Entity Recognition (NER) models trained on clean, high-resource corpora exhibit catastrophic performance collapse when deployed on noisy, sparse User-Generated Content (UGC), such as social media. Prior research has predominantly focused on point-wise symptom remediation -- employing customized fine-tuning to address issues like neologisms, alias drift, non-standard orthography, long-tail entities, and class imbalance. However, these improvements often fail to generalize because they overlook the structural sparsity inherent in UGC. This study reveals that surface-level noise symptoms share a unified root cause: low Information Density (ID). Through hierarchical confounding-controlled resampling experiments (specifically controlling for entity rarity and annotation consistency), this paper identifies ID as an independent key factor. We introduce Attention Spectrum Analysis (ASA) to quantify how reduced ID causally leads to ``attention blunting,'' ultimately degrading NER performance. Informed by these mechanistic insights, we propose the Window-Aware Optimization Module (WOM), an LLM-empowered, model-agnostic framework. WOM identifies information-sparse regions and utilizes selective back-translation to directionally enhance semantic density without altering model architecture. Deployed atop mainstream architectures on standard UGC datasets (WNUT2017, Twitter-NER, WNUT2016), WOM yields up to 4.5\% absolute F1 improvement, demonstrating robustness and achieving new state-of-the-art (SOTA) results on WNUT2017.

73. 【2604.18943】Personalized Benchmarking: Evaluating LLMs by Individual Preferences

链接：https://arxiv.org/abs/2604.18943

作者：Cristina Garbacea,Heran Wang,Chenhao Tan

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：evaluating LLM alignment, large language models, LLM, real-world tasks, important challenge

备注： Accepted to Findings of ACL 2026

点击查看摘要

Abstract:With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We compute personalized model rankings using ELO ratings and Bradley-Terry coefficients for 115 active Chatbot Arena users and analyze how user query characteristics (topics and writing style) relate to LLM ranking variations. We demonstrate that individual rankings of LLM models diverge dramatically from aggregate LLM rankings, with Bradley-Terry correlations averaging only $\rho = 0.04$ (57\% of users show near-zero or negative correlation) and ELO ratings showing moderate correlation ($\rho = 0.43$). Through topic modeling and style analysis, we find users exhibit substantial heterogeneity in topical interests and communication styles, influencing their model preferences. We further show that a compact combination of topic and style features provides a useful feature space for predicting user-specific model rankings. Our results provide strong quantitative evidence that aggregate benchmarks fail to capture individual preferences for most users, and highlight the importance of developing personalized benchmarks that rank LLM models according to individual user preferences.

74. 【2604.18942】Disparities In Negation Understanding Across Languages In Vision-Language Models

链接：https://arxiv.org/abs/2604.18942

作者：Charikleia Moraitaki,Sarah Pan,Skyler Pulling,Gwendolyn Flusche,Kumail Alhamoud,Marzyeh Ghassemi

类目：Computation and Language (cs.CL)

关键词：exhibit affirmation bias, select positive captions, exhibit affirmation, affirmation bias, positive captions

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions ("X is present") even when the correct description contains negation ("no X"). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.

75. 【2604.18920】Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

链接：https://arxiv.org/abs/2604.18920

作者：Chenqian Le,Ruisi Li,Beatrice Fumagalli,Xupeng Chen,Amirhossein Khalilian-Gourtani,Tianyu He,Adeen Flinker,Yao Wang

类目：ound (cs.SD); Computation and Language (cs.CL)

关键词：predict surface electromyography, linearly predict surface, Speech Articulatory Coding, Articulatory Coding, surface electromyography

备注：

点击查看摘要

Abstract:We test whether Speech Articulatory Coding (SPARC) features can linearly predict surface electromyography (sEMG) envelopes across aloud, mimed, and subvocal speech in twenty-four subjects. Using elastic-net multivariate temporal response function (mTRF) with sentence-level cross-validation, SPARC yields higher prediction accuracy than phoneme one-hot representations on nearly all electrodes and in all speech modes. Aloud and mimed speech perform comparably, and subvocal speech remains above chance, indicating detectable articulatory activity. Variance partitioning shows a substantial unique contribution from SPARC and a minimal unique contribution from phoneme features. mTRF weight patterns reveal anatomically interpretable relationships between electrode sites and articulatory movements that remain consistent across modes. This study focuses on representation/encoding analysis (not end-to-end decoding) and supports SPARC as a robust and interpretable intermediate target for sEMG-based silent-speech modeling.

76. 【2604.18919】Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data

链接：https://arxiv.org/abs/2604.18919

作者：Yura Yoshida,Masato Kanai,Masataka Nakayama,Haruki Ohsawa,Yukiko Uchida,Arata Yuminaga,Gakuse Hoshina,Nobuo Sayama

类目：Computation and Language (cs.CL)

关键词：Analyzing topics extracted, computational social science, Analyzing topics, organizational research, extracted from text

备注：

点击查看摘要

Abstract:Analyzing topics extracted from text data in relation to external outcomes is important across fields such as computational social science and organizational research. However, existing topic modeling methods struggle to simultaneously achieve interpretability, topic specificity (alignment with concrete actions or characteristics), and polarity stance consistency (absence of mixed positive and negative evaluations within a topic). Focusing on leadership analysis using corporate review data, this study proposes a method leveraging large language models to generate topics that satisfy these properties, along with an evaluation framework tailored to external outcome analysis. The framework explicitly incorporates topic specificity and polarity stance consistency as evaluation criteria and examines automated evaluation methods based on existing metrics. Using employee reviews from OpenWork, a major corporate review platform in Japan, the proposed method achieves improved interpretability, specificity, and polarity consistency compared to existing approaches. In analyses of external outcomes such as employee morale, it also produces topics with higher explanatory power. These results suggest that the proposed method and evaluation framework provide a generalized approach for topic analysis in applications involving external outcomes.

77. 【2604.18914】MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

链接：https://arxiv.org/abs/2604.18914

作者：Mehul Agarwal,Aditya Aggarwal,Arnav Goel,Medha Hira,Anubha Gupta

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：agreement remains underexplored, morphological agreement remains, question answering, remains underexplored, translation and question

备注： 25 pages, accepted to ACL 2026 (Main)

点击查看摘要

Abstract:While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions of gender. We introduce MORPHOGEN, a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation in three typologically diverse grammatically gendered languages: French, Arabic, and Hindi. The core task, GENFORM, requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. We construct a high-quality synthetic dataset spanning these three languages and benchmark 15 popular multilingual LLMs (2B-70B) on their ability to perform this transformation. Our results reveal significant gaps and interesting insights into how current models handle morphological gender. MORPHOGEN provides a focused diagnostic lens for gender-aware language modeling and lays the groundwork for future research on inclusive and morphology-sensitive NLP.

78. 【2604.18913】LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval

链接：https://arxiv.org/abs/2604.18913

作者：He Cheng,Yifu Wu,Saksham Khatwani,Maya Kruse,Dmitriy Dligach,Timothy A. Miller,Majid Afshar,Yanjun Gao

类目：Computation and Language (cs.CL)

关键词：large language models, language models, increasingly integrated, verifiable reasoning, large language

备注： Accepted to the ACL 2026 Main Conference. 9 pages

点击查看摘要

Abstract:Knowledge graphs (KGs) are increasingly integrated with large language models (LLMs) to provide structured, verifiable reasoning. A core operation in this integration is multi-hop retrieval, yet existing systems struggle to balance efficiency, scalability, and interpretability. We introduce LogosKG, a novel, hardware-aligned framework that enables scalable and interpretable k-hop retrieval on large KGs by building on symbolic KG formulations and executing traversal as hardware-efficient operations over decomposed subject, object, and relation representations. To scale to billion-edge graphs, LogosKG integrates degree-aware partitioning, cross-graph routing, and on-demand caching. Experiments show substantial efficiency gains over CPU and GPU baselines without loss of retrieval fidelity. With proven performance in KG retrieval, a downstream two-round KG-LLM interaction demonstrates how LogosKG enables large-scale, evidence-grounded analysis of how KG topology, such as hop distribution and connectivity, shapes the alignment between structured biomedical knowledge and LLM diagnostic reasoning, thereby opening the door for next-generation KG-LLM integration. The source code is publicly available at this https URL, and an online demo is available at this https URL.

79. 【2604.18901】Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

链接：https://arxiv.org/abs/2604.18901

作者：Isaac Llorente-Saguer

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：model residual streams, projection methods fail, residual streams, geometrically recoverable, recoverable from large

备注： 25 pages, 7 figures, 11 tables. Code at [this https URL](https://github.com/isaac-6/harm-directions)

点击查看摘要

Abstract:Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), under single-turn, English evaluation, we characterise this geometry through six direction-finding strategies. Three succeed: a soft-AUC-optimised linear direction reaches mean AUROC 0.98 and TPR@1\%FPR 0.80; a class-mean probe reaches 0.98 and 0.71 at 1ms fitting cost; a supervised angular-deviation strategy reaches AUROC 0.96 and TPR of 0.61 along a representationally distinct direction ($73^\circ$ from projection-based solutions), uniquely sustaining detection in middle layers where projection methods collapse. Detection remains stable across alignment variants, including abliterated models from which refusal has been surgically removed: harmful intent and refusal behaviour are functionally dissociated features of the representation. A direction fitted on AdvBench transfers to held-out HarmBench and JailbreakBench with worst-case AUROC 0.96. The same picture holds at scale: across Qwen3.5 from 0.8B to 9B parameters, AUROC remains $\geq$0.98 and cross-variant transfer stays within 0.018 of own-direction performance This is consistent with a simple account: models acquire a linearly decodable representation of harmful intent as part of general language understanding, and alignment then shapes what they do with such inputs without reorganising the upstream recognition signal. As a practical consequence, AUROC in the 0.97+ regime can substantially overestimate operational detectability; TPR@$1\%$FPR should accompany AUROC in safety-adjacent evaluation.

Comments:
25 pages, 7 figures, 11 tables. Code at this https URL

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

ACMclasses:
I.2.7

Cite as:
arXiv:2604.18901 [cs.LG]

(or
arXiv:2604.18901v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.18901

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

80. 【2604.18897】Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

链接：https://arxiv.org/abs/2604.18897

作者：Manuel Israel Cazares

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Equational Theories Stage, SAIR Equational Theories, Theories Stage, SAIR Equational, Equational Theories

备注： Companion repository: [this https URL](https://github.com/israelcazares/sair-prompt-engineering) | Zenodo DOI: [https://doi.org/10.5281/zenodo.19598433](https://doi.org/10.5281/zenodo.19598433) | v15: final Contributor Network data (n=52, competition close April 20, 2026)

点击查看摘要

Abstract:We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas -- a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single-prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60--79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the TRUE case limits what any finite prompt can encode; (2) complex rule systems decrease performance on weaker models (Llama 3.3 70B collapses to 0% TRUE recall with prompts exceeding 2KB); and (3) prompt ordering effects interact with model attention in fragile, non-monotonic ways. Our best submission (AN45c, 2,252 bytes) achieves 79.25% accuracy on hard3 (n=400; 95% CI: [75.0%, 82.9%]), with TRUE recall of 95.9% and FALSE recall of 63.4%, representing a +19.5 percentage-point improvement over the no-cheatsheet baseline (59.75%). We release all prompt variants, evaluation scripts, and results at this https URL

Comments:
Companion repository: this https URL | Zenodo DOI: https://doi.org/10.5281/zenodo.19598433 | v15: final Contributor Network data (n=52, competition close April 20, 2026)

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

ACMclasses:
I.2.7

Cite as:
arXiv:2604.18897 [cs.CL]

(or
arXiv:2604.18897v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.18897

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

81. 【2604.18892】Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

链接：https://arxiv.org/abs/2604.18892

作者：Mengzhao Jia,Zhihan Zhang,Meng Jiang

类目：Computation and Language (cs.CL)

关键词：Reinforcement Learning, rewarding verifiable final, verifiable final answers, Learning with Verifiable, rewarding verifiable

备注：

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) improves multimodal reasoning by rewarding verifiable final answers. Yet answer-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions. This gap between answer correctness and reasoning validity, which we call reasoning-answer inconsistency, motivates trajectory supervision in multimodal RL. We compare two main approaches: reward models (RMs), and Generative Rewards (GRs). RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive. We therefore propose Groupwise Ranking Reward, which ranks verifier-passed trajectories for the same prompt in one pass and redistributes reward accordingly. Groupwise comparison better separates stronger and weaker correct trajectories with lower judge overhead than GRs. Experiments show that RLVR aggravates reasoning-answer inconsistency, while trajectory supervision alleviates it. Groupwise Ranking Reward performs best overall, improving reliability-conditioned accuracy from 47.4% to 54.7% over RLVR.

82. 【2604.18880】Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

链接：https://arxiv.org/abs/2604.18880

作者：Yuefei Chen,Yihao Quan,Xiaodong Lin,Ruixiang Tang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：LLMs frequently generate, frequently generate fictitious, expressing high confidence, LLMs frequently, frequently generate

备注：

点击查看摘要

Abstract:LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.

83. 【2604.18878】LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

链接：https://arxiv.org/abs/2604.18878

作者：Pedro Barbosa de Carvalho Neto

类目：Computation and Language (cs.CL)

关键词：Catarina State Court, Santa Catarina State, evaluating language models, Brazilian legal text, introduce LegalBench-BR

备注： 8 pages, 1 figure. Preprint. First public benchmark for Brazilian legal text classification. Dataset and model available on Hugging Face

点击查看摘要

Abstract:We introduce LegalBench-BR, the first public benchmark for evaluating language models on Brazilian legal text classification. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas through LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, updating only 0.3% of model parameters, achieves 87.6% accuracy and 0.87 macro-F1 (+22pp over Claude 3.5 Haiku, +28pp over GPT-4o mini). The gap is most striking on administrativo (administrative law): GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on this class, while the fine-tuned model reaches F1 = 0.91. Both commercial LLMs exhibit a systematic bias toward civel (civil law), absorbing ambiguous classes rather than discriminating them, a failure mode that domain-adapted fine-tuning eliminates. These results demonstrate that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even when the task is a simple 5-class problem, and that LoRA fine-tuning on a consumer GPU closes the gap at zero marginal inference cost. We release the full dataset, model, and pipeline to enable reproducible research in Portuguese legal NLP.

84. 【2604.18847】Human-Guided Harm Recovery for Computer Use Agents

链接：https://arxiv.org/abs/2604.18847

作者：Christy Li,Sky CH-Wang,Andi Peng,Andreea Bobu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：real computer systems, prevent harmful actions, effectively remediate harm, execute actions, computer systems

备注：

点击查看摘要

Abstract:As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,150 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent's ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods -- ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.

85. 【2604.18835】Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

链接：https://arxiv.org/abs/2604.18835

作者：Sinan G. Aksoy,Alexandra A. Sabrio,Erik VonKaenel,Lee Burke

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：probes LLM sensitivity, pairwise document comparison, multifactorial experimental framework, propose a scalable, multifactorial experimental

备注： 15 pages, 8 figures

点击查看摘要

Abstract:We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison. We analogize this as a needle-in-a-haystack problem: a single semantically altered sentence (the needle) is embedded within surrounding context (the hay), and we vary the perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length across all combinations, testing five LLMs on tens of thousands of document pairs. Our analysis reveals several striking findings. First, LLMs exhibit a within-document positional bias distinct from previously studied candidate-order effects: most models penalize semantic differences more harshly when they occur earlier in a document. Second, when the altered sentence is surrounded by topically unrelated context, it systematically lowers similarity scores and induces bipolarized scores that indicate either very low or very high similarity. This is consistent with an interpretive frame account in which topically-related context may allow models to contextualize and downweight the alterations. Third, each LLM produces a qualitatively distinct scoring distribution, a stable "fingerprint" that is invariant to perturbation type, yet all models share a universal hierarchy in how leniently they treat different perturbation types. Together, these results demonstrate that LLM semantic similarity scores are sensitive to document structure, context coherence, and model identity in ways that go beyond the semantic change itself, and that the proposed framework offers a practical, LLM-agnostic toolkit for auditing and comparing scoring behavior across current and future models.

86. 【2604.18786】Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

链接：https://arxiv.org/abs/2604.18786

作者：Seyedali Mohammadi,Manas Gaur,Francis Ferraro

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Scientific feasibility assessment, Scientific feasibility, feasibility assessment, claim is consistent, consistent with established

备注： Accepted at ACL 2026

点击查看摘要

Abstract:Scientific feasibility assessment asks whether a claim is consistent with established knowledge and whether experimental evidence could support or refute it. We frame feasibility assessment as a diagnostic reasoning task in which, given a hypothesis, a model predicts feasible or infeasible and justifies its decision. We evaluate large language models (LLMs) under controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and probe robustness by progressively removing portions of the experimental and/or outcome context. Across multiple LLMs and two datasets, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond what internal knowledge alone provides, whereas experimental text can be brittle and may degrade performance when the context is incomplete. These findings clarify when experimental evidence benefits LLM-based feasibility assessment and when it introduces fragility.

87. 【2604.18779】Mango: Multi-Agent Web Navigation via Global-View Optimization

链接：https://arxiv.org/abs/2604.18779

作者：Weixi Tong,Yifeng Di,Tianyi Zhang

类目：Computation and Language (cs.CL)

关键词：typically initiate exploration, Existing web agents, deep hierarchical structures, agents typically initiate, Existing web

备注：

点击查看摘要

Abstract:Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website's structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a limited budget. We propose Mango, a multi-agent web navigation method that leverages the website structure to dynamically determine optimal starting points. We formulate URL selection as a multi-armed bandit problem and employ Thompson Sampling to adaptively allocate the navigation budget across candidate URLs. Furthermore, we introduce an episodic memory component to store navigation history, enabling the agent to learn from previous attempts. Experiments on WebVoyager demonstrate that Mango achieves a success rate of 63.6% when using GPT-5-mini, outperforming the best baseline by 7.3%. Furthermore, on WebWalkerQA, Mango attains a 52.5% success rate, surpassing the best baseline by 26.8%. We also demonstrate the generalizability of Mango using both open-source and closed-source models as backbones. Our data and code are open-source and available at this https URL.

88. 【2604.18775】An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

链接：https://arxiv.org/abs/2604.18775

作者：Hanrui Luo,Shreyank N Gowda

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Detecting jailbreak behaviour, models remains challenging, Detecting jailbreak, strongly aligned models, aligned models produce

备注：

点击查看摘要

Abstract:Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of output based jailbreak detection under realistic conditions using the JailbreakBench Behaviors dataset and multiple generator models with varying alignment strengths. We evaluate both a lexical TF-IDF detector and a generation inconsistency based detector across different sampling budgets. Our results show that single output evaluation systematically underestimates jailbreak vulnerability, as increasing the number of sampled generations reveals additional harmful behaviour. The most significant improvements occur when moving from a single generation to moderate sampling, while larger sampling budgets yield diminishing returns. Cross generator experiments demonstrate that detection signals partially generalise across models, with stronger transfer observed within related model families. A category level analysis further reveals that lexical detectors capture a mixture of behavioural signals and topic specific cues, rather than purely harmful behaviour. Overall, our findings suggest that moderate multi sample auditing provides a more reliable and practical approach for estimating model vulnerability and improving jailbreak detection in large language models. Code will be released.

89. 【2604.18759】Model-Agnostic Meta Learning for Class Imbalance Adaptation

链接：https://arxiv.org/abs/2604.18759

作者：Hanshu Rao,Guangzeng Han,Xiaolei Huang

类目：Computation and Language (cs.CL)

关键词：significantly hindering robust, challenge in NLP, hindering robust performance, Class imbalance, significantly hindering

备注： Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Class imbalance is a widespread challenge in NLP tasks, significantly hindering robust performance across diverse domains and applications. We introduce Hardness-Aware Meta-Resample (HAMR), a unified framework that adaptively addresses both class imbalance and data difficulty. HAMR employs bi-level optimizations to dynamically estimate instance-level weights that prioritize genuinely challenging samples and minority classes, while a neighborhood-aware resampling mechanism amplifies training focus on hard examples and their semantically similar neighbors. We validate HAMR on six imbalanced datasets covering multiple tasks and spanning biomedical, disaster response, and sentiment domains. Experimental results show that HAMR achieves substantial improvements for minority classes and consistently outperforms strong baselines. Extensive ablation studies demonstrate that our proposed modules synergistically contribute to performance gains and highlight HAMR as a flexible and generalizable approach for class imbalance adaptation. Code is available at this https URL.

90. 【2604.18758】Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

链接：https://arxiv.org/abs/2604.18758

作者：Abhishek Purushothama,Emma Thronson,Alexia Guo,Amir Zeldes

类目：Computation and Language (cs.CL)

关键词：Low-resource machine translation, translation requires methods, machine translation requires, support low-resource machine, Low-resource machine

备注： ACL 2026 Findings camera-ready

点击查看摘要

Abstract:Low-resource machine translation requires methods that differ from those used for high-resource languages. This paper proposes a novel in-context learning approach to support low-resource machine translation of the Coptic language to English, with syntactic augmentation from Universal Dependencies parses of input sentences. Building on existing work using bilingual dictionaries to support inference for vocabulary items, we add several representations of syntactic analyses to our inputs , specifically exploring the inclusion of raw parser outputs, verbalizations of parses in plain English, and targeted instructions of difficult constructions identified in sub-trees and how they can be translated. Our results show that while syntactic information alone is not as useful as dictionary-based glosses, combining retrieved dictionary items with syntactic information achieves significant gains across model sizes, achieving new state-of-the-art translation results for Coptic.

91. 【2604.18756】owards Understanding the Robustness of Sparse Autoencoders

链接：https://arxiv.org/abs/2604.18756

作者：Ahson Saiyed,Sabrina Sadiekh,Chirag Agarwal

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：Large Language Models, Large Language, internal gradient structure, exploit internal gradient, Language Models

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.

92. 【2604.18738】Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

链接：https://arxiv.org/abs/2604.18738

作者：Lin Yao

类目：Computation and Language (cs.CL)

关键词：Masked diffusion language, diffusion language models, Masked diffusion, confidence threshold, diffusion language

备注：

点击查看摘要

Abstract:Masked diffusion language models such as LLaDA2.1 rely on Token-to-Token (T2T) editing to correct their own generation errors: whenever a different token crosses a confidence threshold, the committed token is overwritten. We identify three structural failure modes of this rule. The trigger cannot fire when no single alternative is confident enough; the replacement is computed under a context that may itself contain errors; and the uniform perturbations used to train the T2T stream do not resemble the coherent, semantically plausible mistakes that the model actually makes at inference. As an alternative, we propose Token-to-Mask (T2M) remasking. Rather than overwriting a suspect token with a new guess, T2M resets the position to the mask state, so that the next denoising step re-predicts it from an in-distribution context. The method is training-free, modifies only the editing rule, and introduces no new parameters. We pair it with three detection heuristics and give a short theoretical account of why a mask is a better conditioning signal than an erroneous token. Across 8 benchmarks, T2M improves accuracy on tasks that require exact token-level output. Its largest gain is +5.92 points on CMATH, where we attribute 79.9% of baseline errors to last-mile corruption (correct reasoning followed by a garbled final answer); T2M repairs 41.3% of these cases.

93. 【2604.18729】Investigating Counterfactual Unfairness in LLMs towards Identities through Humor

链接：https://arxiv.org/abs/2604.18729

作者：Shubin Kim,Yejin Son,Junyeong Park,Keummin Ka,Seungbeen Lee,Jaeyoung Lee,Hyeju Jang,Alice Oh,Youngjae Yu

类目：Computation and Language (cs.CL)

关键词：find funny, funny often reflects, social perception, Humor, Humor holds

备注： Accepted to ACL 2026 Main Conference. The first two authors contributed equally. The last three authors are co-corresponding authors

点击查看摘要

Abstract:Humor holds up a mirror to social perception: what we find funny often reflects who we are and how we judge others. When language models engage with humor, their reactions expose the social assumptions they have internalized from training data. In this paper, we investigate counterfactual unfairness through humor by observing how the model's responses change when we swap who speaks and who is addressed while holding other factors constant. Our framework spans three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, covering both identity-agnostic humor and identity-specific disparagement humor. We introduce interpretable bias metrics that capture asymmetric patterns under identity swaps. Experiments across state-of-the-art models reveal consistent relational disparities: jokes told by privileged speakers are refused up to 67.5% more often, judged as malicious 64.7% more frequently, and rated up to 1.5 points higher in social harm on a 5-point scale. These patterns highlight how sensitivity and stereotyping coexist in generative models, complicating efforts toward fairness and cultural alignment.

94. 【2604.18722】Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP

链接：https://arxiv.org/abs/2604.18722

作者：Thanmay Jayakumar,Deepon Halder,Raj Dabre

类目：Computation and Language (cs.CL)

关键词：systems inhibit transfer, inhibit transfer learning, writing systems inhibit, inhibit transfer, transfer learning

备注： 9 pages, ACL 2026 (Findings)

点击查看摘要

Abstract:Cross-lingual transfer in NLP is often hindered by the ``script barrier'' where differences in writing systems inhibit transfer learning between languages. Transliteration, the process of converting the script, has emerged as a powerful technique to bridge this gap by increasing lexical overlap. This paper provides a comprehensive survey of the application of transliteration in cross-lingual NLP. We present a taxonomy of key motivations to utilize transliterations in language models, and provide an overview of different approaches of incorporating transliterations as input. We analyze the evolution and effectiveness of these methods, discussing the critical trade-offs involved, and contextualize their need in modern LLMs. The review explores various settings that show how transliteration is beneficial, including handling code-mixed text, leveraging language family relatedness, and pragmatic gains in inference efficiency. Based on this analysis, we provide concrete recommendations for researchers on selecting and implementing the most appropriate transliteration strategy based on their specific language, task, and resource constraints.

95. 【2604.18715】Characterizing AlphaEarth Embedding Geometry for Agentic Environmental Reasoning

链接：https://arxiv.org/abs/2604.18715

作者：Mashrekur Rahman,Samuel J. Barrett,Christina Last

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Earth observation foundation, encode land surface, land surface information, Continental United States, Earth observation

备注：

点击查看摘要

Abstract:Earth observation foundation models encode land surface information into dense embedding vectors, yet the geometric structure of these representations and its implications for downstream reasoning remain underexplored. We characterize the manifold geometry of Google AlphaEarth's 64-dimensional embeddings across 12.1 million Continental United States samples (2017--2023) and develop an agentic system that leverages this geometric understanding for environmental reasoning. The manifold is non-Euclidean: effective dimensionality is 13.3 (participation ratio) from 64 raw dimensions, with local intrinsic dimensionality of approximately 10. Tangent spaces rotate substantially, with 84\% of locations exceeding 60\textdegree{} and local-global alignment (mean$|\cos\theta| = 0.17$) approaching the random baseline of 0.125. Supervised linear probes indicate that concept directions rotate across the manifold, and compositional vector arithmetic using both PCA-derived and probe-derived directions yields poor precision. Retrieval instead produces physically coherent results, with local geometry predicting retrieval coherence ($R^2 = 0.32$). Building on this characterization, we introduce an agentic system with nine specialized tools that decomposes environmental queries into reasoning chains over a FAISS-indexed embedding database. A five-condition ablation (120 queries, three complexity tiers) shows that embedding retrieval dominates response quality ($\mu = 3.79 \pm 0.90$ vs.\ $3.03 \pm 0.77$ parametric-only; scale 1--5), with peak performance on multi-step comparisons ($\mu = 4.28 \pm 0.43$). A cross-model benchmark show that geometric tools reduce Sonnet 4.5's score by 0.12 points but improve Opus 4.6's by 0.07, with Opus achieving higher geometric grounding (3.38 vs.\ 2.64), suggesting that the value of geometric characterization scales with the reasoning capability of the consuming model.

96. 【2604.18712】Probing for Reading Times

链接：https://arxiv.org/abs/2604.18712

作者：Eleftheria Tsipidi,Samuel Kiegeland,Francesco Ignazio Re,Tianyang Xu,Mario Giulianelli,Karolina Stanczak,Ryan Cotterell

类目：Computation and Language (cs.CL)

关键词：encode rich linguistic, capture cognitive signals, Probing has shown, rich linguistic information, representations encode rich

备注： ACL 2026 (main conference)

点击查看摘要

Abstract:Probing has shown that language model representations encode rich linguistic information, but it remains unclear whether they also capture cognitive signals about human processing. In this work, we probe language model representations for human reading times. Using regularized linear regression on two eye-tracking corpora spanning five languages (English, Greek, Hebrew, Russian, and Turkish), we compare the representations from every model layer against scalar predictors -- surprisal, information value, and logit-lens surprisal. We find that the representations from early layers outperform surprisal in predicting early-pass measures such as first fixation and gaze duration. The concentration of predictive power in the early layers suggests that human-like processing signatures are captured by low-level structural or lexical representations, pointing to a functional alignment between model depth and the temporal stages of human reading. In contrast, for late-pass measures such as total reading time, scalar surprisal remains superior, despite its being a much more compressed representation. We also observe performance gains when using both surprisal and early-layer representations. Overall, we find that the best-performing predictor varies strongly depending on the language and eye-tracking measure.

97. 【2604.18697】Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

链接：https://arxiv.org/abs/2604.18697

作者：Ruixuan Liu,David Evans,Li Xiong

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：measured membership inference, broader memorization risks, Indistinguishability properties, low empirically measured, empirically measured membership

备注： Accepted by SP 2026

点击查看摘要

Abstract:Indistinguishability properties such as differential privacy bounds or low empirically measured membership inference are widely treated as proxies to show a model is sufficiently protected against broader memorization risks. However, we show that indistinguishability properties are neither sufficient nor necessary for preventing data extraction in LLM APIs. We formalize a privacy-game separation between extraction and indistinguishability-based privacy, showing that indistinguishability and inextractability are incomparable: upper-bounding distinguishability does not upper-bound extractability. To address this gap, we introduce $(l, b)$-inextractability as a definition that requires at least $2^b$ expected queries for any black-box adversary to induce the LLM API to emit a protected $l$-gram substring. We instantiate this via a worst-case extraction game and derive a rank-based extraction risk upper bound for targeted exact extraction, as well as extensions to cover untargeted and approximate extraction. The resulting estimator captures the extraction risk over multiple attack trials and prefix adaptations. We show that it can provide a tight and efficient estimation for standard greedy extraction and an upper bound on the probabilistic extraction risk given any decoding configuration. We empirically evaluate extractability across different models, clarifying its connection to distinguishability, demonstrating its advantage over existing extraction risk estimators, and providing actionable mitigation guidelines across model training, API access, and decoding configurations in LLM API deployment. Our code is publicly available at: this https URL.

98. 【2604.18658】Owner-Harm: A Missing Threat Model for AI Agent Safety

链接：https://arxiv.org/abs/2604.18658

作者：Dongcheng Zhang,Yiqing Jiang

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：systematic blind spot, consequential threat category, commercially consequential threat, generic criminal harm, weapon synthesis

备注： 15 pages. Companion manuscript on per-decision proof-obligation synthesis (LSVJ-S) in preparation

点击查看摘要

Abstract:Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real-world incidents illustrate the gap: Slack AI credential exfiltration (Aug 2024), Microsoft 365 Copilot calendar-injection leaks (Jan 2024), and a Meta agent unauthorized forum post exposing operational data (Mar 2026). We propose Owner-Harm, a formal threat model with eight categories of agent behavior damaging the deployer. We quantify the defense gap on two benchmarks: a compositional safety system achieves 100% TPR / 0% FPR on AgentHarm (generic criminal harm) yet only 14.8% (4/27; 95% CI: 5.9%-32.5%) on AgentDojo injection tasks (prompt-injection-mediated owner harm). A controlled generic-LLM baseline shows the gap is not inherent to owner-harm (62.7% vs. 59.3%, delta 3.4 pp) but arises from environment-bound symbolic rules that fail to generalize across tool vocabularies. On a post-hoc 300-scenario owner-harm benchmark, the gate alone achieves 75.3% TPR / 3.3% FPR; adding a deterministic post-audit verifier raises overall TPR to 85.3% (+10.0 pp) and Hijacking detection from 43.3% to 93.3%, demonstrating strong layer complementarity. We introduce the Symbolic-Semantic Defense Generalization (SSDG) framework relating information coverage to detection rate. Two SSDG experiments partially validate it: context deprivation amplifies the detection gap 3.4x (R = 3.60 vs. R = 1.06); context injection reveals structured goal-action alignment, not text concatenation, is required for effective owner-harm detection.

99. 【2604.18655】Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

链接：https://arxiv.org/abs/2604.18655

作者：Sravanth Kodavanti,Sowmya Vajrala,Srinivas Miriyala,Utsav Tiwari,Uttam Kumar,Utkarsh Kumar Mahawar,Achal Pratap Singh,Arya D,Narendra Mutyala,Vikram Nelvoy Rajendiran,Sharan Kumar Allur,Euntaik Lee,Dohyoung Kim,HyeonSu Lee,Gyusung Cho,JungBae Kim

类目：Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：smartphones poses significant, poses significant engineering, significant engineering challenges, engineering challenges due, smartphones poses

备注： Accepted at ACL 2026

点击查看摘要

Abstract:Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.

100. 【2604.18592】wo-dimensional early exit optimisation of LLM inference

链接：https://arxiv.org/abs/2604.18592

作者：Jan Hůla,David Adamczyk,Tomáš Filip,Martin Pavlíček,Petr Sosík

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, introduce a two-dimensional, strategy that coordinates, sentence-wise exiting, large language

备注：

点击查看摘要

Abstract:We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively activating deeper layers, our method achieves multiplicative computational savings that exceed those from optimizing either dimension independently. Experimental evaluation across four state-of-the-art LLMs (Llama 3.1, Llama 3.2, Gemma, Qwen; 3B-8B parameters) on three sentiment classification datasets demonstrates additional speed-ups of 1.4--2.3$\times$ over optimal layer-wise early exit for simpler tasks with vanilla models, with graceful degradation on complex multi-class problems. Fine-tuning reduces but does not eliminate this advantage. The approach is model-agnostic, requires only lightweight classification adapters, and is orthogonal to complementary efficiency methods such as quantization and pruning. Our findings indicate that 2D early exit strategies excel when semantic information accumulates predictably across input structure, suggesting possible applicability to sequence-processing tasks beyond sentiment classification.

101. 【2604.18586】Who Shapes Brazil's Vaccine Debate? Semi-Supervised Modeling of Stance and Polarization in YouTube's Media Ecosystem

链接：https://arxiv.org/abs/2604.18586

作者：Geovana S. de Oliveira,Ana P. C. Silva,Fabricio Murai,Carlos H. G. Ferreira

类目：Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词：declining institutional trust, undermine immunization efforts, Vaccination remains, pandemic exposed, remains a cornerstone

备注： Paper accepted at WebSci'26

点击查看摘要

Abstract:Vaccination remains a cornerstone of global public health, yet the COVID-19 pandemic exposed how online misinformation, political polarization, and declining institutional trust can undermine immunization efforts. Most of the prior computational studies that analyzed vaccine discourse on social platforms focus on English-language data, specific vaccines, or short time windows, impairing our understanding of long-term dynamics in high-impact, non-English contexts like Brazil, home to one of the world's most comprehensive immunization systems. We here present the largest longitudinal study of Brazil's vaccine discourse on YouTube, leveraging a semi-supervised stance detection framework that combines self-labeling and self-training to classify nearly 1.4 million comments. By integrating stance with temporal patterns, engagement metrics, and channel taxonomy (legacy media, science communicators, digital-native outlets), we map how pro- and anti-vaccine narratives evolve and circulate within a hybrid media ecosystem. Our results show that semi-supervised learning substantially improves stance classification robustness, enabling fine-grained tracking of public attitudes across Brazil's full immunization schedule. Polarization spikes during epidemiological crises, especially COVID-19, but becomes fragmented across vaccines and interaction patterns in the post-pandemic period. Notably, science communication and digital-native channels emerge as the primary loci of both supportive and oppositional engagement, revealing structural vulnerabilities in contemporary health communication. Thus, our work advances computational methods for large-scale stance modeling while offering actionable evidence for public health agencies, platform governance, and online information ecosystems.

102. 【2604.16529】Scaling Test-Time Compute for Agentic Coding

链接：https://arxiv.org/abs/2604.16529

作者：Joongwon Kim,Wannan Yang,Kelvin Niu,Hongming Zhang,Yun Zhu,Eryk Helenowski,Ruan Silva,Zhengxing Chen,Srinivasan Iyer,Manzil Zaheer,Daniel Fried,Hannaneh Hajishirzi,Sanjeev Arora,Gabriel Synnaeve,Ruslan Salakhutdinov,Anirudh Goyal

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：large language models, language models, improve large language, large language, scaling

备注： 70 pages, 26 figures, 12 tables

点击查看摘要

Abstract:Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.

103. 【2604.19079】Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

链接：https://arxiv.org/abs/2604.19079

作者：Andrei Andrusenko,Vladimir Bataev,Lilit Grigoryan,Nune Tadevosyan,Vitaly Lavrukhin,Boris Ginsburg

类目：Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：systems reduces development, automatic speech recognition, settings remains challenging, Unified ASR framework, Unification of automatic

备注：

点击查看摘要

Abstract:Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.

信息检索

1. 【2604.19664】ECLASS-Augmented Semantic Product Search for Electronic Components

链接：https://arxiv.org/abs/2604.19664

作者：Nico Baumgart,Markus Lange-Hegermann,Jan Henze

类目：Information Retrieval (cs.IR)

关键词：LLM-based agent workflows, highly structured catalogs, emerging LLM-based agent, Efficient semantic access, identify suitable components

备注：

点击查看摘要

Abstract:Efficient semantic access to industrial product data is a key enabler for factory automation and emerging LLM-based agent workflows, where both human engineers and autonomous agents must identify suitable components from highly structured catalogs. However, the vocabulary mismatch between natural-language queries and attribute-centric product descriptions limits the effectiveness of traditional retrieval approaches, e.g., BM25. In this work, we present a systematic evaluation of LLM-assisted dense retrieval for semantic product search on industrial electronic components, and investigate the integration of hierarchical semantics from the ECLASS standard into embedding-based retrieval. Our results show that dense retrieval combined with re-ranking substantially outperforms classical lexical methods and foundation model web-search baselines. In particular, the proposed approach achieves a Hit_Rate@5 of 94.3 %, compared to 31.4 % for BM25 on expert queries, while also exceeding foundation model baselines in both effectiveness and efficiency. Furthermore, augmenting product representations with ECLASS semantics yields consistent performance gains across configurations, demonstrating that standardized hierarchical metadata provides a crucial semantic bridge between user intent and sparse product descriptions.

2. 【2604.19663】From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

链接：https://arxiv.org/abs/2604.19663

作者：Quang-Huy Nguyen,Thanh-Hai Nguyen,Khac-Manh Thai,Duc-Hoang Pham,Huy-Son Nguyen,Cam-Van Thi Nguyen,Masoud Mansoury,Duc-Trong Le,Hoang-Quynh Le

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：alter recommendation outcomes, identifying minimal modifications, understand recommender systems, Counterfactual explanations, recommender systems

备注：

点击查看摘要

Abstract:Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleven state-of-the-art CE methods for recommender systems, covering both native explainers (e.g., LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph-based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). Our evaluation protocol includes effectiveness, sparsity, and computational complexity metrics, and extends existing item-level assessments to top-K list-level explanations. Through extensive experiments on three real-world datasets and six representative recommender models, we analyze how well previously reported strengths of CE methods generalize across diverse setups. We observe that the trade-off between effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under the explicit format; in addition, explainer performance remains largely consistent across item level and list level evaluations, and several graph-based explainers exhibit notable scalability limitations on large recommender graphs. Our results refine and challenge earlier conclusions about the robustness and practicality of CE generation methods in recommender systems: this https URL.

3. 【2604.19578】Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

链接：https://arxiv.org/abs/2604.19578

作者：Wenqing Wu,Chengzhi Zhang,Yi Zhao,Tong Bao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词：Large Language Models, Language Models, Large Language, faced unprecedented disruptions, advancement of Large

备注： Scientometrics

点击查看摘要

4. 【2604.19566】Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference

链接：https://arxiv.org/abs/2604.19566

作者：François Remy

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：strong ranking performance, find systematic model, systematic model failures, clinical retrieval requires, Reliable biomedical

备注：

点击查看摘要

5. 【2604.19550】LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction

链接：https://arxiv.org/abs/2604.19550

作者：Jiakai Tang,Runfeng Zhang,Weiqiu Wang,Yifei Liu,Chuan Wang,Xu Chen,Yeqiu Yang,Jian Wu,Yuning Jiang,Bo Zheng

类目：Information Retrieval (cs.IR)

关键词：Transformer-based click-through rate, Scaling Transformer-based click-through, brings growing computational, Transformer-based click-through, parameters brings growing

备注：

点击查看摘要

Abstract:Scaling Transformer-based click-through rate (CTR) models by stacking more parameters brings growing computational and storage overhead, creating a widening gap between scaling ambitions and the stringent industrial deployment constraints. We propose LoopCTR, which introduces a loop scaling paradigm that increases training-time computation through recursive reuse of shared model layers, decoupling computation from parameter growth. LoopCTR adopts a sandwich architecture enhanced with Hyper-Connected Residuals and Mixture-of-Experts, and employs process supervision at every loop depth to encode multi-loop benefits into the shared parameters. This enables a train-multi-loop, infer-zero-loop strategy where a single forward pass without any loop already outperforms all baselines. Experiments on three public benchmarks and one industrial dataset demonstrate state-of-the-art performance. Oracle analysis further reveals 0.02--0.04 AUC of untapped headroom, with models trained with fewer loops exhibiting higher oracle ceilings, pointing to a promising frontier for adaptive inference.

6. 【2604.19505】Enhancing Unsupervised Keyword Extraction in Academic Papers through Integrating Highlights with Abstract

链接：https://arxiv.org/abs/2604.19505

作者：Yi Xiang,Chengzhi Zhang

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词：natural language processing, Automatic keyword extraction, Automatic keyword, area of interest, interest in natural

备注： Scientometrics

点击查看摘要

7. 【2604.19414】CAST: Modeling Semantic-Level Transitions for Complementary-Aware Sequential Recommendation

链接：https://arxiv.org/abs/2604.19414

作者：Qian Zhang,Lech Szymanski,Haibo Zhang,Jeremiah D. Deng

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Sequential Recommendation, provide essential signals, complementary relations, true complementary relations, aims to predict

备注： 10 pages, 5 figures

点击查看摘要

Abstract:Sequential Recommendation (SR) aims to predict the next interaction of a user based on their behavior sequence, where complementary relations often provide essential signals for predicting the next item. However, mainstream models relying on sparse co-purchase statistics often mistake spurious correlations (e.g., due to popularity bias) for true complementary relations. Identifying true complementary relations requires capturing the fine-grained item semantics (e.g., specifications) that simple cooccurrence statistics would be unable to model. While recent semantics-based methods utilize discrete semantic codes to represent items, they typically aggregate semantic codes into coarse item representations. This aggregation process blurs specific semantic details required to identify complementarity. To address these critical limitations and effectively leverage semantics for capturing reliable complementary relations, we propose a Complementary-Aware Semantic Transition (CAST) framework that introduces a new modeling paradigm built upon semantic-level transitions. Specifically, a semantic-level transition module is designed to model dynamic transitions directly in the discrete semantic code space, effectively capturing fine-grained semantic dependencies often lost in aggregated item representations. Then, a complementary prior injection module is designed to incorporate LLM-verified complementary priors into the attention mechanism, thereby prioritizing complementary patterns over co-occurrence statistics. Experiments on multiple e-commerce datasets demonstrate that CAST consistently outperforms the state-of-the-art approaches, achieving up to 17.6% Recall and 16.0% NDCG gains with 65x training acceleration. This validates its effectiveness and efficiency in uncovering latent item complementarity beyond statistics. The code will be released upon acceptance.

8. 【2604.19298】IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

链接：https://arxiv.org/abs/2604.19298

作者：Rajveer Singh Pall

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：assessing large language, Indian financial regulatory, Indian financial, financial regulatory text, financial NLP benchmarks

备注： 24 pages, 4 figures, 11 tables. Dataset and evaluation code at [this https URL](https://github.com/rajveerpall/IndiaFinBench)

点击查看摘要

9. 【2604.19269】CS3: Efficient Online Capability Synergy for Two-Tower Recommendation

链接：https://arxiv.org/abs/2604.19269

作者：Lixiang Wang,Shaoyun Shi,Peng Wang,Wenjin Wu,Peng Jiang

类目：Information Retrieval (cs.IR)

关键词：large-scale candidate retrieval, multi-stage pipelines commonly, multi-stage pipelines, candidate retrieval, balance effectiveness

备注：

点击查看摘要

Abstract:To balance effectiveness and efficiency in recommender systems, multi-stage pipelines commonly use lightweight two-tower models for large-scale candidate retrieval. However, the isolated two-tower architecture restricts representation capacity, embedding-space alignment, and cross-feature interactions. Existing solutions such as late interaction and knowledge distillation can mitigate these issues, but often increase latency or are difficult to deploy in online learning settings. We propose Capability Synergy (CS3), an efficient online framework that strengthens two-tower retrievers while preserving real-time constraints. CS3 introduces three mechanisms: (1) Cycle-Adaptive Structure for self-revision via adaptive feature denoising within each tower; (2) Cross-Tower Synchronization to improve alignment through lightweight mutual awareness between towers; and (3) Cascade-Model Sharing to enhance cross-stage consistency by reusing knowledge from downstream models. CS3 is plug-and-play with diverse two-tower backbones and compatible with online learning. Experiments on three public datasets show consistent gains over strong baselines, and deployment in a largescale advertising system yields up to 8.36% revenue improvement across three scenarios while maintaining ms-level latency.

10. 【2604.19128】GraphRAG-IRL: Personalized Recommendation with Graph-Grounded Inverse Reinforcement Learning and LLM Re-ranking

链接：https://arxiv.org/abs/2604.19128

作者：Siqi Liang,Xiawei Wang,Yudi Zhang,Jiaying Zhou

类目：Information Retrieval (cs.IR)

关键词：Personalized recommendation requires, capture sequential user, sequential user preferences, Personalized recommendation, recommendation requires models

备注：

点击查看摘要

Abstract:Personalized recommendation requires models that capture sequential user preferences while remaining robust to sparse feedback and semantic ambiguity. Recent work has explored large language models (LLMs) as recommenders and re-rankers, but pure prompt-based ranking often suffers from poor calibration, sensitivity to candidate ordering, and popularity bias. These limitations make LLMs useful semantic reasoners, but unreliable as standalone ranking engines. We present \textbf{GraphRAG-IRL}, a hybrid recommendation framework that combines graph-grounded feature construction, inverse reinforcement learning (IRL), and persona-guided LLM re-ranking. Our method constructs a heterogeneous knowledge graph over items, categories, and concepts, retrieves both individual and community preference context, and uses these signals to train a Maximum Entropy IRL model for calibrated pre-ranking. An LLM is then applied only to a short candidate list, where persona-guided prompts provide complementary semantic judgments that are fused with IRL rankings. Experiments show that GraphRAG-IRL is a strong standalone recommender: IRL-MLP with GraphRAG improves NDCG@10 by 15.7\% on MovieLens and 16.6\% on KuaiRand over supervised baselines. The results also show that IRL and GraphRAG are superadditive, with the combined gain exceeding the sum of their individual improvements. Persona-guided LLM fusion further improves ranking quality, yielding up to 16.8\% NDCG@10 improvement over the IRL-only baseline on MovieLens ml-1m, while score fusion on KuaiRand provides consistent gains of 4--6\% across LLM providers.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2604.19128 [cs.IR]

(or
arXiv:2604.19128v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.19128

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

11. 【2604.19113】hink Before Writing: Feature-Level Multi-Objective Optimization for Generative Citation Visibility

链接：https://arxiv.org/abs/2604.19113

作者：Zikang Liu,Peilan Xu

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：answer engines expose, ranked retrieval, fundamentally altering, Generative answer engines, Generative answer

备注： 14 pages, 5 figures

点击查看摘要

Abstract:Generative answer engines expose content through selective citation rather than ranked retrieval, fundamentally altering how visibility is determined. This shift calls for new optimization methods beyond traditional search engine optimization. Existing generative engine optimization (GEO) approaches primarily rely on token-level text rewriting, offering limited interpretability and weak control over the trade-off between citation visibility and content quality. We propose FeatGEO, a feature-level, multi-objective optimization framework that abstracts webpages into interpretable structural, content, and linguistic properties. Instead of directly editing text, FeatGEO optimizes over this feature space and uses a language model to realize feature configurations into natural language, decoupling high-level optimization from surface-level generation. Experiments on GEO-Bench across three generative engines demonstrate that FeatGEO consistently improves citation visibility while maintaining or improving content quality, substantially outperforming token-level baselines. Further analyses show that citation behavior is more strongly influenced by document-level content properties than by isolated lexical edits, and that the learned feature configurations generalize across language models of different scales.

12. 【2604.19047】RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

链接：https://arxiv.org/abs/2604.19047

作者：Hanjun Cho,Jay-Yoon Lee

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：typically assume distinct, strong inter-document similarity, assume distinct documents, exhibit strong inter-document, benchmarks typically assume

备注： Accepted to ACL 2026 (Main Conference)

点击查看摘要

13. 【2604.19042】STK-Adapter: Incorporating Evolving Graph and Event Chain for Temporal Knowledge Graph Extrapolation

链接：https://arxiv.org/abs/2604.19042

作者：Shuyuan Zhao,Wei Chen,Weijie Zhang,Xinrui Hou,Junfeng Shen,Boyan Shi,Shengnan Guo,Youfang Lin,Huaiyu Wan

类目：Information Retrieval (cs.IR)

关键词：predict future events, future events based, TKG evolving structural, Large Language Models, historical facts

备注： Accepted by ACL 2026

点击查看摘要

Abstract:Temporal Knowledge Graph (TKG) extrapolation aims to predict future events based on historical facts. Recent studies have attempted to enhance TKG extrapolation by integrating TKG's evolving structural representations and textual event chains into Large Language Models (LLMs). Yet, two main challenges limit these approaches: (1) The loss of essential spatial-temporal information due to shallow alignment between TKG's graph evolving structural representation and the LLM's semantic space, and (2) the progressive dilution of the TKG's evolving structural features during LLM fine-tuning. To address these challenges, we propose the Spatial-Temporal Knowledge Adapter (STK-Adapter), which flexibly integrates the evolving graph encoder and the LLM to facilitate TKG reasoning. In STK-Adapter, a Spatial-Temporal MoE is designed to capture spatial structures and temporal patterns inherent in TKGs. An Event-Aware MoE is employed to model intricate temporal semantics dependencies within event chains. In addition, a Cross-Modality Alignment MoE is proposed to facilitate deep cross-modality alignment by TKG-guided attention experts. Extensive experiments on benchmark datasets demonstrate that STK-Adapter significantly outperforms state-of-the-art methods and exhibits strong generalization capabilities in cross-dataset task. The code is available at this https URL.

14. 【2604.18943】Personalized Benchmarking: Evaluating LLMs by Individual Preferences

链接：https://arxiv.org/abs/2604.18943

作者：Cristina Garbacea,Heran Wang,Chenhao Tan

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：evaluating LLM alignment, large language models, LLM, real-world tasks, important challenge

备注： Accepted to Findings of ACL 2026

点击查看摘要

15. 【2604.18845】Dual-View Training for Instruction-Following Information Retrieval

链接：https://arxiv.org/abs/2604.18845

作者：Qingcheng Zeng,Puxuan Yu,Aman Mehta,Fuheng Zhao,Rajhans Samdani

类目：Information Retrieval (cs.IR)

关键词：Instruction-following information retrieval, obey explicit user, explicit user constraints, Instruction-following information, find documents relevant

备注：

点击查看摘要

Abstract:Instruction-following information retrieval (IF-IR) studies retrieval systems that must not only find documents relevant to a query, but also obey explicit user constraints such as required attributes, exclusions, or output preferences. However, most retrievers are trained primarily for semantic relevance and often fail to distinguish documents that match the topic from those that satisfy the instruction. We propose a dual-view data synthesis strategy based on polarity reversal: given a query, a document that is relevant under the instruction, and a hard negative that matches the query but violates the instruction, we prompt an LLM to generate a complementary instruction under which the two documents swap relevance labels. By presenting the same document pair under complementary instructions that invert their relevance labels, the training signal forces the retriever to reconsider the same candidate set through the instruction, rather than relying on fixed topical cues. On a 305M-parameter encoder, our method improves performance on the FollowIR benchmark by 45%, surpassing general-purpose embedding models of comparable or larger scale. Through head-to-head comparisons at matched data budgets, we further show that data diversity and instruction supervision play complementary roles: the former preserves general retrieval quality, while the latter improves instruction sensitivity. These results highlight the value of targeted data synthesis for building retrieval systems that are both broadly capable and instruction-aware.

计算机视觉

1. 【2604.19748】stars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

链接：https://arxiv.org/abs/2604.19748

作者：Mengting Chen,Zhengrui Chen,Yongchao Du,Zuan Gao,Taihang Hu,Jinsong Lan,Chao Lin,Yefeng Shen,Xingjian Wang,Zhao Wang,Zhengtao Wu,Xiaoli Xu,Zhengze Xu,Hao Yan,Mingzhou Zhang,Jun Zheng,Qinye Zhou,Xiaoyong Zhu,Bo Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, editing have opened, opened new opportunities, Recent, virtual try-on

备注： 24 pages, model evaluation report

点击查看摘要

Abstract:Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.

2. 【2604.19747】AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

链接：https://arxiv.org/abs/2604.19747

作者：Yutian Chen,Shi Guo,Renbiao Jin,Tianshuo Yang,Xin Cai,Yawen Luo,Mingxin Yang,Mulin Yu,Linning Xu,Tianfan Xue

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：essential for modeling, remain challenging, challenging for non-generative, Sparse-view, casual captures

备注： Webpage: [this https URL](https://yutian10.github.io/AnyRecon/)

点击查看摘要

Abstract:Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.

3. 【2604.19741】CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

链接：https://arxiv.org/abs/2604.19741

作者：Gene Chou,Charles Herrmann,Kyle Genova,Boyang Deng,Songyou Peng,Bharath Hariharan,Jason Y. Zhang,Noah Snavely,Philipp Henzler

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：navigable environment, address the problem, problem of generating, real location, Existing video generative

备注： Project page: [this http URL](http://cityrag.github.io)

点击查看摘要

Abstract:We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

4. 【2604.19740】Generalization at the Edge of Stability

链接：https://arxiv.org/abs/2604.19740

作者：Mario Tuci,Caner Korkmaz,Umut Şimşekli,Tolga Birdal

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词：Training modern neural, large learning rates, modern neural networks, optimization dynamics exhibit, dynamics exhibit oscillatory

备注： Project page: [this https URL](https://circle-group.github.io/research/GATES)

点击查看摘要

Abstract:Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.

5. 【2604.19736】Generative Drifting for Conditional Medical Image Generation

链接：https://arxiv.org/abs/2604.19736

作者：Zirong Li,Siyuan Mei,Weiwen Wu,Andreas Maier,Lina Gölz,Yan Xia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：clinically relevant imaging, relevant imaging tasks, plays an important, important role, clinically relevant

备注：

点击查看摘要

Abstract:Conditional medical image generation plays an important role in many clinically relevant imaging tasks. However, existing methods still face a fundamental challenge in balancing inference efficiency, patient-specific fidelity, and distribution-level plausibility, particularly in high-dimensional 3D medical imaging. In this work, we propose GDM, a generative drifting framework that reformulates deterministic medical image prediction as a multi-objective learning problem to jointly promote distribution-level plausibility and patient-specific fidelity while retaining one-step inference. GDM extends drifting to 3D medical imaging through an attractive-repulsive drift that minimizes the discrepancy between the generator pushforward and the target distribution. To enable stable drifting-based learning in 3D volumetric data, GDM constructs a multi-level feature bank from a medical foundation encoder to support reliable affinity estimation and drifting field computation across complementary global, local, and spatial representations. In addition, a gradient coordination strategy in the shared output space improves optimization balance under competing distribution-level and fidelity-oriented objectives. We evaluate the proposed framework on two representative tasks, MRI-to-CT synthesis and sparse-view CT reconstruction. Experimental results show that GDM consistently outperforms a wide range of baselines, including GAN-based, flow-matching-based, and SDE-based generative models, as well as supervised regression methods, while improving the balance among anatomical fidelity, quantitative reliability, perceptual realism, and inference efficiency. These findings suggest that GDM provides a practical and effective framework for conditional 3D medical image generation.

6. 【2604.19728】VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

链接：https://arxiv.org/abs/2604.19728

作者：Jean Mercat,Sedrick Keh,Kushal Arora,Isabella Huang,Paarth Shah,Haruki Nishimura,Shun Iwase,Katherine Liu

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词：present VLA Foundry, VLA Foundry, VLA Foundry codebase, VLA, VLA Foundry supports

备注： 32 pages, 16 figures, technical report

点击查看摘要

Abstract:We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM--VLM--VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at this https URL and all multi-task model weights are released on this https URL. Additional qualitative videos are available on the project website this https URL.

7. 【2604.19720】ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

链接：https://arxiv.org/abs/2604.19720

作者：Zhengwentai Sun,Keru Zheng,Chenghong Li,Hongjie Liao,Xihe Yang,Heyuan Li,Yihao Zhi,Shuliang Ning,Shuguang Cui,Xiaoguang Han

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains challenging due, generation remains challenging, remains challenging, challenging due, difficulty of jointly

备注：

点击查看摘要

Abstract:Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at this https URL.

8. 【2604.19715】A Network-Aware Evaluation of Distributed Energy Resource Control in Smart Distribution Systems

链接：https://arxiv.org/abs/2604.19715

作者：Houchao Gan

类目：Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

关键词：Distributed Energy Resources, Energy Resources, Distributed Energy, Distribution networks, increasingly rely

备注：

点击查看摘要

Abstract:Distribution networks with high penetration of Distributed Energy Resources (DERs) increasingly rely on communication networks to coordinate grid-interactive control. While many distributed control schemes have been proposed, they are often evaluated under idealized communication assumptions, making it difficult to assess their performance under realistic network conditions. This work presents an implementation-driven evaluation of a representative virtual power plant (VPP) dispatch algorithm using a co-simulation framework that couples a linearized distribution-system model with packet-level downlink emulation in ns-3. The study considers a modified IEEE~37-node feeder with high photovoltaic penetration and a primal--dual VPP dispatch that simultaneously targets feeder-head active power tracking and voltage regulation. Communication effects are introduced only on the downlink path carrying dual-variable updates, where per-DER packet delays and a hold-last-value strategy are modeled. Results show that, under ideal communication, the dispatch achieves close tracking of the feeder-head power reference while maintaining voltages within the prescribed limits at selected buses. When realistic downlink delay is introduced, the same controller exhibits large oscillations in feeder-head power and more frequent voltage limit violations. These findings highlight that distributed DER control performance can be strongly influenced by communication behavior and motivate evaluation frameworks that explicitly incorporate network dynamics into the assessment of grid-interactive control schemes.

9. 【2604.19710】SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

链接：https://arxiv.org/abs/2604.19710

作者：Zewei Zhou,Ruining Yang,Xuewei(Tony)Qi,Yiluan Guo,Sherry X. Chen,Tao Feng,Kateryna Pistunova,Yishan Shen,Lili Su,Jiaqi Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：leveraging world knowledge, promising autonomous driving, autonomous driving paradigm, existing VLA models, offer a promising

备注： Project page: [this https URL](https://spanvla.github.io/)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.

10. 【2604.19702】Face Anything: 4D Face Reconstruction from Any Image Sequence

链接：https://arxiv.org/abs/2604.19702

作者：Umut Kocasari,Simon Giebenhain,Richard Shaw,Matthias Nießner

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：creating significant ambiguity, variations occur simultaneously, viewpoint variations occur, dynamic human faces, non-rigid deformations

备注： Project website: [this https URL](https://kocasariumut.github.io/FaceAnything/) , Video: [this https URL](https://www.youtube.com/watch?v=wSGHpAscp0Y)

点击查看摘要

Abstract:Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3$\times$ lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.

11. 【2604.19697】Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

链接：https://arxiv.org/abs/2604.19697

作者：Jing Jin,Hao Liu,Yan Bai,Yihang Lou,Zhenke Wang,Tianrun Yuan,Juntong Chen,Yongkang Zhu,Fanhu Zeng,Xuanyu Zhu,Yige Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：domains remains challenging, specialized domains remains, Multimodal large language, promising reasoning abilities, large language models

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at this https URL.

12. 【2604.19680】IR-Flow: Bridging Discriminative and Generative Image Restoration via Rectified Flow

链接：https://arxiv.org/abs/2604.19680

作者：Zihao Fan,Xin Lu,Jie Xiao,Dong Li,Jie Huang,Xueyang Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：lack fine details, single-step discriminative mappings, generative paradigms suffer, expectation learning, noise-residual coupling

备注：

点击查看摘要

Abstract:In image restoration, single-step discriminative mappings often lack fine details via expectation learning, whereas generative paradigms suffer from inefficient multi-step sampling and noise-residual coupling. To address this dilemma, we propose IR-Flow, a novel image restoration method based on Rectified Flow that serves as a unified framework bridging the gap between discriminative and generative paradigms. Specifically, we first construct multilevel data distribution flows, which expand the ability of models to learn from and adapt to various levels of degradation. Subsequently, cumulative velocity fields are proposed to learn transport trajectories across varying degradation levels, guiding intermediate states toward the clean target, while a multi-step consistency constraint is presented to enforce trajectory coherence and boost few-step restoration performance. We show that directly establishing a linear transport flow between degraded and clean image domains not only enables fast inference but also improves adaptability to out-of-distribution degradations. Extensive evaluations on deraining, denoising and raindrop removal tasks demonstrate that IR-Flow achieves competitive quantitative results with only a few sampling steps, offering an efficient and flexible framework that maintains an excellent distortion-perception balance. Our code is available at this https URL.

13. 【2604.19679】MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

链接：https://arxiv.org/abs/2604.19679

作者：Liyang Li,Wen Wang,Canyu Zhao,Tianjian Feng,Zhiyue Zhao,Hao Chen,Chunhua Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, enabled high-quality joint, audio-video Diffusion Transformer, joint audio-video generation, high-quality joint audio-video

备注：

点击查看摘要

Abstract:Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.

14. 【2604.19675】MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention

链接：https://arxiv.org/abs/2604.19675

作者：Zhi Chen,Runze Hu,Le Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：stochastic diffusion processes, continuous-time transport maps, medical image segmentation, learning continuous-time transport, recently emerged

备注：

点击查看摘要

Abstract:Flow matching has recently emerged as a principled framework for learning continuous-time transport maps, enabling efficient deterministic generation without relying on stochastic diffusion processes. While generative modeling has shown promise for medical image segmentation, particularly in capturing uncertainty and complex anatomical variability, existing approaches are predominantly built upon diffusion models, which incur substantial computational overhead due to iterative sampling and are often constrained by UNet-based parameterizations. In this work, we introduce MedFlowSeg, a conditional flow matching framework that formulates medical image segmentation as learning a time-dependent vector field that transports a simple prior distribution to the target segmentation distribution. This formulation enables one-step deterministic inference while preserving the expressiveness of generative modeling. We further develop a dual-conditioning mechanism to incorporate structured priors into the learned flow. Specifically, we propose a Dual-Branch Spatial Attention module that injects multi-scale structural information into the flow field, and a Frequency-Aware Attention module that models cross-domain interactions between spatial and spectral representations via discrepancy-aware fusion and time-dependent modulation. Together, these components provide an effective parameterization of conditional flows that capture both global anatomical structure and fine-grained boundary details. We provide extensive empirical validation across multiple medical imaging modalities, demonstrating that MedFlowSeg achieves state-of-the-art performance while significantly reducing computational cost compared to diffusion-based methods. Our results highlight the potential of flow matching as a theoretically grounded and computationally efficient alternative for generative medical image segmentation.

15. 【2604.19673】InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

链接：https://arxiv.org/abs/2604.19673

作者：Nikita Kister,Pradyumna YM,István Sárándi,Jiayi Wang,Anna Khoreva,Gerard Pons-Moll

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Training embodied agents, people meaningfully interacting, agents to understand, diverse environments, embodied agents

备注：

点击查看摘要

Abstract:Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.

16. 【2604.19667】Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

链接：https://arxiv.org/abs/2604.19667

作者：Yi Zhong,Buqiang Xu,Yijun Wang,Zifei Shan,Shuofei Qiao,Guozhou Zheng,Ningyu Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：offering strong reliability, real-world industrial deployments, industrial deployments, offering strong, reliability and controllability

备注： Work in progress

点击查看摘要

17. 【2604.19648】CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation

链接：https://arxiv.org/abs/2604.19648

作者：Yanhui Chen,Baoyao Yang,Siqi Liu,Jingchao Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：mask generation paradigm, prompt-driven mask generation, advances open-vocabulary semantic, generation paradigm, introducing a prompt-driven

备注：

点击查看摘要

Abstract:SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.

18. 【2604.19636】CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

链接：https://arxiv.org/abs/2604.19636

作者：Xiangyang Luo,Xiaozhe Xin,Tao Feng,Xu Guo,Meiguang Jin,Junfeng Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Synthesizing human, digital advertising, virtual marketing, broad practical, HOI

备注： The project page: [this https URL](https://xinxiaozhe12345.github.io/CoInteract_Project/)

点击查看摘要

Abstract:Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

19. 【2604.19632】CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

链接：https://arxiv.org/abs/2604.19632

作者：Weidong Chen,Dexiang Hong,Zhendong Mao,Yutao Cheng,Xinyan Liu,Lei Zhang,Yongdong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：limiting downstream editing, produce rasterized outputs, explicit layer structures, graphic design parsing, models produce rasterized

备注：

点击查看摘要

Abstract:Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7\% across all metrics.

20. 【2604.19631】MOSA: Motion-Guided Semantic Alignment for Dynamic Scene Graph Generation

链接：https://arxiv.org/abs/2604.19631

作者：Xuejiao Wang,Bohao Zhang,Changbo Wang,Gaoqi He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Scene Graph Generation, Dynamic Scene Graph, Graph Generation, Scene Graph, Dynamic Scene

备注：

点击查看摘要

Abstract:Dynamic Scene Graph Generation (DSGG) aims to structurally model objects and their dynamic interactions in video sequences for high-level semantic understanding. However, existing methods struggle with fine-grained relationship modeling, semantic representation utilization, and the ability to model tail relationships. To address these issues, this paper proposes a motion-guided semantic alignment method for DSGG (MoSA). First, a Motion Feature Extractor (MFE) encodes object-pair motion attributes such as distance, velocity, motion persistence, and directional consistency. Then, these motion attributes are fused with spatial relationship features through the Motion-guided Interaction Module (MIM) to generate motion-aware relationship representations. To further enhance semantic discrimination capabilities, the cross-modal Action Semantic Matching (ASM) mechanism aligns visual relationship features with text embeddings of relationship categories. Finally, a category-weighted loss strategy is introduced to emphasize learning of tail relationships. Extensive and rigorous testing shows that MoSA performs optimally on the Action Genome dataset.

21. 【2604.19624】GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

链接：https://arxiv.org/abs/2604.19624

作者：Pradyumna YM,Yuxuan Xue,Yue Chen,Nikita Kister,István Sárándi,Gerard Pons-Moll

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reconstructing physically plausible, Reconstructing physically, Toggle, lack explicit interaction, optimization based methods

备注： Project Page: [this https URL](https://pradyumnaym.github.io/graft)

点击查看摘要

Abstract:Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human--scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry. GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining. Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at ${\sim}50{\times}$ lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of three-way user study. Project page: this https URL .

Comments:
Project Page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.19624 [cs.CV]

(or
arXiv:2604.19624v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.19624

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Pradyumna Ym [view email] [v1]
Tue, 21 Apr 2026 16:13:15 UTC (16,331 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction, by Pradyumna YM and 5 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CV

|
next

new
|
recent
| 2026-04

Change to browse by:

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

22. 【2604.19623】SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference under Hard Uplink Budgets

链接：https://arxiv.org/abs/2604.19623

作者：Inhyeok Choi,Hyuncheol Park

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词：Edge-cloud hybrid inference, powerful remote model, uplink channel imposes, channel imposes hard, hybrid inference offloads

备注： 11pages, 9 figures

点击查看摘要

Abstract:Edge-cloud hybrid inference offloads difficult inputs to a powerful remote model, but the uplink channel imposes hard per-request constraints on the number of bits that can be transmitted. We show that selecting transmitted content based solely on attention-based importance, the standard approach in collaborative inference, is inherently limited under hard budgets. Two findings support this claim. First, replacing high-importance units with low-importance but complementary ones improves server accuracy. This shows that what matters is not individual importance but how well the transmitted set covers diverse aspects of the input. Second, spatially uniform selection without any content information achieves competitive accuracy at moderate budgets. This confirms that spatial coverage alone carries independent value. Based on this analysis, we propose SAGE (Semantic Attention-Guided Evidence), a principled, training-free method that combines importance filtering with embedding-diversity sampling. SAGE achieves 93% of the server ceiling in offloaded accuracy while transmitting fewer than half of the available evidence units on ImageNet-1K, substantially outperforming importance-only composition.

23. 【2604.19609】Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding

链接：https://arxiv.org/abs/2604.19609

作者：Kadir Yilmaz,Adrian Kruse,Tristan Höfer,Daan de Geus,Bastian Leibe

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Project page: [this https URL](https://vision.rwth-aachen.de/Volt)

点击查看摘要

None

24. 【2604.19596】PC2Model: ISPRS benchmark on 3D point cloud to model registration

链接：https://arxiv.org/abs/2604.19596

作者：Mehdi Maboudi,Said Harb,Jackson Ferrao,Kourosh Khoshelham,Yelda Turkan,Karam Mawas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：registration involves aligning, cloud registration involves, Point cloud registration, Point cloud, point cloud acquisition

备注： ISPRS Congress 2026, Toronto

点击查看摘要

Abstract:Point cloud registration involves aligning one point cloud with another or with a three-dimensional (3D) model, enabling the integration of multimodal data into a unified representation. This is essential in applications such as construction monitoring, autonomous driving, robotics, and virtual or augmented reality (VR/AR).With the increasing accessibility of point cloud acquisition technologies, such as Light Detection and Ranging (LiDAR) and structured light scanning, along with recent advances in deep learning, the research focus has increasingly shifted towards downstream tasks, particularly point cloud-to-model (PC2Model) registration. While data-driven methods aim to automate this process, they struggle with sparsity, noise, clutter, and occlusions in real-world scans, which limit their performance. To address these challenges, this paper introduces the PC2Model benchmark, a publicly available dataset designed to support the training and evaluation of both classical and data-driven methods. Developed under the leadership of ICWG II/Ib, the PC2Model benchmark adopts a hybrid design that combines simulated point clouds with, in some cases, real-world scans and their corresponding 3D models. Simulated data provide precise ground truth and controlled conditions, while real-world data introduce sensor and environmental artefacts. This design supports robust training and evaluation across domains and enables the systematic analysis of model transferability from simulated to real-world scenarios. The dataset is publicly accessible at: this https URL.

25. 【2604.19591】Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping

链接：https://arxiv.org/abs/2604.19591

作者：Jienan Lyu,Miao Yang,Jinchen Cai,Yiwen Hu,Guanyi Lu,Junhao Qiu,Runmin Dong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Fine-grained high-resolution remote, restricts cross-domain generalizability, high-resolution remote sensing, remote sensing mapping, sensing mapping typically

备注：

点击查看摘要

Abstract:Fine-grained high-resolution remote sensing mapping typically relies on localized visual features, which restricts cross-domain generalizability and often leads to fragmented predictions of large-scale land covers. While global geospatial foundation models offer powerful, generalizable representations, directly fusing their high-dimensional implicit embeddings with high-resolution visual features frequently triggers feature interference and spatial structure degradation due to a severe semantic-spatial gap. To overcome these limitations, we propose a Structure-Semantic Decoupled Modulation (SSDM) framework, which decouples global geospatial representations into two complementary cross-modal injection pathways. First, the structural prior modulation branch introduces the macroscopic receptive field priors from global representations into the self-attention modules of the high-resolution encoder. By guiding local feature extraction with holistic structural constraints, it effectively suppresses prediction fragmentation caused by high-frequency detail noise and excessive intra-class variance. Second, the global semantic injection branch explicitly aligns holistic context with the deep high-resolution feature space and directly supplements global semantics via cross-modal integration, thereby significantly enhancing the semantic consistency and category-level discrimination of complex land covers. Extensive experiments demonstrate that our method achieves state-of-the-art performance compared to existing cross-modal fusion approaches. By unleashing the potential of global embeddings, SSDM consistently improves high-resolution mapping accuracy across diverse scenarios, providing a universal and effective paradigm for integrating geospatial foundation models into high-resolution vision tasks.

26. 【2604.19587】SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

链接：https://arxiv.org/abs/2604.19587

作者：Ying Zeng,Miaosen Luo,Guangyuan Li,Yang Yang,Ruiyang Fan,Linxiao Shi,Qirui Yang,Jian Zhang,Chengcheng Liu,Siming Zheng,Jinwei Chen,Bo Li,Peng-Tao Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Traditional photographic image, typically requires users, possess sufficient aesthetic, editing typically requires, Traditional photographic

备注： tech report

点击查看摘要

Abstract:Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: this https URL.

27. 【2604.19571】ransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing

链接：https://arxiv.org/abs/2604.19571

作者：Yanhui Chen,Jiahong Li,Jingchao Wang,Junyi Lin,Zixin Zeng,Yang Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modifying complex scenes, Gaussian Splatting, convenient approach, approach for modifying, modifying complex

备注：

点击查看摘要

Abstract:Language-driven 3D Gaussian Splatting (3DGS) editing provides a more convenient approach for modifying complex scenes in VR/AR. Standard pipelines typically adopt a two-stage strategy: first editing multiple 2D views, and then optimizing the 3D representation to match these edited observations. Existing methods mainly improve view consistency through multi-view feature fusion, attention filtering, or iterative recalibration. However, they fail to explicitly address a more fundamental issue: the semantic correspondence between edited 2D evidence and 3D Gaussians. To tackle this problem, we propose TransSplat, which formulates language-driven 3DGS editing as a multi-view unbalanced semantic transport problem. Specifically, our method establishes correspondences between visible Gaussians and view-specific editing prototypes, thereby explicitly characterizing the semantic relationship between edited 2D evidence and 3D Gaussians. It further recovers a cross-view shared canonical 3D edit field to guide unified 3D appearance updates. In addition, we use transport residuals to suppress erroneous edits in non-target regions, mitigating edit leakage and improving local control precision. Qualitative and quantitative results show that, compared with existing 3D editing methods centered on enhancing view consistency, TransSplat achieves superior performance in local editing accuracy and structural consistency.

28. 【2604.19570】RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

链接：https://arxiv.org/abs/2604.19570

作者：Ahmed Marouane Djouama,Abir Belaala,Abdellah Zakaria Sellam,Salah Eddine Bekhouche,Cosimo Distante,Abdenour Hadid

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate medical image, precise boundary delineation, long-range contextual reasoning, Accurate medical, prohibitive inference latency

备注：

点击查看摘要

Abstract:Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusion-based approaches, RF-HiT leverages rectified flow with efficient transformer blocks to achieve linear complexity while requiring only a few discretization steps. The model further fuses conditioning features across resolutions via learnable interpolation, enabling effective multi-scale representation with minimal computational overhead. As a result, RF-HiT achieves a strong efficiency-performance trade-off, requiring only 10.14 GFLOPs, 13.6M parameters, and inference in as few as three steps. Despite its compact design, RF-HiT attains 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, achieving performance comparable to or exceeding that of significantly more intensive architectures. This demonstrates its strong potential as a robust, computationally efficient foundation for real-time clinical segmentation.

29. 【2604.19564】EgoSelf: From Memory to Personalized Egocentric Assistant

链接：https://arxiv.org/abs/2604.19564

作者：Yanshuo Wang,Yuan Xu,Xuesong Li,Jie Hong,Yizhou Wang,Chang Wen Chen,Wentao Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：first-person view data, rely on first-person, first-person view, view data, https URL

备注：

点击查看摘要

Abstract:Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user's historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at \href{this https URL}{this https URL\_project/}.

30. 【2604.19556】Paparazzo: Active Mapping of Moving 3D Objects

链接：https://arxiv.org/abs/2604.19556

作者：Davide Allegro,Shiyao Li,Stefano Ghidoni,Vincent Lepetit

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：assume static environments, pipelines generally assume, generally assume static, mapping pipelines generally, reconstruct moving objects

备注： Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object's motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target's trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding. Project page: this https URL

31. 【2604.19510】Evaluating Histogram Matching for Robust Deep learning-Based Grapevine Disease Detection

链接：https://arxiv.org/abs/2604.19510

作者：Ruben Pascual,Inés Hernández,Salvador Gutiérrez,Javier Tardaguila,Pedro Melo-Pinto,Daniel Paternain,Mikel Galar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：plant disease detection, primary factor limiting, factor limiting deep, limiting deep learning, field-based plant disease

备注：

点击查看摘要

Abstract:Variability in illumination is a primary factor limiting deep learning robustness for field-based plant disease detection. This study evaluates Histogram Matching (HM), a technique that transforms the pixel intensity distribution of an image to match a reference profile, to mitigate this in grapevine classification, distinguishing among healthy leaves, downy mildew, and spider mite damage. We propose a dual-stage integration of HM: (i) as a preprocessing step for normalization, and (ii) as a data augmentation technique to introduce controlled training variability. Experiments using 1,469 RGB images (comprising homogeneous leaf-focused and heterogeneous canopy samples) to train ResNet-18 models demonstrate that this combination significantly enhances robustness on real-world canopy images. While leaf-focused samples showed marginal gains, the canopy subset improved markedly, indicating that balancing normalization with histogram-based diversification effectively bridges the domain gap caused by uncontrolled lighting.

32. 【2604.19489】Seeing Candidates at Scale: Multimodal LLMs for Visual Political Communication on Instagram

链接：https://arxiv.org/abs/2604.19489

作者：Michael Achmann-Denkler,Mario Haim,Christian Wolff

类目：Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词：multimodal large language, computational case study, specialized machine learning, emerging multimodal large, Google Cloud Vision

备注： An earlier version was presented at #SMSociety 2024 (London)

点击查看摘要

Abstract:This paper presents a computational case study that evaluates the capabilities of specialized machine learning models and emerging multimodal large language models for Visual Political Communication (VPC) analysis. Focusing on concentrated visibility in Instagram stories and posts during the 2021 German federal election campaign, we compare the performance of traditional computer vision models (FaceNet512, RetinaFace, Google Cloud Vision) with a multimodal large language model (GPT-4o) in identifying front-runner politicians and counting individuals in images. GPT-4o outperformed the other models, achieving a macro F1-score of 0.89 for face recognition and 0.86 for person counting in stories. These findings demonstrate the potential of advanced AI systems to scale and refine visual content analysis in political communication while highlighting methodological considerations for future research.

33. 【2604.19480】Deep sprite-based image models: An analysis

链接：https://arxiv.org/abs/2604.19480

作者：Zeynep Sonat Baltacı,Romain Loiseau,Mathieu Aubry

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：drive steady progress, diffusion algorithms compose, seemingly simple problem, identifying recurrent patterns, foundation models drive

备注：

点击查看摘要

Abstract:While foundation models drive steady progress in image segmentation and diffusion algorithms compose always more realistic images, the seemingly simple problem of identifying recurrent patterns in a collection of images remains very much open. In this paper, we focus on sprite-based image decomposition models, which have shown some promise for clustering and image decomposition and are appealing because of their high interpretability. These models come in different flavors, need to be tailored to specific datasets, and struggle to scale to images with many objects. We dive into the details of their design, identify their core components, and perform an extensive analysis on clustering benchmarks. We leverage this analysis to propose a deep sprite-based image decomposition method that performs on par with state-of-the-art unsupervised class-aware image segmentation methods on the standard CLEVR benchmark, scales linearly with the number of objects, identifies explicitly object categories, and fully models images in an easily interpretable way.

34. 【2604.19473】S-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

链接：https://arxiv.org/abs/2604.19473

作者：Hongyu Zhang,Yufan Deng,Zilin Pan,Peng-Tao Jiang,Bo Li,Qibin Hou,Zhiyang Dou,Zhen Dong,Daquan Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generating high-quality videos, Generating high-quality, key unsolved problem, multiple sequential actions, complex temporal descriptions

备注： ICLR 2026, code available at: [this https URL](https://github.com/Hong-yu-Zhang/TS-Attn)

点击查看摘要

Abstract:Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at this https URL.

35. 【2604.19445】LoViF 2026 Challenge on Real-World All-in-One Image Restoration: Methods and Results

链接：https://arxiv.org/abs/2604.19445

作者：Xiang Chen,Hao Li,Jiangxin Dong,Jinshan Pan,Xin Li,Xin He,Naiwei Chen,Shengyuan Li,Fengning Liu,Haoyi Lv,Haowei Peng,Yilian Zhong,Yuxiang Chen,Shibo Yin,Yushun Fang,Xilei Zhu,Yahui Wang,Chen Lu,Kaibin Chen,Xu Zhang,Xuhui Cao,Jiaqi Ma,Ziqi Wang,Shengkai Hu,Yuning Cui,Huan Zhang,Shi Chen,Bin Ren,Lefei Zhang,Guanglu Dong,Qiyao Zhao,Tianheng Zheng,Chunlei Li,Lichao Mou,Chao Ren,Wangzhi Xing,Xin Lu,Enxuan Gu,Jingxi Zhang,Diqi Chen,Qiaosi Yi,Bingcai Wei,Mingyu Liu,Pengyu Wang,Ce Liu,Miaoxin Guan,Boyu Chen,Hongyu Li,Jian Zhu,Xinrui Luo,Ziyang He,Jiayu Wang,Yichen Xiang,Huayi Qi,Haoyu Bian,Yiran Li,Sunlichen Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Image Restoration, paper presents, presents a review, LoViF Challenge, Restoration

备注： CVPR Workshops 2026; [this https URL](https://lowlevelcv.com/)

点击查看摘要

Abstract:This paper presents a review for the LoViF Challenge on Real-World All-in-One Image Restoration. The challenge aimed to advance research on real-world all-in-one image restoration under diverse real-world degradation conditions, including blur, low-light, haze, rain, and snow. It provided a unified benchmark to evaluate the robustness and generalization ability of restoration models across multiple degradation categories within a common framework. The competition attracted 124 registered participants and received 9 valid final submissions with corresponding fact sheets, significantly contributing to the progress of real-world all-in-one image restoration. This report provides a detailed analysis of the submitted methods and corresponding results, emphasizing recent progress in unified real-world image restoration. The analysis highlights effective approaches and establishes a benchmark for future research in real-world low-level vision.

36. 【2604.19432】DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval

链接：https://arxiv.org/abs/2604.19432

作者：Xinwei He,Yansong Zheng,Qianru Han,Zhichuan Wang,Yuxuan Cai,Yang Zhou,Jingbo Xia,Yulong Wang,Jinhai Xiang,Xiang Bai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision foundation models, shown great promise, Vision foundation, object retrieval, DINO Eats CLIP

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP's broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.

37. 【2604.19420】ESO: Online Tracking of Essential Matrix by Stochastic Optimization

链接：https://arxiv.org/abs/2604.19420

作者：Jaroslav Moravec,Radim Šára,Akihiro Sugimoto

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Maintaining long-term accuracy, Maintaining long-term, autonomous systems' perception, online stochastic optimization, TESO

备注： Accepted at CVPR 2026 (Oral)

点击查看摘要

Abstract:Maintaining long-term accuracy of stereo camera calibration parameters is important for autonomous systems' perception. This work proposes Online Tracking of Essential Matrix by Stochastic Optimization (TESO). The core mechanisms of TESO are: 1) a robust loss function based on kernel correlation over tentative correspondences, 2) an adaptive online stochastic optimization on the essential manifold. TESO has low CPU and memory requirements, relies on a few hyperparameters, and eliminates the need for data-driven training, enabling the usage in resource-constrained online perception systems. We evaluated the influence of TESO on geometric precision, rectification quality, and stereo depth consistency. On the large-scale MAN TruckScenes dataset, TESO tracks rotational calibration drift with 0.12 deg precision in the Y-axis (critical for stereo accuracy) while the X- and Z-axes are five times more precise. Tracking applied to sequences with simulated drift shows similar precision with respect to the reference as tracking applied to no-drift sequences, indicating the tracker is unbiased. On the KITTI dataset, TESO revealed systematic inconsistencies in extrinsic parameters across stereo pairs, confirming previous published findings. We verified that intrinsic decalibration affected these errors, as evidenced by the conflicting behavior of the rectification and depth metrics. After correcting the reference calibration, TESO improved its rotation precision around the Y-axis 20 times to 0.025 deg and its depth accuracy 50 times. Despite its lightweight design, direct optimization of the proposed TESO loss function alone achieves accuracy comparable to that of neural network-based single-frame methods.

38. 【2604.19412】VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing

链接：https://arxiv.org/abs/2604.19412

作者：Yanbin Huang,Yisen Li,Guiyao Tie,Xiaoye Qu,Pan Zhou,Hongfei Wang,Zhaofan Zou,Hao Sun,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Large vision-language models, Large vision-language, frequently suffer, input image, Large

备注： ICASSP 2026

点击查看摘要

39. 【2604.19411】GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes

链接：https://arxiv.org/abs/2604.19411

作者：Joshua Niemeijer,Alaa Eddine Ben Zekri,Reza Bahmanyar,Philipp M. Schmälzle,Houda Chaabouni-Chouayakh,Franz Kurz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Understanding road scenes, Understanding road, geometrically consistent, scene-centric representation, planning and mapping

备注：

点击查看摘要

Abstract:Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird's-eye-view (BEV) semantic environment maps-including dynamic agents-from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training. BEV-aligned aerial crops provide an intuitive target space, enabling dense semantic annotation with minimal manual effort and avoiding the ambiguity of ego-only BEV labeling. Crucially, strict aerial-ground synchronization allows overhead observations to supervise moving traffic participants and mitigates the temporal inconsistencies inherent to non-synchronized overhead sources. To obtain scalable dense targets, we generate BEV pseudo-labels using domain-adapted aerial teachers, and jointly train BEV segmentation with optional pseudo-aerial BEV reconstruction for interpretability. Finally, we extend beyond aerial coverage by learning to synthesize pseudo-aerial BEV images from ego sensors, which support lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.

40. 【2604.19406】HP-Edit: A Human-Preference Post-Training Framework for Image Editing

链接：https://arxiv.org/abs/2604.19406

作者：Fan Li,Chonghuinan Wang,Lina Lei,Yuping Qiu,Jiaqi Xu,Jiaxiu Jiang,Xinran Qin,Zhikai Chen,Fenglong Song,Zhixin Wang,Renjing Pei,Wangmeng Zuo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：typically adopt powerful, adopt powerful generative, powerful generative diffusion, tasks typically adopt, generative diffusion models

备注： Accepted by CVPR2026

点击查看摘要

Abstract:Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer--an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.

41. 【2604.19403】VecHeart: Holistic Four-Chamber Cardiac Anatomy Modeling via Hybrid VecSets

链接：https://arxiv.org/abs/2604.19403

作者：Yihong Chen,Pascal Fua

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate cardiac anatomy, handle intricate interrelations, anatomy modeling requires, Accurate cardiac, cardiac anatomy modeling

备注：

点击查看摘要

Abstract:Accurate cardiac anatomy modeling requires the model to be able to handle intricate interrelations among structures. In this paper, we propose VecHeart, a unified framework for holistic reconstruction and generation of four-chamber cardiac structures. To overcome the limitations of current feed-forward implicit methods, specifically their restriction to single-object modeling and their neglect of inter-part correlations, we introduce Hybrid Part Transformer, which leverages part-specific learnable queries and interleaved attention to capture complex inter-chamber dependencies. Furthermore, we propose Anatomical Completion Masking and Modality Alignment strategies, enabling the model to infer complete four-chamber structures from partial, sparse, or noisy observations, even when certain anatomical parts are entirely missing. VecHeart also seamlessly extends to 3D+t dynamic mesh sequence generation, demonstrating exceptional versatility. Experiments show that our method achieves state-of-the-art performance, maintaining high-fidelity reconstruction across diverse challenging scenarios. Code will be released.

42. 【2604.19392】HarmoniDiff-RS: Training-Free Diffusion Harmonization for Satellite Image Composition

链接：https://arxiv.org/abs/2604.19392

作者：Xiaoqi Zhuang,Jefersson A. Dos Santos,Jungong Han

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remote sensing applications, data augmentation, urban planning, plays a critical, critical role

备注： 8 pages, 6 figures, CVPR 2026 findings. Code is available at [this https URL](https://github.com/XiaoqiZhuang/HarmoniDiff-RS)

点击查看摘要

Abstract:Satellite image composition plays a critical role in remote sensing applications such as data augmentation, disaste simulation, and urban planning. We propose HarmoniDiff-RS, a training-free diffusion-based framework for harmonizing composite satellite images under diverse domain conditions. Our method aligns the source and target domains through a Latent Mean Shift operation that transfers radiometric characteristics between them. To balance harmonization and content preservation, we introduce a Timestep-wise Latent Fusion strategy by leveraging early inverted latents for high harmonization and late latents for semantic consistency to generate a set of composite candidates. A lightweight harmony classifier is trained to further automatically select the most coherent result among them. We also construct RSIC-H, a benchmark dataset for satellite image harmonization derived from fMoW, providing 500 paired composition samples. Experiments demonstrate that our method effectively performs satellite image composition, showing strong potential for scalable remote-sensing synthesis and simulation tasks. Code is available at: this https URL.

43. 【2604.19386】Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

链接：https://arxiv.org/abs/2604.19386

作者：Zhiheng Fu,Yupeng Hu,Qianyun Yang,Shiqi Zhang,Zhiwei Chen,Zixu Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Composed Image Retrieval, Noisy Triplet Correspondence, Composed Image, Image Retrieval, Triplet Correspondence

备注：

点击查看摘要

Abstract:Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the "small loss hypothesis", but the unique semantic ambiguity in NTC, such as "partial matching", invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic "representation pollution". To address this critical challenge, we propose a novel "Expert-Proxy-Diversion" decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) External Prior Arbitration (EPA), which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI), which efficiently guides a lightweight proxy "arbiter" to internalize the expert's discriminative logic; (3) Dual Stream Reconciliation (DSR), which leverages the EKI's matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.

44. 【2604.19379】PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving

链接：https://arxiv.org/abs/2604.19379

作者：Yining Pan,Shijie Li,Yuchen Wu,Xulei Yang,Na Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Unsupervised Domain Adaptation, real-world autonomous driving, study on Unsupervised, shifts commonly encountered, Domain Adaptation

备注： Accepted at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm-3DPS backbone. However, existing supervised mm-3DPS methods rely heavily on strong cross-modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g., poor lighting or adverse weather). Moreover, conventional pseudo-labeling typically retains only high-confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we propose PanDA, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single-sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo-label completeness and reliability, we further develop a dual-expert pseudo-label refinement module that extracts domain-invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state-of-the-art UDA baselines for 3D semantic segmentation.

45. 【2604.19369】IonMorphNet: Generalizable Learning of Ion Image Morphologies for Peak Picking in Mass Spectrometry Imaging

链接：https://arxiv.org/abs/2604.19369

作者：Philipp Weigand,Niels Nawrot,Nikolas Ebert,Carsten Hopf,Oliver Wasenmüller

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Mass Spectrometry Imaging, Spectrometry Imaging, Mass Spectrometry, fundamental preprocessing step, step in Mass

备注： This paper has been accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2026

点击查看摘要

Abstract:Peak picking is a fundamental preprocessing step in Mass Spectrometry Imaging (MSI), where each sample is represented by hundreds to thousands of ion images. Existing approaches require careful dataset-specific hyperparameter tuning, and often fail to generalize across acquisition protocols. We introduce IonMorphNet, a spatial-structure-aware representation model for ion images that enables fully data-driven peak picking without any task-specific supervision. We curate 53 publicly available MSI datasets and define six structural classes capturing representative spatial patterns in ion images to train standard image backbones for structural pattern classification. Once trained, IonMorphNet can assess ion images and perform peak picking without additional hyperparameter tuning. Using a ConvNeXt V2-Tiny backbone, our approach improves peak picking performance by +7 % mSCF1 compared to state-of-the-art methods across multiple datasets. Beyond peak picking, we demonstrate that spatially informed channel reduction enables a 3D CNN for patch-based tumor classification in MSI. This approach matches or exceeds pixel-wise spectral classifiers by up to +7.3 % Balanced Accuracy on three tumor classification tasks, indicating meaningful ion image selection. The source code and model weights are available at this https URL.

46. 【2604.19368】Mind2Drive: Predicting Driver Intentions from EEG in Real-world On-Road Driving

链接：https://arxiv.org/abs/2604.19368

作者：Ghadah Alosaimi,Hanadi Alhamdan,Wenke E,Stamos Katsigiannis,Amir Atapour-Abarghouei,Toby P. Breckon

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Predicting driver intention, neurophysiological signals offers, enhancing proactive safety, driver assistance systems, EEG signal non-stationarity

备注： 8 pages, 4 figures, 6 tables, conference

点击查看摘要

Abstract:Predicting driver intention from neurophysiological signals offers a promising pathway for enhancing proactive safety in advanced driver assistance systems, yet remains challenging in real-world driving due to EEG signal non-stationarity and the complexity of cognitive-motor preparation. This study proposes and evaluates an EEG-based driver intention prediction framework using a synchronised multi-sensor platform integrated into a real electric vehicle. A real-world on-road dataset was collected across 32 driving sessions, and 12 deep learning architectures were evaluated under consistent experimental conditions. Among the evaluated architectures, TSCeption achieved the highest average accuracy (0.907) and Macro-F1 score (0.901). The proposed framework demonstrates strong temporal stability, maintaining robust decoding performance up to 1000 ms before manoeuvre execution with minimal degradation. Furthermore, additional analyses reveal that minimal EEG preprocessing outperforms artefact-handling pipelines, and prediction performance peaks within a 400-600 ms interval, corresponding to a critical neural preparatory phase preceding driving manoeuvres. Overall, these findings support the feasibility of early and stable EEG-based driver intention decoding under real-world on-road conditions. Code: this https URL.

47. 【2604.19365】Detection of T-shirt Presentation Attacks in Face Recognition Systems

链接：https://arxiv.org/abs/2604.19365

作者：Mathias Ibsen,Loris Tim Ide,Christian Rathgeb,Christoph Busch

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Face recognition systems, recognition systems, Face recognition, T-shirt attacks, biometric authentication

备注：

点击查看摘要

Abstract:Face recognition systems are often used for biometric authentication. Nevertheless, it is known that without any protective measures, face recognition systems are vulnerable to presentation attacks. To tackle this security problem, methods for detecting presentation attacks have been developed and shown good detection performance on several benchmark datasets. However, generalising presentation attack detection methods to new and novel types of attacks is an ongoing challenge. In this work, we employ 1,608 T-shirt attacks of the T-shirt Face Presentation Attack (TFPA) database using 100 unique presentation attack instruments together with 152 bona fide presentations. In a comprehensive evaluation, we show that this type of attack can compromise the security of face recognition systems. Furthermore, we propose a detection method based on spatial consistency checks in order to detect said T-shirt attacks. Precisely, state-of-the-art face and person detectors are combined to analyse the spatial positions of detected faces and persons based on which T-shirt attacks can be reliably detected.

48. 【2604.19350】Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms

链接：https://arxiv.org/abs/2604.19350

作者：Samyak Sanghvi,Piyush Miglani,Sarvesh Shashikumar,Kaustubh R Borgavi,Veenu Singla,Chetan Arora

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision tasks, diagnostics remains limited, Vision Transformers, computer-aided diagnostics remains, vision tasks

备注：

点击查看摘要

Abstract:Vision Transformers $(\texttt{ViT})$ have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest $(\texttt{RoI})$ based token reduction using an object detection model to guide attention; (2) contrastive learning between selected $\texttt{RoI}$ to enhance fine-grained discrimination through hard-negative based training; and (3) a $\texttt{DINOv2}$ pretrained $\texttt{ViT}$ that captures localization-aware, fine-grained features instead of global $\texttt{CLIP}$ representations. Experiments on public mammography datasets demonstrate that our method achieves superior performance over existing baselines, establishing its effectiveness and potential clinical utility for large-scale breast cancer screening. Our code is available for reproducibility here: this https URL

49. 【2604.19349】RAFT-MSF++: Temporal Geometry-Motion Feature Fusion for Self-Supervised Monocular Scene Flow

链接：https://arxiv.org/abs/2604.19349

作者：Xunpei Sun,Zuoxun Hou,Yi Chang,Gang Chen,Wei-Shi Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Monocular scene flow, flow estimation aims, scene flow estimation, restricting temporal modeling, Monocular scene

备注： This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Monocular scene flow estimation aims to recover dense 3D motion from image sequences, yet most existing methods are limited to two-frame inputs, restricting temporal modeling and robustness to occlusions. We propose RAFT-MSF++, a self-supervised multi-frame framework that recurrently fuses temporal features to jointly estimate depth and scene flow. Central to our approach is the Geometry-Motion Feature (GMF), which compactly encodes coupled motion and geometry cues and is iteratively updated for effective temporal reasoning. To ensure the robustness of this temporal fusion against occlusions, we incorporate relative positional attention to inject spatial priors and an occlusion regularization module to propagate reliable motion from visible regions. These components enable the GMF to effectively propagate information even in ambiguous areas. Extensive experiments show that RAFT-MSF++ achieves 24.14% SF-all on the KITTI Scene Flow benchmark, with a 30.99% improvement over the baseline and better robustness in occluded regions. The code is available at this https URL.

50. 【2604.19345】Geometry-Guided Self-Supervision for Ultra-Fine-Grained Recognition with Limited Data

链接：https://arxiv.org/abs/2604.19345

作者：Shijie Wang,Yadan Luo,Zijian Wang,Haojie Li,Zi Huang,Mahsa Baktashmotlagh

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Attribute Exploration Network, intrinsic geometrical features, general self-supervised framework, self-supervised framework called, Geometric Attribute Exploration

备注：

点击查看摘要

Abstract:This paper investigates the intrinsic geometrical features of highly similar objects and introduces a general self-supervised framework called the Geometric Attribute Exploration Network (GAEor), which is designed to address the ultra-fine-grained visual categorization (Ultra-FGVC) task in data-limited scenarios. Unlike prior work that often captures subtle yet critical distinctions, GAEor generates geometric attributes as novel alternative recognition cues. These attributes are determined by various details within the object, aligned with its geometric patterns, such as the intricate vein structures in soybean leaves. Crucially, each category exhibits distinct geometric descriptors that serve as powerful cues, even among objects with minimal visual variation -- a factor largely overlooked in recent research. GAEor discovers these geometric attributes by first amplifying geometry-relevant details via visual feedback from a backbone network, then embedding the relative polar coordinates of these details into the final representation. Extensive experiments demonstrate that GAEor significantly sets new state-of-the-art records in five widely-used Ultra-FGVC benchmarks.

51. 【2604.19339】Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data

链接：https://arxiv.org/abs/2604.19339

作者：Shijie Wang,Zijian Wang,Yadan Luo,Haojie Li,Zi Huang,Mahsa Baktashmotlagh

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：highly similar subcategories, classify highly similar, holistic cues, limited training samples, visual categorization

备注：

点击查看摘要

Abstract:Ultra-fine-grained visual categorization (Ultra-FGVC) aims to classify highly similar subcategories within fine-grained objects using limited training samples. However, holistic yet discriminative cues, such as leaf contours in extremely similar cultivars, remain under-explored in current studies, thereby limiting recognition performance. Though crucial, modeling holistic cues with complex morphological structures typically requires massive training samples, posing significant challenges in data-limited scenarios. To address this challenge, we propose a novel Divide-and-Conquer Holistic Cognition Network (DHCNet) that implements a divide-and-conquer strategy by decomposing holistic cues into spatially-associated subtle discrepancies and progressively establishing the holistic cognition process, significantly simplifying holistic cognition while reducing dependency on training data. Technically, DHCNet begins by progressively analyzing subtle discrepancies, transitioning from smaller local patches to larger ones using a self-shuffling operation on local regions. Simultaneously, it leverages the unaffected local regions to potentially guide the perception of the original topological structure among the shuffled patches, thereby aiding in the establishment of spatial associations for these discrepancies. Additionally, DHCNet incorporates the online refinement of these holistic cues discovered from local regions into the training process to iteratively improve their quality. As a result, DHCNet uses these holistic cues as supervisory signals to fine-tune the parameters of the recognition model, thus improving its sensitivity to holistic cues across the entire objects. Extensive evaluations demonstrate that DHCNet achieves remarkable performance on five widely-used Ultra-FGVC datasets.

52. 【2604.19334】Silicon Aware Neural Networks

链接：https://arxiv.org/abs/2604.19334

作者：Sebastian Fieldhouse,Kea-Tiong Tang

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：Logic Gate Networks, machine learning literature, GPU and FPGA, Differentiable Logic Gate, discrete logic gate

备注：

点击查看摘要

Abstract:Recent work in the machine learning literature has demonstrated that deep learning can train neural networks made of discrete logic gate functions to perform simple image classification tasks at very high speeds on CPU, GPU and FPGA platforms. By virtue of being formed by discrete logic gates, these Differentiable Logic Gate Networks (DLGNs) lend themselves naturally to implementation in custom silicon - in this work we present a method to map DLGNs in a one-to-one fashion to a digital CMOS standard cell library by converting the trained model to a gate-level netlist. We also propose a novel loss function whereby the DLGN can optimize the area, and indirectly power consumption, of the resulting circuit by minimizing the expected area per neuron based on the area of the standard cells in the target standard cell library. Finally, we also show for the first time an implementation of a DLGN as a silicon circuit in simulation, performing layout of a DLGN in the SkyWater 130nm process as a custom hard macro using a Cadence standard cell library and performing post-layout power analysis. We find that our custom macro can perform classification on MNIST with 97% accuracy 41.8 million times a second at a power consumption of 83.88 mW.

53. 【2604.19324】PLaMo 2.1-VL Technical Report

链接：https://arxiv.org/abs/2604.19324

作者：Tommi Kerola,Yuya Masuda,Takashi Masuko,Toshiki Nakanishi,Daisuke Nishino,Kuniyuki Takahashi,Hanqin Wang,Yoshihiro Yamada

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：lightweight Vision Language, Vision Language Model, Vision Language, Visual Question Answering, Japanese-language operation

备注： 35 pages, 9 figreus

点击查看摘要

Abstract:We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.

54. 【2604.19323】Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset

链接：https://arxiv.org/abs/2604.19323

作者：Gonzalo Nápoles,Isel Grau,Yamisleydi Salgueiro

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Concept Bottleneck Models, Bottleneck Models, grounded concept layer, clinically grounded concept, route predictions exclusively

备注：

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) route predictions exclusively through a clinically grounded concept layer, binding interpretability to concept-label consistency. When a dataset contains concept-level inconsistencies, identical concept profiles mapped to conflicting diagnosis labels create an unresolvable bottleneck that imposes a hard ceiling on achievable accuracy. In this paper, we apply rough set theory to the Derm7pt dermoscopy benchmark and characterize the full extent and clinical structure of this inconsistency. Among 305 unique concept profiles formed by the 7 dermoscopic criteria of the 7-point melanoma checklist, 50 (16.4%) are inconsistent, spanning 306 images (30.3% of the dataset). This yields a theoretical accuracy ceiling of 92.1%, independent of backbone architecture or training strategy for CBMs that exclusively operate with hard concepts. In addition, we characterize the conflict-severity distribution, identify the clinical features most responsible for boundary ambiguity, and evaluate two filtering strategies with quantified effects on dataset composition and CBM interpretability. Symmetric removal of all boundary-region images yields Derm7pt+, a fully consistent benchmark subset of 705 images with perfect quality of classification and no hard accuracy ceiling. Building on this filtered dataset, we present a hard CBM evaluated across 19 backbone architectures from the EfficientNet, DenseNet, ResNet, and Wide ResNet families. Under symmetric filtering, explored for completeness, EfficientNet-B5 achieves the best label F1 score (0.85) and label accuracy (0.90) on the held-out test set, with a concept accuracy of 0.70. Under asymmetric filtering, EfficientNet-B7 leads across all four metrics, reaching a label F1 score of 0.82 and concept accuracy of 0.70. These results establish reproducible baselines for concept-consistent CBM evaluation on dermoscopic data.

55. 【2604.19321】RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

链接：https://arxiv.org/abs/2604.19321

作者：Yusuf Çelebi,Yağız Asker,Özay Ezerceli,Mahmoud ElHussieni,Selva Taş,Reyhan Bayraktar,Fatma Betül Terzioğlu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Fine-tuning Large Language, Large Language Models, Large Language, remains structurally uncertain, Fine-tuning Large

备注：

点击查看摘要

56. 【2604.19318】Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

链接：https://arxiv.org/abs/2604.19318

作者：Qi Zhang,Jixuan Chen,Kaiyi Zhang,Xinquan Yu,Antoni B. Chan,Hui Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-view crowd tracking, person tracking trajectories, crowd tracking estimates, Multi-view crowd, Transformer-based multi-view crowd

备注： CVPR 2026

点击查看摘要

Abstract:Multi-view crowd tracking estimates each person's tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, \textit{MVTrackTrans}, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: this https URL.

57. 【2604.19314】Framelet-Based Blind Image Restoration with Minimax Concave Regularization

链接：https://arxiv.org/abs/2604.19314

作者：Heng Zhang,Reza Parvaz,Rui Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)

关键词：Recovering corrupted images, Recovering corrupted, Recovering, ell, image processing

备注：

点击查看摘要

Abstract:Recovering corrupted images is one of the most challenging problems in image processing. Among various restoration tasks, blind image deblurring has been extensively studied due to its practical importance and inherent difficulty. In this problem, both the point spread function (PSF) and the underlying latent sharp image must be estimated simultaneously. This problem cannot be solved directly due to its ill-posed nature. One powerful tool for solving such problems is total variation (TV) regularization. The $\ell_0$-norm regularization within the TV framework has been widely adopted to promote sparsity in image gradients or transform domains, leading to improved preservation of edges and fine structures. However, the use of the $\ell_0$-norm results in a highly nonconvex and computationally intractable optimization problem, which limits its practical applicability. To overcome these difficulties, we employ the minimax concave penalty (MCP), which promotes enhanced sparsity and provides a closer approximation to the $\ell_0$-norm. In addition, a reweighted $\ell_1$-norm regularization is incorporated to further reduce estimation bias and improve the preservation of fine image details and textures. After introducing the proposed model, a numerical algorithm is developed to solve the resulting optimization problem. The effectiveness of the proposed approach is then demonstrated through experimental evaluations on several test images.

58. 【2604.19264】DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

链接：https://arxiv.org/abs/2604.19264

作者：Shengqin Wang,Wentao Yan,Huichi Zhou,Yihang Chen,Kun Shao,Zhizhong Zhang,Yuan Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Agentic multimodal models, tackle complex tasks, garnered significant attention, leverage external tools, Agentic multimodal

备注：

点击查看摘要

Abstract:Agentic multimodal models have garnered significant attention for their ability to leverage external tools to tackle complex tasks. However, it is observed that such agents often meet premature interaction collapse, caused by two primary reasons: 1) the terminal reward often appending on the last token prevents the advantage from distinguishing trajectories with exploratory behavior; 2) excessively redundant context hinders the agent from absorbing useful feedback. To address these issues, we propose the Deepening Reasoning MMSearchAgent, the framework leverages the structural proximity to derive advantage signals from the whole rollout trajectories in an entire batch, such that trajectories of different lengths are further encouraged to be generated, even when containing the same correct answer. Additionally, differentiated gaussian rewards are employed to dynamically calibrate interaction tolerance, thereby ensuring information reliability and reduce redundancy. To support multi-turn interaction training, we have constructed a multi-step deep-reasoning dataset including 3602 high-quality QA pair with at least 3 reasonning steps. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming the MMSearch-R1 by 8.4$\%$ on FVQA-test.

59. 【2604.19259】Feature Perturbation Pool-based Fusion Network for Unified Multi-Class Industrial Defect Detection

链接：https://arxiv.org/abs/2604.19259

作者：Yuanchan Xu,Wenjun Zang,Ying Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：industrial quality inspection, approaches typically suffer, Multi-class defect detection, inter-class feature perturbation, degraded robustness caused

备注：

点击查看摘要

Abstract:Multi-class defect detection constitutes a critical yet challenging task in industrial quality inspection, where existing approaches typically suffer from two fundamental limitations: (i) the necessity of training separate models for each defect category, resulting in substantial computational and memory overhead, and (ii) degraded robustness caused by inter-class feature perturbation when heterogeneous defect categories are jointly modeled. In this paper, we present FPFNet, a Feature Perturbation Pool-based Fusion Network that synergistically integrates a stochastic feature perturbation pool with a multi-layer feature fusion strategy to address these challenges within a unified detection framework. The feature perturbation pool enriches the training distribution by randomly injecting diverse noise patterns -- including Gaussian noise, F-Noise, and F-Drop -- into the extracted feature representations, thereby strengthening the model's robustness against domain shifts and unseen defect morphologies. Concurrently, the multi-layer feature fusion module aggregates hierarchical feature representations from both the encoder and decoder through residual connections and normalization, enabling the network to capture complex cross-scale relationships while preserving fine-grained spatial details essential for precise defect localization. Built upon the UniAD architecture~\cite{you2022unified}, our method achieves state-of-the-art performance on two widely adopted benchmarks: 97.17\% image-level AUROC and 96.93\% pixel-level AUROC on MVTec-AD, and 91.08\% image-level AUROC and 99.08\% pixel-level AUROC on VisA, surpassing existing methods by notable margins while introducing no additional learnable parameters or computational complexity.

60. 【2604.19257】Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images

链接：https://arxiv.org/abs/2604.19257

作者：Hongyuan Liu,Bochao Zou,Qiankun Liu,Haochen Yu,Qi Mei,Jianfei Jiang,Chen Liu,Cheng Bi,Zhao Wang,Xueyang Zhang,Yifei Zhan,Jiansheng Chen,Huimin Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving research, virtual environment construction, crucial for autonomous, research and virtual, driving

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train an image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameters from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed images. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.

61. 【2604.19238】Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows

链接：https://arxiv.org/abs/2604.19238

作者：Zihan Wang,Xudong Huang,Junbo Qiao,Wei Li,Jie Hu,Xinghao Chen,Shaohui Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Real-world image super-resolution, powerful generative priors, image super-resolution, revolutionized by leveraging, leveraging the powerful

备注：

点击查看摘要

Abstract:Real-world image super-resolution (Real-SR) has been revolutionized by leveraging the powerful generative priors of large-scale diffusion and flow-based models. However, fine-tuning these models on limited LR-HR pairs often precipitates "prior collapse" that the model sacrifices its inherent generative richness to overfit specific training degradations. This issue is further exacerbated in one-step generation, where the absence of multi-step refinement leads to significant trajectory drift and artifact generation. In this paper, we propose Allo{SR}$^2$, a novel framework that rectifies one-step SR trajectories via allomorphic generative flows to maintain high-fidelity generative realism. Specifically, we utilize Signal-to-Noise Ratio (SNR) Guided Trajectory Initialization to establish a physically grounded starting state by aligning the degradation level of LR latent features with the optimal anchoring timestep of the pre-trained flow. To ensure a stable, curvature-free path for one-step inference, we propose Flow-Anchored Trajectory Consistency (FATC), which enforces velocity-level supervision across intermediate states. Furthermore, we develop Allomorphic Trajectory Matching (ATM), a self-adversarial alignment strategy that minimizes the distributional discrepancy between the SR flow and the generative flow in a unified vector field. Extensive experiments on both synthetic and real-world benchmarks demonstrate that Allo{SR}$^2$ achieves state-of-the-art performance in one-step Real-SR, offering a superior balance between restoration fidelity and generative realism while maintaining extreme efficiency.

62. 【2604.19234】Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

链接：https://arxiv.org/abs/2604.19234

作者：Rui Li,Ke Hao,Yuanzhi Liang,Haibin Huang,Chi Zhang,YunGu,XueLong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Group Relative Policy, Reinforcement learning, Relative Policy Optimization, Policy Optimization, post-training visual generative

备注：

点击查看摘要

Abstract:Reinforcement learning, particularly Group Relative Policy Optimization (GRPO), has emerged as an effective framework for post-training visual generative models with human preference signals. However, its effectiveness is fundamentally limited by coarse reward credit assignment. In modern visual generation, multiple reward models are often used to capture heterogeneous objectives, such as visual quality, motion consistency, and text alignment. Existing GRPO pipelines typically collapse these rewards into a single static scalar and propagate it uniformly across the entire diffusion trajectory. This design ignores the stage-specific roles of different denoising steps and produces mistimed or incompatible optimization signals. To address this issue, we propose Objective-aware Trajectory Credit Assignment (OTCA), a structured framework for fine-grained GRPO training. OTCA consists of two key components. Trajectory-Level Credit Decomposition estimates the relative importance of different denoising steps. Multi-Objective Credit Allocation adaptively weights and combines multiple reward signals throughout the denoising process. By jointly modeling temporal credit and objective-level credit, OTCA converts coarse reward supervision into a structured, timestep-aware training signal that better matches the iterative nature of diffusion-based generation. Extensive experiments show that OTCA consistently improves both image and video generation quality across evaluation metrics.

63. 【2604.19233】Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery

链接：https://arxiv.org/abs/2604.19233

作者：Francesco Moretti,Yi Jin,Guiqin Mario

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Deep learning-based object, variable shooting angles, dense object distributions, computer vision applications, pose formidable challenges

备注：

点击查看摘要

Abstract:Deep learning-based object detectors have achieved remarkable success across numerous computer vision applications, yet they continue to struggle with small object detection in high-resolution aerial and satellite imagery, where dense object distributions, variable shooting angles, diminutive target sizes, and substantial inter-class variability pose formidable challenges. Existing slicing strategies that partition high-resolution images into manageable patches have demonstrated promising results for enlarging the effective receptive field of small targets; however, their reliance on fixed slice dimensions introduces significant redundant computation, inflating inference cost and undermining detection speed. In this paper, we propose \textbf{Adaptive Slicing-Assisted Hyper Inference (ASAHI)}, a novel slicing framework that shifts the paradigm from prescribing a fixed slice size to adaptively determining the optimal number of slices according to image resolution, thereby substantially mitigating redundant computation while preserving beneficial overlap between adjacent patches. ASAHI integrates three synergistic components: (1)an adaptive resolution-aware slicing algorithm that dynamically generates 6 or 12 overlapping patches based on a learned threshold, (2)a slicing-assisted fine-tuning (SAF) strategy that constructs augmented training data comprising both full-resolution and sliced image patches, and (3)a Cluster-DIoU-NMS (CDN) post-processing module that combines the geometric merging efficiency of Cluster-NMS with the center-distance-aware suppression of DIoU-NMS to achieve robust duplicate elimination in crowded scenes. Extensive experiments on VisDrone2019 and xView, demonstrate that ASAHI achieves state-of-the-art performance with 56.8% on VisDrone2019-DET-val and 22.7% on xView-test, while reducing inference time by 20-25% compared to the baseline SAHI method.

64. 【2604.19218】hinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification

链接：https://arxiv.org/abs/2604.19218

作者：Quan Zhang,Jingze Wu,Jialong Wang,Xiaohua Xie,Jianhuang Lai,Hongbo Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Learning identity-discriminative representations, person re-identification, multi-scene generality, critical objective, objective in person

备注： 10 pages

点击查看摘要

Abstract:Learning identity-discriminative representations with multi-scene generality has become a critical objective in person re-identification (ReID). However, mainstream perception-driven paradigms tend to identify fitting from massive annotated data rather than identity-causal cues understanding, which presents a fragile representation against multiple disruptions. In this work, ReID-R is proposed as a novel reasoning-driven paradigm that achieves explicit identity understanding and reasoning by incorporating chain-of-thought into the ReID pipeline. Specifically, ReID-R consists of a two-stage contribution: (i) Discriminative reasoning warm-up, where a model is trained in a CoT label-free manner to acquire identity-aware feature understanding; and (ii) Efficient reinforcement learning, which proposes a non-trivial sampling to construct scene-generalizable data. On this basis, ReID-R leverages high-quality reward signals to guide the model toward focusing on ID-related cues, achieving accurate reasoning and correct responses. Extensive experiments on multiple ReID benchmarks demonstrate that ReID-R achieves competitive identity discrimination as superior methods using only 14.3K non-trivial data (20.9% of the existing data scale). Furthermore, benefit from inherent reasoning, ReID-R can provide high-quality interpretation for results.

65. 【2604.19217】Attention-based Multi-modal Deep Learning Model of Spatio-temporal Crop Yield Prediction with Satellite, Soil and Climate Data

链接：https://arxiv.org/abs/2604.19217

作者：Gopal Krishna Shyam,Ila Chandrakar

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：world food security, Crop yield prediction, Deep Learning Framework, policy-making decisions, crucial to world

备注： 6 pages, 2 Figures

点击查看摘要

Abstract:Crop yield prediction is one of the most important challenge, which is crucial to world food security and policy-making decisions. The conventional forecasting techniques are limited in their accuracy with reference to the fact that they utilize static data sources that do not reflect the dynamic and intricate relationships that exist between the variables of the environment over time [5,13]. This paper presents Attention-Based Multi-Modal Deep Learning Framework (ABMMDLF), which is suggested to be used in high-accuracy spatio-temporal crop yield prediction. The model we use combines multi-year satellite imagery, high-resolution time-series of meteorological data and initial soil properties as opposed to the traditional models which use only one of the aforementioned factors [12, 21]. The main architecture involves the use of Convolutional Neural Networks (CNN) to extract spatial features and a Temporal Attention Mechanism to adaptively weight important phenological periods targeted by the algorithm to change over time and condition on spatial features of images and video sequences. As can be experimentally seen, the proposed research work provides an R^2 score of 0.89, which is far better than the baseline models do.

Comments:
6 pages, 2 Figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.19217 [cs.CV]

(or
arXiv:2604.19217v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.19217

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

66. 【2604.19216】An Object-Centered Data Acquisition Method for 3D Gaussian Splatting using Mobile Phones

链接：https://arxiv.org/abs/2604.19216

作者：Yuezhe Zhang,Luqian Bai,Mengting Yu,Lei Wei,Shuai Wan,Yifan Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, mobile phones remains, Data acquisition, phones remains, remains a challenge

备注：

点击查看摘要

Abstract:Data acquisition through mobile phones remains a challenge for 3D Gaussian Splatting (3DGS). In this work we target the object-centered scenario and enable reliable mobile acquisition by providing on-device capture guidance and recording onboard sensor signals for offline reconstruction. After the calibration step, the device orientations are aligned to a baseline frame to obtain relative poses, and the optical axis of the camera is mapped to an object-centered spherical grid for uniform viewpoint indexing. To curb polar sampling bias, we compute area-weighted spherical coverage in real-time and guide the user's motion accordingly. We compare the proposed method with RealityScan and the free-capture strategy. Our method achieves superior reconstruction quality using fewer input images compared to free capture and RealityScan. Further analysis shows that the proposed method is able to obtain more comprehensive and uniform viewpoint coverage during object-centered acquisition.

67. 【2604.19206】When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide

链接：https://arxiv.org/abs/2604.19206

作者：Hang-Cheng Dong,Yuhao Jiang,Yibo Jiao,Lu Zou,Kai Zheng,Bingguo Liu,Dong Ye,Guodong Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving, safety-critical domains, medical diagnosis, severely hampered, industrial defect inspection

备注：

点击查看摘要

Abstract:The deployment of AI systems in safety-critical domains, such as industrial defect inspection, autonomous driving, and medical diagnosis, is severely hampered by their lack of reliability. A single undetected erroneous prediction can lead to catastrophic outcomes. Unfortunately, there is often no alternative but to place trust in the outputs of a trained AI system, which operates without an internal safeguard to flag unreliable predictions, even in cases of high accuracy. We propose a post-hoc explanation-based indicator to detect false negatives in binary defect detection networks. To our knowledge, this is the first method to proactively identify potentially erroneous network outputs. Our core idea leverages the difference between class-specific discriminative heatmaps and class-agnostic ones. We compute the difference in their intersection over union (IoU) as a reliability score. An adversarial enhancement method is further introduced to amplify this disparity. Evaluations on two industrial defect detection benchmarks show our method effectively identifies false negatives. With adversarial enhancement, it achieves 100\% recall, albeit with a trade-off for true negatives. Our work thus advocates for a new and trustworthy deployment paradigm: data-model-explanation-output, moving beyond conventional end-to-end systems to provide critical support for reliable AI in real-world applications.

68. 【2604.19202】SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting

链接：https://arxiv.org/abs/2604.19202

作者：Bo Li,Jiahao Kang,Yubo Ma,Feng-Lin Liu,Bin Liu,Fang-Lue Zhang,Lin Gao

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：digital head modeling, Gaussian head models, achieving photorealistic quality, Gaussian representations, Gaussian head

点击查看摘要

Abstract:3D Gaussian representations have emerged as a powerful paradigm for digital head modeling, achieving photorealistic quality with real-time rendering. However, intuitive and interactive creation or editing of 3D Gaussian head models remains challenging. Although 2D sketches provide an ideal interaction modality for fast, intuitive conceptual design, they are sparse, depth-ambiguous, and lack high-frequency appearance cues, making it difficult to infer dense, geometrically consistent 3D Gaussian structures from strokes - especially under real-time constraints. To address these challenges, we propose SketchFaceGS, the first sketch-driven framework for real-time generation and editing of photorealistic 3D Gaussian head models from 2D sketches. Our method uses a feed-forward, coarse-to-fine architecture. A Transformer-based UV feature-prediction module first reconstructs a coarse but geometrically consistent UV feature map from the input sketch, and then a 3D UV feature enhancement module refines it with high-frequency, photorealistic detail to produce a high-fidelity 3D head. For editing, we introduce a UV Mask Fusion technique combined with a layer-by-layer feature-fusion strategy, enabling precise, real-time, free-viewpoint modifications. Extensive experiments show that SketchFaceGS outperforms existing methods in both generation fidelity and editing flexibility, producing high-quality, editable 3D heads from sketches in a single forward pass.

69. 【2604.19196】Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

链接：https://arxiv.org/abs/2604.19196

作者：Mika Feng,Pierre Gallin-Martel,Koichi Ito,Takafumi Aoki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains challenging due, remains challenging, unseen environments, robust domain generalization, challenging due

备注： 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

点击查看摘要

Abstract:Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: this https URL .

70. 【2604.19193】How Far Are Video Models from True Multimodal Reasoning?

链接：https://arxiv.org/abs/2604.19193

作者：Xiaotian Zhang,Jianhui Wei,Yuan Wang,Jie Tan,Yichen Li,Yan Zhang,Ziyi Chen,Daoan Zhang,Dezhi YU,Wei Xu,Songtao Jiang,Zuozhu Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：question remains unanswered, true multimodal reasoning, achieving true multimodal, multimodal reasoning, remarkable progress

备注：

点击查看摘要

Abstract:Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert perception using minimal annotations, delivering interpretable textual feedback across diverse video context tasks. Extensive experiments reveal a striking answer to our central question: while state-of-the-art (SOTA) video models, such as Seedance 2.0, demonstrate competence on certain understanding and reasoning subtasks, they fall substantially short with logically grounded and interactive generation tasks (achieving success rates 25% and ~0%, respectively), exposing multimodal reasoning and physical grounding as critical bottlenecks. By systematically quantifying these limitations, the proposed method provides actionable feedbacks and a clear roadmap toward truly robust, general-purpose video models. CLVG-Bench and code are released here.

71. 【2604.19191】Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement

链接：https://arxiv.org/abs/2604.19191

作者：Pritam Kar,Gouri Lakshmi S,Saptarshi Bej

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：rare pathological conditions, identifying rare pathological, annotated abnormal samples, pathological conditions, essential for identifying

备注：

点击查看摘要

Abstract:Anomaly detection in medical imaging is essential for identifying rare pathological conditions, particularly when annotated abnormal samples are limited. We propose a hybrid anomaly detection framework that integrates self-supervised representation learning with manifold-based density estimation, a combination that remains largely unexplored in this domain. Medical images are first embedded into a latent feature space using pretrained, potentially domain-specific, backbones. These representations are then refined via Mean Shift Density Enhancement (MSDE), an iterative manifold-shifting procedure that moves samples toward regions of higher likelihood. Anomaly scores are subsequently computed using Gaussian density estimation in a PCA-reduced latent space, where Mahalanobis distance measures deviation from the learned normal distribution. The framework follows a one-class learning paradigm and requires only normal samples for training. Extensive experiments on seven medical imaging datasets demonstrate state-of-the-art performance. MSDE achieves the highest AUC on four datasets and the highest Average Precision on five datasets, including near-perfect performance on brain tumor detection (0.981 AUC/AP). These results underscore the potential of the proposed framework as a scalable clinical decision-support tool for early disease detection, screening in low-label settings, and robust deployment across diverse imaging modalities.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.19191 [cs.CV]

(or
arXiv:2604.19191v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.19191

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

72. 【2604.19159】MSDS: Deep Structural Similarity with Multiscale Representation

链接：https://arxiv.org/abs/2604.19159

作者：Danling Kang,Xue-Hua Chen,Bin Liu,Keke Zhang,Weiling Chen,Tiesong Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Image Quality Assessment, Quality Assessment, Image Quality, demonstrated strong alignment, human visual perception

备注：

点击查看摘要

Abstract:Deep-feature-based perceptual similarity models have demonstrated strong alignment with human visual perception in Image Quality Assessment (IQA). However, most existing approaches operate at a single spatial scale, implicitly assuming that structural similarity at a fixed resolution is sufficient. The role of spatial scale in deep-feature similarity modeling thus remains insufficiently understood. In this letter, we isolate spatial scale as an independent factor using a minimal multiscale extension of DeepSSIM, referred to as Deep Structural Similarity with Multiscale Representation (MSDS). The proposed framework decouples deep feature representation from cross-scale integration by computing DeepSSIM independently across pyramid levels and fusing the resulting scores with a lightweight set of learnable global weights. Experiments on multiple benchmark datasets demonstrate consistent and statistically significant improvements over the single-scale baseline, while introducing negligible additional complexity. The results empirically confirm spatial scale as a non-negligible factor in deep perceptual similarity, isolated here via a minimal testbed.

73. 【2604.19145】ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

链接：https://arxiv.org/abs/2604.19145

作者：Lin Sha,Haiyun Guo,Tao Wang,Cong Zhang,Min Huang,Jinqiao Wang,Qinghai Miao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：massive computational overhead, multi-frame video input, Vision-Language Models, autonomous driving systems, central to autonomous

备注： 18 pages, 4 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.

74. 【2604.19141】Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

链接：https://arxiv.org/abs/2604.19141

作者：Johannes Schusterbauer,Ming Gui,Yusong Li,Pingchuan Ma,Felix Krause,Björn Ommer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：updating all patches, function evaluations, number of function, allocate compute uniformly, compute uniformly

备注： CVPR 2026, Code: [this https URL](https://github.com/CompVis/patch-forcing)

点击查看摘要

Abstract:Diffusion- and flow-based models usually allocate compute uniformly across space, updating all patches with the same timestep and number of function evaluations. While convenient, this ignores the heterogeneity of natural images: some regions are easy to denoise, whereas others benefit from more refinement or additional context. Motivated by this, we explore patch-level noise scales for image synthesis. We find that naively varying timesteps across image tokens performs poorly, as it exposes the model to overly informative training states that do not occur at inference. We therefore introduce a timestep sampler that explicitly controls the maximum patch-level information available during training, and show that moving from global to patch-level timesteps already improves image generation over standard baselines. By further augmenting the model with a lightweight per-patch difficulty head, we enable adaptive samplers that allocate compute dynamically where it is most needed. Combined with noise levels varying over both space and diffusion time, this yields Patch Forcing (PF), a framework that advances easier regions earlier so they can provide context for harder ones. PF achieves superior results on class-conditional ImageNet, remains orthogonal to representation alignment and guidance methods, and scales to text-to-image synthesis. Our results suggest that patch-level denoising schedules provide a promising foundation for adaptive image generation.

75. 【2604.19135】Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

链接：https://arxiv.org/abs/2604.19135

作者：Hang Cheng,Fanhe Dong,Long Zeng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：paper presents, diffusion models, shape retrieval, zero-shot visual retrieval, shape retrieval methods

备注：

点击查看摘要

Abstract:This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal feature-enhanced strategy that conditions the frozen diffusion backbone with complementary visual and textual cues from CLIP, explicitly enhancing the ability of semantic context capture and concentrating on sketch contours. Specifically, we inject global and local visual features derived from a pretrained CLIP visual encoder, and incorporate enriched textual guidance by combining learnable soft prompts with hard textual descriptions generated by BLIP. Furthermore, we employ the Circle-T loss to dynamically strengthen positive-pair attraction once negative samples are sufficiently separated, thereby adapting to sketch noise and enabling more effective sketch-3D alignment. Extensive experiments on two public benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in ZS-SBSR.

76. 【2604.19133】BALTIC: A Benchmark and Cross-Domain Strategy for 3D Reconstruction Across Air and Underwater Domains Under Varying Illumination

链接：https://arxiv.org/abs/2604.19133

作者：Michele Grimaldi,David Nakath,Oscar Pizarro,Jonatan Scharff Willners,Ignacio Carlucho,Yvan R. Petillot

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：varying environmental conditions, environmental conditions remains, robotic perception, varying environmental, remains a critical

备注：

点击查看摘要

Abstract:Robust 3D reconstruction across varying environmental conditions remains a critical challenge for robotic perception, particularly when transitioning between air and water. To address this, we introduce BALTIC, a controlled benchmark designed to systematically evaluate modern 3D reconstruction methods under variations in medium and lighting. The benchmark comprises 13 datasets spanning two media (air and water) and three lighting conditions (ambient, artificial, and mixed), with additional variations in motion type, scanning pattern, and initialization trajectory, resulting in a diverse set of sequences. Our experimental setup features a custom water tank equipped with a monocular camera and an HTC Vive tracker, enabling accurate ground-truth pose estimation. We further investigate cross-domain reconstruction by augmenting underwater image sequences with a small number of in-air views captured under similar lighting conditions. We evaluate Structure-from-Motion reconstruction using COLMAP in terms of both trajectory accuracy and scene geometry, and use these reconstructions as input to Neural Radiance Fields and 3D Gaussian Splatting methods. The resulting models are assessed against ground-truth trajectories and in-air references, while rendered outputs are compared using perceptual and photometric metrics. Additionally, we perform a color restoration analysis to evaluate radiometric consistency across domains. Our results show that under controlled, texture-consistent conditions, Gaussian Splatting with simple preprocessing (e.g., white balance correction) can achieve performance comparable to specialized underwater methods, although its robustness decreases in more complex and heterogeneous real-world environments

77. 【2604.19129】PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

链接：https://arxiv.org/abs/2604.19129

作者：Chaonan Ji,Jinwei Qi,Sheng Xu,Peng Zhang,Bang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing facial reenactment, Existing facial, reenactment methods struggle, fine-grained controllability, facial reenactment methods

备注： accepted by CVPR2026

点击查看摘要

Abstract:Existing facial reenactment methods struggle with a trade-off between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce PortraitDirector, a novel framework that formulates face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent. Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. PortraitDirector achieves streaming, high-fidelity, controllable 512 x 512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.

78. 【2604.19108】Robust Continual Unlearning against Knowledge Erosion and Forgetting Reversal

链接：https://arxiv.org/abs/2604.19108

作者：Eun-Ju Park,Youjin Shin,Simon S. Woo

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：machine unlearning plays, privacy protection, artificial intelligence, balance the growth, plays a crucial

备注：

点击查看摘要

Abstract:As a means to balance the growth of the AI industry with the need for privacy protection, machine unlearning plays a crucial role in realizing the ``right to be forgotten'' in artificial intelligence. This technique enables AI systems to remove the influence of specific data while preserving the rest of the learned knowledge. Although it has been actively studied, most existing unlearning methods assume that unlearning is performed only once. In this work, we evaluate existing unlearning algorithms in a more realistic scenario where unlearning is conducted repeatedly, and in this setting, we identify two critical phenomena: (1) Knowledge Erosion, where the accuracy on retain data progressively degrades over unlearning phases, and (2) Forgetting Reversal, where previously forgotten samples become recognizable again in later phases. To address these challenges, we propose SAFER (StAbility-preserving Forgetting with Effective Regularization), a continual unlearning framework that maintains representation stability for retain data while enforcing negative logit margins for forget data. Extensive experiments show that SAFER mitigates not only knowledge erosion but also forgetting reversal, achieving stable performance across multiple unlearning phases.

79. 【2604.19105】EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

链接：https://arxiv.org/abs/2604.19105

作者：Ruibing Hou,Mingyue Zhou,Yuwei Gui,Mingshuang Luo,Bingpeng Ma,Hong Chang,Shiguang Shan,Xilin Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Faithfully modeling human, Faithfully modeling, embodied intelligence, behavior in dynamic, dynamic environments

备注： 12 pages, 3 figures

点击查看摘要

Abstract:Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.

80. 【2604.19093】Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration

链接：https://arxiv.org/abs/2604.19093

作者：Jinglin Xu,Yi Li,Chuxiong Sun,Xiao Xu,Jiangmeng Li,Fanjiang Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：unlabeled target data, Multi-modal test-time adaptation, multi-modal TTA, multi-modal TTA methodologies, category-conditional distributions

备注：

点击查看摘要

Abstract:Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category-conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts. The code is available at this https URL.

81. 【2604.19064】he Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation

链接：https://arxiv.org/abs/2604.19064

作者：Zhen Liu,Yuhan Liu,Jinjun Wang,Jianyi Liu,Wei Song,Jingwen Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：VLN action supervision, standard VLN action, balancing behavioral diversity, critically depends, depends on balancing

备注：

点击查看摘要

Abstract:In vision-and-language navigation (VLN), self-improvement from policy-induced experience, using only standard VLN action supervision, critically depends on balancing behavioral diversity and learning stability, which governs whether the agent can extract a reliable learning signal for improvement. Increasing behavioral diversity is necessary to expose alternative action hypotheses but can destabilize policy-induced learning signals, whereas overly conservative stability constraints suppress exploration and induce early commitment, making reliable self-improvement difficult. To address this challenge, we propose Stability-Diversity Balance (SDB), a plug-and-play mechanism for balanced self-improvement in VLN. SDB expands each decision step into multiple latent behavioral hypotheses by applying controlled shifts in the instruction-conditioned hidden states, and then performs reliability-aware soft evaluation and aggregation to retain diverse yet instruction-consistent alternatives during learning. An explicit regularizer further constrains hypothesis interactions, preventing excessive drift or premature collapse of hypothesis diversity and stabilizing self-improvement without discarding training signals. Experiments on R2R, SOON, and REVERIE show consistent improvements; for example, on REVERIE val-unseen, SDB improves SPL from 33.73 to 35.93 and OSR from 51.07 to 54.25.

82. 【2604.19054】Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge

链接：https://arxiv.org/abs/2604.19054

作者：Zihao Ye,Yung Hsiang Lu,Xiao Hu,Shuai Zhang,Taotao Jing,Xin Li,Zhen Yao,Bo Lang,Zhihao Zheng,Seungmin Oh,Hankyul Kang,Seunghun Kang,Jongbin Ryu,Kexin Chen,Yuan Qi,George K Thiruvathukal,Mooi Choo Chuah

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：IEEE Low-Power Computer, Monocular Depth Estimation, efficient vision models, IEEE Low-Power, Low-Power Computer Vision

备注： 11 pages, 8 figures, 4 tables

点击查看摘要

Abstract:The IEEE Low-Power Computer Vision Challenge (LPCVC) aims to promote the development of efficient vision models for edge devices, balancing accuracy with constraints such as latency, memory capacity, and energy use. The 2025 challenge featured three tracks: (1) Image classification under various lighting conditions and styles, (2) Open-Vocabulary Segmentation with Text Prompt, and (3) Monocular Depth Estimation. This paper presents the design of LPCVC 2025, including its competition structure and evaluation framework, which integrates the Qualcomm AI Hub for consistent and reproducible benchmarking. The paper also introduces the top-performing solutions from each track and outlines key trends and observations. The paper concludes with suggestions for future computer vision competitions.

83. 【2604.19039】Generative Texture Filtering

链接：https://arxiv.org/abs/2604.19039

作者：Rongjia Zheng,Shangwei Huang,Lei Zhu,Wei-Shi Zheng,Qing Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：exhibits surprisingly good, surprisingly good performance, performance and generalizability, exhibits surprisingly, surprisingly good

备注： Accepted to SIGGRAPH 2026 conference track

点击查看摘要

Abstract:We present a generative method for texture filtering, which exhibits surprisingly good performance and generalizability. Our core idea is to empower texture filtering by taking full advantage of the strong learned image prior of pre-trained generative models. To this end, we propose to fine-tune a pre-trained generative model via a two-stage strategy. Specifically, we first conduct supervised fine-tuning on a very small set of paired images, and then perform reinforcement fine-tuning on a large-scale unlabeled dataset under the guidance of a reward function that quantifies the quality of texture removal and structure preservation. Extensive experiments show that our method clearly outperforms previous methods, and is effective to deal with previously challenging cases. Our code is available at this https URL.

84. 【2604.19034】Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents

链接：https://arxiv.org/abs/2604.19034

作者：Xu Chen,Shichao Xie,Zhining Gu,Lu Jia,Minghua Luo,Fei Liu,Zedong Chu,Yanfen Shen,Xiaolong Wu,Mu Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Constructing structured spatial, enabling long-horizon reasoning, complex embodied navigation, Constructing structured, structured spatial memory

备注：

点击查看摘要

Abstract:Constructing structured spatial memory is essential for enabling long-horizon reasoning in complex embodied navigation tasks. Current memory construction predominantly relies on a decoupled, two-stage paradigm: agents first aggregate environmental data through exploration, followed by the offline reconstruction of spatial memory. However, this post-hoc and geometry-centric approach precludes agents from leveraging high-level semantic intelligence, often causing them to overlook navigationally critical landmarks (e.g., doorways and staircases) that serve as fundamental semantic anchors in human cognitive maps. To bridge this gap, we propose ABot-Explorer, a novel active exploration framework that unifies memory construction and exploration into an online, RGB-only process. At its core, ABot-Explorer leverages Large Vision-Language Models (VLMs) to distill Semantic Navigational Affordances (SNA), which act as cognitive-aligned anchors to guide the agent's movement. By dynamically integrating these SNAs into a hierarchical SG-Memo, ABot-Explorer mirrors human-like exploratory logic by prioritizing structural transit nodes to facilitate efficient coverage. To support this framework, we contribute a large-scale dataset extending InteriorGS with SNA and SG-Memo annotations. Experimental results demonstrate that ABot-Explorer significantly outperforms current state-of-the-art methods in both exploration efficiency and environment coverage, while the resulting SG-Memo is shown to effectively support diverse downstream tasks.

85. 【2604.19009】Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

链接：https://arxiv.org/abs/2604.19009

作者：Linwei Dong,Ruoyu Guo,Ge Bai,Zehuan Yuan,Yawei Luo,Changqing Zou

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Distribution Matching Distillation, Distribution Matching, shown great promise, exemplified by Distribution, integrating Reinforcement Learning

备注：

点击查看摘要

Abstract:Diffusion distillation, exemplified by Distribution Matching Distillation (DMD), has shown great promise in few-step generation but often sacrifices quality for sampling speed. While integrating Reinforcement Learning (RL) into distillation offers potential, a naive fusion of these two objectives relies on suboptimal raw sample evaluation. This sample-based scoring creates inherent conflicts with the distillation trajectory and produces unreliable rewards due to the noisy nature of early-stage generation. To overcome these limitations, we propose GDMD, a novel framework that redefines the reward mechanism by prioritizing distillation gradients over raw pixel outputs as the primary signal for optimization. By reinterpreting the DMD gradients as implicit target tensors, our framework enables existing reward models to directly evaluate the quality of distillation updates. This gradient-level guidance functions as an adaptive weighting that synchronizes the RL policy with the distillation objective, effectively neutralizing optimization divergence. Empirical results show that GDMD sets a new SOTA for few-step generation. Specifically, our 4-step models outperform the quality of their multi-step teacher and substantially exceed previous DMDR results in GenEval and human-preference metrics, exhibiting strong scalability potential.

86. 【2604.18993】AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos

链接：https://arxiv.org/abs/2604.18993

作者：Jiagao Hu,Daiguo Zhou,Danzhen Fu,Fuhao Li,Zepeng Wang,Fei Wang,Wenhua Liao,Jiayi Xie,Haiyang Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词：adverse weather remains, autonomous driving, adverse weather, Adverse Weather video, remains a critical

备注： Accepted by ICMR 2026

点击查看摘要

Abstract:Perception robustness under adverse weather remains a critical challenge for autonomous driving, with the core bottleneck being the scarcity of real-world video data in adverse weather. Existing weather generation approaches struggle to balance visual quality and annotation reusability. We present AutoAWG, a controllable Adverse Weather video Generation framework for Autonomous driving. Our method employs a semantics-guided adaptive fusion of multiple controls to balance strong weather stylization with high-fidelity preservation of safety-critical targets; leverages a vanishing point-anchored temporal synthesis strategy to construct training sequences from static images, thereby reducing reliance on synthetic data; and adopts masked training to enhance long-horizon generation stability. On the nuScenes validation set, AutoAWG significantly outperforms prior state-of-the-art methods: without first-frame conditioning, FID and FVD are relatively reduced by 50.0% and 16.1%; with first-frame conditioning, they are further reduced by 8.7% and 7.2%, respectively. Extensive qualitative and quantitative results demonstrate advantages in style fidelity, temporal consistency, and semantic--structural integrity, underscoring the practical value of AutoAWG for improving downstream perception in autonomous driving. Our code is available at: this https URL

87. 【2604.18988】A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation

链接：https://arxiv.org/abs/2604.18988

作者：Liping Wang,Cheng Ye,Weidong Chen,Peipei Song,Bo Hu,Zhendong Mao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generate emotionally engaging, users' multimodal contexts, aims to generate, generate emotionally, emotionally engaging

备注： Submitted to ACM Multimetida 2026

点击查看摘要

Abstract:Multimodal empathetic response generation (MERG) aims to generate emotionally engaging and empathetic responses based on users' multimodal contexts. Existing approaches usually rely on an implicit one-pass generation paradigm from multimodal context to the final response, which overlooks two intrinsic characteristics of MERG: (1) Human perception of emotional cues is inherently structured rather than a direct mapping. The conventional paradigm neglects the hierarchical progression of emotion perception, leading to distorted emotional judgments. (2) Given the inherent complexity and ambiguity of human emotions, the conventional paradigm is prone to significant emotional biases, ultimately resulting in suboptimal empathy. In this paper, we propose a multi-agent framework for MERG, which enhances empathy through structured reasoning and reflective refinement. Specifically, we first introduce a structured empathetic reasoning-to-generation module that explicitly decomposes response generation via multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation, providing a clearer intermediate path from multimodal evidence to response realization. Besides, we develop a global reflection and refinement module, in which a global reflection agent performs step-wise auditing over intermediate states and the generated response, eliminating existing emotional biases and empathy errors, and triggering targeted regeneration. Overall, such a closed-loop framework enables our model to gradually improve the accuracy of emotion perception and eliminate emotion biases during the iteration process. Experiments on several benchmarks, e.g., IEMOCAP and MELD, demonstrate that our model has superior empathic response generation capabilities compared to state-of-the-art methods.

88. 【2604.18980】AdaGScale: Viewpoint-Adaptive Gaussian Scaling in 3D Gaussian Splatting to Reduce Gaussian-Tile Pairs

链接：https://arxiv.org/abs/2604.18980

作者：Joongho Jo,Hyerin Lim,Hanjun Choi,Jongsun Park

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian-tile pairs, Gaussian Splatting, Gaussian-tile pairs based, number of Gaussian-tile, Reducing the number

备注： DAC 2026

点击查看摘要

Abstract:Reducing the number of Gaussian-tile pairs is one of the most promising approaches to improve 3D Gaussian Splatting (3D-GS) rendering speed on GPUs. However, the importance difference existing among Gaussian-tile pairs has never been considered in the previous works. In this paper, we propose AdaGScale, a novel viewpoint-adaptive Gaussian scaling technique for reducing the number of Gaussian-tile pairs. AdaGScale is based on the observation that the peripheral tiles located far from Gaussian center contribute negligibly to pixel color accumulation. This suggests an opportunity for reducing the number of Gaussian-tile pairs based on color contribution. AdaGScale efficiently estimates the color contribution in the peripheral region of each Gaussian during a preprocessing stage and adaptively scales its size based on the peripheral score. As a result, Gaussians with lower importance intersect with fewer tiles during the intersection test, which improves rendering speed while maintaining image quality. The adjusted size is used only for tile intersection test, and the original size is retained during color accumulation to preserve visual fidelity. Experimental results show that AdaGScale achieves a geometric mean speedup of 13.8x over original 3D-GS on a GPU, with only about 0.5 dB degradation in PSNR on city-scale scenes.

89. 【2604.18967】oward Clinically Acceptable Chest X-ray Report Generation: A Qualitative Retrospective Pilot Study of CXRMate-2

链接：https://arxiv.org/abs/2604.18967

作者：Aaron Nicolson,Elizabeth J. Cooper,Hwan-Jin Yoon,Claire McCafferty,Ramya Krishnan,Michelle Craigie,Nivene Saad,Jason Dowling,Ian A. Scott,Bevan Koopman

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Chest X-ray, shown rapid progress, clinical utility remains, utility remains uncertain, remains uncertain due

备注：

点击查看摘要

Abstract:Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress, yet their clinical utility remains uncertain due to limited evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that integrates structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates between radiologist reports and acceptable generated reports for seven of the eight analysed findings. Preference for radiologist reports was driven primarily by higher recall, while generated reports were often preferred for readability. Together, these results suggest a credible pathway to clinically acceptable CXR RRG. Improvements in recall, alongside better detection of subtle findings (e.g., pulmonary congestion), are likely sufficient to achieve non-inferiority to radiologist reporting. With these targeted advances, CXR RRG systems may be ready for prospective evaluation in assistive roles within radiologist-led workflows.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.18967 [cs.CV]

(or
arXiv:2604.18967v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.18967

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

90. 【2604.18961】AI-Enabled Image-Based Hybrid Vision/Force Control of Tendon-Driven Aerial Continuum Manipulators

链接：https://arxiv.org/abs/2604.18961

作者：Shayan Sepahvand,Farrokh Janabi-Sharifi,Farhad Aghili

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：continuum manipulators based, AI-enabled cascaded hybrid, tendon-driven aerial continuum, aerial continuum manipulators, cascaded hybrid vision

备注：

点击查看摘要

Abstract:This paper presents an AI-enabled cascaded hybrid vision/force control framework for tendon-driven aerial continuum manipulators based on constant-strain modeling in $SE(3)$ as a coupled system. The proposed controller is designed to enable autonomous, physical interaction with a static environment while stabilizing the image feature error. The developed strategy combines the cascaded fast fixed-time sliding mode control and a radial basis function neural network to cope with the uncertainties in the image acquired by the eye-in-hand monocular camera and the measurements from the force sensing apparatus. This ensures rapid, online learning of the vision- and force-related uncertainties without requiring offline training. Furthermore, the features are extracted via a state-of-the-art graph neural network architecture employed by a visual servoing framework using line features, rather than relying on heuristic geometric line extractors, to concurrently contribute to tracking the desired normal interaction force during contact and regulating the image feature error. A comparative study benchmarks the proposed controller against established rigid-arm aerial manipulation methods, evaluating robustness across diverse scenarios and feature extraction strategies. The simulation and experimental results showcase the effectiveness of the proposed methodology under various initial conditions and demonstrate robust performance in executing manipulation tasks.

91. 【2604.18957】Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images

链接：https://arxiv.org/abs/2604.18957

作者：Abdul Mueez,Shruti Vyas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Extracting standardized metallurgical, microscopy images remains, images remains challenging, remains challenging due, standardized metallurgical metrics

备注： Accepted at the 11th IEEE Workshop on Computer Vision for Multimodal Microscopy Image Analysis (CVMI), CVPR Workshops 2026

点击查看摘要

Abstract:Extracting standardized metallurgical metrics from microscopy images remains challenging due to complex grain morphology and the data demands of supervised segmentation. To bridge foundational computer vision with practical metallurgical evaluation, we propose an automated pipeline for dense instance segmentation and grain size estimation that adapts Cellpose-SAM to microstructures and integrates its topology-aware gradient tracking with an ASTM E112 Jeffries planimetric module. We systematically benchmark this pipeline against a classical convolutional network (U-Net), an adaptive-prompting vision foundation model (MatSAM) and a contemporary vision-language model (Qwen2.5-VL-7B). Our evaluations reveal that while the out-of-the-box vision-language model struggles with the localized spatial reasoning required for dense microscopic counting and MatSAM suffers from over-segmentation despite its domain-specific prompt generation, our adapted pipeline successfully maintains topological separation. Furthermore, experiments across progressively reduced training splits demonstrate exceptional few-shot scalability; utilizing only two training samples, the proposed system predicts the ASTM grain size number (G) with a mean absolute percentage error (MAPE) as low as 1.50%, while robustness testing across varying target grain counts empirically validates the ASTM 50-grain sampling minimum. These results highlight the efficacy of application-level foundation model integration for highly accurate, automated materials characterization. Our project repository is available at this https URL.

92. 【2604.18940】Localization-Guided Foreground Augmentation in Autonomous Driving

链接：https://arxiv.org/abs/2604.18940

作者：Jiawei Yong,Deyuan Qu,Qi Chen,Kentaro Oguchi,Shintaro Fukushima

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Autonomous driving systems, adverse visibility conditions-such, online scene geometry, Autonomous driving, snow-where online scene

备注：

点击查看摘要

Abstract:Autonomous driving systems often degrade under adverse visibility conditions-such as rain, nighttime, or snow-where online scene geometry (e.g., lane dividers, road boundaries, and pedestrian crossings) becomes sparse or fragmented. While high-definition (HD) maps can provide missing structural context, they are costly to construct and maintain at scale. We propose Localization-Guided Foreground Augmentation (LG-FA), a lightweight and plug-and-play inference module that enhances foreground perception by enriching geometric context online. LG-FA: (i) incrementally constructs a sparse global vector layer from per-frame Bird's-Eye View (BEV) predictions; (ii) estimates ego pose via class-constrained geometric alignment, jointly improving localization and completing missing local topology; and (iii) reprojects the augmented foreground into a unified global frame to improve per-frame predictions. Experiments on challenging nuScenes sequences demonstrate that LG-FA improves the geometric completeness and temporal stability of BEV representations, reduces localization error, and produces globally consistent lane and topology reconstructions. The module can be seamlessly integrated into existing BEV-based perception systems without backbone modification. By providing a reliable geometric context prior, LG-FA enhances temporal consistency and supplies stable structural support for downstream modules such as tracking and decision-making.

93. 【2604.18881】A Proxy Consistency Loss for Grounded Fusion of Earth Observation and Location Encoders

链接：https://arxiv.org/abs/2604.18881

作者：Zhongying Wang,Kevin Lane,Levi Cai,Morteza Karimzadeh,Esther Rolf

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Supervised learning, learning with Earth, in-situ measured data, Earth observation inputs, location encoder

备注： Accepted to EarthVision 2026 (CVPR Workshop). 13 pages total (10 pages main paper + 3 pages supplementary material), 5 main figures

点击查看摘要

Abstract:Supervised learning with Earth observation inputs is often limited by the sparsity of high-quality labeled or in-situ measured data to use as training labels. With the abundance of geographic data products, in many cases there are variables correlated with - but different from - the variable of interest that can be leveraged. We integrate such proxy variables within a geographic prior via a trainable location encoder and introduce a proxy consistency loss (PCL) formulation to imbue proxy data into the location encoder. The first key insight behind our approach is to use the location encoder as an agile and flexible way to learn from abundantly available proxy data which can be sampled independently of training label availability. Our second key insight is that we will need to regularize the location encoder appropriately to achieve performance and robustness with limited labeled data. Our experiments on air quality prediction and poverty mapping show that integrating proxy data implicitly through the location encoder outperforms using both as input to an observation encoder and fusion strategies that use frozen, pretrained location embeddings as a geographic prior. Superior performance for in-sample prediction shows that the PCL can incorporate rich information from the proxies, and superior out-of-sample prediction shows that the learned latent embeddings help generalize to areas without training labels.

94. 【2604.18867】Hierarchically Robust Zero-shot Vision-language Models

链接：https://arxiv.org/abs/2604.18867

作者：Junhao Dong,Yifei Zhang,Hao Zhu,Yew-Soon Ong,Piotr Koniusz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：perform zero-shot classification, perform zero-shot, zero-shot classification, Vision-Language Models, adversarial attacks

备注： This paper is accepted by CVPR'26

点击查看摘要

Abstract:Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., mammal) in addition to their base (leaf) classes (e.g., cat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide a theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes, boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety. Experiments across several datasets are performed.

95. 【2604.18866】HMR-Net: Hierarchical Modular Routing for Cross-Domain Object Detection in Aerial Images

链接：https://arxiv.org/abs/2604.18866

作者：Pourya Shamsolmoali,Masoumeh Zareapoor,Michael Felsberg,Nick Pears,Yue Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：aerial imagery remains, semantic label coverage, spatial resolution, label coverage, imagery remains

备注： Submitted to IJCV September 2025

点击查看摘要

Abstract:Despite advances in object detection, aerial imagery remains a challenging domain, as models often fail to generalize across variations in spatial resolution, scene composition, and semantic label coverage. Differences in geographic context, sensor characteristics, and object distributions across datasets limit the capacity of conventional models to learn consistent and transferable representations. Shared methods trained on such data tend to impose a unified representation across fundamentally different domains, resulting in poor performance on region-specific content and less flexibility when dealing with novel object categories. To address this, we propose a novel modular learning framework that enables structured specialization in aerial detection. Our method introduces a hierarchical routing mechanism with two levels of modularity: a global expert assignment layer that uses latent geographic embeddings to route datasets to specialized processing modules, and a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules. This allows our method to specialize across datasets and within complex scenes. Additionally, the framework contains a conditional expert module that uses external semantic information (e.g., category names or textual descriptions) to enable detection of novel object categories during inference, without the need for retraining or fine-tuning. By moving beyond monolithic representations, our method offers an adaptive framework for remote sensing object detection. Comprehensive evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection.

96. 【2604.18857】ask Switching Without Forgetting via Proximal Decoupling

链接：https://arxiv.org/abs/2604.18857

作者：Pourya Shamsolmoali,Masoumeh Zareapoor,Eric Granger,William A. P. Smith,Yue Lu

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：forgetting old knowledge, primary challenge, learn new information, information without forgetting, continual learning

备注： Submitted to IEEE TPAMI January 2026

点击查看摘要

Abstract:In continual learning, the primary challenge is to learn new information without forgetting old knowledge. A common solution addresses this trade-off through regularization, penalizing changes to parameters critical for previous tasks. In most cases, this regularization term is directly added to the training loss and optimized with standard gradient descent, which blends learning and retention signals into a single update and does not explicitly separate essential parameters from redundant ones. As task sequences grow, this coupling can over-constrain the model, limiting forward transfer and leading to inefficient use of capacity. We propose a different approach that separates task learning from stability enforcement via operator splitting. The learning step focuses on minimizing the current task loss, while a proximal stability step applies a sparse regularizer to prune unnecessary parameters and preserve task-relevant ones. This turns the stability-plasticity into a negotiated update between two complementary operators, rather than a conflicting gradient. We provide theoretical justification for the splitting method on the continual-learning objective, and demonstrate that our proposed solver achieves state-of-the-art results on standard benchmarks, improving both stability and adaptability without the need for replay buffers, Bayesian sampling, or meta-learning components.

97. 【2604.18856】ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification

链接：https://arxiv.org/abs/2604.18856

作者：Mohammed Q. Alkhatib

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：limited labeled data, remains challenging due, classification remains challenging, Hyperspectral image, high spectral dimensionality

备注： Pre-print Accepted for Publication in International Journal of Remote Sensing

点击查看摘要

Abstract:Hyperspectral image (HSI) classification remains challenging due to high spectral dimensionality, redundancy, and limited labeled data. Although convolutional neural networks (CNNs) and Vision Transformers (ViTs) achieve strong performance by exploiting spectral-spatial information and long-range dependencies, they often incur high computational cost and large model size, limiting practical use. To address these limitations, a unified hybrid framework, termed ConvVitMamba, is proposed for efficient HSI classification. The architecture integrates three components: a multiscale convolutional feature extractor to capture local spectral, spatial, and joint patterns; a Vision Transformer based tokenization and encoding stage to model global contextual relationships; and a lightweight Mamba inspired gated sequence mixing module for efficient content-aware refinement without quadratic self-attention. Principal Component Analysis (PCA) is used as preprocessing to reduce redundancy and improve efficiency. Experiments on four benchmark datasets, including Houston and three UAV borne QUH datasets (Pingan, Qingyun, and Tangdaowan), demonstrate that ConvVitMamba consistently outperforms CNN, Transformer, and Mamba based methods while maintaining a favorable balance between accuracy, model size, and inference efficiency. Ablation studies confirm the complementary contributions of all components. The results indicate that the proposed framework provides an effective and efficient solution for HSI classification in diverse scenarios. The source code is publicly available at this https URL

98. 【2604.18853】DDF2Pol: A Dual-Domain Feature Fusion Network for PolSAR Image Classification

链接：https://arxiv.org/abs/2604.18853

作者：Mohammed Q. Alkhatib

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：lightweight dual-domain convolutional, dual-domain convolutional neural, convolutional neural network, paper presents, lightweight dual-domain

备注： Pre-print Accepted for Publication in Pattern Recognition Letters

点击查看摘要

Abstract:This paper presents DDF2Pol, a lightweight dual-domain convolutional neural network for PolSAR image classification. The proposed architecture integrates two parallel feature extraction streams, one real-valued and one complex-valued, designed to capture complementary spatial and polarimetric information from PolSAR data. To further refine the extracted features, a depth-wise convolution layer is employed for spatial enhancement, followed by a coordinate attention mechanism to focus on the most informative regions. Experimental evaluations conducted on two benchmark datasets, Flevoland and San Francisco, demonstrate that DDF2Pol achieves superior classification performance while maintaining low model complexity. Specifically, it attains an Overall Accuracy (OA) of 98.16% on the Flevoland dataset and 96.12% on the San Francisco dataset, outperforming several state-of-the-art real- and complex-valued models. With only 91,371 parameters, DDF2Pol offers a practical and efficient solution for accurate PolSAR image analysis, even when training data is limited. The source code is publicly available at this https URL

99. 【2604.18842】Multi-Domain Learning with Global Expert Mapping

链接：https://arxiv.org/abs/2604.18842

作者：Pourya Shamsolmoali,Masoumeh Zareapoor,Huiyu Zhou,Oscar Mendez,Dacheng Tao,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Human perception generalizes, Human perception, vision models struggle, perception generalizes, Human

备注： Submitted to IEEE TPAMI on August 2025

点击查看摘要

Abstract:Human perception generalizes well across different domains, but most vision models struggle beyond their training data. This gap motivates multi-dataset learning, where a single model is trained on diverse datasets to improve robustness under domain shifts. However, unified training remains challenging due to inconsistencies in data distributions and label semantics. Mixture-of-Experts (MoE) models provide a scalable solution by routing inputs to specialized subnetworks (experts). Yet, existing MoEs often fail to specialize effectively, as their load-balancing mechanisms enforce uniform input distribution across experts. This fairness conflicts with domain-aware routing, causing experts to learn redundant representations, and reducing performance especially on rare or out-of-distribution domains. We propose GEM (Global Expert Mapping), a planner-compiler framework that replaces the learned router with a global scheduler. Our planner, based on linear programming relaxation, computes a fractional assignment of datasets to experts, while the compiler applies hierarchical rounding to convert this soft plan into a deterministic, capacity-aware mapping. Unlike prior MoEs, GEM avoids balancing loss, resolves the conflict between fairness and specialization, and produces interpretable routing. Experiments show that GEM-DINO achieves state-of-the-art performance on the UODB benchmark, with notable gains on underrepresented datasets and solves task interference in few-shot adaptation scenarios.

100. 【2604.18831】Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model

链接：https://arxiv.org/abs/2604.18831

作者：Haiyang Wu,Juan J. Gonzales Torres,George Vosselman,Ville Lehtola

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：step toward higher-level, mapping applications, Visual Foundation Models, fundamental step, understanding and mapping

备注：

点击查看摘要

Abstract:Frame-wise semantic segmentation of indoor lidar scans is a fundamental step toward higher-level 3D scene understanding and mapping applications. However, acquiring frame-wise ground truth for training deep learning models is costly and time-consuming. This challenge is largely addressed, for imagery, by Visual Foundation Models (VFMs) which segment image frames. The same VFMs may be used to train a lidar scan frame segmentation model via a 2D-to-3D distillation pipeline. The success of such distillation has been shown for autonomous driving scenes, but not yet for indoor scenes. Here, we study the feasibility of repeating this success for indoor scenes, in a frame-wise distillation manner by coupling each lidar scan with a VFM-processed camera image. The evaluation is done using indoor SLAM datasets, where pseudo-labels are used for downstream evaluation. Also, a small manually annotated lidar dataset is provided for validation, as there are no other lidar frame-wise indoor datasets with semantics. Results show that the distilled model achieves up to 56% mIoU under pseudo-label evaluation and around 36% mIoU with real-label, demonstrating the feasibility of cross-modal distillation for indoor lidar semantic segmentation without manual annotations.

101. 【2604.18829】DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning

链接：https://arxiv.org/abs/2604.18829

作者：Abrar Majeedi,Zhiyuan Ruan,Ziyi Zhao,Hongcheng Wang,Jianglin Lu,Yin Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal large language, large language models, Multimodal large, achieved impressive performance, RGB imagery

备注： Accepted at CVPR Findings 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved impressive performance on visual perception and reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a well-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DUALVISION, a lightweight fusion module that efficiently incorporates IR-RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR-RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging these datasets, we benchmark both open- and closed-source MLLMs and demonstrate that DUALVISION delivers strong empirical performance under a wide range of visual degradations. Our code and dataset are available at this https URL.

102. 【2604.18811】Rethinking Dataset Distillation: Hard Truths about Soft Labels

链接：https://arxiv.org/abs/2604.18811

作者：Priyam Dey,Aditya Sahdev,Sunny Bhati,Konda Reddy Mopuri,R. Venkatesh Babu

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：recent evidence finds, simple random image, downstream model training, soft labels, recent evidence

备注： CVPR 2026 (Oral). First two authors contributed equally

点击查看摘要

Abstract:Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence finds that simple random image baselines perform on-par with state-of-theart DD methods like SRe2L due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hardlabel (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches nearoptimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance dataefficient learning for both coresets and DD.

103. 【2604.18804】Geometric Decoupling: Diagnosing the Structural Instability of Latent

链接：https://arxiv.org/abs/2604.18804

作者：Yuanbang Liang,Zhengwen Chen,Yu-Kun Lai

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Latent Diffusion Models, latent space brittleness, Diffusion Models, achieve high-fidelity synthesis, Latent Diffusion

备注：

点击查看摘要

Abstract:Latent Diffusion Models (LDMs) achieve high-fidelity synthesis but suffer from latent space brittleness, causing discontinuous semantic jumps during editing. We introduce a Riemannian framework to diagnose this instability by analyzing the generative Jacobian, decomposing geometry into \textit{Local Scaling} (capacity) and \textit{Local Complexity} (curvature). Our study uncovers a \textbf{``Geometric Decoupling"}: while curvature in normal generation functionally encodes image detail, OOD generation exhibits a functional decoupling where extreme curvature is wasted on unstable semantic boundaries rather than perceptible details. This geometric misallocation identifies ``Geometric Hotspots" as the structural root of instability, providing a robust intrinsic metric for diagnosing generative reliability.

104. 【2604.18803】LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

链接：https://arxiv.org/abs/2604.18803

作者：Zhiyuan Jiang,Weihao Hong,Xinlei Guan,Tejaswi Dhandu,Miles Q. Li,Meng Xu,Kuan Huang,Umamaheswara Rao Tida,Bingyu Shen,Daehan Kwak,Boyang Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：carries operational consequences, phrasing remains undercharacterized, reliable visual grounding, visual grounding carries, grounding carries operational

备注： 23 pages, 12 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families -- text-illegibility, time-reading, and object-absence -- each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt Intensity Framework, holding the image and task identity fixed while varying only directive force, so that tone is isolated as the sole independent variable. We adopt a dual-track evaluation protocol: a rule-based H-Rate measuring the proportion of responses in which a model crosses from grounded refusal into unsupported positive commitment, and a GPT-4o-mini-judged H-Score on a 1-5 scale characterizing the confidence and specificity of fabrication once it occurs. We additionally release a three-stage automated validation workflow, which retrospectively confirms 717 of 800 images as strictly compliant. Evaluating nine open-weight VLMs, we find that H-Rate and H-Score dissociate substantially across model families, reading-style and presence-detection subsets respond to prompt pressure in qualitatively different ways, and several models exhibit non-monotonic sensitivity peaking at intermediate tone levels -- patterns that aggregate metrics obscure.

105. 【2604.18797】CrossPan: A Comprehensive Benchmark for Cross-Sequence Pancreas MRI Segmentation and Generalization

链接：https://arxiv.org/abs/2604.18797

作者：Linkai Peng,Cuiling Sun,Zheyuan Zhang,Wanying Dou,Halil Ertugrul Aktas,Andrea M Bejar,Elif Keles,Tamas Gonda,Michael B Wallace,Zongwei Zhou,Gorkem Durak,Rajesh N Keswani,Ulas Bagci

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：abdominal MRI analysis, systematic investigation, Automatic pancreas, Automatic pancreas segmentation, fundamental to abdominal

备注： Accepted to MIDL 2026

点击查看摘要

Abstract:Automatic pancreas segmentation is fundamental to abdominal MRI analysis, yet deep learning models trained on one MRI sequence often fail catastrophically when applied to another-a challenge that has received little systematic investigation. We introduce CrossPan, a multi-institutional benchmark comprising 1,386 3D scans across three routinely acquired sequences (T1-weighted, T2-weighted, and Out-of-Phase) from eight centers. Our experiments reveal three key findings. First, cross-sequence domain shifts are far more severe than cross-center variability: models achieving Dice scores above 0.85 in-domain collapse to near-zero (0.02) when transferred across sequences. Second, state-of-the-art domain generalization methods provide negligible benefit under these physics-driven contrast inversions, whereas foundation models like MedSAM2 maintain moderate zero-shot performance through contrast-invariant shape priors. Third, semi-supervised learning offers gains only under stable intensity distributions and becomes unstable on sequences with high intra-organ variability. These results establish cross-sequence generalization-not model architecture or center diversity-as the primary barrier to clinically deployable pancreas MRI segmentation. Dataset and code are available at this https URL.

106. 【2604.18790】EfficientPENet: Real-Time Depth Completion from Sparse LiDAR via Lightweight Multi-Modal Fusion

链接：https://arxiv.org/abs/2604.18790

作者：Johny J. Lopez,Md Meftahul Ferdaus,Mahdi Abdelguerfi,Anton Netchaev,Steven Sloan,Ken Pathak,Kendall N. Niles

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：sparse LiDAR measurements, Convolutional Spatial Propagation, prerequisite for accurate, perception in robotic, robotic systems

备注： This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Depth completion from sparse LiDAR measurements and corresponding RGB images is a prerequisite for accurate 3D perception in robotic systems. Existing methods achieve high accuracy on standard benchmarks but rely on heavy backbone architectures that preclude real-time deployment on embedded hardware. We present EfficientPENet, a two-branch depth completion network that replaces the conventional ResNet encoder with a modernized ConvNeXt backbone, introduces sparsity-invariant convolutions for the depth stream, and refines predictions through a Convolutional Spatial Propagation Network (CSPN). The RGB branch leverages ImageNet-pretrained ConvNeXt blocks with Layer Normalization, 7x7 depthwise convolutions, and stochastic depth regularization. Features from both branches are merged via late fusion and decoded through a multi-scale deep supervision strategy. We further introduce a position-aware test-time augmentation scheme that corrects coordinate tensors during horizontal flipping, yielding consistent error reduction at inference. On the KITTI depth completion benchmark, EfficientPENet achieves an RMSE of 631.94 mm with 36.24M parameters and a latency of 20.51 ms, operating at 48.76 FPS. This represents a 3.7 times reduction in parameters and a 23 times speedup relative to BP-Net, while maintaining competitive accuracy. These results establish EfficientPENet as a practical solution for real-time depth completion on resource-constrained edge platforms such as the NVIDIA Jetson.

107. 【2604.18781】CAHAL: Clinically Applicable resolution enHAncement for Low-resolution MRI scans

链接：https://arxiv.org/abs/2604.18781

作者：Sergio Morell-Ortega,Ángela González-Cebrián,Boris Mansencal,Marien Gadea,Roberto Vivo-Hernando,Gregorio Rubio,Fernando Aparici,Maria de la Iglesia-Vaya,Gwenaelle Catheline,Pierrick Coupé,José V. Manjón

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：automated morphometric analysis, routine clinical practice, Large-scale automated morphometric, anisotropic acquisitions prevalent, automated morphometric

备注：

点击查看摘要

Abstract:Large-scale automated morphometric analysis of brain MRI is limited by the thick-slice, anisotropic acquisitions prevalent in routine clinical practice. Existing generative super-resolution (SR) methods produce visually compelling isotropic volumes but often introduce anatomical hallucinations, systematic volumetric overestimation, and structural distortions that compromise downstream quantitative analysis and diagnostic safety. To address this, we propose CAHAL (Clinically Applicable resolution enHAncement for Low-resolution MRI scans), a hallucination-robust, physics-informed resolution enhancement framework that operates directly in the patient's native acquisition space. CAHAL employs a deterministic bivariate Mixture of Experts (MoE) architecture routing each input through specialised residual 3D U-Net experts conditioned on both volumetric resolution and acquisition anisotropy, two independent descriptors of clinical MRI acquisition. Experts are optimised with a composite loss combining edge-penalised spatial reconstruction, Fourier-domain spectral coherence matching, and a segmentation-guided semantic consistency constraint. Training pairs are generated on-the-fly via physics-based degradation sampled from a large-scale real-world database, ensuring robust generalisation. Validated on T1-weighted and FLAIR sequences against generative baselines, CAHAL achieves state-of-the-art results, improving the best related methods in terms of accuracy and efficiency.

108. 【2604.18757】REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction

链接：https://arxiv.org/abs/2604.18757

作者：Seowung Leem,Lin Gu,Chenyu You,Kuang Gong,Ruogu Fang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：disease susceptibility long, reflect well-established contributors, capturing early structural, clinical symptom onset, factors reflect well-established

备注： Accepted for publication a MIDL 2026

点击查看摘要

Abstract:The retina provides a unique, noninvasive window into Alzheimer's disease (AD) and dementia, capturing early structural changes through morphometric features, while systemic and lifestyle risk factors reflect well-established contributors to disease susceptibility long before clinical symptom onset. However, current retinal analysis frameworks typically model imaging and risk factors separately, limiting their ability to capture joint multimodal patterns critical for early risk prediction. Moreover, existing methods rarely incorporate mechanisms to organize or align patients with similar retinal and clinical characteristics, constraining the learning of coherent cross-modal associations. To address these limitations, we introduce REVEAL (REtinal-risk Vision-Language Early Alzheimer's Learning), a framework that aligns color fundus photographs with individualized disease-specific risk profiles for predicting incident AD and dementia, on average 8 years before diagnosis (range: 1-11 years). Because real-world risk factors are structured questionnaire data, we translate them into clinically interpretable narratives compatible with pretrained vision-language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters patients with similar retinal morphometry and risk factors as positive pairs, strengthening multimodal alignment. This unified representation learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text encoders, as well as general-purpose VLMs, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable and noninvasive approach for early AD and dementia risk stratification, REVEAL has the potential to enable earlier intervention and improve preventive care at the population level.

109. 【2604.18747】URoPE: Universal Relative Position Embedding across Geometric Spaces

链接：https://arxiv.org/abs/2604.18747

作者：Yichen Xie,Depu Meng,Chensheng Peng,Yihan Hu,Quentin Herau,Masayoshi Tomizuka,Wei Zhan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Relative position embedding, position embedding, Rotary Position Embedding, Relative position, fixed geometric space

备注：

点击查看摘要

Abstract:Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids, which restricts their applicability to many computer vision tasks that require geometric reasoning across camera views or between 2D and 3D spaces. To address this limitation, we propose URoPE, a universal extension of Rotary Position Embedding (RoPE) to cross-view or cross-dimensional geometric spaces. For each key/value image patch, URoPE samples 3D points along the corresponding camera ray at predefined depth anchors and projects them into the query image plane. Standard 2D RoPE can then be applied using the projected pixel coordinates. URoPE is a parameter-free and intrinsics-aware relative position embedding that is invariant to the choice of global coordinate systems, while remaining fully compatible with existing RoPE-optimized attention kernels. We evaluate URoPE as a plug-in positional encoding for transformer architectures across a diverse set of tasks, including novel view synthesis, 3D object detection, object tracking, and depth estimation, covering 2D-2D, 2D-3D, and temporal scenarios. Experiments show that URoPE consistently improves the performance of transformer-based models across all tasks, demonstrating its effectiveness and generality for geometric reasoning. Our project website is: this https URL.

110. 【2604.18745】DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation

链接：https://arxiv.org/abs/2604.18745

作者：Enrique Hernandez Noguera,Md Meftahul Ferdaus,Elias Ioup,Mahdi Abdelguerfi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：extreme class imbalance, precise boundary delineation, visual inspection imagery, inspection imagery remains, imagery remains challenging

备注：

点击查看摘要

Abstract:Automated segmentation of structural defects from visual inspection imagery remains challenging due to the diversity of damage types, extreme class imbalance, and the need for precise boundary delineation. This paper presents DeltaSeg, a U-shaped encoder-decoder architecture with a tiered attention strategy that integrates Squeeze-and-Excitation (SE) channel attention in the encoder, Coordinate Attention at the bottleneck and decoder, and a novel Deep Delta Attention (DDA) mechanism in the skip connections. The encoder uses depthwise separable convolutions with dilated stages to maintain spatial resolution while expanding the receptive field. Atrous Spatial Pyramid Pooling (ASPP) at the bottleneck captures multi-scale context. The DDA module refines skip connections through a dual-path scheme combining a learned delta operator for nuisance feature suppression with spatial attention gates conditioned on decoder signals. Deep supervision through multi-scale auxiliary heads further strengthens gradient flow and encourages semantically meaningful features at intermediate decoder stages. We evaluate DeltaSeg on two datasets: the S2DS dataset (7 classes) and the Culvert-Sewer Defect Dataset (CSDD, 9 classes). Across both benchmarks, DeltaSeg consistently outperforms 12 competing architectures including U-Net, SA-UNet, UNet3+, SegFormer, Swin-UNet, EGE-UNet, FPN, and Mobile-UNETR, demonstrating strong generalization across damage types, imaging conditions, and structural geometries.

111. 【2604.18744】Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras

链接：https://arxiv.org/abs/2604.18744

作者：Ruijun Zhang,Hang Su,Kostas Daniilidis,Ziyun Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently shown promising, shown promising capabilities, instantaneous motion estimation, motion estimation due, cameras have recently

备注：

点击查看摘要

Abstract:Event cameras have recently shown promising capabilities in instantaneous motion estimation due to their robustness to low light and fast motions. However, computing wide-baseline correspondence between two arbitrary views remains a significant challenge, since event appearance changes substantially with motion, and learning-based approaches are constrained by both scalability and limited wide-baseline supervision. We therefore introduce the first event matching model that achieves cross-dataset wide-baseline correspondence in a zero-shot manner: a single model trained once is deployed on unseen datasets without any target-domain fine-tuning or adaptation. To enable this capability, we introduce a motion-robust and computationally efficient attention backbone that learns multi-timescale features from event streams, augmented with sparsity-aware event token selection, making large-scale training on diverse wide-baseline supervision computationally feasible. To provide the supervision needed for wide-baseline generalization, we develop a robust event motion synthesis framework to generate large-scale event-matching datasets with augmented viewpoints, modalities, and motions. Extensive experiments across multiple benchmarks show that our framework achieves a 37.7% improvement over the previous best event feature matching methods. Code and data are available at: this https URL.

112. 【2604.18740】Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control

链接：https://arxiv.org/abs/2604.18740

作者：Jay Jung,Ahmad Arrabi,Jax Luo,Scott Raymond,Safwan Wshah

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Automated C-arm positioning, requiring emergent interventions, ensures timely treatment, patients requiring emergent, Automated C-arm

备注： Accepted at IJCARS: IPCAI 2026. Int J CARS (2026)

点击查看摘要

Abstract:Purpose: Automated C-arm positioning ensures timely treatment in patients requiring emergent interventions. When a conventional Deep Learning (DL) approach for C-arm control fails, clinicians must revert to manual operation, resulting in additional delays. Consequently, an agentic C-arm control framework based on multimodal large language models (MLLMs) is highly desirable, as it can incorporate clinician feedback and use reasoning to make adjustments toward more accurate positioning. Skeletal landmark localization is essential for C-arm control, and we investigate adapting MLLMs for autonomous landmark localization. Methods: We used an annotated synthetic X-ray dataset and a real X-ray dataset. Each X-ray in both datasets is paired with several skeletal landmarks. We fine-tuned two MLLMs and tasked them with retrieving the closest landmarks from each X-ray. Quantitative evaluations of landmark localization were performed and compared against a leading DL approach. We further conducted qualitative experiments demonstrating: (1) how an MLLM can correct an initially incorrect prediction through reasoning, and (2) how the MLLM can sequentially navigate the C-arm toward a target location. Results: On both datasets, fine-tuned MLLMs demonstrate competitive performance across all localization tasks when compared with the DL approach. In the qualitative experiments, the MLLMs provide evidence of reasoning and spatial awareness. Conclusion: This study shows that fine-tuned MLLMs achieve accurate skeletal landmark localization and hold promise for agentic autonomous C-arm control. Our code is available athttps://github.com/marszzibros/Cthis http URL

Comments:
Accepted at IJCARS: IPCAI 2026. Int J CARS (2026)

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.18740 [cs.CV]

(or
arXiv:2604.18740v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.18740

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Focus to learn more

            DOI(s) linking to related resources

Submission history From: Jay Jung [view email] [v1]
Mon, 20 Apr 2026 18:45:02 UTC (22,819 KB)

113. 【2604.18725】Colour Extraction Pipeline for Odonates using Computer Vision

链接：https://arxiv.org/abs/2604.18725

作者：Megan Mirnalini Sundaram Rajaraman,Fons J. Verbeek,Vincent J. Kalkman,Rita Pucci

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：insect morphological traits, studies remain limited, physiological studies, studies remain, species' morphological traits

备注： 18 pages long (excluding references), 12 figures, to be submitted in NCCV 2026

点击查看摘要

Abstract:The correlation between insect morphological traits and climate has been documented in physiological studies, but such studies remain limited by the time-consuming nature of the data analysis. In particular, the open source datasets often lack annotations of species' morphological traits, making dedicated annotations campaigns necessary; these efforts are typically local in scale and costly. In this paper, we propose a pipeline to identify and segment body parts of Odonates (dragonflies and damselflies) using deep neural networks, with the ultimate goal of extracting body parts' colouration. The pipeline is trained on a limited annotated dataset and refined with pseudo supervised data. We show that, by using open source images from citizen science platforms, our approach can segment each visible subject (Odonates) into head, thorax, abdomen, and wings and then extract a colour palette for each body part. This will enable large-scale statistical analysis of ecological correlations (e.g., between colouration and climate change, habitat loss, or geolocation) which are crucial for quantifying and assessing ecosystem biodiversity status.

114. 【2604.18713】Align then Refine: Text-Guided 3D Prostate Lesion Segmentation

链接：https://arxiv.org/abs/2604.18713

作者：Cuiling Sun,Linkai Peng,Adam Murphy,Elif Keles,Hiten D. Patel,Ashley Ross,Frank Miller,Baris Turkbey,Andrea Mia Bejar,Halil Ertugrul Aktas,Gorkem Durak,Ulas Bagci

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reliable algorithmic analysis, precision remains challenging, achieving high precision, high precision remains, biparametric MRI

备注： Accepted to EMBC 2026

点击查看摘要

Abstract:Automated 3D segmentation of prostate lesions from biparametric MRI (bp-MRI) is essential for reliable algorithmic analysis, but achieving high precision remains challenging. Volumetric methods must combine multiple modalities while ensuring anatomical consistency, but current models struggle to integrate cross-modal information reliably. While vision-language models (VLMs) are replacing the currently used architectural designs, they still lack the fine-grained, lesion-level semantics required for effective localized guidance. To address these limitations, we propose a new multi-encoder U-Net architecture incorporating three key innovations: (1) an alignment loss that enhances foreground text-image similarity to inject lesion semantics; (2) a heatmap loss that calibrates the similarity map and suppresses spurious background activations; and (3) a final-stage, confidence-gated multi-head cross-attention refiner that performs localized boundary edits in high-confidence regions. A phase-scheduled training regime stabilizes the optimization of these components. Our method consistently outperforms prior approaches, establishing a new state-of-the-art on the PI-CAI dataset through enhanced multi-modal fusion and localized text guidance. Our code is available at this https URL.

115. 【2604.18648】DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax

链接：https://arxiv.org/abs/2604.18648

作者：Hang Yuan,Xiaolin Hu,Yan Wan,Menglin Gao,Wenzhe Yu,Cong Huang,Fei Xu,Qing Li,Christina Dan Wang,Zhou Yu,Kai Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Text-driven controllable dance, Text-driven controllable, articulating complex choreographies, generation remains under-explored, remains under-explored

备注： 22 pages, 13 figures

点击查看摘要

Abstract:Text-driven controllable dance generation remains under-explored, primarily due to the severe scarcity of high-quality datasets and the inherent difficulty of articulating complex choreographies. Characterizing dance is particularly challenging owing to its intricate spatial dynamics, strong directionality, and the highly decoupled movements of distinct body parts. To overcome these bottlenecks, we bridge principles from dance studies, human anatomy, and biomechanics to propose \textit{Choreographic Syntax}, a novel theoretical framework with a tailored annotation system. Grounded in this syntax, we combine professional dance archives with high-fidelity motion capture data to construct \textbf{DanceFlow}, the most fine-grained dance dataset to date. It encompasses 41 hours of high-quality motions paired with 6.34 million words of detailed descriptions. At the model level, we introduce \textbf{DanceCrafter}, a tailored motion transformer built upon the Momentum Human Rig. To circumvent optimization instabilities, we construct a continuous manifold motion representation paired with a hybrid normalization strategy. Furthermore, we design an anatomy-aware loss to explicitly regulate the decoupled nature of body parts. Together, these adaptations empower DanceCrafter to achieve the high-fidelity and stable generation of complex dance sequences. Extensive evaluations and user studies demonstrate our state-of-the-art performance in motion quality, fine-grained controllability, and generation naturalness.

116. 【2604.18632】StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network

链接：https://arxiv.org/abs/2604.18632

作者：Quanling Zhao,Meng'en Qin,Yanfeng Sun,Yuan Miao,Xiaohui Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)

关键词：reflecting environmental responses, regulating plant physiological, plant physiological processes, environmental responses, play a crucial

备注：

点击查看摘要

Abstract:Stomata play a crucial role in regulating plant physiological processes and reflecting environmental responses. However, accurate and high-throughput stomatal phenotyping remains challenging, as conventional approaches rely on destructive sampling and manual annotation, restricting large-scale and field deployment. To overcome these limitations, a noninvasive restoration-detection integrated framework, termed StomaD2, is developed to achieve accurate and fast stomatal phenotyping under complex imaging conditions. The framework incorporates a diffusion-based restoration module to recover degraded images and a specialized rotated object detection network tailored to the small, dense, and cluttered characteristics of stomata. The proposed network enhances feature representation through three key innovations: a column-wise structure for global feature interaction, context-aware resampling and reweighting mechanism to improve multi-scale consistency, and a feature reassembly module to boost discrimination against complex backgrounds. In extensive comparisons, StomaD2 demonstrated state-of-the-art performance. On public Maize and Wheat datasets, it achieved accuracies of 0.994 and 0.992, respectively, significantly outperforming existing benchmarks. When benchmarked against ten other advanced models, including Oriented Former and YOLOv12, StomaD2 achieved a top-tier F1-score/mAP of 0.989. The framework is integrated into a user-friendly, field-operable system that supports the fast extraction of eight stomatal phenotypes, such as density and conductance. Validated on more than 130 plant species, StomaD2's results highlight its strong generalizability and potential for large-scale phenotyping, plant physiology analysis, and precision agriculture applications.

117. 【2604.18627】Vision-Based Human Awareness Estimation for Enhanced Safety and Efficiency of AMRs in Industrial Warehouses

链接：https://arxiv.org/abs/2604.18627

作者：Maximilian Haug(1),Christian Stippel(2),Lukas Pscherer(3),Benjamin Schwendinger(1),Ralph Hoch(3 and 4),Angel Gaydarov(1),Sebastian Schlund(1),Thilo Sauter(4) ((1) Fraunhofer Austria Research GmbH, Vienna, Austria, (2) Computer Vision Lab, TU Wien, Vienna, Austria, (3) Digital Factory Vorarlberg GmbH, Dornbirn, Austria, (4) Institute of Computer Technology, TU Wien, Vienna, Austria)

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：autonomous mobile robots, feature mixed traffic, Ensuring human safety, Ensuring human, mobile robots

备注： 5 pages, 2 figures

点击查看摘要

Abstract:Ensuring human safety is of paramount importance in warehouse environments that feature mixed traffic of human workers and autonomous mobile robots (AMRs). Current approaches often treat humans as generic dynamic obstacles, leading to conservative AMR behaviors like slowing down or detouring, even when workers are fully aware and capable of safely sharing space. This paper presents a real-time vision-based method to estimate human awareness of an AMR using a single RGB camera. We integrate state-of-the-art 3D human pose lifting with head orientation estimation to ascertain a human's position relative to the AMR and their viewing cone, thereby determining if the human is aware of the AMR. The entire pipeline is validated using synthetically generated data within NVIDIA Isaac Sim, a robust physics-accurate robotics simulation environment. Experimental results confirm that our system reliably detects human positions and their attention in real time, enabling AMRs to safely adapt their motion based on human awareness. This enhancement is crucial for improving both safety and operational efficiency in industrial and factory automation settings.

118. 【2604.18623】Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

链接：https://arxiv.org/abs/2604.18623

作者：Xin Hu,Ke Qin,Wen Yin,Yuan-Fang Li,Ming Li,Tao He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Scene Graph Generation, unifies object localization, visual relationship reasoning, Graph Generation, Scene Graph

备注：

点击查看摘要

Abstract:Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors and segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, with an average improvement of about 3 points over the state-of-the-art USG-Par.

119. 【2604.18557】SynAgent: Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy

链接：https://arxiv.org/abs/2604.18557

作者：Wei Yao,Haohan Ma,Hongwen Zhang,Yunlian Sun,Liangjun Xing,Zhile Yang,Yuanjun Guo,Yebin Liu,Jinhui Tang

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词：severe data scarcity, cooperative humanoid manipulation, Controllable cooperative humanoid, embodied intelligence, due to severe

备注：

点击查看摘要

Abstract:Controllable cooperative humanoid manipulation is a fundamental yet challenging problem for embodied intelligence, due to severe data scarcity, complexities in multi-agent coordination, and limited generalization across objects. In this paper, we present SynAgent, a unified framework that enables scalable and physically plausible cooperative manipulation by leveraging Solo-to-Cooperative Agent Synergy to transfer skills from single-agent human-object interaction to multi-agent human-object-human scenarios. To maintain semantic integrity during motion transfer, we introduce an interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization, which faithfully maintains spatial relationships among humans and objects. Building upon this refined data, we propose a single-agent pretraining and adaptation paradigm that bootstraps synergistic collaborative behaviors from abundant single-human data through decentralized training and multi-agent PPO. Finally, we develop a trajectory-conditioned generative policy using a conditional VAE, trained via multi-teacher distillation from motion imitation priors to achieve stable and controllable object-level trajectory execution. Extensive experiments demonstrate that SynAgent significantly outperforms existing baselines in both cooperative imitation and trajectory-conditioned control, while generalizing across diverse object geometries. Codes and data will be available after publication. Project Page: this http URL

120. 【2604.18721】A Controlled Benchmark of Visual State-Space Backbones with Domain-Shift and Boundary Analysis for Remote-Sensing Segmentation

链接：https://arxiv.org/abs/2604.18721

作者：Nichula Wasalathilaka,Dineth Perera,Oshadha Samarakoon,Buddhi Wijenayake,Roshan Godaliyadda,Vijitha Herath,Parakrama Ekanayake

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：advantages remain unclear, existing studies rarely, studies rarely isolate, Visual state-space models, rarely isolate encoder

备注： 5 pages, 3 figures, Accepted for publication at IEEE IGARSS 2026

点击查看摘要

Abstract:Visual state-space models (SSMs) are increasingly promoted as efficient alternatives to Vision Transformers, yet their practical advantages remain unclear under fair comparison because existing studies rarely isolate encoder effects from decoder and training choices. We present a strictly controlled benchmark of representative visual SSM families, including VMamba, MambaVision, and Spatial-Mamba, for remote-sensing semantic segmentation, in which only the encoder varies across experiments. Evaluated on LoveDA and ISPRS Potsdam under a unified 4-stage feature interface and a fixed lightweight decoder, the benchmark reveals three main findings, intra-family scaling yields only modest gains, cross-domain generalization is strongly asymmetric, and boundary delineation is the dominant failure mode under distribution shift. Although visual SSMs achieve favorable accuracy-efficiency trade-offs relative to the controlled CNN and Transformer baselines considered here, the results suggest that future improvements are more likely to come from robustness-oriented design and boundary-aware decoding than from encoder scaling alone. By isolating encoder behavior under a unified and reproducible protocol, this study establishes a practical reference benchmark for the design and evaluation of future Mamba-based segmentation backbones