本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新699篇论文，其中：

自然语言处理139篇
信息检索22篇
计算机视觉128篇

自然语言处理

1. 【2604.07343】Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

链接：https://arxiv.org/abs/2604.07343

作者：Qiyao Ma,Dechen Gao,Rui Cai,Boqi Zhao,Hanchu Zhou,Junshan Zhang,Zhe Zhao

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, Pluralistic alignment, development of Large, Language Models

备注：

点击查看摘要

Abstract:Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.

2. 【2604.07338】Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

链接：https://arxiv.org/abs/2604.07338

作者：Yuechen Jiang,Enze Zhang,Md Mohsinul Kabir,Qianqian Xie,Stavroula Golfomitsou,Konstantinos Arvanitis,Sophia Ananiadou

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词：improved image captioning, Recent advances, advances in vision-language, improved image, image captioning

备注：

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

3. 【2604.07320】Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

链接：https://arxiv.org/abs/2604.07320

作者：Jackson Petty,Jaulie Goe,Tal Linzen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Low-resource languages pose, require large amounts, Low-resource languages, require large, large amounts

备注：

点击查看摘要

Abstract:Low-resource languages pose a challenge for machine translation with large language models (LLMs), which require large amounts of training data. One potential way to circumvent this data dependence is to rely on LLMs' ability to use in-context descriptions of languages, like textbooks and dictionaries. To do so, LLMs must be able to infer the link between the languages' grammatical descriptions and the sentences in question. Here we isolate this skill using a formal analogue of the task: string transduction based on a formal grammar provided in-context. We construct synchronous context-free grammars which define pairs of formal languages designed to model particular aspects of natural language grammar, morphology, and written representation. Using these grammars, we measure how well LLMs can translate sentences from one formal language into another when given both the grammar and the source-language sentence. We vary the size of the grammar, the lengths of the sentences, the syntactic and morphological properties of the languages, and their written script. We note three key findings. First, LLMs' translation accuracy decreases markedly as a function of grammar size and sentence length. Second, differences in morphology and written representation between the source and target languages can strongly diminish model performance. Third, we examine the types of errors committed by models and find they are most prone to recall the wrong words from the target language vocabulary, hallucinate new words, or leave source-language words untranslated.

4. 【2604.07296】OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

链接：https://arxiv.org/abs/2604.07296

作者：Jianhui Liu,Haoze Sun,Wenbo Li,Yanbing Zhang,Rui Yang,Zhiliang Zhu,Yijun Yang,Shenghe Zheng,Nan Jiang,Jiaxiu Jiang,Haoyang Huang,Tien-Tsin Wong,Nan Duan,Xiaojuan Qi

类目：Computation and Language (cs.CL)

关键词：cornerstone of human-level, Spatial, data, Spatial understanding, fundamental cornerstone

备注：

点击查看摘要

Abstract:Spatial understanding is a fundamental cornerstone of human-level intelligence. Nonetheless, current research predominantly focuses on domain-specific data production, leaving a critical void: the absence of a principled, open-source engine capable of fully unleashing the potential of high-quality spatial data. To bridge this gap, we elucidate the design principles of a robust data generation system and introduce OpenSpatial -- an open-source data engine engineered for high quality, extensive scalability, broad task diversity, and optimized efficiency. OpenSpatial adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi-view Consistency (MC), and Scene-Aware Reasoning (SAR). Leveraging this scalable infrastructure, we curate OpenSpatial-3M, a large-scale dataset comprising 3 million high-fidelity samples. Extensive evaluations demonstrate that versatile models trained on our dataset achieve state-of-the-art performance across a wide spectrum of spatial reasoning benchmarks. Notably, the best-performing model exhibits a substantial average improvement of 19 percent, relatively. Furthermore, we provide a systematic analysis of how data attributes influence spatial perception. By open-sourcing both the engine and the 3M-scale dataset, we provide a robust foundation to accelerate future research in spatial intelligence.

5. 【2604.07285】Why teaching resists automation in an AI-inundated era: Human judgment, non-modular work, and the limits of delegation

链接：https://arxiv.org/abs/2604.07285

作者：Songhee Han

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：

备注：

点击查看摘要

None

6. 【2604.07274】A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

链接：https://arxiv.org/abs/2604.07274

作者：Nusrat Sultana,Abdullah Muhammad Moosa,Kazi Afzalur Rahman,Sajal Chandra Banik

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：limited factual grounding, Large language models, purely parametric models, Large language, demonstrated strong capabilities

备注：

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities in medical question answering; however, purely parametric models often suffer from knowledge gaps and limited factual grounding. Retrieval-augmented generation (RAG) addresses this limitation by integrating external knowledge retrieval into the reasoning process. Despite increasing interest in RAG-based medical systems, the impact of individual retrieval components on performance remains insufficiently understood. This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus. We analyze the interaction between language models, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking within a unified experimental framework comprising forty configurations. Results show that retrieval augmentation significantly improves zero-shot medical question answering performance. The best-performing configuration was dense retrieval with query reformulation and reranking achieved 60.49% accuracy. Domain-specialized language models were also found to better utilize retrieved medical evidence than general-purpose models. The analysis further reveals a clear tradeoff between retrieval effectiveness and computational cost, with simpler dense retrieval configurations providing strong performance while maintaining higher throughput. All experiments were conducted on a single consumer-grade GPU, demonstrating that systematic evaluation of retrieval-augmented medical QA systems can be performed under modest computational resources.

7. 【2604.07272】ClickGuard: A Trustworthy Adaptive Fusion Framework for Clickbait Detection

链接：https://arxiv.org/abs/2604.07272

作者：Chhavi Dhiman,Naman Chawla,Riya Dhami,Gaurav Kumar,Ganesh Naik

类目：Computation and Language (cs.CL)

关键词：crafted to mislead, maximize engagement, poses a significant, mislead and maximize, Adaptive Fusion Block

备注：

点击查看摘要

Abstract:The widespread use of clickbait headlines, crafted to mislead and maximize engagement, poses a significant challenge to online credibility. These headlines employ sensationalism, misleading claims, and vague language, underscoring the need for effective detection to ensure trustworthy digital content. The paper introduces, ClickGuard: a trustworthy adaptive fusion framework for clickbait detection. It combines BERT embeddings and structural features using a Syntactic-Semantic Adaptive Fusion Block (SSAFB) for dynamic integration. The framework incorporates a hybrid CNN-BiLSTM to capture patterns and dependencies. The model achieved 96.93% testing accuracy, outperforming state-of-the-art approaches. The model's trustworthiness is evaluated using LIME and Permutation Feature Importance (PFI) for interpretability and perturbation analysis. These methods assess the model's robustness and sensitivity to feature changes by measuring the average prediction variation. Ablation studies validated the SSAFB's effectiveness in optimizing feature fusion. The model demonstrated robust performance across diverse datasets, providing a scalable, reliable solution for enhancing online content credibility by addressing syntactic-semantic modelling challenges. Code of the work is available at: this https URL

8. 【2604.07269】Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent

链接：https://arxiv.org/abs/2604.07269

作者：Bingxuan Li,Simo Du,Yue Guo

类目：Computation and Language (cs.CL)

关键词：acquiring medical knowledge, acquiring medical, Clinical expertise improves, SEA, Clinical expertise

备注：

点击查看摘要

Abstract:Clinical expertise improves not only by acquiring medical knowledge, but by accumulating experience that yields reusable diagnostic patterns. Recent LLMs-based diagnostic agents have shown promising progress in clinical reasoning for decision support. However, most approaches treat cases independently, limiting experience reuse and continual adaptation. We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module. We design a reinforcement training framework tailored to our designed agent for joint optimization of reasoning and memory management. We evaluate SEA in two complementary settings. On standard evaluation with MedCaseReasoning dataset, SEA achieves 92.46% accuracy, outperforming the strongest baseline by +19.6%, demonstrating the benefit of jointly optimizing reasoning and memory. On the long-horizon with ER-Reason dataset, SEA attains the best final accuracy (0.7214) and the largest improvement (+0.35 Acc@100), while baseline methods show limited or unstable gains. Expert evaluation further indicates that rules consolidated from SEA show strong clinical correctness, usefulness and trust, suggesting that the induced rules in dual-memory module are reliable and practically meaningful. Overall, SEA improves both diagnostic reasoning ability and continual learning by effectively transforming experience into reusable knowledge.

9. 【2604.07239】Efficient Learned Data Compression via Dual-Stream Feature Decoupling

链接：https://arxiv.org/abs/2604.07239

作者：Huidong Ma,Xinyan Shi,Hui Sun,Xiaofei Yue,Xiaoguang Liu,Gang Wang,Wentong Cai

类目：Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)

关键词：Learned Data Compression, efficiency remains challenging, Learned Data, achieved superior compression, system efficiency remains

备注： Accepted to ACL 2026

点击查看摘要

Abstract:While Learned Data Compression (LDC) has achieved superior compression ratios, balancing precise probability modeling with system efficiency remains challenging. Crucially, uniform single-stream architectures struggle to simultaneously capture micro-syntactic and macro-semantic features, necessitating deep serial stacking that exacerbates latency. Compounding this, heterogeneous systems are constrained by device speed mismatches, where throughput is capped by Amdahl's Law due to serial processing. To this end, we propose a Dual-Stream Multi-Scale Decoupler that disentangles local and global contexts to replace deep serial processing with shallow parallel streams, and incorporate a Hierarchical Gated Refiner for adaptive feature refinement and precise probability modeling. Furthermore, we design a Concurrent Stream-Parallel Pipeline, which overcomes systemic bottlenecks to achieve full-pipeline parallelism. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both compression ratio and throughput, while maintaining the lowest latency and memory usage. The code is available at this https URL.

10. 【2604.07238】On the Price of Privacy for Language Identification and Generation

链接：https://arxiv.org/abs/2604.07238

作者：Xiaoyu Li,Andi Han,Jiaojiao Jiang,Junbin Gao

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)

关键词：sensitive user data, large language models, cost of privacy, varepsilon, user data

备注：

点击查看摘要

Abstract:As large language models (LLMs) are increasingly trained on sensitive user data, understanding the fundamental cost of privacy in language learning becomes essential. We initiate the study of differentially private (DP) language identification and generation in the agnostic statistical setting, establishing algorithms and matching lower bounds that precisely quantify the cost of privacy. For both tasks, approximate $(\varepsilon, \delta)$-DP with constant $\varepsilon 0$ recovers the non-private error rates: $\exp(-r(n))$ for identification (for any $r(n) = o(n)$) and $\exp(-\Omega(n))$ for generation. Under pure $\varepsilon$-DP, the exponents degrade by a multiplicative factor of $\min\{1, \varepsilon\}$, which we show is tight up to constants. Notably, for generation under pure DP with mild assumptions, the upper bound $\exp(-\min\{1,\varepsilon\} \cdot \Omega(n))$ matches the lower bound up to some constants, establishing an optimal rate. Our results show that the cost of privacy in language learning is surprisingly mild: absent entirely under approximate DP, and exactly a $\min\{1,\varepsilon\}$ factor in the exponent under pure DP.

11. 【2604.07236】How Much LLM Does a Self-Revising Agent Actually Need?

链接：https://arxiv.org/abs/2604.07236

作者：Seongwoo Jeong,Seonil Son

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：place world modeling, language model loop, Recent LLM-based agents, single language model, Recent LLM-based

备注： WIP

点击查看摘要

Abstract:Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop. This can produce capable behavior, but it makes a basic scientific question difficult to answer: which part of the agent's competence actually comes from the LLM, and which part comes from explicit structure around it? We study this question not by claiming a general answer, but by making it empirically tractable. We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure. We instantiate this protocol in a declarative runtime and evaluate it on noisy Collaborative Battleship [4] using four progressively structured agents over 54 games (18 boards $\times$ 3 seeds). The resulting decomposition isolates four components: posterior belief tracking, explicit world-model planning, symbolic in-episode reflection, and sparse LLM-based revision. Across this decomposition, explicit world-model planning improves substantially over a greedy posterior-following baseline (+24.1pp win rate, +0.017 F1). Symbolic reflection operates as a real runtime mechanism -- with prediction tracking, confidence gating, and guarded revision actions -- even though its current revision presets are not yet net-positive in aggregate. Adding conditional LLM revision at about 4.3\% of turns yields only a small and non-monotonic change: average F1 rises slightly (+0.005) while win rate drops (31$\rightarrow$29 out of 54). These results suggest a methodological contribution rather than a leaderboard claim: externalizing reflection turns otherwise latent agent behavior into inspectable runtime structure, allowing the marginal role of LLM intervention to be studied directly.

Comments:
WIP

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2604.07236 [cs.AI]

(or
arXiv:2604.07236v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.07236

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Seonil Son [view email] [v1]
Wed, 8 Apr 2026 16:02:04 UTC (60 KB)

12. 【2604.07223】raceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

链接：https://arxiv.org/abs/2604.07223

作者：Yen-Shan Chen,Sian-Yao Huang,Cheng-Lin Yang,Yun-Nung Chen

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词：primary vulnerability surface, vulnerability surface shifts, intermediate execution traces, autonomous agents, chatbots into autonomous

备注：

点击查看摘要

Abstract:As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($\rho=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.

13. 【2604.07193】LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics

链接：https://arxiv.org/abs/2604.07193

作者：Kosmas Pinitas,Ilias Maglogiannis

类目：Computation and Language (cs.CL); Emerging Technologies (cs.ET)

关键词：unconstrained environments remains, unconstrained environments, environments remains, remains a fundamental, fundamental challenge

备注： This paper has been accepted at the CVPR 2026 Workshop on Affective Behavior Analysis in-the-wild (ABAW)

点击查看摘要

Abstract:Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI. While deep neural embeddings dominate contemporary approaches, they often lack interpretability and limit expert-driven refinement. We propose a novel framework that uses Language Models (LMs) as semantic context conditioners over handcrafted affect descriptors to model changes in Valence and Arousal. Our approach begins with interpretable facial geometry and acoustic features derived from structured domain knowledge. These features are transformed into symbolic natural-language descriptions encoding their affective implications. A pretrained LM processes these descriptions to generate semantic context embeddings that act as high-level priors over affective dynamics. Unlike end-to-end black-box pipelines, our framework preserves feature transparency while leveraging the contextual abstraction capabilities of LMs. We evaluate the proposed method on the Aff-Wild2 and SEWA datasets for affect change prediction. Experimental results show consistent improvements in accuracy for both Valence and Arousal compared to handcrafted-only and deep-embedding baselines. Our findings demonstrate that semantic conditioning enables interpretable affect modelling without sacrificing predictive performance, offering a transparent and computationally efficient alternative to fully end-to-end architectures

14. 【2604.07189】Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery

链接：https://arxiv.org/abs/2604.07189

作者：Jia Yu,Weiwei Yu,Pengfei Xiao,Fukun Xing

类目：Computation and Language (cs.CL)

关键词：process demanding specialized, demanding specialized technical, specialized technical skills, construct queries, considerable time

备注：

点击查看摘要

Abstract:Corpus linguistics has traditionally relied on human researchers to formulate hypotheses, construct queries, and interpret results - a process demanding specialized technical skills and considerable time. We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the corpus, interpreting results, and refining analysis across multiple rounds. The human researcher sets direction and evaluates final output. Unlike unconstrained LLM generation, every finding is anchored in verifiable corpus evidence. We treat this not as a replacement for the corpus-based/corpus-driven distinction but as a complementary dimension: it concerns who conducts the inquiry, not the epistemological relationship between theory and data. We demonstrate the framework by linking an LLM agent to a CQP-indexed Gutenberg corpus (5 million tokens) via the Model Context Protocol (MCP). Given only "investigate English intensifiers," the agent identified a diachronic relay chain (so+ADJ very really), three pathways of semantic change (delexicalization, polarity fixation, metaphorical constraint), and register-sensitive distributions. A controlled baseline experiment shows that corpus grounding contributes quantification and falsifiability that the model cannot produce from training data alone. To test external validity, the agent replicated two published studies on the CLMET corpus (40 million tokens) - Claridge (2025) and De Smet (2013) - with close quantitative agreement. Agent-driven corpus research can thus produce empirically grounded findings at machine speed, lowering the technical barrier for a broader range of researchers.

15. 【2604.07147】Dynamic Context Evolution for Scalable Synthetic Data Generation

链接：https://arxiv.org/abs/2604.07147

作者：Ryan Lingo,Rajeev Chhajer

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language models, produce repetitive output, term cross-batch mode, Large language, language models produce

备注：

点击查看摘要

Abstract:Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. Practitioners have long mitigated this with ad hoc deduplication and seed rotation, but no principled framework exists. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling (the model labels each idea with a guess about how obvious it is, and obvious ideas are discarded), which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. In experiments across three domains (sustainable packaging concepts, educational exam questions, and creative writing prompts) and two model families (gpt-5-mini and claude-haiku-4-5), a component ablation across 2-3 random seeds per method shows that DCE achieves 0.0 +/- 0.0% collapse versus 5.6 +/- 2.0% for naive prompting, while producing 17-18 HDBSCAN clusters per seed versus naive's volatile 2-17, indicating reliably richer conceptual structure. These results are validated with an independent embedding model (all-MiniLM-L6-v2) and hold across sensitivity sweeps of the VTS threshold tau and dedup threshold delta. Deduplication and prompt evolution are individually insufficient but jointly effective, at approximately $0.50 per 1,000 candidates using only standard API calls, with no fine-tuning or custom architectures required.

16. 【2604.07123】Language Bias under Conflicting Information in Multilingual LLMs

链接：https://arxiv.org/abs/2604.07123

作者：Robert Östling,Murathan Kurfalı

类目：Computation and Language (cs.CL)

关键词：integrating conflicting information, process of integrating, Large Language Models, answering questions, integrating conflicting

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to contain biases in the process of integrating conflicting information when answering questions. Here we ask whether such biases also exist with respect to which language is used for each conflicting piece of information. To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of multilingual LLMs of different sizes. We find that all LLMs tested, including GPT-5.2, ignore the conflict and confidently assert only one of the possible answers in the large majority of cases. Furthermore, there is a consistent bias across models in which languages are preferred, with a general bias against Russian and, for the longest context lengths, in favor of Chinese. Both of these patterns are consistent between models trained inside and outside of mainland China, though somewhat stronger in the former category.

17. 【2604.07119】Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews

链接：https://arxiv.org/abs/2604.07119

作者：Ehsan Barkhordar,Abdulfattah Safa,Verena Blaschke,Erika Lombart,Marie-Catherine de Marneffe,Gözde Gül Şahin

类目：Computation and Language (cs.CL)

关键词：NLP publication process, Peer review plays, publication process, plays a central, central role

备注： 21 pages, 10 figures, 9 tables

点击查看摘要

Abstract:Peer review plays a central role in the NLP publication process, but is susceptible to various biases. Here, we study language-of-study (LoS) bias: the tendency for reviewers to evaluate a paper differently based on the language(s) it studies, rather than its scientific merit. Despite being explicitly flagged in reviewing guidelines, such biases are poorly understood. Prior work treats such comments as part of broader categories of weak or unconstructive reviews without defining them as a distinct form of bias. We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37 macro F1 for detection. We analyze 15,645 reviews to estimate how negative and positive biases differ with respect to the LoS, and find that non-English papers face substantially higher bias rates than English-only ones, with negative bias consistently outweighing positive bias. Finally, we identify four subcategories of negative bias, and find that demanding unjustified cross-lingual generalization is the most dominant form. We publicly release all resources to support work on fairer reviewing practices in NLP and beyond.

18. 【2604.07116】Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering

链接：https://arxiv.org/abs/2604.07116

作者：Elyas Irankhah,Samah Fodeh

类目：Computation and Language (cs.CL)

关键词：shared task, Claude Sonnet, task studies patient-authored, clinician-interpreted question reformulation, Abstract

备注： 9 pages, 2 figures. System description for ArchEHR-QA 2026 shared task

点击查看摘要

Abstract:We describe the Yale-DM-Lab system for the ArchEHR-QA 2026 shared task. The task studies patient-authored questions about hospitalization records and contains four subtasks (ST): clinician-interpreted question reformulation, evidence sentence identification, answer generation, and evidence-answer alignment. ST1 uses a dual-model pipeline with Claude Sonnet 4 and GPT-4o to reformulate patient questions into clinician-interpreted questions. ST2-ST4 rely on Azure-hosted model ensembles (o3, GPT-5.2, GPT-5.1, and DeepSeek-R1) combined with few-shot prompting and voting strategies. Our experiments show three main findings. First, model diversity and ensemble voting consistently improve performance compared to single-model baselines. Second, the full clinician answer paragraph is provided as additional prompt context for evidence alignment. Third, results on the development set show that alignment accuracy is mainly limited by reasoning. The best scores on the development set reach 88.81 micro F1 on ST4, 65.72 macro F1 on ST2, 34.01 on ST3, and 33.05 on ST1.

19. 【2604.07102】he Impact of Steering Large Language Models with Persona Vectors in Educational Applications

链接：https://arxiv.org/abs/2604.07102

作者：Yongchao Wu,Aron Henriksson

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：personalize large language, settings remain unclear, English Language Arts, Activation-based steering, large language models

备注：

点击查看摘要

Abstract:Activation-based steering can personalize large language models at inference time, but its effects in educational settings remain unclear. We study persona vectors for seven character traits in short-answer generation and automated scoring on the ASAP-SAS benchmark across three models spanning two architectures. Persona steering lowers answer quality overall, with much larger effects on open-ended English Language Arts (ELA) prompts than on factual science prompts; interpretive and argumentative tasks are up to 11x more sensitive. On the scoring side, we observe predictable valence-aligned calibration shifts: evil and impolite scorers grade more harshly, while good and optimistic scorers grade more leniently. ELA tasks are 2.5-3x more susceptible to scorer personalization than science tasks, and the Mixture-of-Experts model shows roughly 6x larger calibration shifts than the dense models. To our knowledge, this is the first study to systematically examine the effects of activation-steered persona traits in educational generation and scoring, and the results highlight the need for task-aware and architecture-aware calibration when deploying steered models in educational settings.

20. 【2604.07100】STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems

链接：https://arxiv.org/abs/2604.07100

作者：Hongru Ji,Yuyin Fan,Meng Zhao,Xianghua Li,Lianwei Wu,Chao Gao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：user emotional state, Empathetic dialogue requires, Empathetic dialogue, context-sensitive decisions, modeling empathetic dialogue

备注： Accepted by ACL 2026

点击查看摘要

Abstract:Empathetic dialogue requires not only recognizing a user's emotional state but also making strategy-aware, context-sensitive decisions throughout response generation. However, the lack of a comprehensive empathy strategy framework, explicit task-aligned multi-stage reasoning, and high-quality strategy-aware data fundamentally limits existing approaches, preventing them from effectively modeling empathetic dialogue as a complex, multi-stage cognitive and decision-making process. To address these challenges, we propose STRIDE-ED, a STRategy-grounded, Interpretable, and DEep reasoning framework that models Empathetic Dialogue through structured, strategy-conditioned reasoning. To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with empathetic strategies. Furthermore, we adopt a two-stage training paradigm that combines supervised fine-tuning with multi-objective reinforcement learning to better align model behaviors with target emotions, empathetic strategies, and response formats. Extensive experiments demonstrate that STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.

21. 【2604.07098】Selective Neuron Amplification for Training-Free Task Enhancement

链接：https://arxiv.org/abs/2604.07098

作者：Ryyan Akhtar

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large language models, Large language, Selective Neuron Amplification, Large, model

备注： 28 pages, 12 figures. Preprint. Code and experiments conducted independently

点击查看摘要

Abstract:Large language models often fail on tasks they seem to already understand. In our experiments, this appears to be less about missing knowledge and more about certain internal circuits not being strongly activated during inference. We explore Selective Neuron Amplification, which increases the influence of task relevant neurons without changing the model's parameters. The method works at inference time and does not permanently alter the model. SNA helps mainly when the model is uncertain, while having low effect when the model is already confident. This suggests that some model failures are due to weak activation rather than lack of capability.

22. 【2604.07095】Multilingual Embedding Probes Fail to Generalize Across Learner Corpora

链接：https://arxiv.org/abs/2604.07095

作者：Laurits Lyngbaek,Ross Deans Kristensen-McLachlan

类目：Computation and Language (cs.CL)

关键词：predict CEFR, predict CEFR proficiency, CEFR proficiency levels, proficiency, language-general representation

备注：

点击查看摘要

Abstract:Do multilingual embedding models encode a language-general representation of proficiency? We investigate this by training linear and non-linear probes on hidden-state activations from Qwen3-Embedding (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. We compare five probing architectures against a baseline trained on surface-level text features. Under in-distribution evaluation, probes achieve strong performance ($QWK\approx0.7$), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions. However, in cross-corpus evaluation performance collapses across all probe types and model sizes. Residual analysis reveals that out-of-distribution probes converge towards predicting uniformly distributed labels, indicating that the learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension. These results suggest that current multilingual embeddings do not straightforwardly encode language-general proficiency, with implications for representation-based approaches to proficiency-adaptive language technology.

23. 【2604.07067】Is Cross-Lingual Transfer in Bilingual Models Human-Like? A Study with Overlapping Word Forms in Dutch and English

链接：https://arxiv.org/abs/2604.07067

作者：Iza Škrjanec,Irene Elisabeth Winther,Vera Demberg,Stefan L. Frank

类目：Computation and Language (cs.CL)

关键词：shared surface form, surface form, Bilingual, friends, models

备注：

点击查看摘要

Abstract:Bilingual speakers show cross-lingual activation during reading, especially for words with shared surface form. Cognates (friends) typically lead to facilitation, whereas interlingual homographs (false friends) cause interference or no effect. We examine whether cross-lingual activation in bilingual language models mirrors these patterns. We train Dutch-English causal Transformers under four vocabulary-sharing conditions that manipulate whether (false) friends receive shared or language-specific embeddings. Using psycholinguistic stimuli from bilingual reading studies, we evaluate the models through surprisal and embedding similarity analyses. The models largely maintain language separation, and cross-lingual effects arise primarily when embeddings are shared. In these cases, both friends and false friends show facilitation relative to controls. Regression analyses reveal that these effects are mainly driven by frequency rather than consistency in form-meaning mapping. Only when just friends share embeddings are the qualitative patterns of bilinguals reproduced. Overall, bilingual language models capture some cross-linguistic activation effects. However, their alignment with human processing seems to critically depend on how lexical overlap is encoded, possibly limiting their explanatory adequacy as models of bilingual reading.

24. 【2604.07066】SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)

链接：https://arxiv.org/abs/2604.07066

作者：Liang-Chih Yu,Jonas Becker,Shamsuddeen Hassan Muhammad,Idris Abdulmumin,Lung-Hao Lee,Ying-Lung Lin,Jin Wang,Jan Philip Wahle,Terry Ruas,Natalia Loukachevitch,Alexander Panchenko,Ilseyar Alimova,Lilian Wanzare,Nelson Odhiambo,Bela Gipp,Kai-Wei Chang,Saif M. Mohammad

类目：Computation and Language (cs.CL)

关键词：improves traditional ABSA, categorical polarity labels, Aspect-Based Sentiment Analysis, dimensional aspect sentiment, traditional ABSA

备注：

点击查看摘要

Abstract:We present the SemEval-2026 shared task on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which improves traditional ABSA by modeling sentiment along valence-arousal (VA) dimensions rather than using categorical polarity labels. To extend ABSA beyond consumer reviews to public-issue discourse (e.g., political, energy, and climate issues), we introduce an additional task, Dimensional Stance Analysis (DimStance), which treats stance targets as aspects and reformulates stance detection as regression in the VA space. The task consists of two tracks: Track A (DimABSA) and Track B (DimStance). Track A includes three subtasks: (1) dimensional aspect sentiment regression, (2) dimensional aspect sentiment triplet extraction, and (3) dimensional aspect sentiment quadruplet extraction, while Track B includes only the regression subtask for stance targets. We also introduce a continuous F1 (cF1) metric to jointly evaluate structured extraction and VA regression. The task attracted more than 400 participants, resulting in 112 final submissions and 42 system description papers. We report baseline results, discuss top-performing systems, and analyze key design choices to provide insights into dimensional sentiment analysis at the aspect and stance-target levels. All resources are available on our GitHub repository.

25. 【2604.07057】IndoBERT-Sentiment: Context-Conditioned Sentiment Classification for Indonesian Text

链接：https://arxiv.org/abs/2604.07057

作者：Muhammad Apriandito Arya Saputra,Andry Alamsyah,Dian Puteri Ramadhani,Thomhert Suprapto Siadari,Hanif Fakhrurroja

类目：Computation and Language (cs.CL)

关键词：Existing Indonesian sentiment, Existing Indonesian, statement is positive, topical context, Indonesian sentiment

备注： 8 pages, 5 tables, and 2 figures

点击查看摘要

Abstract:Existing Indonesian sentiment analysis models classify text in isolation, ignoring the topical context that often determines whether a statement is positive, negative, or neutral. We introduce IndoBERT-Sentiment, a context-conditioned sentiment classifier that takes both a topical context and a text as input, producing sentiment predictions grounded in the topic being discussed. Built on IndoBERT Large (335M parameters) and trained on 31,360 context-text pairs labeled across 188 topics, the model achieves an F1 macro of 0.856 and accuracy of 88.1%. In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points. We show that context-conditioning, previously demonstrated for relevancy classification, transfers effectively to sentiment analysis and enables the model to correctly classify texts that are systematically misclassified by context-free approaches.

26. 【2604.07054】Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

链接：https://arxiv.org/abs/2604.07054

作者：Xuanbo Su,Wenhao Hu,Le Zhan,Yanqi Yang,Leo Huang

类目：Computation and Language (cs.CL)

关键词：large language models, dialogues require multi-turn, Sales dialogues require, goal-directed persuasion, asymmetric incentives

备注：

点击查看摘要

Abstract:Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress, and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000 crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). Experiments across 15 mainstream LLMs reveal substantial variability: top-performance LLMs are competitive with human-level performance while the less capable ones are worse than human. SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents.

27. 【2604.07036】ReDAct: Uncertainty-Aware Deferral for LLM Agents

链接：https://arxiv.org/abs/2604.07036

作者：Dzianis Piatrashyn,Nikita Kotelevskii,Kirill Grishchenkov,Nikita Glazkov,Ivan Nasonov,Ilya Makarov,Timothy Baldwin,Preslav Nakov,Roman Vashurin,Maxim Panov

类目：Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：including complex sequential, complex sequential decision-making, sequential decision-making problems, including complex, increasingly popular

备注：

点击查看摘要

Abstract:Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems. However, they inherit the tendency of LLMs to hallucinate, leading to incorrect decisions. In sequential settings, even a single mistake can irreversibly degrade the trajectory, making hallucinations an even bigger problem. Although larger LLMs hallucinate less, they incur a significantly higher per-token cost. In this paper, we address this tradeoff by proposing ReDAct (Reason-Defer-Act). In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model. When the predictive uncertainty of the small model exceeds a calibrated threshold, the decision is deferred to the large model. We evaluate our approach in text-based embodied environments such as ALFWorld and MiniGrid and show that deferring only about 15% of decisions to the large model can match the quality of using it exclusively, while significantly reducing inference costs.

28. 【2604.07035】Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

链接：https://arxiv.org/abs/2604.07035

作者：Md Motaleb Hossen Manik,Ge Wang

类目：Computation and Language (cs.CL)

关键词：realistic inference constraints, activated per token, behavior under realistic, expected to offer, offer better quality-efficiency

备注：

点击查看摘要

Abstract:Mixture-of-experts (MoE) language models are often expected to offer better quality-efficiency tradeoffs than dense models because only a subset of parameters is activated per token, but the practical value of that advantage depends on end-to-end behavior under realistic inference constraints. We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and Qwen3-30B-A3B, evaluated on four benchmarks -- ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 -- under three prompting strategies: zero-shot, chain-of-thought, and few-shot chain-of-thought. The study covers 8,400 total model-dataset-prompt evaluations and records accuracy, latency, peak GPU memory usage (VRAM), and an approximate floating-point operations (FLOPs)-per-token proxy. Across the weighted multi-task summary, Gemma-4-E4B with few-shot chain-of-thought achieved the best overall result, reaching weighted accuracy 0.675 with mean VRAM 14.9 GB, while Gemma-4-26B-A4B was close in accuracy at 0.663 but substantially more memory intensive at 48.1 GB. At the task level, Gemma models dominated ARC and Math, Phi models were strongest on TruthfulQA, and GSM8K showed the largest prompt sensitivity, including a sharp drop for Phi-4-reasoning from 0.67 under chain-of-thought to 0.11 under few-shot chain-of-thought. These results show that sparse activation alone does not guarantee the best practical operating point: observed accuracy-efficiency tradeoffs depend jointly on architecture, prompting protocol, and task composition. We release a reproducible benchmark pipeline, aggregated results, and paired statistical analyses to support deployment-oriented evaluation of reasoning LLMs under real resource constraints.

29. 【2604.07028】Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

链接：https://arxiv.org/abs/2604.07028

作者：Philipp D. Siedler

类目：Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：game-theoretic models abstract, Large Language Model, Strategic Courtroom Framework, models abstract, operate through discourse

备注：

点击查看摘要

Abstract:Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three-trait team configurations, totaling over 7{,}000 simulated trials using DeepSeek-R1 and Gemini~2.5~Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement-learning-based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human-designed trait combinations. Together, these findings demonstrate how language can be treated as a first-class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments.

Subjects:

Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2604.07028 [cs.MA]

(or
arXiv:2604.07028v1 [cs.MA] for this version)

https://doi.org/10.48550/arXiv.2604.07028

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

30. 【2604.07023】MARS: Enabling Autoregressive Models Multi-Token Generation

链接：https://arxiv.org/abs/2604.07023

作者：Ziqi Jin,Lei Wang,Ziwei Luo,Aixin Sun

类目：Computation and Language (cs.CL)

关键词：language models generate, models generate text, earlier context, generate text, highly predictable

备注： 15 pages, 4 fugures

点击查看摘要

Abstract:Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

31. 【2604.07015】Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl

链接：https://arxiv.org/abs/2604.07015

作者：Juan-José Guzman-Landa,Juan-Manuel Torres-Moreno,Graham Ranger,Miguel Figueroa-Saavedra,Martha-Lorena Avendaño-Garrido,Elvys Linhares-Pontes,Luis-Gil Moreno-Jiménez

类目：Computation and Language (cs.CL)

关键词：Natural Language Processing, limited computational resources, Language Processing, Large Language Models, Natural Language

备注： 8 pages, 1 figure, 1 table

点击查看摘要

Abstract:In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or $\pi$-languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic $\pi$-language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new $\pi$-yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion. Furthermore, to our knowledge, this technique has not yet been used in the literature.

32. 【2604.07012】DTCRS: Dynamic Tree Construction for Recursive Summarization

链接：https://arxiv.org/abs/2604.07012

作者：Guanran Luo,Zhongquan Jian,Wentao Qiu,Meihong Wang,Qingqiang Wu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, incorporating external knowledge, Retrieval-Augmented Generation, Large Language

备注：

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates the hallucination problem of Large Language Models (LLMs) by incorporating external knowledge. Recursive summarization constructs a hierarchical summary tree by clustering text chunks, integrating information from multiple parts of a document to provide evidence for abstractive questions involving multi-step reasoning. However, summary trees often contain a large number of redundant summary nodes, which not only increase construction time but may also negatively impact question answering. Moreover, recursive summarization is not suitable for all types of questions. We introduce DTCRS, a method that dynamically generates summary trees based on document structure and query semantics. DTCRS determines whether a summary tree is necessary by analyzing the question type. It then decomposes the question and uses the embeddings of sub-questions as initial cluster centers, reducing redundant summaries while improving the relevance between summaries and the question. Our approach significantly reduces summary tree construction time and achieves substantial improvements across three QA tasks. Additionally, we investigate the applicability of recursive summarization to different question types, providing valuable insights for future research.

33. 【2604.07006】Continuous Interpretive Steering for Scalar Diversity

链接：https://arxiv.org/abs/2604.07006

作者：Ye-eun Cho

类目：Computation and Language (cs.CL)

关键词：Pragmatic, graded, Scalar, Steering, Pragmatic inference

备注：

点击查看摘要

Abstract:Pragmatic inference is inherently graded. Different lexical items give rise to pragmatic enrichment to different degrees. Scalar implicature exemplifies this property through scalar diversity, where implicature strength varies across scalar items. However, evaluations of pragmatic inference in large language models (LLMs) often rely on prompt-based manipulations. Beyond prompt-level effects, this study introduces Continuous Interpretive Steering (CIS), a method that probes graded pragmatic interpretation by treating activation-level steering strength as a continuous experimental variable. To support this analysis, this study introduces a new dataset, GraSD, which encodes graded scalar diversity. Experiments on four LLMs show that uniform activation steering increases pragmatic interpretations globally but collapses item-level variation, whereas graded activation steering yields differentiated interpretive shifts aligned with scalar diversity grades. It indicates that graded sensitivity is encoded in the representation space and can be systematically recovered through controlled intervention. Together, CIS and GraSD provide a principled framework for evaluating graded pragmatic sensitivity in LLMs.

34. 【2604.06997】ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals

链接：https://arxiv.org/abs/2604.06997

作者：Yihao Wang,Zijian He,Jie Ren,Keze Wang

类目：Computation and Language (cs.CL)

关键词：language models access, retrieval-augmented generation, Classical Chinese annals, shapes how language, language models

备注： 24 pages, 11 figures. To appear in Findings of ACL 2026

点击查看摘要

Abstract:Retrieval shapes how language models access and ground knowledge in retrieval-augmented generation (RAG). In historical research, the target is often not an arbitrary relevant passage, but the exact record for a specific regnal month, where temporal consistency matters as much as topical relevance. This is especially challenging for Classical Chinese annals, where time is expressed through terse, implicit, non-Gregorian reign phrases that must be interpreted from surrounding context, so semantically plausible evidence can still be temporally invalid. We introduce \textbf{ChunQiuTR}, a time-keyed retrieval benchmark built from the \textit{Spring and Autumn Annals} and its exegetical tradition. ChunQiuTR organizes records by month-level reign keys and includes chrono-near confounders that mirror realistic retrieval failures. We further propose \textbf{CTD} (Calendrical Temporal Dual-encoder), a time-aware dual-encoder that combines Fourier-based absolute calendrical context with relative offset biasing. Experiments show consistent gains over strong semantic dual-encoder baselines under time-keyed evaluation, supporting retrieval-time temporal consistency as a key prerequisite for faithful downstream historical RAG. Our code and datasets are available at \href{this https URL}{\texttt{this http URL}}.

35. 【2604.06996】Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

链接：https://arxiv.org/abs/2604.06996

作者：José Pombal,Ricardo Rei,André F. T. Martins

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：evaluating LLM outputs, evaluating LLM, LLM outputs, facto approach, approach for evaluating

备注：

点击查看摘要

Abstract:LLM-as-a-judge has become the de facto approach for evaluating LLM outputs. However, judges are known to exhibit self-preference bias (SPB): they tend to favor outputs produced by themselves or by models from their own family. This skews evaluations and, thus, hinders model development, especially in settings of recursive self-improvement. We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings. Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50\% more likely to incorrectly mark them as satisfied when the output is their own. We also find that, similarly to other evaluation paradigms, ensembling multiple judges helps mitigate SPB, but without fully eliminating it. On HealthBench, a medical chat benchmark with subjective rubrics, we observe that SPB skews model scores by up to 10 points, a potentially decisive margin when ranking frontier models. We analyze the factors that drive SPB in this setting, finding that negative rubrics, extreme rubric lengths, and subjective topics like emergency referrals are particularly susceptible.

36. 【2604.06906】he AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era

链接：https://arxiv.org/abs/2604.06906

作者：Rudra Jadhav,Janhavi Danve

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：Large Language Models, Language Models reshape, global labor market, Large Language, High Displacement Risk

备注： 11 pages, 12 figures, 2 tables, 17 references. Code and data available at

点击查看摘要

Abstract:As Large Language Models reshape the global labor market, policymakers and workers need empirical data on which occupational skills may be most susceptible to automation. We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate). Cross-referencing with real-world AI adoption data from the Anthropic Economic Index (756 occupations, 17,998 tasks), we propose an AI Impact Matrix -- an interpretive framework that positions skills along four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk. Key findings: (1) Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest; (2) a "capability-demand inversion" where skills most demanded in AI-exposed jobs are those LLMs perform least well at in our benchmark; (3) 78.7% of observed AI interactions are augmentation, not automation; (4) all four models converge to similar skill profiles (3.6-point spread), suggesting that text-based automation feasibility may be more skill-dependent than model-dependent. SAFI measures LLM performance on text-based representations of skills, not full occupational execution. All data, code, and model responses are open-sourced.

37. 【2604.06903】Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus

链接：https://arxiv.org/abs/2604.06903

作者：Aidan Mannion,Cécile Macaire,Armand Violle,Stéphane Ohayon,Xavier Tannier,Didier Schwab,Lorraine Goeuriot,François Portet

类目：Computation and Language (cs.CL)

关键词：fields remains challenging, demonstrated remarkable capabilities, specialized fields remains, Large language models, French biomedical

备注：

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet their adaptation to specialized fields remains challenging, particularly for non-English languages. This study investigates domain-adaptive pre-training (DAPT) as a strategy for specializing small to mid-sized LLMs in the French biomedical domain through continued pre-training. We address two key research questions: the viability of specialized continued pre-training for domain adaptation and the relationship between domain-specific performance gains and general capability degradation. Our contributions include the release of a fully open-licensed French biomedical corpus suitable for commercial and open-source applications, the training and release of specialized French biomedical LLMs, and novel insights for DAPT implementation. Our methodology encompasses the collection and refinement of high-quality French biomedical texts, the exploration of causal language modeling approaches using DAPT, and conducting extensive comparative evaluations. Our results cast doubt on the efficacy of DAPT, in contrast to previous works, but we highlight its viability in smaller-scale, resource-constrained scenarios under the right conditions. Findings in this paper further suggest that model merging post-DAPT is essential to mitigate generalization trade-offs, and in some cases even improves performance on specialized tasks at which the DAPT was directed.

38. 【2604.06902】AG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations

链接：https://arxiv.org/abs/2604.06902

作者：Wenshuo Wang,Boyu Cao,Nan Zhuang,Wei Li

类目：Computation and Language (cs.CL)

关键词：causal graph annotation, causally annotated text, causal graph, high annotation costs, graph annotation accuracy

备注： Accepted at ACL 2026

点击查看摘要

Abstract:A fundamental obstacle to causal discovery from text is the lack of causally annotated text data for use as ground truth, due to high annotation costs. This motivates an important task of generating text with causal graph annotations. Early template-based generation methods sacrifice text naturalness in exchange for high causal graph annotation accuracy. Recent Large Language Model (LLM)-dependent methods directly generate natural text from target graphs through LLMs, but do not guarantee causal graph annotation accuracy. Therefore, we propose iTAG, which performs real-world concept assignment to nodes before converting causal graphs into text in existing LLM-dependent methods. iTAG frames this process as an inverse problem with the causal graph as the target, iteratively examining and refining concept selection through Chain-of-Thought (CoT) reasoning so that the induced relations between concepts are as consistent as possible with the target causal relationships described by the causal graph. iTAG demonstrates both extremely high annotation accuracy and naturalness across extensive tests, and the results of testing text-based causal discovery algorithms with the generated data show high statistical correlation with real-world data. This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.

39. 【2604.06871】Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

链接：https://arxiv.org/abs/2604.06871

作者：Bajian Xiang,Tingwei Guo,Xuan Chen,Yang Han

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Speech Language, Speech Language Models, Language Models, incurring prohibitive inference, prohibitive inference costs

备注： Accepted to ACL 2026 (Findings)

点击查看摘要

Abstract:Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengths that far exceed the underlying semantic content, incurring prohibitive inference costs. In this paper, we empirically revisit the necessity of such granular token-level processing. Through layer-wise oracle interventions, we unveil a structured redundancy hierarchy: while shallow layers encode essential acoustic details, deep layers exhibit extreme redundancy, allowing for aggressive compression. Motivated by these findings, we introduce Affinity Pooling, a training-free, similarity-based token merging mechanism. By strategically applying this method at both input and deep layers, we effectively compress speech representations without compromising semantic information. Extensive evaluations across three tasks demonstrate that our approach reduces prefilling FLOPs by 27.48\% while maintaining competitive accuracy. Practical deployment further confirms significant efficiency gains, yielding up to $\sim$1.7$\times$ memory savings and $\sim$1.1$\times$ faster time-to-first-token on long utterances. Our results challenge the necessity of fully distinct token representations, providing new perspectives on LSLM efficiency.

40. 【2604.06863】Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings

链接：https://arxiv.org/abs/2604.06863

作者：Mingchen Li,Wajdi Aljedaani,Yingjie Liu,Navyasri Meka,Xuan Lu,Xinyue Ye,Junhua Ding,Yunhe Feng

类目：ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：fostering personal identity, Large Language Models, online communication, crucial for fostering, fostering personal

备注： Accepted at WWW'26

点击查看摘要

Abstract:Skin-toned emojis are crucial for fostering personal identity and social inclusion in online communication. As AI models, particularly Large Language Models (LLMs), increasingly mediate interactions on web platforms, the risk that these systems perpetuate societal biases through their representation of such symbols is a significant concern. This paper presents the first large-scale comparative study of bias in skin-toned emoji representations across two distinct model classes. We systematically evaluate dedicated emoji embedding models (emoji2vec, emoji-sw2v) against four modern LLMs (Llama, Gemma, Qwen, and Mistral). Our analysis first reveals a critical performance gap: while LLMs demonstrate robust support for skin tone modifiers, widely-used specialized emoji models exhibit severe deficiencies. More importantly, a multi-faceted investigation into semantic consistency, representational similarity, sentiment polarity, and core biases uncovers systemic disparities. We find evidence of skewed sentiment and inconsistent meanings associated with emojis across different skin tones, highlighting latent biases within these foundational models. Our findings underscore the urgent need for developers and platforms to audit and mitigate these representational harms, ensuring that AI's role on the web promotes genuine equity rather than reinforcing societal biases.

41. 【2604.06854】o Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models

链接：https://arxiv.org/abs/2604.06854

作者：Ane G. Domingo-Aldama,Iker De La Iglesia,Maitane Urruela,Aitziber Atutxa,Ander Barrena

类目：Computation and Language (cs.CL)

关键词：Recent studies, specialized clinical adaptation, domain-adapted large language, clinical, studies have shown

备注：

点击查看摘要

Abstract:BACKGROUND: Recent studies have shown that domain-adapted large language models (LLMs) do not consistently outperform general-purpose counterparts on standard medical benchmarks, raising questions about the need for specialized clinical adaptation. METHODS: We systematically compare general and clinical LLMs on a diverse set of multiple choice clinical question answering tasks in English and Spanish. We introduce a perturbation based evaluation benchmark that probes model robustness, instruction following, and sensitivity to adversarial variations. Our evaluation includes, one-step and two-step question transformations, multi prompt testing and instruction guided assessment. We analyze a range of state-of-the-art clinical models and their general-purpose counterparts, focusing on Llama 3.1-based models. Additionally, we introduce Marmoka, a family of lightweight 8B-parameter clinical LLMs for English and Spanish, developed via continual domain-adaptive pretraining on medical corpora and instructions. RESULTS: The experiments show that clinical LLMs do not consistently outperform their general purpose counterparts on English clinical tasks, even under the proposed perturbation based benchmark. However, for the Spanish subsets the proposed Marmoka models obtain better results compared to Llama. CONCLUSIONS: Our results show that, under current short-form MCQA benchmarks, clinical LLMs offer only marginal and unstable improvements over general-purpose models in English, suggesting that existing evaluation frameworks may be insufficient to capture genuine medical expertise. We further find that both general and clinical models exhibit substantial limitations in instruction following and strict output formatting. Finally, we demonstrate that robust medical LLMs can be successfully developed for low-resource languages such as Spanish, as evidenced by the Marmoka models.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.06854 [cs.CL]

(or
arXiv:2604.06854v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.06854

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Ane G. Domingo-Aldama [view email] [v1]
Wed, 8 Apr 2026 09:17:55 UTC (1,442 KB)

42. 【2604.06846】MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

链接：https://arxiv.org/abs/2604.06846

作者：Xiaotian Luo,Xun Jiang,Jiangcheng Wu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Interactive medical dialogue, single ungraded axis, Interactive medical, apply adversarial behaviors, reduce patient non-cooperation

备注： 9 pages, 4 figures, 9 tables. Preprint

点击查看摘要

Abstract:Interactive medical dialogue benchmarks have shown that LLM diagnostic accuracy degrades significantly when interacting with non-cooperative patients, yet existing approaches either apply adversarial behaviors without graded severity or case-specific grounding, or reduce patient non-cooperation to a single ungraded axis, and none analyze cross-dimension interactions. We introduce MedDialBench, a benchmark enabling controlled, dose-response characterization of how individual patient behavior dimensions affect LLM diagnostic robustness. It decomposes patient behavior into five dimensions -- Logic Consistency, Health Cognition, Expression Style, Disclosure, and Attitude -- each with graded severity levels and case-specific behavioral scripts. This controlled factorial design enables graded sensitivity analysis, dose-response profiling, and cross-dimension interaction detection. Evaluating five frontier LLMs across 7,225 dialogues (85 cases x 17 configurations x 5 models), we find a fundamental asymmetry: information pollution (fabricating symptoms) produces 1.7-3.4x larger accuracy drops than information deficit (withholding information), and fabricating is the only configuration achieving statistical significance across all five models (McNemar p 0.05). Among six dimension combinations, fabricating is the sole driver of super-additive interaction: all three fabricating-involving pairs produce O/E ratios of 0.70-0.81 (35-44% of eligible cases fail under the combination despite succeeding under each dimension alone), while all non-fabricating pairs show purely additive effects (O/E ~ 1.0). Inquiry strategy moderates deficit but not pollution: exhaustive questioning recovers withheld information, but cannot compensate for fabricated inputs. Models exhibit distinct vulnerability profiles, with worst-case drops ranging from 38.8 to 54.1 percentage points.

Comments:
9 pages, 4 figures, 9 tables. Preprint

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.06846 [cs.CL]

(or
arXiv:2604.06846v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.06846

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

43. 【2604.06845】HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

链接：https://arxiv.org/abs/2604.06845

作者：Yijie Zhong,Yunfan Gao,Haofen Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：critical for dialogue, dialogue systems, systems that support, textit, support continuous

备注： Accepted by TheWebConf 2026

点击查看摘要

Abstract:Long-term memory is critical for dialogue systems that support continuous, sustainable, and personalized interactions. However, existing methods rely on continuous summarization or OpenIE-based graph construction paired with fixed Top-\textit{k} retrieval, leading to limited adaptability across query categories and high computational overhead. In this paper, we propose HingeMem, a boundary-guided long-term memory that operationalizes event segmentation theory to build an interpretable indexing interface via boundary-triggered hyperedges over four elements: person, time, location, and topic. When any such element changes, HingeMem draws a boundary and writes the current segment, thereby reducing redundant operations and preserving salient context. To enable robust and efficient retrieval under diverse information needs, HingeMem introduces query-adaptive retrieval mechanisms that jointly decide (a) \textit{what to retrieve}: determine the query-conditioned routing over the element-indexed memory; (b) \textit{how much to retrieve}: control the retrieval depth based on the estimated query type. Extensive experiments across LLM scales (from 0.6B to production-tier models; \textit{e.g.}, Qwen3-0.6B to Qwen-Flash) on LOCOMO show that HingeMem achieves approximately $20\%$ relative improvement over strong baselines without query categories specification, while reducing computational cost (68\%$\downarrow$ question answering token cost compared to HippoRAG2). Beyond advancing memory modeling, HingeMem's adaptive retrieval makes it a strong fit for web applications requiring efficient and trustworthy memory over extended interactions.

44. 【2604.06834】On the Step Length Confounding in LLM Reasoning Data Selection

链接：https://arxiv.org/abs/2604.06834

作者：Bing Wang,Rui Miao,Chen Shen,Shaotian Yan,Kaiyuan Liu,Ximing Li,Xiaosong Yuan,Sinan Fan,Jun Zhang,Jieping Ye

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, recently demonstrated strong, demonstrated strong performance, Large reasoning models, capable Large Language

备注： Accepted by Findings of ACL 2026. 15 pages, 9 figures. Code: [this https URL](https://github.com/wangbing1416/ASLEC)

点击查看摘要

Abstract:Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised fine-tuning on large-scale and high-quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness-based selection methods to filter high-quality samples. Despite the proven effectiveness of naturalness-based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher-quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low-probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC-DROP, which drops first-token probabilities when computing average log probability, and ASLEC-CASL, which applies a causal debiasing regression to remove the first tokens' confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.

45. 【2604.06832】Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

链接：https://arxiv.org/abs/2604.06832

作者：Chengyue Wu,Shiyi Lan,Yonggan Fu,Sensen Gao,Jin Wang,Jincheng Yu,Jose M. Alvarez,Pavlo Molchanov,Ping Luo,Song Han,Ligeng Zhu,Enze Xie

类目：Computation and Language (cs.CL)

关键词：Vision-language models, limits inference throughput, fundamentally limits inference, predominantly rely, time and fundamentally

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM; we therefore adopt it as our recommended recipe. We introduce a suite of multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation, that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.

46. 【2604.06829】WRAP++: Web discoveRy Amplified Pretraining

链接：https://arxiv.org/abs/2604.06829

作者：Jiang Zhou,Yunhao Wang,Xing Wu,Tinghao Yu,Feng Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Synthetic data rephrasing, enhancing knowledge acquisition, large language model, Synthetic data, rephrasing has emerged

备注： Work in progress. Correspondence to ucaswu@tencent.com or wuxing@iie. [this http URL](http://ac.cn)

点击查看摘要

Abstract:Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data. On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.

47. 【2604.06826】Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models

链接：https://arxiv.org/abs/2604.06826

作者：Paula Dodig,Boshko Koloski,Katarina Sitar Šuštar,Senja Pollak,Matthew Purver

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：assessing corporate performance, considerations are increasingly, long-term sustainability, ESG, increasingly integral

备注： Accepted at the The 7th Financial Narrative Processing Workshop at LREC 26'

点击查看摘要

Abstract:Environmental, Social, and Governance (ESG) considerations are increasingly integral to assessing corporate performance, reputation, and long-term sustainability. Yet, reliable ESG ratings remain limited for smaller companies and emerging markets. We introduce the first publicly available Slovene ESG sentiment dataset and a suite of models for automatic ESG sentiment detection. The dataset, derived from the MaCoCu Slovene news collection, combines large language model (LLM)-assisted filtering with human annotation of company-related ESG content. We evaluate the performance of monolingual (SloBERTa) and multilingual (XLM-R) models, embedding-based classifiers (TabPFN), hierarchical ensemble architectures, and large language models. Results show that LLMs achieve the strongest performance on Environmental (Gemma3-27B, F1-macro: 0.61) and Social aspects (gpt-oss 20B, F1-macro: 0.45), while fine-tuned SloBERTa is the best model on Governance classification (F1-macro: 0.54). We then show in a small case study how the best-preforming classifier (gpt-oss) can be applied to investigate ESG aspects for selected companies across a long time frame.

48. 【2604.06817】SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization

链接：https://arxiv.org/abs/2604.06817

作者：Usman Naseem,Robert Geislinger,Juan Ren,Sarah Kohail,Rudy Garrido Veliz,P Sam Sahil,Yiran Zhang,Marco Antonio Stranisci,Idris Abdulmumin,Özge Alaçam,Cengiz Acartürk,Aisha Jabr,Saba Anwar,Abinew Ali Ayele,Elena Tutubalina,Aung Kyaw Htet,Xintong Wang,Surendrabikram Thapa,Tanmoy Chakraborty,Dheeraj Kodati,Sahar Moradizeyveh,Firoj Alam,Ye Kyaw Thu,Shantipriya Parida,Ihsan Ayyub Qazi,Lilian Wanzare,Nelson Odhiambo Onyango,Clemencia Siro,Ibrahim Said Ahmad,Adem Chanie Ali,Martin Semmann,Chris Biemann,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam

类目：Computation and Language (cs.CL)

关键词：online polarization detection, polarization, polarization manifestation, polarization detection, annotated instances

备注：

点击查看摘要

Abstract:We present SemEval-2026 Task 9, a shared task on online polarization detection, covering 22 languages and comprising over 110K annotated instances. Each data instance is multi-labeled with the presence of polarization, polarization type, and polarization manifestation. Participants were asked to predict labels in three sub-tasks: (1) detecting the presence of polarization, (2) identifying the type of polarization, and (3) recognizing the polarization manifestation. The three tasks attracted over 1,000 participants worldwide and more than 10k submission on Codabench. We received final submissions from 67 teams and 73 system description papers. We report the baseline results and analyze the performance of the best-performing systems, highlighting the most common approaches and the most effective methods across different subtasks and languages. The dataset of this task is publicly available.

49. 【2604.06812】AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation

链接：https://arxiv.org/abs/2604.06812

作者：Guanran Luo,Wentao Qiu,Wanru Zhao,Wenhan Lv,Zhongquan Jian,Meihong Wang,Qingqiang Wu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, demonstrated impressive capabilities, hallucination problem, Language Models

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in long-form generation, yet their application is hindered by the hallucination problem. While Uncertainty Quantification (UQ) is essential for assessing reliability, the complex structure makes reliable aggregation across heterogeneous themes difficult, in addition, existing methods often overlook the nuance of neutral information and suffer from the high computational cost of fine-grained decomposition. To address these challenges, we propose AGSC (Adaptive Granularity and GMM-based Semantic Clustering), a UQ framework tailored for long-form generation. AGSC first uses NLI neutral probabilities as triggers to distinguish irrelevance from uncertainty, reducing unnecessary computation. It then applies Gaussian Mixture Model (GMM) soft clustering to model latent semantic themes and assign topic-aware weights for downstream aggregation. Experiments on BIO and LongFact show that AGSC achieves state-of-the-art correlation with factuality while reducing inference time by about 60% compared to full atomic decomposition.

50. 【2604.06805】Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning

链接：https://arxiv.org/abs/2604.06805

作者：Jia-Chen Zhang,Zheng Zhou,Yu-Jie Xiong

类目：Computation and Language (cs.CL)

关键词：explicit reasoning steps, leveraging explicit reasoning, significantly advanced, capabilities of LLMs, LLMs by leveraging

备注：

点击查看摘要

Abstract:Multi-step Chain-of-Thought (CoT) has significantly advanced the mathematical reasoning capabilities of LLMs by leveraging explicit reasoning steps. However, the widespread adoption of Long CoT often results in sequence lengths that exceed manageable computational limits. While existing approaches attempt to alleviate this by reducing KV Cache redundancy via Markov chain-like structures, they introduce two critical limitations: inherent memorylessness (loss of context) and limited backward reasoning capability. To address these limitations, we propose a novel Chain-of-Thought framework based on Reversible Hierarchical Markov Chain, termed Cognitive Loop of Thought (CLoT), and a backward reasoning dataset CLoT-Instruct. In CLoT, problems are decomposed into sub-problems with hierarchical dependencies. Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer. Furthermore, we implement a pruning strategy: once higher-level sub-problems are verified, redundant lower-level sub-problems are pruned to maximize efficiency. This approach effectively mitigates error propagation and enhances reasoning robustness. Experiments on four mathematical benchmarks demonstrate the effectiveness of our method. Notably, on the AddSub dataset using GPT-4o-mini, CLoT achieves 99.0% accuracy, outperforming traditional CoT and CoT-SC by 4.1% and 2.9%, respectively.

51. 【2604.06799】Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

链接：https://arxiv.org/abs/2604.06799

作者：Parth Patil,Dhruv Kumar,Yash Sinha,Murari Mandal

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：informative stress tests, current benchmarks provide, large language models, informative stress, stress tests

备注： Under Review as a conference paper at COLM 2026

点击查看摘要

Abstract:Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause. When a model fails an algebraic problem, a single accuracy score cannot reveal whether the expression was too deeply nested, the operator too uncommon, the intermediate state count too high, or the dependency chain too long. Prior work has studied individual failure modes in isolation, but no framework has varied each complexity factor independently under strict experimental control. No prior system has offered automatic generation and verification of problems of increasing complexity to track model progress over time. We introduce a nine-dimension algebraic complexity framework in which each factor is varied independently while all others are held fixed, with problem generation and verification handled by a parametric pipeline requiring no human annotation. Each dimension is grounded in a documented LLM failure mode and captures a structurally distinct aspect of algebraic difficulty, including expression nesting depth, simultaneous intermediate result count, sub-expression complexity, operator hardness, and dependent reasoning chain length. We evaluated seven instruction-tuned models spanning 8B to 235B parameters across all nine dimensions and find that working memory is the dominant scale-invariant bottleneck. Every model collapses between 20 and 30 parallel branches regardless of parameter count, pointing to a hard architectural constraint rather than a solvable capacity limitation. Our analysis further identifies a minimal yet diagnostically sufficient subset of five dimensions that together span the full space of documented algebraic failure modes, providing a complete complexity profile of a model's algebraic reasoning capacity.

52. 【2604.06794】GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering

链接：https://arxiv.org/abs/2604.06794

作者：Guanran Luo,Wentao Qiu,Zhongquan Jian,Meihong Wang,Qingqiang Wu

类目：Computation and Language (cs.CL)

关键词：enhance large language, requires manually designed, large language models, manually designed prompts, enhance large

备注：

点击查看摘要

Abstract:Chain-of-Thought reasoning can enhance large language models, but it requires manually designed prompts to guide the model. Recently proposed CoT-decoding enables the model to generate CoT-style reasoning paths without prompts, but it is only applicable to problems with fixed answer sets. To address this limitation, we propose a general decoding strategy GCoT-decoding that extends applicability to a broader range of question-answering tasks. GCoT-decoding employs a two-stage branching method combining Fibonacci sampling and heuristic error backtracking to generate candidate decoding paths. It then splits each path into a reasoning span and an answer span to accurately compute path confidence, and finally aggregates semantically similar paths to identify a consensus answer, replacing traditional majority voting. We conduct extensive experiments on six datasets covering both fixed and free QA tasks. Our method not only maintains strong performance on fixed QA but also achieves significant improvements on free QA, demonstrating its generality.

53. 【2604.06789】Video-guided Machine Translation with Global Video Context

链接：https://arxiv.org/abs/2604.06789

作者：Jian Chen,JinZe Lv,Zi Long,XiangHua Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Video-guided Multimodal Translation, Video-guided Multimodal, recent years, globally video-guided multimodal, Multimodal Translation

备注：

点击查看摘要

Abstract:Video-guided Multimodal Translation (VMT) has advanced significantly in recent years. However, most existing methods rely on locally aligned video segments paired one-to-one with subtitles, limiting their ability to capture global narrative context across multiple segments in long videos. To overcome this limitation, we propose a globally video-guided multimodal translation framework that leverages a pretrained semantic encoder and vector database-based subtitle retrieval to construct a context set of video segments closely related to the target subtitle semantics. An attention mechanism is employed to focus on highly relevant visual content, while preserving the remaining video features to retain broader contextual information. Furthermore, we design a region-aware cross-modal attention mechanism to enhance semantic alignment during translation. Experiments on a large-scale documentary translation dataset demonstrate that our method significantly outperforms baseline models, highlighting its effectiveness in long-video scenarios.

54. 【2604.06788】From Perception to Autonomous Computational Modeling: A Multi-Agent Approach

链接：https://arxiv.org/abs/2604.06788

作者：Daniel N. Wilke

类目：Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：large language model, computational mechanics workflow, coordinated large language, complete computational mechanics, agents autonomously execute

备注： 32 pages, 8 figures, 5 tables

点击查看摘要

Abstract:We present a solver-agnostic framework in which coordinated large language model (LLM) agents autonomously execute the complete computational mechanics workflow, from perceptual data of an engineering component through geometry extraction, material inference, discretisation, solver execution, uncertainty quantification, and code-compliant assessment, to an engineering report with actionable recommendations. Agents are formalised as conditioned operators on a shared context space with quality gates that introduce conditional iteration between pipeline layers. We introduce a mathematical framework for extracting engineering information from perceptual data under uncertainty using interval bounds, probability densities, and fuzzy membership functions, and introduce task-dependent conservatism to resolve the ambiguity of what `conservative' means when different limit states are governed by opposing parameter trends. The framework is demonstrated through a finite element analysis pipeline applied to a photograph of a steel L-bracket, producing a 171,504-node tetrahedral mesh, seven analyses across three boundary condition hypotheses, and a code-compliant assessment revealing structural failure with a quantified redesign. All results are presented as generated in the first autonomous iteration without manual correction, reinforcing that a professional engineer must review and sign off on any such analysis.

55. 【2604.06787】When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning

链接：https://arxiv.org/abs/2604.06787

作者：Yang Xiang,Yixin Ji,Ruotao Xu,Dan Qiao,Zheming Yang,Juntao Li,Min Zhang

类目：Computation and Language (cs.CL)

关键词：inference-time scaling capability, powerful inference-time scaling, complex reasoning tasks, Large reasoning models, achieved remarkable performance

备注： ACL 2026 Main Conference

点击查看摘要

Abstract:Large reasoning models (LRMs) have achieved remarkable performance in complex reasoning tasks, driven by their powerful inference-time scaling capability. However, LRMs often suffer from overthinking, which results in substantial computational redundancy and significantly reduces efficiency. Early-exit methods aim to mitigate this issue by terminating reasoning once sufficient evidence has been generated, yet existing approaches mostly rely on handcrafted or empirical indicators that are unreliable and impractical. In this work, we introduce Dynamic Thought Sufficiency in Reasoning (DTSR), a novel framework for efficient reasoning that enables the model to dynamically assess the sufficiency of its chain-of-thought (CoT) and determine the optimal point for early exit. Inspired by human metacognition, DTSR operates in two stages: (1) Reflection Signal Monitoring, which identifies reflection signals as potential cues for early exit, and (2) Thought Sufficiency Check, which evaluates whether the current CoT is sufficient to derive the final answer. Experimental results on the Qwen3 models show that DTSR reduces reasoning length by 28.9%-34.9% with minimal performance loss, effectively mitigating overthinking. We further discuss overconfidence in LRMs and self-evaluation paradigms, providing valuable insights for early-exit reasoning.

56. 【2604.06784】Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation

链接：https://arxiv.org/abs/2604.06784

作者：Zhiyu Cao,Peifeng Li,Qiaoming Zhu

类目：Computation and Language (cs.CL)

关键词：predominantly leveraged structural, leveraged structural information, structural information inherent, Previous research, multi-party dialogue generation

备注： ACL 2026 Main Conference

点击查看摘要

Abstract:Previous research on multi-party dialogue generation has predominantly leveraged structural information inherent in dialogues to directly inform the generation process. However, the prevalence of colloquial expressions and incomplete utterances in dialogues often impedes comprehension and weakens the fidelity of dialogue structure representations, which is particularly pronounced in multi-party dialogues. In this work, we propose a novel framework DRCR (Discourse coherence and Response-guided Context Rewriting) to improve multi-party dialogue generation through dialogue context rewriting. Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation. Moreover, we propose a dynamic self-evolution learning method that allows the rewriter and responder to continuously enhance their capabilities through mutual interaction in an iterative training loop. Comprehensive experiments conducted on four multi-party dialogue datasets substantiate the effectiveness of DRCR.

57. 【2604.06771】Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search

链接：https://arxiv.org/abs/2604.06771

作者：Zhiyu Cao,Peifeng Li,Qiaoming Zhu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：efficient conversational search, Conversational Query Rewriting, conversational search, efficient conversational, Conversational Query

备注： ACL 2026 Findings

点击查看摘要

Abstract:Conversational Query Rewriting (CQR) aims to rewrite ambiguous queries to achieve more efficient conversational search. Early studies have predominantly focused on the rewriting in isolation, ignoring the feedback from query rewrite, passage retrieval and response generation in the rewriting process. To address this issue, we propose Multi-Faceted Self-Consistent Preference Aligned CQR (MSPA-CQR). Specifically, we first construct self-consistent preference alignment data from three dimensions (rewriting, retrieval, and response) to generate more diverse rewritten queries. Then we propose prefix guided multi-faceted direct preference optimization to learn preference information from three different dimensions. The experimental results show that our MSPA-CQR is effective in both in- and out-of-distribution scenarios.

58. 【2604.06767】Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Models

链接：https://arxiv.org/abs/2604.06767

作者：Marshall Brett

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：continuous vector spaces, vector spaces, representation manifold, Voronoi tessellation, Language models operate

备注： 20 pages

点击查看摘要

Abstract:Language models operate on discrete tokens but compute in continuous vector spaces, inducing a Voronoi tessellation over the representation manifold. We study this tessellation empirically on Qwen3.5-4B-Base, making two contributions. First, using float32 margin recomputation to resolve bfloat16 quantization artifacts, we validate Mabrok's (2026) linear scaling law of the expressibility gap with $R^2$ = 0.9997 - the strongest confirmation to date - and identify a mid-layer geometric ambiguity regime where margin geometry is anti-correlated with cross-entropy (layers 24-28, $\rho$ = -0.29) before crystallizing into alignment at the final layer ($\rho$ = 0.836). Second, we show that the Voronoi tessellation of a converged model is reshapable through margin refinement procedures (MRP): short post-hoc optimization runs that widen token-decision margins without retraining. We compare direct margin maximization against Fisher information distance maximization across a dose-response sweep. Both methods find the same ceiling of ~16,300 correctable positions per 256K evaluated, but differ critically in collateral damage. Margin maximization damage escalates with intervention strength until corrections are overwhelmed. Fisher damage remains constant at ~5,300 positions across the validated range ($\lambda$ = 0.15-0.6), achieving +28% median margin improvement at $\lambda$ = 0.6 with invariant downstream benchmarks - a geometric reorganization that compresses the expressibility gap while preserving its scaling law. However, frequency and token-class audits reveal that gains concentrate in high-frequency structural tokens (84% of net corrections at $\lambda$ = 0.6), with content and entity-like contributions shrinking at higher $\lambda$. Fisher MRP is therefore a viable geometric polishing tool whose practical ceiling is set not by aggregate damage but by the uniformity of token-level benefit.

Comments:
20 pages

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2604.06767 [cs.LG]

(or
arXiv:2604.06767v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.06767

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

59. 【2604.06765】amLLM: A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks

链接：https://arxiv.org/abs/2604.06765

作者：Xiangyu Wang,Jin Wu,Haoran Shi,Wei Xia,Jiarui Yu,Chanjin Zheng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：multi-Large Language Model, Language Model, multi-step contextualized tasks, multi-Large Language, contextualized tasks

备注：

点击查看摘要

Abstract:Recently, multi-Large Language Model (LLM) frameworks have been proposed to solve contextualized tasks. However, these frameworks do not explicitly emulate human team role division, which may lead to a single perspective, thereby weakening performance on multi-step contextualized tasks. To address this issue, we propose TeamLLM, a human-like Team-Oriented Multi-LLM Collaboration Framework. TeamLLM adopts four team roles with distinct division and employs a three-phase multi-LLM collaboration for multi-step contextualized tasks. To evaluate the effectiveness of TeamLLM on multi-step contextualized tasks, we propose Contextually-Grounded and Procedurally-Structured tasks (CGPST) and construct the CGPST benchmark. This benchmark has four core features: contextual grounding, procedural structure, process-oriented evaluation and multi-dimensional assessment. We evaluate ten popular LLMs on CGPST at overall-level, step-level, and dimension-level. Results show that TeamLLM substantially improves performance on CGPST. We release the benchmark with scenarios, full-process responses and human scores from ten LLMs. The code and data are available at this https URL.

60. 【2604.06758】Multilingual Cognitive Impairment Detection in the Era of Foundation Models

链接：https://arxiv.org/abs/2604.06758

作者：Damar Hoogland,Boshko Koloski,Jaya Caporusso,Tine Kolenik,Ana Zwitter Vitez,Senja Pollak,Christina Manouilidou,Matthew Purver

类目：Computation and Language (cs.CL)

关键词：evaluate cognitive impairment, speech in English, cognitive impairment, evaluate cognitive, Slovene

备注： Accepted as an oral at the RAPID workshop @ LREC 2026'

点击查看摘要

Abstract:We evaluate cognitive impairment (CI) classification from transcripts of speech in English, Slovene, and Korean. We compare zero-shot large language models (LLMs) used as direct classifiers under three input settings -- transcript-only, linguistic-features-only, and combined -- with supervised tabular approaches trained under a leave-one-out protocol. The tabular models operate on engineered linguistic features, transcript embeddings, and early or late fusion of both modalities. Across languages, zero-shot LLMs provide competitive no-training baselines, but supervised tabular models generally perform better, particularly when engineered linguistic features are included and combined with embeddings. Few-shot experiments focusing on embeddings indicate that the value of limited supervision is language-dependent, with some languages benefiting substantially from additional labelled examples while others remain constrained without richer feature representations. Overall, the results suggest that, in small-data CI detection, structured linguistic signals and simple fusion-based classifiers remain strong and reliable signals.

61. 【2604.06756】How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality

链接：https://arxiv.org/abs/2604.06756

作者：Minzhu Tu,Shiyu Ni,Keping Bi

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, human evaluation, surface-level biases, widely adopted

备注： ACL2026 Main

点击查看摘要

Abstract:Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator's reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.

62. 【2604.06753】Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

链接：https://arxiv.org/abs/2604.06753

作者：Heng Zhou,Zelin Tan,Zhemeng Zhang,Yutao Fan,Yibing Lin,Li Kang,Xiufeng Song,Rui Li,Songtao Huang,Ao Yu,Yuchen Fan,Yanxu Chen,Kaixin Xu,Xiaohong Liu,Yiran Qin,Philip Torr,Chen Zhang,Zhenfei Yin

类目：Computation and Language (cs.CL)

关键词：LLM-based agent improves, LLM-based agent, reasoning paradigm wrapped, paradigm, agent improves

备注：

点击查看摘要

Abstract:When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it? We study this question by comparing six inference-time paradigms, namely Direct, CoT, ReAct, Plan-Execute, Reflection, and ReCode, across four frontier LLMs and ten benchmarks, yielding roughly 18,000 runs. We find that reasoning structure helps dramatically on some tasks but hurts on others: ReAct improves over Direct by 44pp on GAIA, while CoT degrades performance by 15pp on HumanEval. No single paradigm dominates, and oracle per-task selection beats the best fixed paradigm by 17.1pp on average. Motivated by this complementarity, we propose a select-then-solve approach: before answering each task, a lightweight embedding-based router selects the most suitable paradigm. Across four models, the router improves average accuracy from 47.6% to 53.1%, outperforming the best fixed paradigm at 50.3% by 2.8pp and recovering up to 37% of the oracle gap. In contrast, zero-shot self-routing only works for GPT-5 at 67.1% and fails for weaker models, all trailing the learned router. Our results argue that reasoning paradigm selection should be a per-task decision made by a learned router, not a fixed architectural choice.

63. 【2604.06746】StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

链接：https://arxiv.org/abs/2604.06746

作者：Zhirui Chen,Peiyang Liu,Ling Shao

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, Large Language, support context windows, context windows exceeding

备注： Accepted to ACL 2026 Findings, 14 pages

点击查看摘要

Abstract:As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context inference. Existing compression approaches typically prioritize tokens based on local saliency metrics to decouple prefill computation from decoding memory. However, these methods often rely on local saliency snapshots at a specific layer, thereby systematically discarding tokens that act as global information hubs across the network depth but appear temporarily dormant at the specific layer selected for pruning. To address this limitation, we propose StructKV, a structure-aware KV cache compression framework that introduces three core innovations: First, Global In-Degree Centrality aggregates attention patterns across the network depth to identify global information hubs. Second, Dynamic Pivot Detection utilizes information-theoretic metrics to adaptively locate the optimal layer for compression. Finally, Structural Propagation and Decoupling separates the computational budget from the memory storage budget. Experimental results on the LongBench and RULER benchmarks demonstrate that StructKV effectively preserves long-range dependencies and retrieval robustness.

64. 【2604.06737】Luwen Technical Report

链接：https://arxiv.org/abs/2604.06737

作者：Yiquan Wu,Yuhang Liu,Yifei Liu,Ang Li,Siying Zhou,Kun Kuang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, demonstrated remarkable capabilities, remains challenging due, complex reasoning requirements, natural language processing

备注： 10 pages, 4 figures

点击查看摘要

Abstract:Large language models have demonstrated remarkable capabilities across a wide range of natural language processing tasks, yet their application in the legal domain remains challenging due to the specialized terminology, complex reasoning requirements, and rapidly evolving legal knowledge involved. In this paper, we present Luwen, an open-source Chinese legal language model built upon the Baichuan foundation model through three key techniques: continual pre-training on a large-scale legal corpus, supervised fine-tuning with carefully curated legal instruction data, and retrieval-augmented generation integrated with a comprehensive legal knowledge base. We evaluate Luwen on five representative legal tasks spanning both prediction and generation settings, including legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning. Experimental results show that Luwen outperforms several strong baselines, demonstrating the effectiveness of our approach in adapting general-purpose language models to the legal domain.

65. 【2604.06736】SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

链接：https://arxiv.org/abs/2604.06736

作者：Yixi Zhou,Fan Zhang,Zhiqiao Guo,Yu Chen,Haipeng Zhang,Preslav Nakov,Zhuohan Xie

类目：Computation and Language (cs.CL); Databases (cs.DB)

关键词：LLM-generated SQL, LLM-generated SQL programs, LLM-generated SQL queries, strong performance, remains unclear

备注： 17 pages, including figures and tables

点击查看摘要

Abstract:Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable. In this work, we investigate the structural behavior of LLM-generated SQL queries and introduce SQLStructEval, a framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Our experiments on the Spider benchmark show that modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and that such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. We further show that generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency. These findings suggest that structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. Our code is available at this https URL.

66. 【2604.06734】EC: A Collection of Human Trial-and-error Trajectories for Problem Solving

链接：https://arxiv.org/abs/2604.06734

作者：Xinkai Zhang,Jingtao Zhan,Yiqun Liu,Qingyao Ai

类目：Computation and Language (cs.CL)

关键词：Artificial Intelligence, solve complex problems, capability for Artificial, real-world environments, fundamental strategy

备注：

点击查看摘要

Abstract:Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments. Although several trial-and-error AI techniques have recently been proposed, most of them rely on simple heuristics designed by researchers and achieve limited performance gains. The core issue is the absence of appropriate data: current models cannot learn from detailed records of how humans actually conduct trial-and-error in practice. To address this gap, we introduce a data annotation platform and a corresponding dataset, termed Trial-and-Error Collection (TEC). The platform records users' complete trajectories across multiple trials and collects their reflections after receiving error feedback. Using this platform, we record the problem-solving processes of 46 participants on 58 tasks, resulting in 5,370 trial trajectories along with error reflections across 41,229 webpages. With this dataset, we observe that humans achieve substantially higher accuracy compared to LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs. We believe that the TEC platform and dataset provide a valuable foundation for understanding human trial-and-error behavior and for developing more capable AI systems. Platform and dataset are publicly available.

67. 【2604.06714】Steering the Verifiability of Multimodal AI Hallucinations

链接：https://arxiv.org/abs/2604.06714

作者：Jianhong Pang,Ruoxi Cheng,Ziyi Ye,Xingjun Ma,Zuxuan Wu,Xuanjing Huang,Yu-Gang Jiang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：pose considerable risks, human users, multimodal large language, large language models, large language

备注：

点击查看摘要

Abstract:AI applications driven by multimodal large language models (MLLMs) are prone to hallucinations and pose considerable risks to human users. Crucially, such hallucinations are not equally problematic: some hallucination contents could be detected by human users(i.e., obvious hallucinations), while others are often missed or require more verification effort(i.e., elusive hallucinations). This indicates that multimodal AI hallucinations vary significantly in their verifiability. Yet, little research has explored how to control this property for AI applications with diverse security and usability demands. To address this gap, we construct a dataset from 4,470 human responses to AI-generated hallucinations and categorize these hallucinations into obvious and elusive types based on their verifiability by human users. Further, we propose an activation-space intervention method that learns separate probes for obvious and elusive hallucinations. We reveal that obvious and elusive hallucinations elicit different intervention probes, allowing for fine-grained control over the model's verifiability. Empirical results demonstrate the efficacy of this approach and show that targeted interventions yield superior performance in regulating corresponding verifiability. Moreover, simply mixing these interventions enables flexible control over the verifiability required for different scenarios.

68. 【2604.06711】Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation

链接：https://arxiv.org/abs/2604.06711

作者：Jianing Zhang,Runan Li,Honglin Pang,Ding Xia,Zhou Zhu,Qian Zhang,Chuntao Li,Xi Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Oracle Bone Script, Chinese Oracle Bone, Deciphering ancient Chinese, ancient Chinese Oracle, Bone Script

备注：

点击查看摘要

Abstract:Deciphering ancient Chinese Oracle Bone Script (OBS) is a challenging task that offers insights into the beliefs, systems, and culture of the ancient era. Existing approaches treat decipherment as a closed-set image recognition problem, which fails to bridge the ``interpretation gap'': while individual characters are often unique and rare, they are composed of a limited set of recurring, pictographic components that carry transferable semantic meanings. To leverage this structural logic, we propose an agent-driven Vision-Language Model (VLM) framework that integrates a VLM for precise visual grounding with an LLM-based agent to automate a reasoning chain of component identification, graph-based knowledge retrieval, and relationship inference for linguistically accurate interpretation. To support this, we also introduce OB-Radix, an expert-annotated dataset providing structural and semantic data absent from prior corpora, comprising 1,022 character images (934 unique characters) and 1,853 fine-grained component images across 478 distinct components with verified explanations. By evaluating our system across three benchmarks of different tasks, we demonstrate that our framework yields more detailed and precise decipherments compared to baseline methods.

69. 【2604.06699】Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs

链接：https://arxiv.org/abs/2604.06699

作者：Haoyue Liu,Zhichao Wang,Yongxin Guo,Haoran Shou,Xiaoying Tang

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：obscuring credit assignment, iteratively edit monolithic, Automated prompt optimization, edit monolithic prompts, large language models

备注：

点击查看摘要

Abstract:Automated prompt optimization is crucial for eliciting reliable reasoning from large language models (LLMs), yet most API-only prompt optimizers iteratively edit monolithic prompts, coupling components and obscuring credit assignment, limiting controllability, and wasting tokens. We propose Adaptive Prompt Structure Factorization (aPSF), an API-only framework (prompt-in/text-out; no access to model internals) that uses an Architect model to discover task-specific prompt structures as semantic factors. aPSF then performs interventional, single-factor updates: interventional factor-level scoring estimates each factor's marginal contribution via validation-performance changes, and error-guided factor selection routes updates to the current dominant failure source for more sample-efficient optimization. Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45--87% tokens on MultiArith while reaching peak validation in 1 step.

70. 【2604.06685】ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding

链接：https://arxiv.org/abs/2604.06685

作者：Xuanle Zhao,Xinyuan Cai,Xiang Cheng,Xiuyi Chen,Bo Xu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：demonstrated significant potential, direct visual question-answering, Large Language Models, demonstrated significant, significant potential

备注： Accepted by ACL 2026 Findings, Preprint Version

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question-answering tasks. This paradigm often results in "black-box" systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine-grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross-modality reverse-engineering strategy, combined with a rigorous filtering pipeline, to curate a large-scale reasoning-and-captioning dataset comprising 760k high-quality samples across molecular and reaction tasks. Furthermore, we adopt a three-stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state-of-the-art (SOTA) performance, surpassing both leading proprietary models and domain-specific open-source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at this https URL.

71. 【2604.06674】Between Century and Poet: Graph-Based Lexical Semantic Change in Persian Poetry

链接：https://arxiv.org/abs/2604.06674

作者：Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Meaning in Persian, historical and relational, Meaning, Persian poetry, terms

备注：

点击查看摘要

Abstract:Meaning in Persian poetry is both historical and relational. Words persist through literary tradition while shifting their force through changing constellations of neighbors, rhetorical frames, and poetic voices. This study examines that process using aligned Word2Vec spaces combined with graph-based neighborhood analysis across centuries and major poets. Rather than modeling semantic change as vector displacement alone, it treats lexical history as the rewiring of local semantic graphs: the gain and loss of neighbors, shifts in bridge roles, and movement across communities. The analysis centers on twenty target words, anchored by five recurrent reference terms: Earth, Night, two wine terms, and Heart. Surrounding them are affective, courtly, elemental, and Sufi concepts such as Love, Sorrow, Dervish, King, Annihilation, and Truth. These words exhibit distinct patterns of change. Night is more time-sensitive, Earth more poet-sensitive, and Heart shows continuity despite graph-role mobility. The two wine terms highlight probe sensitivity: one is broad and semantically diffuse, while the other is narrower and more stable. A lexical audit confirms that the corpus contains historically driven terms, poet-specific usages, and sparsely attested mystical vocabulary requiring caution. Overall, semantic change in Persian poetry is better captured as neighborhood rewiring than as abstract drift. For Digital Humanities, this approach restores local structure to computational analysis and supports interpretations closer to literary practice: persistence, migration, mediation, and selective transformation.

72. 【2604.06666】A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM

链接：https://arxiv.org/abs/2604.06666

作者：Bo Wang,Jing Ma,Hongzhan Lin,Zhiwei Yang,Ruichao Yang,Yuan Tian,Yi Chang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：providing human-friendly explanations, providing human-friendly, Explainable fake, effective explainable fake, detection aims

备注： Accepted by TOIS

点击查看摘要

Abstract:Explainable fake news detection aims to assess the veracity of news claims while providing human-friendly explanations. Existing methods incorporating investigative journalism are often inefficient and struggle with breaking news. Recent advances in large language models (LLMs) enable leveraging externally retrieved reports as evidence for detection and explanation generation, but unverified reports may introduce inaccuracies. Moreover, effective explainable fake news detection should provide a comprehensible explanation for all aspects of a claim to assist the public in verifying its accuracy. To address these challenges, we propose a graph-enhanced defense framework (G-Defense) that provides fine-grained explanations based solely on unverified reports. Specifically, we construct a claim-centered graph by decomposing the news claim into several sub-claims and modeling their dependency relationships. For each sub-claim, we use the retrieval-augmented generation (RAG) technique to retrieve salient evidence and generate competing explanations. We then introduce a defense-like inference module based on the graph to assess the overall veracity. Finally, we prompt an LLM to generate an intuitive explanation graph. Experimental results demonstrate that G-Defense achieves state-of-the-art performance in both veracity detection and the quality of its explanations.

73. 【2604.06650】A Parameter-Efficient Transfer Learning Approach through Multitask Prompt Distillation and Decomposition for Clinical NLP

链接：https://arxiv.org/abs/2604.06650

作者：Cheng Peng,Mengxian Lyu,Ziyi Chen,Yonghui Wu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Existing prompt-based fine-tuning, imposing significant computing, prompt-based fine-tuning methods, fine-tuning methods typically, Existing prompt-based

备注：

点击查看摘要

Abstract:Existing prompt-based fine-tuning methods typically learn task-specific prompts independently, imposing significant computing and storage overhead at scale when deploying multiple clinical natural language processing (NLP) systems. We present a multitask prompt distillation and decomposition framework that learns a single shared metaprompt from 21 diverse clinical source tasks and adapts it to unseen target tasks with fewer than 0.05% trainable parameters. Evaluated across five clinical NLP task types (named entity recognition, relation extraction, question answering, natural language inference, and summarization) on 10 held-out target datasets using three backbone models (LLaMA 3.1 8B, Meditron3 8B, gpt-oss 20B), our framework consistently outperforms LoRA by 1.5~1.7% despite using orders of magnitude fewer parameters, and exceeds single-task prompt tuning by 6.1~6.6%. The gpt-oss 20B model achieves the highest overall performance, particularly on clinical reasoning tasks. The strong zero- and few-shot performance demonstrates better transferability of the shared prompt representation.

74. 【2604.06647】Feedback Adaptation for Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2604.06647

作者：Jihwan Bang,Seunghan Yang,Kyuhong Shim,Simyung Chang,Juntae Lee,Sungha Choi

类目：Computation and Language (cs.CL)

关键词：Retrieval-Augmented Generation, static assumptions, typically evaluated, evaluated under static, frequently corrected

备注： Accepted at ACL 2026 Findings

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are typically evaluated under static assumptions, despite being frequently corrected through user or expert feedback in deployment. Existing evaluation protocols focus on overall accuracy and fail to capture how systems adapt after feedback is introduced. We introduce feedback adaptation as a problem setting for RAG systems, which asks how effectively and how quickly corrective feedback propagates to future queries. To make this behavior measurable, we propose two evaluation axes: correction lag, which captures the delay between feedback provision and behavioral change, and post-feedback performance, which measures reliability on semantically related queries after feedback. Using these metrics, we show that training-based approaches exhibit a trade-off between delayed correction and reliable adaptation. We further propose PatchRAG, a minimal inference-time instantiation that incorporates feedback without retraining, demonstrating immediate correction and strong post-feedback generalization under the proposed evaluation. Our results highlight feedback adaptation as a previously overlooked dimension of RAG system behavior in interactive settings.

75. 【2604.06636】SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning

链接：https://arxiv.org/abs/2604.06636

作者：Zhengyang Ai,Zikang Shan,Xiaodong Ai,Jingxian Tang,Hangkai Hu,Pinyan Lu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：enhancing LLM reasoning, existing methods fail, distinguish meaningful progress, Process supervision, unresolved token inefficiency

备注： ACL 2026 Main

点击查看摘要

Abstract:Process supervision has emerged as a promising approach for enhancing LLM reasoning, yet existing methods fail to distinguish meaningful progress from mere verbosity, leading to limited reasoning capabilities and unresolved token inefficiency. To address this, we propose Stage-aware Hierarchical Advantage via Potential Estimation (SHAPE), a framework that formalizes reasoning as a trajectory through a state space of empirical solvability. SHAPE introduces a hierarchical credit assignment mechanism: at the segment level, it employs a stage-aware advantage function to prioritize efficient breakthroughs in low-potential states; at the token level, it utilizes entropy-driven redistribution to sharpen execution signals. Extensive experiments in math reasoning across three base models and five benchmarks demonstrate that SHAPE achieves an average accuracy gain of 3% with 30% reduced token consumption.

76. 【2604.06633】Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerability Detection

链接：https://arxiv.org/abs/2604.06633

作者：Zi Liang,Qipeng Xie,Jun He,Bohuan Xue,Weizheng Wang,Yuandao Cai,Fei Luo,Boxian Zhang,Haibo Hu,Kaishun Wu

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：Large Language Models, Application Security Testing, Static Application Security, Language Models, Security Testing

备注：

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have sparked interest in their application to Static Application Security Testing (SAST), primarily due to their superior contextual reasoning capabilities compared to traditional symbolic or rule-based methods. However, existing LLM-based approaches typically attempt to replace human experts directly without integrating effectively with existing SAST tools. This lack of integration results in ineffectiveness, including high rates of false positives, hallucinations, limited reasoning depth, and excessive token usage, making them impractical for industrial deployment. To overcome these limitations, we present a paradigm shift that reorchestrates the SAST workflow from current LLM-assisted structure to a new LLM-centered workflow. We introduce Argus (Agentic and Retrieval-Augmented Guarding System), the first multi-agent framework designed specifically for vulnerability detection. Argus incorporates three key novelties: comprehensive supply chain analysis, collaborative multi-agent workflows, and the integration of state-of-the-art techniques such as Retrieval-Augmented Generation (RAG) and ReAct to minimize hallucinations and enhance reasoning. Extensive empirical evaluation demonstrates that Argus significantly outperforms existing methods by detecting a higher volume of true vulnerabilities while simultaneously reducing false positives and operational costs. Notably, Argus has identified several critical zero-day vulnerabilities with CVE assignments.

77. 【2604.06627】DiffuMask: Diffusion Language Model for Token-level Prompt Pruning

链接：https://arxiv.org/abs/2604.06627

作者：Caleb Zheng,Jyotika Singh,Fang Tu,Weiyi Sun,Sujeeth Bharadwaj,Yassine Benajiba,Sujith Ravi,Eli Shlizerman,Dan Roth

类目：Computation and Language (cs.CL)

关键词：large language models, language models, large language, In-Context Learning, prompting improve reasoning

备注：

点击查看摘要

Abstract:In-Context Learning and Chain-of-Thought prompting improve reasoning in large language models (LLMs). These typically come at the cost of longer, more expensive prompts that may contain redundant information. Prompt compression based on pruning offers a practical solution, yet existing methods rely on sequential token removal which is computationally intensive. We present DiffuMask, a diffusion-based framework integrating hierarchical shot-level and token-level pruning signals, that enables rapid and parallel prompt pruning via iterative mask prediction. DiffuMask substantially accelerates the compression process via masking multiple tokens in each denoising step. It offers tunable control over retained content, preserving essential reasoning context and achieving up to 80\% prompt length reduction. Meanwhile, it maintains or improves accuracy across in-domain, out-of-domain, and cross-model settings. Our results show that DiffuMask provides a generalizable and controllable framework for prompt compression, facilitating faster and more reliable in-context reasoning in LLMs.

78. 【2604.06613】he Detection--Extraction Gap: Models Know the Answer Before They Can Say It

链接：https://arxiv.org/abs/2604.06613

作者：Hanyang Wang,Mingxuan Zhu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)

关键词：Modern reasoning models, continue generating long, Modern reasoning, reasoning models continue, models continue generating

备注：

点击查看摘要

Abstract:Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three benchmarks, we find that \textbf{52--88\% of chain-of-thought tokens are produced after the answer is recoverable} from a partial prefix. This post-commitment generation reveals a structural phenomenon: the \textbf{detection--extraction gap}. Free continuations from early prefixes recover the correct answer even at 10\% of the trace, while forced extraction fails on 42\% of these cases. The answer is recoverable from the model state, yet prompt-conditioned decoding fails to extract it. We formalize this mismatch via a total-variation bound between free and forced continuation distributions, yielding quantitative estimates of suffix-induced shift. Exploiting this asymmetry, we propose Black-box Adaptive Early Exit (\BAEE{}), which uses free continuations for both detection and extraction, truncating \textbf{70--78\% of serial generation} while \textbf{improving accuracy by 1--5\,pp} across all models. For thinking-mode models, early exit prevents post-commitment overwriting, yielding gains of up to 5.8\,pp; a cost-optimized variant achieves 68--73\% reduction at a median of 9 API calls. Code is available at this https URL.

79. 【2604.06603】Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMs

链接：https://arxiv.org/abs/2604.06603

作者：Maotian Ma,Zheni Zeng,Zhenghao Liu,Yukun Yan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, task-solving capabilities, severe hallucination, reserves and task-solving

备注：

点击查看摘要

Abstract:Large language models (LLMs) have shown strong knowledge reserves and task-solving capabilities, but still face the challenge of severe hallucination, hindering their practical application. Though scientific theories and rules can efficiently direct the behaviors of human manipulators, LLMs still do not utilize these highly-condensed knowledge sufficiently through training or prompting. To address this issue, we propose \textbf{SciDC}, an LLM generation method that integrate subject-specific knowledge with strong constraints. By adopting strong LLMs to automatically convert flexible knowledge into multi-layered, standardized rules, we build an extensible framework to effectively constrain the model generation on domain tasks. Experiments on scientific tasks including industrial formulation design, clinical tumor diagnosis and retrosynthesis planning, consistently demonstrate the effectiveness of our method, achieving a 12\% accuracy improvement on average compared with vanilla generation. We further discuss the potential of LLMs in automatically inductively summarizing highly-condensed knowledge, looking ahead to practical solutions for accelerating the overall scientific research process. All the code of this paper can be obtained (this https URL).

80. 【2604.06573】Scoring Edit Impact in Grammatical Error Correction via Embedded Association Graphs

链接：https://arxiv.org/abs/2604.06573

作者：Qiyuan Xiao,Xiaoman Wang,Yunshi Lan

类目：Computation and Language (cs.CL)

关键词：Grammatical Error Correction, Grammatical Error, Error Correction, produces a sequence, correct an erroneous

备注：

点击查看摘要

Abstract:A Grammatical Error Correction (GEC) system produces a sequence of edits to correct an erroneous sentence. The quality of these edits is typically evaluated against human annotations. However, a sentence may admit multiple valid corrections, and existing evaluation settings do not fully accommodate diverse application scenarios. Recent meta-evaluation approaches rely on human judgments across multiple references, but they are difficult to scale to large datasets. In this paper, we propose a new task, Scoring Edit Impact in GEC, which aims to automatically estimate the importance of edits produced by a GEC system. To address this task, we introduce a scoring framework based on an embedded association graph. The graph captures latent dependencies among edits and syntactically related edits, grouping them into coherent groups. We then perform perplexity-based scoring to estimate each edit's contribution to sentence fluency. Experiments across 4 GEC datasets, 4 languages, and 4 GEC systems demonstrate that our method consistently outperforms a range of baselines. Further analysis shows that the embedded association graph effectively captures cross-linguistic structural dependencies among edits.

81. 【2604.06571】LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

链接：https://arxiv.org/abs/2604.06571

作者：Joshua Castillo,Ravi Mukkamala

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：including structured forms, narrative web profiles, child-safety investigations rely, Missing-person and child-safety, bulletin-style posters

备注： 9 pages, 6 figures. Accepted at International Conference on Intelligent Digitization of Systems and Services (IDSS 2026)

点击查看摘要

Abstract:Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional Large Language Model (LLM)-assisted extraction pathway incorporating validator-guided repair and shared geocoding services. We present the system architecture, key implementation decisions, and output design, and evaluate performance using both gold-aligned extraction metrics and corpus-level operational indicators. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved substantially higher extraction quality than the deterministic comparator (F1 = 0.8664 vs. 0.2578), while across 517 parsed records per pathway it also improved aggregate key-field completeness (96.97\% vs. 93.23\%). The deterministic pathway remained much faster (mean runtime 0.03 s/record vs. 3.95 s/record for the LLM pathway). In the evaluated run, all LLM outputs passed initial schema validation, so validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains. These results support controlled use of probabilistic AI within a schema-first, auditable pipeline for high-stakes investigative settings.

82. 【2604.06552】o Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs

链接：https://arxiv.org/abs/2604.06552

作者：Zohaib Khan,Mustafa Dogan,Ifeoma Okoh,Pouya Sadeghi,Siddhartha Shrestha,Sergius Justus Nyah,Mahmoud O. Mokhiamar,Michael J. Ryan,Tarek Naous

类目：Computation and Language (cs.CL)

关键词：strong writing capabilities, disseminate false information, strong writing, writing capabilities, barrier for malicious

备注： Accepted at ACL 2026 Main Conference

点击查看摘要

Abstract:Misinformation is on the rise, and the strong writing capabilities of LLMs lower the barrier for malicious actors to produce and disseminate false information. We study how LLMs behave when prompted to spread misinformation across languages and target countries, and introduce GlobalLies, a multilingual parallel dataset of 440 misinformation generation prompt templates and 6,867 entities, spanning 8 languages and 195 countries. Using both human annotations and large-scale LLM-as-a-judge evaluations across hundreds of thousands of generations from state-of-the-art models, we show that misinformation generation varies systematically based on the country being discussed. Propagation of lies by LLMs is substantially higher in many lower-resource languages and for countries with a lower Human Development Index (HDI). We find that existing mitigation strategies provide uneven protection: input safety classifiers exhibit cross-lingual gaps, and retrieval-augmented fact-checking remains inconsistent across regions due to unequal information availability. We release GlobalLies for research purposes, aiming to support the development of mitigation strategies to reduce the spread of global misinformation: this https URL

83. 【2604.06551】CCD-CBT: Multi-Agent Therapeutic Interaction for CBT Guided by Cognitive Conceptualization Diagram

链接：https://arxiv.org/abs/2604.06551

作者：Chang Liu,Changsheng Ma,Yongfeng Tao,Bin Hu,Minqiang Yang

类目：Computation and Language (cs.CL)

关键词：Cognitive Behavioral Therapy, simulating Cognitive Behavioral, Large language models, scalable mental-health support, Behavioral Therapy

备注：

点击查看摘要

Abstract:Large language models show potential for scalable mental-health support by simulating Cognitive Behavioral Therapy (CBT) counselors. However, existing methods often rely on static cognitive profiles and omniscient single-agent simulation, failing to capture the dynamic, information-asymmetric nature of real therapy. We introduce CCD-CBT, a multi-agent framework that shifts CBT simulation along two axes: 1) from a static to a dynamically reconstructed Cognitive Conceptualization Diagram (CCD), updated by a dedicated Control Agent, and 2) from omniscient to information-asymmetric interaction, where the Therapist Agent must reason from inferred client states. We release CCDCHAT, a synthetic multi-turn CBT dataset generated under this framework. Evaluations with clinical scales and expert therapists show that models fine-tuned on CCDCHAT outperform strong baselines in both counseling fidelity and positive-affect enhancement, with ablations confirming the necessity of dynamic CCD guidance and asymmetric agent design. Our work offers a new paradigm for building theory-grounded, clinically-plausible conversational agents.

84. 【2604.06543】he Illusion of Stochasticity in LLMs

链接：https://arxiv.org/abs/2604.06543

作者：Xiangming Gu,Soham De,Michalis Titsias,Larisa Markeeva,Petar Veličković,Razvan Pascanu

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, requirement for Large, reliable stochastic sampling, Language Models

备注： Under review

点击查看摘要

Abstract:In this work, we demonstrate that reliable stochastic sampling is a fundamental yet unfulfilled requirement for Large Language Models (LLMs) operating as agents. Agentic systems are frequently required to sample from distributions, often inferred from observed data, a process which needs to be emulated by the LLM. This leads to a distinct failure point: while standard RL agents rely on external sampling mechanisms, LLMs fail to map their internal probability estimates to their stochastic outputs. Through rigorous empirical analysis across multiple model families, model sizes, prompting styles, and distributions, we demonstrate the extent of this failure. Crucially, we show that while powerful frontier models can convert provided random seeds to target distributions, their ability to sample directly from specific distributions is fundamentally flawed.

85. 【2604.06542】Does a Global Perspective Help Prune Sparse MoEs Elegantly?

链接：https://arxiv.org/abs/2604.06542

作者：Zeliang Zhang,Nikhil Ghosh,Jiani Liu,Bin Yu,Xiaodong Liu

类目：Computation and Language (cs.CL)

关键词：Empirical scaling laws, Empirical scaling, ever-larger LLMs, scaling laws, encouraged the development

备注：

点击查看摘要

Abstract:Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average performance. On the three main models reported in the paper, it improves average accuracy over the strongest local baseline by 1.40% on average across pruning settings, with gains of up to 2.45%.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.06542 [cs.CL]

(or
arXiv:2604.06542v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.06542

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

86. 【2604.06507】Fine-tuning Whisper for Pashto ASR: strategies and scale

链接：https://arxiv.org/abs/2604.06507

作者：Hanif Rahman

类目：Computation and Language (cs.CL)

关键词：sizes output Arabic, Whisper pre-training corpus, Whisper sizes output, largest language collections, output Arabic

备注：

点击查看摘要

Abstract:Pashto is absent from Whisper's pre-training corpus despite being one of CommonVoice's largest language collections, leaving off-the-shelf models unusable: all Whisper sizes output Arabic, Dari, or Urdu script on Pashto audio, achieving word error rates above 100%. We compare four fine-tuning strategies for whisper-base on CommonVoice Pashto v20: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. We extend vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on CommonVoice Pashto v24 (113 hours). Vanilla fine-tuning achieves WER 21.22% on CV20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. Frozen-encoder fine-tuning degrades performance on whisper-base (6 encoder layers): layer-function separation does not hold at this depth, and freezing removes a third of trainable capacity. Urdu-to-Pashto transfer fails due to an unverified intermediate checkpoint, phonological mismatch, and insufficient training. On CV24, whisper-small achieves WER 24.89% (2.24 pp over whisper-base at 3.3x parameters); whisper-large-v3-turbo achieves 23.37% (a further 1.52 pp). Diminishing returns indicate whisper-small is the practical optimum at 113 hours. Online augmentation provides 7.25 pp WER benefit over matched training. Error analysis identifies word-final suffix confusion (masculine -ay vs. feminine -a) and retroflex substitutions involving the Pashto-unique consonant /ts/ as dominant failure modes. Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.

87. 【2604.06505】MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

链接：https://arxiv.org/abs/2604.06505

作者：Weiyue Li,Ruizhi Qian,Yi Li,Yongce Li,Yunfan Long,Jiahui Cai,Yan Luo,Mengyu Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：reasoning-intensive research tasks, Large language models, Large language, evidence remain limited, research tasks

备注：

点击查看摘要

Abstract:Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: this https URL.

88. 【2604.06501】ransformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning

链接：https://arxiv.org/abs/2604.06501

作者：Philipp Hellwig,Willem Zuidema,Claire E. Stevenson,Martha Lewis

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Analogical reasoning, transferring knowledge, human-like analogical reasoning, analogical reasoning task, human analogical reasoning

备注：

点击查看摘要

Abstract:Analogical reasoning is a hallmark of human intelligence, enabling us to solve new problems by transferring knowledge from one situation to another. Yet, developing artificial intelligence systems capable of robust human-like analogical reasoning has proven difficult. In this work, we train transformers using Meta-Learning for Compositionality (MLC) on an analogical reasoning task (letter-string analogies) and assess their generalization capabilities. We find that letter-string analogies become learnable when guiding the models to attend to the most informative problem elements induced by including copying tasks in the training data. Furthermore, generalization to new alphabets becomes better when models are trained with more heterogeneous datasets, where our 3-layer encoder-decoder model outperforms most frontier models. The MLC approach also enables some generalization to compositions of trained transformations, but not to completely novel transformations. To understand how the model operates, we identify an algorithm that approximates the model's computations. We verify this using interpretability analyses and show that the model can be steered precisely according to expectations derived from the algorithm. Finally, we discuss implications of our findings for generalization capabilities of larger models and parallels to human analogical reasoning.

89. 【2604.06487】Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

链接：https://arxiv.org/abs/2604.06487

作者：Thibault Bañeras-Roux,Sergio Burdisso,Esaú Villatoro-Tello,Dairazalia Sánchez-Cortés,Shiran Liu,Severin Baroudi,Shashi Kumar,Hasindri Watawana,Manjunath K E,Kadri Hacioglu,Petr Motlicek,Andreas Stolcke

类目：Computation and Language (cs.CL)

关键词：automatic speech recognition, systems rely, Recent LLM-based ASR, LLM-based ASR architectures, ASR architectures connect

备注： Submitted to Interspeech

点击查看摘要

Abstract:Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.

90. 【2604.06484】ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

链接：https://arxiv.org/abs/2604.06484

作者：Zhipin Wang,Christoph Leiter,Christian Frey,Mohamed Hesham Ibrahim Abdalla,Josif Grabocka,Steffen Eger

类目：Computation and Language (cs.CL)

关键词：everyday social practices, social practices, scenes and everyday, everyday social, Cultural

备注： Preprint. Under review

点击查看摘要

Abstract:Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey (WVS) questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country's value tendency without access to the original response-option texts. Across six MLLMs and 13 countries, average accuracy drops from 72.8% in the text-only setting to 65.8% when options are visualized, despite 92.8% accuracy on option-image alignment. Stronger models are more robust, but all remain prone to prediction reversals. Our benchmark provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments.

91. 【2604.06474】DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling

链接：https://arxiv.org/abs/2604.06474

作者：Shicheng Liu,Yucheng Jiang,Sajid Farook,Camila Nicollier Sanchez,David Fernando Castro Pena,Monica S. Lam

类目：Computation and Language (cs.CL)

关键词：Large Language Model, multi-step information discovery, Deep research, Language Model, Large Language

备注：

点击查看摘要

Abstract:Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis-driven analytical process: discovering candidate theses from data, validating them through iterative cross-source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state-of-the-art result with a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score. We further introduce a new dataset built on ACLED, a real-world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.06474 [cs.CL]

(or
arXiv:2604.06474v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.06474

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

92. 【2604.06465】Multi-objective Evolutionary Merging Enables Efficient Reasoning Models

链接：https://arxiv.org/abs/2604.06465

作者：Mario Iacobelli,Adrian Robert Minut,Tommaso Mencattini,Donato Crisostomi,Andrea Santilli,Iacopo Masi,Emanuele Rodolà

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：demonstrated remarkable capabilities, solving complex problems, leveraging long chains, chains of thought, demonstrated remarkable

备注：

点击查看摘要

Abstract:Reasoning models have demonstrated remarkable capabilities in solving complex problems by leveraging long chains of thought. However, this more deliberate reasoning comes with substantial computational overhead at inference time. The Long-to-Short (L2S) reasoning problem seeks to maintain high accuracy using fewer tokens, but current training-free model merging approaches rely on scalarized, fixed-hyperparameter arithmetic methods that are highly brittle and force suboptimal compromises. To address this gap, we introduce Evo-L2S, a novel framework that formulates L2S reasoning as a multi-objective optimization challenge. By leveraging evolutionary model merging, Evo-L2S explicitly optimizes the trade-off between accuracy and output length to produce a robust Pareto front of merged models. To make this search computationally tractable for large language models, we propose an entropy-based subset sampling technique that drastically reduces the overhead of fitness estimation. Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the problem-solving accuracy of the original reasoning models.

93. 【2604.06456】Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection

链接：https://arxiv.org/abs/2604.06456

作者：Afroza Nowshin,Prithweeraj Acharjee Porag,Haziq Jeelani,Fayeq Jeelani Syed

类目：Computation and Language (cs.CL)

关键词：Current Machine Translation, Current Machine, Modern Standard Arabic, inputs into Modern, frequently homogenizing dialectal

备注： 14 pages, 5 figures, 5 tables. Preprint under review

点击查看摘要

Abstract:Current Machine Translation (MT) systems for Arabic often struggle to account for dialectal diversity, frequently homogenizing dialectal inputs into Modern Standard Arabic (MSA) and offering limited user control over the target vernacular. In this work, we propose a context-aware and steerable framework for dialectal Arabic MT that explicitly models regional and sociolinguistic variation. Our primary technical contribution is a Rule-Based Data Augmentation (RBDA) pipeline that expands a 3,000-sentence seed corpus into a balanced 57,000-sentence parallel dataset, covering eight regional varieties eg., Egyptian, Levantine, Gulf, etc. By fine-tuning an mT5-base model conditioned on lightweight metadata tags, our approach enables controllable generation across dialects and social registers in the translation output. Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by defaulting toward the MSA mean, while exhibiting limited dialectal specificity. In contrast, our model achieves lower BLEU scores (8.19) but produces outputs that align more closely with the intended regional varieties. Supporting qualitative evaluation, including an LLM-assisted cultural authenticity analysis, suggests improved dialectal alignment compared to baseline systems (4.80/5 vs. 1.0/5). These findings highlight the limitations of standard MT metrics for dialect-sensitive tasks and motivate the need for evaluation practices that better reflect linguistic diversity in Arabic MT.

Comments:
14 pages, 5 figures, 5 tables. Preprint under review

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.06456 [cs.CL]

(or
arXiv:2604.06456v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.06456

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

94. 【2604.06452】Learning to Interrupt in Language-based Multi-agent Communication

链接：https://arxiv.org/abs/2604.06452

作者：Danqing Wang,Da Yin,Ruta Desai,Lei Li,Asli Celikyilmaz,Ansong Ni

类目：Computation and Language (cs.CL)

关键词：large language models, demonstrated impressive capabilities, language models, systems using large, large language

备注：

点击查看摘要

Abstract:Multi-agent systems using large language models (LLMs) have demonstrated impressive capabilities across various domains. However, current agent communication suffers from verbose output that overload context and increase computational costs. Although existing approaches focus on compressing the message from the speaker side, they struggle to adapt to different listeners and identify relevant information. An effective way in human communication is to allow the listener to interrupt and express their opinion or ask for clarification. Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker. Through prompting experiments, we find that current LLMs are often overconfident and interrupt before receiving enough information. Therefore, we propose a learning method that predicts the appropriate interruption points based on the estimated future reward and cost. We evaluate our framework across various multi-agent scenarios, including 2-agent text pictionary games, 3-agent meeting scheduling, and 3-agent debate. The results of the experiment show that our HANDRAISER can reduce the communication cost by 32.2% compared to the baseline with comparable or superior task performance. This learned interruption behavior can also be generalized to different agents and tasks.

95. 【2604.06427】he Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

链接：https://arxiv.org/abs/2604.06427

作者：Yi Xu,Philipp Jettkant,Laura Ruis

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：unable to reason, reason effectively, latent, latent planning, steps

备注： 10 pages, 3 figures, 1 table (30 pages, 9 figures, 10 tables including references and appendices)

点击查看摘要

Abstract:The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is known about the limits of such latent reasoning in LLMs. We test these limits by studying whether models can discover multi-step planning strategies without supervision on intermediate steps and execute them latently, within a single forward pass. Using graph path-finding tasks that precisely control the number of required latent planning steps, we uncover a striking limitation unresolved by massive scaling: tiny transformers trained from scratch discover strategies requiring up to three latent steps, fine-tuned GPT-4o and Qwen3-32B reach five, and GPT-5.4 attains seven under few-shot prompting. Although the maximum latent planning depth models can learn during training is five, the discovered strategy generalizes up to eight latent steps at test-time. This reveals a dissociation between the ability to discover a latent strategy under final-answer supervision alone and the ability to execute it once discovered. If similar limits hold more broadly, strategies requiring multiple coordinated latent planning steps may need to be explicitly taught or externalized, lending credence to CoT monitoring.

96. 【2604.06424】am Fusion@ SU@ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking

链接：https://arxiv.org/abs/2604.06424

作者：Georgi Grazhdanski,Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：named entity recognition, SympTEMIST named entity, paper presents, presents a transformer-based, transformer-based approach

备注： 6 pages, 3 tables, Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models, American Medical Informatics Association 2023 Annual Symposium

点击查看摘要

Abstract:This paper presents a transformer-based approach to solving the SympTEMIST named entity recognition (NER) and entity linking (EL) tasks. For NER, we fine-tune a RoBERTa-based (1) token-level classifier with BiLSTM and CRF layers on an augmented train set. Entity linking is performed by generating candidates using the cross-lingual SapBERT XLMR-Large (2), and calculating cosine similarity against a knowledge base. The choice of knowledge base proves to have the highest impact on model accuracy.

97. 【2604.06422】When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't

链接：https://arxiv.org/abs/2604.06422

作者：Jonathan Nemitz,Carsten Eickhoff,Junyi Jessy Li,Kyle Mahowald,Michal Golovanevsky,William Rudman

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Understanding when Vision-Language, behave unexpectedly, Color, Graded Color Attribution, reliably predict

备注：

点击查看摘要

Abstract:Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60\% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.

98. 【2604.06421】State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

链接：https://arxiv.org/abs/2604.06421

作者：Navan Preet Singh,Anurag Garikipati,Ahmed Abulkhair,Jyani Akshay Jagdishbhai,Atul Yaduvanshi,Amarendra Chaudhary,Madalina Ciobanu,Qingqing Mao,Ritankar Das

类目：Computation and Language (cs.CL)

关键词：Arabic LLM Leaderboard, digital equity gap, entire Open Arabic, LLM Leaderboard, application-driven open-source Arabic

备注：

点击查看摘要

Abstract:This paper introduces Arabic-DeepSeek-R1, an application-driven open-source Arabic LLM that leverages a sparse MoE backbone to address the digital equity gap for under-represented languages, and establishes a new SOTA across the entire Open Arabic LLM Leaderboard (OALL). Our four-phase CoT distillation scheme integrates Arabic-specific linguistic verification and regional ethical norms into a 372M-token, contamination-controlled 80/20 Arabic-English training mixture. Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by substantial margins), safety-oriented AraTrust, multi-ability AlGhafa, and retrieval-augmented ALRAGE. Our results indicate that the combination of sparse MoE architecture, culturally-informed CoT distillation with explicit Arabic linguistic checks, and strategic bilingual data curation enables an open-source adapted model to systematically outperform the proprietary frontier system GPT-5.1 on the majority of benchmarks evaluating comprehensive language-specific tasks: the first such demonstration for Arabic LLMs. These findings indicate that much of Arabic's performance deficit in current LLM ecosystems stems from under-specialization rather than architectural limitations, and that parameter-efficient adaptation of open reasoning models can yield breakthrough SOTA performance without industrial-scale pretraining costs. Arabic-DeepSeek-R1 establishes a validated and replicable framework for sovereign and domain-specific language technologies, demonstrating that strategic, culturally-grounded adaptation of sparse MoE backbones offers a viable and cost-effective pathway to achieving record-breaking performance across standardized benchmarks for low-resource languages.

99. 【2604.06416】Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries

链接：https://arxiv.org/abs/2604.06416

作者：Rebecca M. M. Hicke,Sil Hamilton,David Mimno,Ross Deans Kristensen-McLachlan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：LLM context lengths, lengths have grown, context lengths, ability to integrate, integrate information

备注：

点击查看摘要

Abstract:Although LLM context lengths have grown, there is evidence that their ability to integrate information across long-form texts has not kept pace. We evaluate one such understanding task: generating summaries of novels. When human authors of summaries compress a story, they reveal what they consider narratively important. Therefore, by comparing human and LLM-authored summaries, we can assess whether models mirror human patterns of conceptual engagement with texts. To measure conceptual engagement, we align sentences from 150 human-written novel summaries with the specific chapters they reference. We demonstrate the difficulty of this alignment task, which indicates the complexity of summarization as a task. We then generate and align additional summaries by nine state-of-the-art LLMs for each of the 150 reference texts. Comparing the human and model-authored summaries, we find both stylistic differences between the texts and differences in how humans and LLMs distribute their focus throughout a narrative, with models emphasizing the ends of texts. Comparing human narrative engagement with model attention mechanisms suggests explanations for degraded narrative comprehension and targets for future development. We release our dataset to support future research.

100. 【2604.06409】Say Something Else: Rethinking Contextual Privacy as Information Sufficiency

链接：https://arxiv.org/abs/2604.06409

作者：Yunze Xiao,Wenkai Li,Xiaoyuan Wu,Ningshan Ma,Yueqi Song,Weihao Xuan

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：users routinely overshare, agents increasingly draft, increasingly draft messages, LLM agents increasingly, routinely overshare sensitive

备注：

点击查看摘要

Abstract:LLM agents increasingly draft messages on behalf of users, yet users routinely overshare sensitive information and disagree on what counts as private. Existing systems support only suppression (omitting sensitive information) and generalization (replacing information with an abstraction), and are typically evaluated on single isolated messages, leaving both the strategy space and evaluation setting incomplete. We formalize privacy-preserving LLM communication as an \textbf{Information Sufficiency (IS)} task, introduce \textbf{free-text pseudonymization} as a third strategy that replaces sensitive attributes with functionally equivalent alternatives, and propose a \textbf{conversational evaluation protocol} that assesses strategies under realistic multi-turn follow-up pressure. Across 792 scenarios spanning three power-relation types (institutional, peer, intimate) and three sensitivity categories (discrimination risk, social cost, boundary), we evaluate seven frontier LLMs on privacy at two granularities, covertness, and utility. Pseudonymization yields the strongest privacy\textendash utility tradeoff overall, and single-message evaluation systematically underestimates leakage, with generalization losing up to 16.3 percentage points of privacy under follow-up.

101. 【2604.06403】FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts

链接：https://arxiv.org/abs/2604.06403

作者：Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Spanish clinical texts, toxic habits named, habits named entities, entities in Spanish, Spanish clinical

备注： 8 pages, 1 figure, 6 tables, Challenge and Workshop BC9 Large Language Models for Clinical and Biomedical NLP, International Joint Conference on Artificial Intelligence IJCAI 2025

点击查看摘要

Abstract:The paper presents an approach for the recognition of toxic habits named entities in Spanish clinical texts. The approach was developed for the ToxHabits Shared Task. Our team participated in subtask 1, which aims to detect substance use and abuse mentions in clinical case reports and classify them in four categories (Tobacco, Alcohol, Cannabis, and Drug). We explored various methods of utilizing LLMs for the task, including zero-shot, few-shot, and prompt optimization, and found that GPT-4.1's few-shot prompting performed the best in our experiments. Our method achieved an F1 score of 0.65 on the test set, demonstrating a promising result for recognizing named entities in languages other than English.

102. 【2604.06393】ART: Attention Replacement Technique to Improve Factuality in LLMs

链接：https://arxiv.org/abs/2604.06393

作者：Ziqin Luo,Yihao Quan,Xiaofeng Zhang,Xiaosong Yuan,Chen Shen

类目：Computation and Language (cs.CL)

关键词：large language models, attention, question answering, attention patterns, large language

备注：

点击查看摘要

Abstract:Hallucination in large language models (LLMs) continues to be a significant issue, particularly in tasks like question answering, where models often generate plausible yet incorrect or irrelevant information. Although various methods have been proposed to mitigate hallucinations, the relationship between attention patterns and hallucinations has not been fully explored. In this paper, we analyze the distribution of attention scores across each layer and attention head of LLMs, revealing a common and intriguing phenomenon: shallow layers of LLMs primarily rely on uniform attention patterns, where the model distributes its attention evenly across the entire sequence. This uniform attention pattern can lead to hallucinations, as the model fails to focus on the most relevant information. To mitigate this issue, we propose a training-free method called Attention Replacement Technique (ART), which replaces these uniform attention patterns in the shallow layers with local attention patterns. This change directs the model to focus more on the relevant contexts, thus reducing hallucinations. Through extensive experiments, ART demonstrates significant reductions in hallucinations across multiple LLM architectures, proving its effectiveness and generalizability without requiring fine-tuning or additional training data.

103. 【2604.06385】Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

链接：https://arxiv.org/abs/2604.06385

作者：Navan Preet Singh,Xiaokun Wang,Anurag Garikipati,Madalina Ciobanu,Qingqing Mao,Ritankar Das

类目：Computation and Language (cs.CL)

关键词：subsequent SFT phase, combining reinforcement learning, extended reasoning rollouts, progressive difficulty training, synthesize high-quality training

备注： * These authors contributed equally to this work and share first authorship

点击查看摘要

Abstract:We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly larger proprietary systems such as the previous benchmark leader Gemini-3 Pro. These dense 32-billion-parameter models demonstrate that domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical domain experts that outperform much larger general-purpose systems, while preserving the transparency, customizability, and cost-efficiency required for responsible educational AI deployment.

104. 【2604.06374】he Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models

链接：https://arxiv.org/abs/2604.06374

作者：Michael Rizvi-Martel,Guillaume Rabusseau,Marius Mosbach

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：discrete CoT reasoning, promising alternative, alternative to discrete, Latent, latent thoughts

备注： 9 pages

点击查看摘要

Abstract:Latent reasoning via continuous chain-of-thoughts (Latent CoT) has emerged as a promising alternative to discrete CoT reasoning. Operating in continuous space increases expressivity and has been hypothesized to enable superposition: the ability to maintain multiple candidate solutions simultaneously within a single representation. Despite theoretical arguments, it remains unclear whether language models actually leverage superposition when reasoning using latent CoTs. We investigate this question across three regimes: a training-free regime that constructs latent thoughts as convex combinations of token embeddings, a fine-tuned regime where a base model is adapted to produce latent thoughts, and a from-scratch regime where a model is trained entirely with latent thoughts to solve a given task. Using Logit Lens and entity-level probing to analyze internal representations, we find that only models trained from scratch exhibit signs of using superposition. In the training-free and fine-tuned regimes, we find that the superposition either collapses or is not used at all, with models discovering shortcut solutions instead. We argue that this is due to two complementary phenomena: i) pretraining on natural language data biases models to commit to a token in the last layers ii) capacity has a huge effect on which solutions a model favors. Together, our results offer a unified explanation for when and why superposition arises in continuous chain-of-thought reasoning, and identify the conditions under which it collapses.

105. 【2604.06365】A Severity-Based Curriculum Learning Strategy for Arabic Medical Text Generation

链接：https://arxiv.org/abs/2604.06365

作者：Ahmed Alansary,Molham Mohamed,Ali Hamdi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：access general health, general health guidance, Arabic medical text, medical text generation, users interpret symptoms

备注： 10 pages, 2 figures, 2 tables, ICTIS2026

点击查看摘要

Abstract:Arabic medical text generation is increasingly needed to help users interpret symptoms and access general health guidance in their native language. Nevertheless, many existing methods assume uniform importance across training samples, overlooking differences in clinical severity. This simplification can hinder the model's ability to properly capture complex or high-risk cases. To overcome this issue, this work introduces a Severity-based Curriculum Learning Strategy for Arabic Medical Text Generation, where the training process is structured to move gradually from less severe to more critical medical conditions. The approach divides the dataset into ordered stages based on severity and incrementally exposes the model to more challenging cases during fine-tuning, allowing it to first learn basic medical patterns before addressing more complex scenarios. The proposed method is evaluated on a subset of the Medical Arabic Question Answering (MAQA) dataset, which includes Arabic medical questions describing symptoms alongside corresponding responses. In addition, the dataset is annotated with three severity levels (Mild, Moderate, and Critical) using a rule-based method developed in this study. The results demonstrate that incorporating severity-aware curriculum learning leads to consistent performance improvements across all tested models, with gains of around +4% to +7% over baseline models and +3% to +6% compared with conventional fine-tuning approaches.

106. 【2604.06356】In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads

链接：https://arxiv.org/abs/2604.06356

作者：Charlotte Pouw,Hosein Mohebbi,Afra Alishahi,Willem Zuidema

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：text-only Language Models, Speech Language Models, remains largely unexplored, In-Context Learning, Language Models

备注： Submitted to COLM 2026

点击查看摘要

Abstract:In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain. Here, we investigate how linguistic and acoustic features affect ICL in Speech Language Models. We focus on the Text-to-Speech (TTS) task, which allows us to analyze ICL from two angles: (1) how accurately the model infers the task from the demonstrations (i.e., generating the correct spoken content), and (2) to what extent the model mimics the acoustic characteristics of the demonstration speech in its output. We find that speaking rate strongly affects ICL performance and is also mimicked in the output, whereas pitch range and intensity have little impact on performance and are not consistently reproduced. Finally, we investigate the role of induction heads in speech-based ICL and show that these heads play a causal role: ablating the top-k induction heads completely removes the model's ICL ability, mirroring findings from text-based ICL.

107. 【2604.06346】Severity-Aware Weighted Loss for Arabic Medical Text Generation

链接：https://arxiv.org/abs/2604.06346

作者：Ahmed Alansary,Molham Mohamed,Ali Hamdi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：shown strong potential, medical text generation, traditional fine-tuning objectives, fine-tuning objectives treat, text generation

备注： 10 pages, 1 figure, 2 tables, ICTIS2026

点击查看摘要

Abstract:Large language models have shown strong potential for Arabic medical text generation; however, traditional fine-tuning objectives treat all medical cases uniformly, ignoring differences in clinical severity. This limitation is particularly critical in healthcare settings, where errors in severe cases contain higher clinical risk. In this work, we propose a severity-aware weighted loss for fine-tuning Arabic language models on medical complaint-response data. The method depends on soft severity probabilities to dynamically scale token-level loss contributions during optimization, thereby prioritizing clinically critical interactions without modifying model architectures. Experiments are conducted using the MAQA dataset, which provides Arabic medical complaints and trusted human responses. Severity labels and probabilistic scores are automatically derived using a fine-tuned AraBERT-based classifier and incorporated exclusively at the loss level. The proposed approach is evaluated across ten Arabic large language models of varying architectures and parameter scales. While standard cross-entropy fine-tuning yields only modest improvements, severity-aware optimization consistently achieves larger gains. Using a balanced weighting configuration, performance improves from 54.04% to 66.14% for AraGPT2-Base, from 59.16% to 67.18% for AraGPT2-Medium, and from 57.83% to 66.86% for Qwen2.5-0.5B, with peak performance reaching 67.18%. Overall, severity-aware fine-tuning delivers improvements of up to 12.10% over non-fine-tuned baselines, demonstrating robust and architecture-consistent gains.

108. 【2604.06330】STDec: Spatio-Temporal Stability Guided Decoding for dLLMs

链接：https://arxiv.org/abs/2604.06330

作者：Yuzhe Chen,Jiale Cao,Xuyang Liu,Jin Xie,Aiping Yang,Yanwei Pang

类目：Computation and Language (cs.CL)

关键词：Diffusion Large Language, Large Language Models, Diffusion Large, Large Language, achieved rapid progress

备注： Homepage: [this https URL](https://yzchen02.github.io/STDec)

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have achieved rapid progress, viewed as a promising alternative to the autoregressive paradigm. However, most dLLM decoders still adopt a global confidence threshold, and do not explicitly model local context from neighboring decoded states or temporal consistency of predicted token IDs across steps. To address this issue, we propose a simple spatio-temporal stability guided decoding approach, named STDec. We observe strong spatio-temporal stability in dLLM decoding: newly decoded tokens tend to lie near decoded neighbors, and their predicted IDs often remain consistent across several denoising steps. Inspired by this stability, our STDec includes spatial-aware decoding and temporal-aware decoding. The spatial-aware decoding dynamically generates the token-adaptive threshold by aggregating the decoded states of nearby tokens. The temporal-aware decoding relaxes the decoding thresholds for tokens whose predicted token IDs remain consistent over denoising steps. Our STDec is training-free and remains compatible with cache-based acceleration methods. Across textual reasoning and multimodal understanding benchmarks, STDec substantially improves throughput while maintaining comparable task performance score. Notably, on MBPP with LLaDA, STDec achieves up to 14.17x speedup with a comparable score. Homepage: this https URL.

109. 【2604.06277】Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

链接：https://arxiv.org/abs/2604.06277

作者：Shoaib Sadiq Salehmohamed,Jinal Prashant Thakkar,Hansika Aredla,Shaik Mohammed Omar,Shalmali Ayachit

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Existing hallucination detection, requiring gold answers, auxiliary judge models, inference time, Existing hallucination

备注： 20 pages, 6 figures, 6 tables. Introduces a 15k-sample representation-level hallucination dataset with full transformer hidden states and multi-signal weak supervision. Evaluates 5 probing architectures and demonstrates internal hallucination detection without external inference-time signals. Includes held-out test evaluation and deployment benchmarks

点击查看摘要

Abstract:Existing hallucination detection methods for large language models (LLMs) rely on external verification at inference time, requiring gold answers, retrieval systems, or auxiliary judge models. We ask whether this external supervision can instead be distilled into the model's own representations during training, enabling hallucination detection from internal activations alone at inference time. We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without human annotation. Using this framework, we construct a 15000-sample dataset from SQuAD v2 (10500 train/development samples and a separate 5000-sample test set), where each example pairs a LLaMA-2-7B generated answer with its full per-layer hidden states and structured hallucination labels. We then train five probing classifiers: ProbeMLP (M0), LayerWiseMLP (M1), CrossLayerTransformer (M2), HierarchicalTransformer (M3), and CrossLayerAttentionTransformerV2 (M4), directly on these hidden states, treating external grounding signals as training-time supervision only. Our central hypothesis is that hallucination detection signals can be distilled into transformer representations, enabling internal detection without any external verification at inference time. Results support this hypothesis. Transformer-based probes achieve the strongest discrimination, with M2 performing best on 5-fold average AUC/F1, and M3 performing best on both single-fold validation and held-out test evaluation. We also benchmark inference efficiency: probe latency ranges from 0.15 to 5.62 ms (batched) and 1.55 to 6.66 ms (single sample), while end-to-end generation plus probe throughput remains approximately 0.231 queries per second, indicating negligible practical overhead.

Comments:
20 pages, 6 figures, 6 tables. Introduces a 15k-sample representation-level hallucination dataset with full transformer hidden states and multi-signal weak supervision. Evaluates 5 probing architectures and demonstrates internal hallucination detection without external inference-time signals. Includes held-out test evaluation and deployment benchmarks

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

ACMclasses:
I.2.6; I.2.7

Cite as:
arXiv:2604.06277 [cs.AI]

(or
arXiv:2604.06277v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.06277

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Shoaib Sadiq Salehmohamed [view email] [v1]
Tue, 7 Apr 2026 08:14:48 UTC (28,449 KB)

110. 【2604.06231】Automating Database-Native Function Code Synthesis with LLMs

链接：https://arxiv.org/abs/2604.06231

作者：Wei Zhou,Xuanhe Zhou,Qikang He,Guoliang Li,Bingsheng He,Quanqing Xu,Fan Wu

类目：Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Software Engineering (cs.SE)

关键词：database native functions, database native, Database systems incorporate, business migration, database native function

备注： Please visit our homepage at: [this https URL](https://code4db.github.io/hi-opencook/) . The code is available at: [this https URL](https://github.com/weAIDB/OpenCook)

点击查看摘要

Abstract:Database systems incorporate an ever-growing number of functions in their kernels (a.k.a., database native functions) for scenarios like new application support and business migration. This growth causes an urgent demand for automatic database native function synthesis. While recent advances in LLM-based code generation (e.g., Claude Code) show promise, they are too generic for database-specific development. They often hallucinate or overlook critical context because database function synthesis is inherently complex and error-prone, where synthesizing a single function may involve registering multiple function units, linking internal references, and implementing logic correctly. To this end, we propose DBCooker, an LLM-based system for automatically synthesizing database native functions. It consists of three components. First, the function characterization module aggregates multi-source declarations, identifies function units that require specialized coding, and traces cross-unit dependencies. Second, we design operations to address the main synthesis challenges: (1) a pseudo-code-based coding plan generator that constructs structured implementation skeletons by identifying key elements such as reusable referenced functions; (2) a hybrid fill-in-the-blank model guided by probabilistic priors and component awareness to integrate core logic with reusable routines; and (3) three-level progressive validation, including syntax checking, standards compliance, and LLM-guided semantic verification. Finally, an adaptive orchestration strategy unifies these operations with existing tools and dynamically sequences them via the orchestration history of similar functions. Results show that DBCooker outperforms other methods on SQLite, PostgreSQL, and DuckDB (34.55% higher accuracy on average), and can synthesize new functions absent in the latest SQLite (v3.50).

111. 【2604.06228】Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

链接：https://arxiv.org/abs/2604.06228

作者：Gregory Magarshak

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Information Theory (cs.IT)

关键词：prefix structure implicitly, structure implicitly defined, introduce probabilistic language, makes explicit, explicit the prefix

备注： 24 pages, 2 figures

点击查看摘要

Abstract:We introduce probabilistic language tries (PLTs), a unified representation that makes explicit the prefix structure implicitly defined by any generative model over sequences. By assigning to each outgoing edge the conditional probability of the corresponding token or action, a PLT simultaneously serves as: (i) an optimal lossless compressor via frequency-weighted interval encoding, generalizing arithmetic coding to model-conditioned distributions; (ii) a policy representation for sequential decision problems including games, search, and robotic control; and (iii) a memoization index that lets repeated inference queries be answered by structured retrieval rather than full model execution. The central technical result is a prior-guided caching theorem: under a stationary generative distribution, a PLT-guided cache achieves strictly lower expected inference cost than any empirical-frequency cache for all query counts below a threshold that grows with the concentration of the prior. This converts O(n^2) transformer attention cost into an expected cost of p_r * O(log N) + (1 - p_r) * O(n^2), where p_r is the prior-estimated reuse probability and N is the artifact store size. We further introduce a hybrid compression architecture decomposing any dataset into a PLT-covered majority and a sparse residual store, connecting arithmetic coding with Kolmogorov-style program representations and rate-distortion theory. We instantiate the framework across chess, web search, robotics, organizational workflows, and LLM inference, demonstrating that compression, decision making, and computational reuse are all derived from a single probability measure on sequence space.

Comments:
24 pages, 2 figures

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Information Theory (cs.IT)

MSC classes:
94A29, 68P30, 68T50

ACMclasses:
E.4; I.2.7; H.3.3

Cite as:
arXiv:2604.06228 [cs.LG]

(or
arXiv:2604.06228v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.06228

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

112. 【2604.06216】Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses

链接：https://arxiv.org/abs/2604.06216

作者：Khizar Hussain,Bradley A. Malin,Zhijun Yin,Susannah Leigh Rose,Murat Kantarcioglu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Toggle, mental health, mental health services, Toggle Hugging Face, Bibliographic Explorer Toggle

备注：

点击查看摘要

Abstract:As LLM-powered chatbots are increasingly deployed in mental health services, detecting hallucinations and omissions has become critical for user safety. However, state-of-the-art LLM-as-a-judge methods often fail in high-risk healthcare contexts, where subtle errors can have serious consequences. We show that leading LLM judges achieve only 52% accuracy on mental health counseling data, with some hallucination detection approaches exhibiting near-zero recall. We identify the root cause as LLMs' inability to capture nuanced linguistic and therapeutic patterns recognized by domain experts. To address this, we propose a framework that integrates human expertise with LLMs to extract interpretable, domain-informed features across five analytical dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness. Experiments on a public mental health dataset and a new human-annotated dataset show that traditional machine learning models trained on these features achieve 0.717 F1 on our custom dataset and 0.849 F1 on a public benchmark for hallucination detection, with 0.59-0.64 F1 for omission detection across both datasets. Our results demonstrate that combining domain expertise with automated methods yields more reliable and transparent evaluation than black-box LLM judging in high-stakes mental health applications.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.06216 [cs.CL]

(or
arXiv:2604.06216v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.06216

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Khizar Hussain [view email] [v1]
Tue, 17 Mar 2026 21:13:19 UTC (2,139 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses, by Khizar Hussain and 4 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CL

|
next

new
|
recent
| 2026-04

Change to browse by:

cs
cs.AI

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

信息检索

1. 【2604.07220】HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

链接：https://arxiv.org/abs/2604.07220

作者：Mahmoud Abdalla,Mahmoud SalahEldin Kasem,Mohamed Mahmoud,Mostafa Farouk Senussi,Abdelrahman Abdallah,Hyun-Soo Kang

类目：Information Retrieval (cs.IR)

关键词：identify relevant documents, textbf, relevant documents, retrieval models fail, fail on reasoning-intensive

备注： accepted at CVPR 2026 Workshop GRAIL-V

点击查看摘要

Abstract:Multimodal retrieval models fail on reasoning-intensive queries where images (diagrams, charts, screenshots) must be deeply integrated with text to identify relevant documents -- the best multimodal model achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming even strong text-only retrievers (32.2). We introduce \textbf{HIVE} (\textbf{H}ypothesis-driven \textbf{I}terative \textbf{V}isual \textbf{E}vidence Retrieval), a plug-and-play framework that injects explicit visual-text reasoning into a retriever via LLMs. HIVE operates in four stages: (1) initial retrieval over the corpus, (2) LLM-based compensatory query synthesis that explicitly articulates visual and logical gaps observed in top-$k$ candidates, (3) secondary retrieval with the refined query, and (4) LLM verification and reranking over the union of candidates. Evaluated on the multimodal-to-text track of MM-BRIGHT (2,803 real-world queries across 29 technical domains), HIVE achieves a new state-of-the-art aggregated nDCG@10 of \textbf{41.7} -- a \textbf{+9.5} point gain over the best text-only model (DiVeR: 32.2) and \textbf{+14.1} over the best multimodal model (Nomic-Vision: 27.6), where our reasoning-enhanced base retriever contributes 33.2 and the HIVE framework adds a further \textbf{+8.5} points -- with particularly strong results in visually demanding domains (Gaming: 68.2, Chemistry: 42.5, Sustainability: 49.4). Compatible with both standard and reasoning-enhanced retrievers, HIVE demonstrates that LLM-mediated visual hypothesis generation and verification can substantially close the multimodal reasoning gap in retrieval. this https URL

2. 【2604.07201】BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

链接：https://arxiv.org/abs/2604.07201

作者：Mohamed Darwish Mounis,Mohamed Mahmoud,Shaimaa Sedek,Mahmoud Abdalla,Mahmoud SalahEldin Kasem,Abdelrahman Abdallah,Hyun-Soo Kang

类目：Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词：textbf, underperforming strong text-only, underperforming strong, resolve image-text queries, retrieval systems struggle

备注： Accepted at CVPR 2026 Workshop GRAIL-V

点击查看摘要

Abstract:Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces. Evaluated on MM-BRIGHT (2,803 queries, 29 domains), BRIDGE achieves \textbf{29.7} nDCG@10, surpassing all multimodal encoder baselines including Nomic-Vision (27.6). When FORGE is applied as a plug-and-play aligner on top of Nomic-Vision, the combined system reaches \textbf{33.3} nDCG@10 -- exceeding the best text-only retriever (32.2) -- demonstrating that \textit{query alignment} is the key bottleneck in multimodal-to-text retrieval. this https URL

3. 【2604.07090】Leveraging Artist Catalogs for Cold-Start Music Recommendation

链接：https://arxiv.org/abs/2604.07090

作者：Yan-Martin Tamm,Gregor Meehan,Vojtěch Nekl,Vojtěch Vančura,Rodrigo Alves,Johan Pauwels,Anna Aljanaki

类目：Information Retrieval (cs.IR)

关键词：newly added tracks, added tracks lack, newly added, cold-start problem poses, challenge for music

备注： Accepted at UMAP 2026

点击查看摘要

Abstract:The item cold-start problem poses a fundamental challenge for music recommendation: newly added tracks lack the interaction history that collaborative filtering (CF) requires. Existing approaches often address this problem by learning mappings from content features such as audio, text, and metadata to the CF latent space. However, previous works either omit artist information or treat it as just another input modality, missing the fundamental hierarchy of artists and items. Since most new tracks come from artists with previous history available, we frame cold-start track recommendation as 'semi-cold' by leveraging the rich collaborative signal that exists at the artist level. We show that artist-aware methods can more than double Recall and NDCG compared to content-only baselines, and propose ACARec, an attention-based architecture that generates CF embeddings for new tracks by attending over the artist's existing catalog. We show that our approach has notable advantages in predicting user preferences for new tracks, especially for new artist discovery and more accurate estimation of cold item popularity.

4. 【2604.07079】MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

链接：https://arxiv.org/abs/2604.07079

作者：Mahmoud SalahEldin Kasem,Mohamed Mahmoud,Mostafa Farouk Senussi,Mahmoud Abdalla,Abdelrahman Abdallah,Hyun-Soo Kang

类目：Information Retrieval (cs.IR)

关键词：underperforming strong text-only, strong text-only systems, text corpora remains, textbf, multimodal retrieval benchmark

备注：

点击查看摘要

Abstract:Multimodal retrieval over text corpora remains a fundamental challenge: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, a reasoning-intensive multimodal retrieval benchmark, underperforming strong text-only systems. We argue that effective multimodal retrieval requires three tightly integrated capabilities that existing approaches address only in isolation: expanding the query's latent intent, retrieving with a model trained for complex reasoning, and reranking via explicit step-by-step reasoning over candidates. We introduce \textbf{MARVEL} (\textbf{M}ultimodal \textbf{A}daptive \textbf{R}easoning-intensi\textbf{V}e \textbf{E}xpand-rerank and retrieva\textbf{L}), a unified pipeline that combines LLM-driven query expansion, \textbf{MARVEL-Retriever} -- a reasoning-enhanced dense retriever fine-tuned for complex multimodal queries -- and GPT-4o-based chain-of-thought reranking with optional multi-pass reciprocal rank fusion. Evaluated on MM-BRIGHT across 29 technical domains, MARVEL achieves \textbf{37.9} nDCG@10, surpassing the best multimodal encoder by \textbf{+10.3 points} and outperforming all single-stage baselines in 27 of 29 domains and matching or approaching the best baseline in the remaining two highly-specialized domains (Crypto, Quantum Computing), demonstrating that reasoning-intensive multimodal retrieval is best addressed through a unified expand-retrieve-rerank framework. this https URL

5. 【2604.07041】AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views

链接：https://arxiv.org/abs/2604.07041

作者：Minh Tam Pham,Trinh Pham,Tong Chen,Hongzhi Yin,Quoc Viet Hung Nguyen,Thanh Tam Nguyen

类目：Databases (cs.DB); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：enabling non-expert users, access structured data, writing SQL manually, translating natural language, natural language queries

备注：

点击查看摘要

Abstract:Text-to-SQL is the task of translating natural language queries into executable SQL for a given database, enabling non-expert users to access structured data without writing SQL manually. Despite rapid advances driven by large language models (LLMs), existing approaches still struggle with complex queries in real-world settings, where database schemas are large and questions require multi-step reasoning over many interrelated tables. In such cases, providing the full schema often exceeds the context window, while one-shot generation frequently produces non-executable SQL due to syntax errors and incorrect schema linking. To address these challenges, we introduce AV-SQL, a framework that decomposes complex Text-to-SQL into a pipeline of specialized LLM agents. Central to AV-SQL is the concept of agentic views: agent-generated Common Table Expressions (CTEs) that encapsulate intermediate query logic and filter relevant schema elements from large schemas. AV-SQL operates in three stages: (1) a rewriter agent compresses and clarifies the input query; (2) a view generator agent processes schema chunks to produce agentic views; and (3) a planner, generator, and revisor agent collaboratively compose these views into the final SQL query. Extensive experiments show that AV-SQL achieves 70.38% execution accuracy on the challenging Spider 2.0 benchmark, outperforming state-of-the-art baselines, while remaining competitive on standard datasets with 85.59% on Spider, 72.16% on BIRD and 63.78% on KaggleDBQA. Our source code is available at this https URL.

6. 【2604.06928】Leveraging LLMs and Heterogeneous Knowledge Graphs for Persona-Driven Session-Based Recommendation

链接：https://arxiv.org/abs/2604.06928

作者：Muskan Gupta,Suraj Thapa,Jyotsana Khatri

类目：Information Retrieval (cs.IR)

关键词：Session-based recommendation systems, Session-based recommendation, aim to capture, personalized information, user

备注：

点击查看摘要

Abstract:Session-based recommendation systems (SBRS) aim to capture user's short-term intent from interaction sequences. However, the common assumption of anonymous sessions limits personalization, particularly under sparse or cold-start conditions. Recent advances in LLM-augmented recommendation have shown that LLMs can generate rich item representations, but modeling user personas with LLMs remains challenging due to anonymous sessions. In this work, we propose a persona-driven SBRS framework that explicitly models latent user personas inferred from a heterogeneous knowledge graph (KG) and integrates them into a data-driven recommendation this http URL framework adopts a two-stage architecture consisting of personalized information extraction and personalized information utilization, inspired by recent chain-of-thought recommendation approaches. In the personalized information extraction stage, we construct a heterogeneous KG that integrates time-independent user-item, item-item, item-feature association, and metadata from DBpedia. We then learn latent user personas in an unsupervised manner using a Heterogeneous Deep Graph Infomax (HDGI) objective over a KG initialized with LLM-derived item embeddings. In the personalized information utilization stage, the learned persona representations together with LLM-derived item embeddings are incorporated into a modified architecture of data-driven SBRS to generate a candidate set of relevant items, followed by reranking using the base sequential model to emphasize short-term session intent. Unlike prior approaches that rely solely on sequence modeling or text-based user representations, our method grounds user persona modeling in structured relational signals derived from a KG. Experiments on Amazon Books and Amazon Movies TV demonstrate that our approach consistently improves over sequential models with user embeddings derived using session history.

7. 【2604.06718】CASE: Cadence-Aware Set Encoding for Large-Scale Next Basket Repurchase Recommendation

链接：https://arxiv.org/abs/2604.06718

作者：Yanan Cao,Ashish Ranjan,Sinduja Subramaniam,Evren Korpeoglu,Kaushiki Nag,Kannan Achan

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：basket repurchase recommendation, large-scale retail recommendation, frequent replenishment, timing follows stable, basket repurchase

备注： Accepted at SIGIR 2026 Industry Track

点击查看摘要

Abstract:Repurchase behavior is a primary signal in large-scale retail recommendation, particularly in categories with frequent replenishment: many items in a user's next basket were previously purchased and their timing follows stable, item-specific cadences. Yet most next basket repurchase recommendation models represent history as a sequence of discrete basket events indexed by visit order, which cannot explicitly model elapsed calendar time or update item rankings as days pass between purchases. We present CASE (Cadence-Aware Set Encoding for next basket repurchase recommendation), which decouples item-level cadence learning from cross-item interaction, enabling explicit calendar-time modeling while remaining production-scalable. CASE represents each item's purchase history as a calendar-time signal over a fixed horizon, applies shared multi-scale temporal convolutions to capture recurring rhythms, and uses induced set attention to model cross-item dependencies with sub-quadratic complexity, allowing efficient batch inference at scale. Across three public benchmarks and a proprietary dataset, CASE consistently improves Precision, Recall, and NDCG at multiple cutoffs compared to strong next basket prediction baselines. In a production-scale evaluation with tens of millions of users and a large item catalog, CASE achieves up to 8.6% relative Precision and 9.9% Recall lift at top-5, demonstrating that scalable cadence-aware modeling yields measurable gains in both benchmark and industrial settings.

8. 【2604.06710】ATANT: An Evaluation Framework for AI Continuity

链接：https://arxiv.org/abs/2604.06710

作者：Samuel Sameer Tanguturi

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：reconstruct meaningful context, Automated Test, Narrative Truth, open evaluation framework, ability to persist

备注： 7 pages, 8 tables. Framework and evaluation protocol available at [this https URL](https://github.com/Kenotic-Labs/ATANT)

点击查看摘要

Abstract:We present ATANT (Automated Test for Acceptance of Narrative Truth), an open evaluation framework for measuring continuity in AI systems: the ability to persist, update, disambiguate, and reconstruct meaningful context across time. While the AI industry has produced memory components (RAG pipelines, vector databases, long context windows, profile layers), no published framework formally defines or measures whether these components produce genuine continuity. We define continuity as a system property with 7 required properties, introduce a 10-checkpoint evaluation methodology that operates without an LLM in the evaluation loop, and present a narrative test corpus of 250 stories comprising 1,835 verification questions across 6 life domains. We evaluate a reference implementation across 5 test suite iterations, progressing from 58% (legacy architecture) to 100% in isolated mode (250 stories) and 100% in 50-story cumulative mode, with 96% at 250-story cumulative scale. The cumulative result is the primary measure: when 250 distinct life narratives coexist in the same database, the system must retrieve the correct fact for the correct context without cross-contamination. ATANT is system-agnostic, model-independent, and designed as a sequenced methodology for building and validating continuity systems. The framework specification, example stories, and evaluation protocol are available at this https URL. The full 250-story corpus will be released incrementally.

9. 【2604.06616】CubeGraph: Efficient Retrieval-Augmented Generation for Spatial and Temporal Data

链接：https://arxiv.org/abs/2604.06616

作者：Mingyu Yang,Wentao Li,Wei Wang

类目：Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：modern retrieval-augmented generation, queries combining high-dimensional, Hybrid queries combining, combining high-dimensional vector, high-dimensional vector similarity

备注： Technical Report

点击查看摘要

Abstract:Hybrid queries combining high-dimensional vector similarity search with spatio-temporal filters are increasingly critical for modern retrieval-augmented generation (RAG) systems. Existing systems typically handle these workloads by nesting vector indices within low-dimensional spatial structures, such as R-trees. However, this decoupled architecture fragments the vector space, forcing the query engine to invoke multiple disjoint sub-indices per query. This fragmentation destroys graph routing connectivity, incurs severe traversal overhead, and struggles to optimize for complex spatial boundaries. In this paper, we propose CubeGraph, a novel indexing framework designed to natively integrate vector search with arbitrary spatial constraints. CubeGraph partitions the spatial domain using a hierarchical grid, maintaining modular vector graphs within each cell. During query execution, CubeGraph dynamically stitches together adjacent cube-level indices on the fly whenever their spatial cells intersect with the query filter. This dynamic graph integration restores global connectivity, enabling a unified, single-pass nearest-neighbor traversal that eliminates the overhead of fragmented sub-index invocations. Extensive evaluations on real-world datasets demonstrate that CubeGraph significantly outperforms state-of-the-art baselines, offering superior query execution performance, scalability, and flexibility for complex hybrid workloads.

10. 【2604.06571】LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

链接：https://arxiv.org/abs/2604.06571

作者：Joshua Castillo,Ravi Mukkamala

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：including structured forms, narrative web profiles, child-safety investigations rely, Missing-person and child-safety, bulletin-style posters

备注： 9 pages, 6 figures. Accepted at International Conference on Intelligent Digitization of Systems and Services (IDSS 2026)

点击查看摘要

11. 【2604.06420】he Unreasonable Effectiveness of Data for Recommender Systems

链接：https://arxiv.org/abs/2604.06420

作者：Youssef Abdou

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：providing meaningful gains, stops providing meaningful, processing large-scale interaction, additional data stops, data stops providing

备注： 5 pages, 6 figures. Poster paper

点击查看摘要

Abstract:In recommender systems, collecting, storing, and processing large-scale interaction data is increasingly costly in terms of time, energy, and computation, yet it remains unclear when additional data stops providing meaningful gains. This paper investigates how offline recommendation performance evolves as the size of the training dataset increases and whether a saturation point can be observed. We implemented a reproducible Python evaluation workflow with two established toolkits, LensKit and RecBole, included 11 large public datasets with at least 7 million interactions, and evaluated 10 tool-algorithm combinations. Using absolute stratified user sampling, we trained models on nine sample sizes from 100,000 to 100,000,000 interactions and measured NDCG@10. Overall, raw NDCG usually increased with sample size, with no observable saturation point. To make result groups comparable, we applied min-max normalization within each group, revealing a clear positive trend in which around 75% of the points at the largest completed sample size also achieved the group's best observed performance. A late-stage slope analysis over the final 10-30% of each group further supported this upward trend: the interquartile range remained entirely non-negative with a median near 1.0. In summary, for traditional recommender systems on typical user-item interaction data, incorporating more training data remains primarily beneficial, while weaker scaling behavior is concentrated in atypical dataset cases and in the algorithmic outlier RecBole BPR under our setup.

12. 【2604.06263】Incentive-Aware Multi-Fidelity Optimization for Generative Advertising in Large Language Models

链接：https://arxiv.org/abs/2604.06263

作者：Jiayuan Liu,Barry Wang,Jiarui Gan,Tonghan Wang,Leon Xie,Mingyu Guo,Vincent Conitzer

类目：Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：large language model, responses requires optimizing, requires optimizing sponsorship, optimizing sponsorship configurations, language model

备注：

点击查看摘要

Abstract:Generative advertising in large language model (LLM) responses requires optimizing sponsorship configurations under two strict constraints: the strategic behavior of advertisers and the high cost of stochastic generations. To address this, we propose the Incentive-Aware Multi-Fidelity Mechanism (IAMFM), a unified framework coupling Vickrey-Clarke-Groves (VCG) incentives with Multi-Fidelity Optimization to maximize expected social welfare. We compare two algorithmic instantiations (elimination-based and model-based), revealing their budget-dependent performance trade-offs. Crucially, to make VCG computationally feasible, we introduce Active Counterfactual Optimization, a "warm-start" approach that reuses optimization data for efficient payment calculation. We provide formal guarantees for approximate strategy-proofness and individual rationality, establishing a general approach for incentive-aligned, budget-constrained generative processes. Experiments demonstrate that IAMFM outperforms single-fidelity baselines across diverse budgets.

13. 【2604.06232】What Do Humanities Scholars Need? A User Model for Recommendation in Digital Archives

链接：https://arxiv.org/abs/2604.06232

作者：Florian Atzenhofer-Baumgartner,Dominik Kowald

类目：Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词：typically assume stable, assume stable preferences, high-volume consumer contexts, recommender systems, typically assume

备注： To be presented at the 34th ACM Conference on User Modeling, Adaptation and Personalization (UMAP'26), June 08-11, 2026, Gothenburg, Sweden

点击查看摘要

Abstract:User models for recommender systems (RecSys) typically assume stable preferences, similarity-based relevance, and session-bounded interactions -- assumptions derived from high-volume consumer contexts. This paper investigates these assumptions for humanities scholars working with digital archives. Following a human-centered design approach, we conducted focus groups and analyzed interview data from 18 researchers. Our analysis identifies four dimensions where scholarly information-seeking diverges from common RecSys user modeling: (1) context volatility -- preferences shift with research tasks and domain expertise; (2) epistemic trust -- relevance depends on verifiable provenance; (3) contrastive seeking -- researchers seek items that challenge their current direction; and (4) strand continuity -- research spans long-term threads rather than discrete sessions. We discuss implications for user modeling and outline how these dimensions relate to collaborative filtering, content-based, and session-based recommendation. We propose these dimensions as a diagnostic framework applicable beyond archives to similar application domains where typical user modeling assumptions may not hold.

14. 【2604.06231】Automating Database-Native Function Code Synthesis with LLMs

链接：https://arxiv.org/abs/2604.06231

作者：Wei Zhou,Xuanhe Zhou,Qikang He,Guoliang Li,Bingsheng He,Quanqing Xu,Fan Wu

类目：Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Software Engineering (cs.SE)

关键词：database native functions, database native, Database systems incorporate, business migration, database native function

备注： Please visit our homepage at: [this https URL](https://code4db.github.io/hi-opencook/) . The code is available at: [this https URL](https://github.com/weAIDB/OpenCook)

点击查看摘要

15. 【2604.06228】Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

链接：https://arxiv.org/abs/2604.06228

作者：Gregory Magarshak

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Information Theory (cs.IT)

关键词：prefix structure implicitly, structure implicitly defined, introduce probabilistic language, makes explicit, explicit the prefix

备注： 24 pages, 2 figures

点击查看摘要

Comments:
24 pages, 2 figures

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Information Theory (cs.IT)

MSC classes:
94A29, 68P30, 68T50

ACMclasses:
E.4; I.2.7; H.3.3

Cite as:
arXiv:2604.06228 [cs.LG]

(or
arXiv:2604.06228v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.06228

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

16. 【2604.06179】ARIA: Adaptive Retrieval Intelligence Assistant -- A Multimodal RAG Framework for Domain-Specific Engineering Education

链接：https://arxiv.org/abs/2604.06179

作者：Yue Luo,Dibakar Roy Sarkar,Rachel Herring Sangree,Somdatta Goswami

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Developing effective, educational support systems, support systems, systems is central, central to advancing

备注：

点击查看摘要

17. 【2604.06177】WebExpert: domain-aware web agents with critic-guided expert experience for high-precision search

链接：https://arxiv.org/abs/2604.06177

作者：Yuelin Hu,Zhengxue Cheng,Ronghua Wu,Qunshan Gu,Hongwei Hu,Wei Liu,Qiao Liang,Li Song

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Specialized web tasks, missing domain priors, pharmaceuticals remain challenging, remain challenging due, Specialized web

备注： accepted by icassp2026

点击查看摘要

18. 【2604.06176】Robustness Risk of Conversational Retrieval: Identifying and Mitigating Noise Sensitivity in Qwen3-Embedding Model

链接：https://arxiv.org/abs/2604.06176

作者：Weishu Chen,Zhouhui Hou,Mingjie Zhan,Zhicheng Zhao,Fei Su

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：realistic conversational settings, structured conversational artifacts, queries are short, present an empirical, empirical study

备注：

点击查看摘要

19. 【2604.06173】Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA

链接：https://arxiv.org/abs/2604.06173

作者：Kyubyung Chae,Jewon Yeom,Jeongjae Park,Seunghyun Bae,Ijun Jang,Hyunbin Jin,Jinkwan Jang,Taesup Kim

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：overlooking the unique, predominantly focused, unique challenges, statute-centric regulatory reasoning, case law

备注：

点击查看摘要

Abstract:Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.

20. 【2604.06172】EviSnap: Faithful Evidence-Cited Explanations for Cold-Start Cross-Domain Recommendation

链接：https://arxiv.org/abs/2604.06172

作者：Yingjun Dai,Ahmed El-Roby

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Cold-start cross-domain recommender, existing CDR models, Cold-start cross-domain, cross-domain recommender, systems predict

备注： 8 pages

点击查看摘要

Abstract:Cold-start cross-domain recommender (CDR) systems predict a user's preferences in a target domain using only their source-domain behavior, yet existing CDR models either map opaque embeddings or rely on post-hoc or LLM-generated rationales that are hard to audit. We introduce EviSnap a lightweight CDR framework whose predictions are explained by construction with evidence-cited, faithful rationales. EviSnap distills noisy reviews into compact facet cards using an LLM offline, pairing each facet with verbatim supporting sentences. It then induces a shared, domain-agnostic concept bank by clustering facet embeddings and computes user-positive, user-negative, and item-presence concept activations via evidence-weighted pooling. A single linear concept-to-concept map transfers users across domains, and a linear scoring head yields per-concept additive contributions, enabling exact score decompositions and counterfactual 'what-if' edits grounded in the cited sentences. Experiments on the Amazon Reviews dataset across six transfers among Books, Movies, and Music show that EviSnap consistently outperforms strong mapping and review-text baselines while passing deletion- and sufficiency-based tests for explanation faithfulness.

21. 【2511.10354】Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

链接：https://arxiv.org/abs/2511.10354

作者：Andrea Schimmenti,Valentina Pasqual,Fabio Vitali,Marieke van Erp

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词：structured Knowledge Graphs, Cultural Heritage, Large Language Model-based, Cultural Heritage texts, Language Model-based Knowledge

备注： 46 pages

点击查看摘要

22. 【2604.06222】he Geometry of Forgetting

链接：https://arxiv.org/abs/2604.06222

作者：Sambartha Ray Barman,Andrey Starenky,Sophia Bodnar,Nikhil Narasimhan,Ashwin Gopinath

类目：Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)

关键词：Abstract, forget, approx, human, sim

备注：

点击查看摘要

Abstract:Why do we forget? Why do we remember things that never happened? The conventional answer points to biological hardware. We propose a different one: geometry. Here we show that high-dimensional embedding spaces, subjected to noise, interference, and temporal degradation, reproduce quantitative signatures of human memory with no phenomenon-specific engineering. Power-law forgetting ($b = 0.460 \pm 0.183$, human $b \approx 0.5$) arises from interference among competing memories, not from decay. The identical decay function without competitors yields $b \approx 0.009$, fifty times smaller. Time alone does not produce forgetting in this system. Competition does. Production embedding models (nominally 384--1{,}024 dimensions) concentrate their variance in only ${\sim}16$ effective dimensions, placing them deep in the interference-vulnerable regime. False memories require no engineering at all: cosine similarity on unmodified pre-trained embeddings reproduces the Deese--Roediger--McDermott false alarm rate ($0.583$ versus human ${\sim}0.55$) with zero parameter tuning and no boundary conditions. We did not build a false memory system. We found one already present in the raw geometry of semantic space. These results suggest that core memory phenomena are not bugs of biological implementation but features of any system that organizes information by meaning and retrieves it by proximity.

计算机视觉

1. 【2604.07350】Fast Spatial Memory with Elastic Test-Time Training

链接：https://arxiv.org/abs/2604.07350

作者：Ziqiao Ma,Xueyang Yu,Haoyu Zhen,Yuncong Yang,Joyce Chai,Chuang Gan

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：shown strong performance, fully plastic inference-time, inference-time updates remain, updates remain vulnerable, plastic inference-time updates

备注： Project Page: [this https URL](https://fast-spatial-memory.github.io/)

点击查看摘要

Abstract:Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. As a result, LaCT is typically instantiated with a single large chunk spanning the full input sequence, falling short of the broader goal of handling arbitrarily long sequences in a single pass. We propose Elastic Test-Time Training inspired by elastic weight consolidation, that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior around a maintained anchor state. The anchor evolves as an exponential moving average of past fast weights to balance stability and plasticity. Based on this updated architecture, we introduce Fast Spatial Memory (FSM), an efficient and scalable model for 4D reconstruction that learns spatiotemporal representations from long observation sequences and renders novel view-time combinations. We pre-trained FSM on large-scale curated 3D/4D data to capture the dynamics and semantics of complex spatial environments. Extensive experiments show that FSM supports fast adaptation over long sequences and delivers high-quality 3D/4D reconstruction with smaller chunks and mitigating the camera-interpolation shortcut. Overall, we hope to advance LaCT beyond the bounded single-chunk setting toward robust multi-chunk adaptation, a necessary step for generalization to genuinely longer sequences, while substantially alleviating the activation-memory bottleneck.

2. 【2604.07348】MoRight: Motion Control Done Right

链接：https://arxiv.org/abs/2604.07348

作者：Shaowei Liu,Xuanchi Ren,Tianchang Shen,Huan Ling,Saurabh Gupta,Shenlong Wang,Sanja Fidler,Jun Gao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Generating motion-controlled videos, trigger coherent reactions, Generating motion-controlled, user-specified actions drive, actions drive physically

备注： Project Page: [this https URL](https://research.nvidia.com/labs/sil/projects/moright)

点击查看摘要

Abstract:Generating motion-controlled videos--where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints--demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.

3. 【2604.07340】C-AE: Unlocking Token Capacity for Deep Compression Autoencoders

链接：https://arxiv.org/abs/2604.07340

作者：Teng Li,Ziyuan Huang,Cong Chen,Yangfu Li,Yuanhuiyi Lyu,Dandan Zheng,Chunhua Shen,Jun Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：latent representation collapse, deep compression autoencoders, propose TC-AE, representation collapse, compression autoencoders

备注：

点击查看摘要

Abstract:We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.

4. 【2604.07338】Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

链接：https://arxiv.org/abs/2604.07338

作者：Yuechen Jiang,Enze Zhang,Md Mohsinul Kabir,Qianqian Xie,Stavroula Golfomitsou,Konstantinos Arvanitis,Sophia Ananiadou

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词：improved image captioning, Recent advances, advances in vision-language, improved image, image captioning

备注：

点击查看摘要

5. 【2604.07337】From Blobs to Spokes: High-Fidelity Surface Reconstruction via Oriented Gaussians

链接：https://arxiv.org/abs/2604.07337

作者：Diego Gomez,Antoine Guédon,Nissim Maruani,Bingchen Gong,Maks Ovsjanikov

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：extraction fundamentally difficult, opacity-based formulation makes, Signed Distance Fields, makes surface extraction, surface extraction fundamentally

备注： Our project page is available in [this http URL](http://diego1401.github.io/BlobsToSpokesWebsite/index.html)

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has revolutionized fast novel view synthesis, yet its opacity-based formulation makes surface extraction fundamentally difficult. Unlike implicit methods built on Signed Distance Fields or occupancy, 3DGS lacks a global geometric field, forcing existing approaches to resort to heuristics such as TSDF fusion of blended depth maps. Inspired by the Objects as Volumes framework, we derive a principled occupancy field for Gaussian Splatting and show how it can be used to extract highly accurate watertight meshes of complex scenes. Our key contribution is to introduce a learnable oriented normal at each Gaussian element and to define an adapted attenuation formulation, which leads to closed-form expressions for both the normal and occupancy fields at arbitrary locations in space. We further introduce a novel consistency loss and a dedicated densification strategy to enforce Gaussians to wrap the entire surface by closing geometric holes, ensuring a complete shell of oriented primitives. We modify the differentiable rasterizer to output depth as an isosurface of our continuous model, and introduce Primal Adaptive Meshing for Region-of-Interest meshing at arbitrary resolution. We additionally expose fundamental biases in standard surface evaluation protocols and propose two more rigorous alternatives. Overall, our method Gaussian Wrapping sets a new state-of-the-art on DTU and Tanks and Temples, producing complete, watertight meshes at a fraction of the size of concurrent work-recovering thin structures such as the notoriously elusive bicycle spokes.

Comments:
Our project page is available in this http URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.07337 [cs.CV]

(or
arXiv:2604.07337v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.07337

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

6. 【2604.07331】RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild

链接：https://arxiv.org/abs/2604.07331

作者：Wenjing Margaret Mao,Jefferson Ng,Luyang Hu,Daniel Gehrig,Antonio Loquercio

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Scaling up robot, require human data, require human, Scaling, long-horizon interactions

备注： 8 pages, 4 figures. *Equal contribution by first three authors. Project webpage: [this https URL](https://roshi-mocap.github.io/)

点击查看摘要

Abstract:Scaling up robot learning will likely require human data containing rich and long-horizon interactions in the wild. Existing approaches for collecting such data trade off portability, robustness to occlusion, and global consistency. We introduce RoSHI, a hybrid wearable that fuses low-cost sparse IMUs with the Project Aria glasses to estimate the full 3D pose and body shape of the wearer in a metric global coordinate frame from egocentric perception. This system is motivated by the complementarity of the two sensors: IMUs provide robustness to occlusions and high-speed motions, while egocentric SLAM anchors long-horizon motion and stabilizes upper body pose. We collect a dataset of agile activities to evaluate RoSHI. On this dataset, we generally outperform other egocentric baselines and perform comparably to a state-of-the-art exocentric baseline (SAM3D). Finally, we demonstrate that the motion data recorded from our system are suitable for real-world humanoid policy learning. For videos, data and more, visit the project webpage: this https URL

7. 【2604.07329】Distilling Photon-Counting CT into Routine Chest CT through Clinically Validated Degradation Modeling

链接：https://arxiv.org/abs/2604.07329

作者：Junqi Liu,Xinze Zhou,Wenxuan Li,Scott Ye,Arkadiusz Sitek,Xiaofeng Yang,Yucheng Tang,Daguang Xu,Kai Ding,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：higher spatial resolution, lower noise compared, availability restricts large-scale, restricts large-scale research, clinical availability restricts

备注：

点击查看摘要

Abstract:Photon-counting CT (PCCT) provides superior image quality with higher spatial resolution and lower noise compared to conventional energy-integrating CT (EICT), but its limited clinical availability restricts large-scale research and clinical deployment. To bridge this gap, we propose SUMI, a simulated degradation-to-enhancement method that learns to reverse realistic acquisition artifacts in low-quality EICT by leveraging high-quality PCCT as reference. Our central insight is to explicitly model realistic acquisition degradations, transforming PCCT into clinically plausible lower-quality counterparts and learning to invert this process. The simulated degradations were validated for clinical realism by board-certified radiologists, enabling faithful supervision without requiring paired acquisitions at scale. As outcomes of this technical contribution, we: (1) train a latent diffusion model on 1,046 PCCTs, using an autoencoder first pre-trained on both these PCCTs and 405,379 EICTs from 145 hospitals to extract general CT latent features that we release for reuse in other generative medical imaging tasks; (2) construct a large-scale dataset of over 17,316 publicly available EICTs enhanced to PCCT-like quality, with radiologist-validated voxel-wise annotations of airway trees, arteries, veins, lungs, and lobes; and (3) demonstrate substantial improvements: across external data, SUMI outperforms state-of-the-art image translation methods by 15% in SSIM and 20% in PSNR, improves radiologist-rated clinical utility in reader studies, and enhances downstream top-ranking lesion detection performance, increasing sensitivity by up to 15% and F1 score by up to 10%. Our results suggest that emerging imaging advances can be systematically distilled into routine EICT using limited high-quality scans as reference.

8. 【2604.07306】Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment

链接：https://arxiv.org/abs/2604.07306

作者：Huaiyuan Qin,Muli Yang,Gabriel James Goenawan,Kai Wang,Zheng Wang,Peng Hu,Xi Peng,Hongyuan Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Existing dynamic data, Existing dynamic, noisy-label settings, methods often fail, fail under noisy-label

备注： Published in CVPR 2026 Findings

点击查看摘要

Abstract:Existing dynamic data pruning methods often fail under noisy-label settings, as they typically rely on per-sample loss as the ranking criterion. This could mistakenly lead to preserving noisy samples due to their high loss values, resulting in significant performance drop. To address this, we propose AlignPrune, a noise-robust module designed to enhance the reliability of dynamic pruning under label noise. Specifically, AlignPrune introduces the Dynamic Alignment Score (DAS), which is a loss-trajectory-based criterion that enables more accurate identification of noisy samples, thereby improving pruning effectiveness. As a simple yet effective plug-and-play module, AlignPrune can be seamlessly integrated into state-of-the-art dynamic pruning frameworks, consistently outperforming them without modifying either the model architecture or the training pipeline. Extensive experiments on five widely-used benchmarks across various noise types and pruning ratios demonstrate the effectiveness of AlignPrune, boosting accuracy by up to 6.3\% over state-of-the-art baselines. Our results offer a generalizable solution for pruning under noisy data, encouraging further exploration of learning in real-world scenarios. Code is available at: this https URL.

9. 【2604.07298】Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification

链接：https://arxiv.org/abs/2604.07298

作者：Xin Tian,Jiuliu Lu,Ephraim Tsalik,Bart Wanders,Colleen Knoth,Julian Knight

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词：Multiple Instance Learning, gigapixel whole-slide image, Multiple Instance, Instance Learning, whole-slide image

备注： 10 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Multiple Instance Learning (MIL) is the dominant framework for gigapixel whole-slide image (WSI) classification in computational pathology. However, current MIL aggregators route all instances through a shared pathway, constraining their capacity to specialise across the pathological heterogeneity inherent in each slide. Mixture-of-Experts (MoE) methods offer a natural remedy by partitioning instances across specialised expert subnetworks; yet unconstrained softmax routing may yield highly imbalanced utilisation, where one or a few experts absorb most routing mass, collapsing the mixture back to a near-single-pathway solution. To address these limitations, we propose ROAM (Region-graph OptimAl-transport Mixture-of-experts), a spatially aware MoE-MIL aggregator that routes region tokens to expert poolers via capacity-constrained entropic optimal transport, promoting balanced expert utilisation by construction. ROAM operates on spatial region tokens, obtained by compressing dense patch bags into spatially binned units that align routing with local tissue neighbourhoods and introduces two key mechanisms: (i) region-to-expert assignment formulated as entropic optimal transport (Sinkhorn) with explicit per slide capacity marginals, enforcing balanced expert utilisation without auxiliary load-balancing losses; and (ii) graph-regularised Sinkhorn iterations that diffuse routing assignments over the spatial region graph, encouraging neighbouring regions to coherently route to the same experts. Evaluated on four WSI benchmarks with frozen foundation-model patch embeddings, ROAM achieves performance competitive against strong MIL and MoE baselines, and on NSCLC generalisation (TCGA-CPTAC) reaches external AUC 0.845 +- 0.019.

10. 【2604.07282】Are Face Embeddings Compatible Across Deep Neural Network Models?

链接：https://arxiv.org/abs/2604.07282

作者：Fizza Rubab,Yiying Tong,Arun Ross

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：deep neural network, made rapid strides, past decade due, Automated face recognition, neural network

备注：

点击查看摘要

Abstract:Automated face recognition has made rapid strides over the past decade due to the unprecedented rise of deep neural network (DNN) models that can be trained for domain-specific tasks. At the same time, foundation models that are pretrained on broad vision or vision-language tasks have shown impressive generalization across diverse domains, including biometrics. This raises an important question: Do different DNN models--both domain-specific and foundation models--encode facial identity in similar ways, despite being trained on different datasets, loss functions, and architectures? In this regard, we directly analyze the geometric structure of embedding spaces imputed by different DNN models. Treating embeddings of face images as point clouds, we study whether simple affine transformations can align face representations of one model with another. Our findings reveal surprising cross-model compatibility: low-capacity linear mappings substantially improve cross-model face recognition over unaligned baselines for both face identification and verification tasks. Alignment patterns generalize across datasets and vary systematically across model families, indicating representational convergence in facial identity encoding. These findings have implications for model interoperability, ensemble design, and biometric template security.

11. 【2604.07279】Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

链接：https://arxiv.org/abs/2604.07279

作者：Changkun Liu,Jiezhi Yang,Zeman Li,Yuan Deng,Jiancong Guo,Luca Ballan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：long visual streams, augmented reality, efficiently and consistently, suited to robotics, robotics and augmented

备注： Project page: [this https URL](https://lck666666.github.io/Mem3R/)

点击查看摘要

Abstract:Streaming 3D perception is well suited to robotics and augmented reality, where long visual streams must be processed efficiently and consistently. Recent recurrent models offer a promising solution by maintaining fixed-size states and enabling linear-time inference, but they often suffer from drift accumulation and temporal forgetting over long sequences due to the limited capacity of compressed latent memories. We propose Mem3R, a streaming 3D reconstruction model with a hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency over long sequences. For camera tracking, Mem3R employs an implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training. For geometric mapping, Mem3R maintains an explicit token-based fixed-size state. Compared with CUT3R, this design not only significantly improves long-sequence performance but also reduces the model size from 793M to 644M parameters. Mem3R supports existing improved plug-and-play state update strategies developed for CUT3R. Specifically, integrating it with TTT3R decreases Absolute Trajectory Error by up to 39% over the base implementation on 500 to 1000 frame sequences. The resulting improvements also extend to other downstream tasks, including video depth estimation and 3D reconstruction, while preserving constant GPU memory usage and comparable inference throughput. Project page: this https URL

12. 【2604.07273】GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

链接：https://arxiv.org/abs/2604.07273

作者：Yiqian Wu,Rawal Khirodkar,Egor Zakharov,Timur Bagautdinov,Lei Xiao,Zhaoen Su,Shunsuke Saito,Xiaogang Jin,Junxuan Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diffusion-based generative model, photorealistic full-body avatars, diffusion-based generative, text and image, editing photorealistic full-body

备注：

点击查看摘要

Abstract:We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at this https URL.

13. 【2604.07263】BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving

链接：https://arxiv.org/abs/2604.07263

作者：Yuhang Wang,Yiyao Xu,Chaoyun Yang,Lingyao Li,Jingran Sun,Hao Zhou

类目：Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：remain continuously attentive, production vehicles rely, Existing driving automation, ready to intervene, rely on human

备注：

点击查看摘要

Abstract:Existing driving automation (DA) systems on production vehicles rely on human drivers to decide when to engage DA while requiring them to remain continuously attentive and ready to intervene. This design demands substantial situational judgment and imposes significant cognitive load, leading to steep learning curves, suboptimal user experience, and safety risks from both over-reliance and delayed takeover. Predicting when drivers hand over control to DA and when they take it back is therefore critical for designing proactive, context-aware HMI, yet existing datasets rarely capture the multimodal context, including road scene, driver state, vehicle dynamics, and route environment. To fill this gap, we introduce BATON, a large-scale naturalistic dataset capturing real-world DA usage across 127 drivers, and 136.6 hours of driving. The dataset synchronizes front-view video, in-cabin video, decoded CAN bus signals, radar-based lead-vehicle interaction, and GPS-derived route context, forming a closed-loop multimodal record around each control transition. We define three benchmark tasks: driving action understanding, handover prediction, and takeover prediction, and evaluate baselines spanning sequence models, classical classifiers, and zero-shot VLMs. Results show that visual input alone is insufficient for reliable transition prediction: front-view video captures road context but not driver state, while in-cabin video reflects driver readiness but not the external scene. Incorporating CAN and route-context signals substantially improves performance over video-only settings, indicating strong complementarity across modalities. We further find takeover events develop more gradually and benefit from longer prediction horizons, whereas handover events depend more on immediate contextual cues, revealing an asymmetry with direct implications for HMI design in assisted driving systems.

14. 【2604.07254】Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

链接：https://arxiv.org/abs/2604.07254

作者：Icaro Re Depaolini,Uri Hasson

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：rely on human-like, human-like information, information or reveal, reveal the cues, cues underlying

备注：

点击查看摘要

Abstract:Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.

15. 【2604.07250】Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving

链接：https://arxiv.org/abs/2604.07250

作者：Yatong Lan,Rongkui Tang,Lei He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reduce camera-rig dependency, standardized virtual views, generating standardized virtual, Extrapolative novel view, heterogeneous sensors

备注：

点击查看摘要

Abstract:Extrapolative novel view synthesis can reduce camera-rig dependency in autonomous driving by generating standardized virtual views from heterogeneous sensors. Existing methods degrade outside recorded trajectories because extrapolated poses provide weak geometric support and no dense target-view supervision. The key is to explicitly expose the model to out-of-trajectory condition defects during training. We propose Geo-EVS, a geometry-conditioned framework under sparse supervision. Geo-EVS has two components. Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps. This design unifies the reprojection path between training and inference. Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training so the model learns to recover structure under missing support. For evaluation, we use a LiDAR-Projected Sparse-Reference (LPSR) protocol when dense extrapolated-view ground truth is unavailable. On Waymo, Geo-EVS improves sparse-view synthesis quality and geometric accuracy, especially in high-angle and low-coverage settings. It also improves downstream 3D detection.

16. 【2604.07230】PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

链接：https://arxiv.org/abs/2604.07230

作者：Ruihang Xu,Dewei Zhou,Xiaolong Shen,Fan Ma,Yi Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Achieving physically accurate, Achieving physically, interactive world models, physically accurate object, potential applications

备注：

点击查看摘要

Abstract:Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D--3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.

17. 【2604.07210】VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

链接：https://arxiv.org/abs/2604.07210

作者：Jian Yu,Fei Shen,Cong Wang,Yi Xin,Si Shen,Xiaoyu Du,Jinhui Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：driven remarkable advancements, real-world fashion workflows, treat garment generation, fashion image generation, separate problems

备注：

点击查看摘要

Abstract:Diffusion models have driven remarkable advancements in fashion image generation, yet prior works usually treat garment generation and virtual dressing as separate problems, limiting their flexibility in real-world fashion workflows. Moreover, fashion image synthesis under multi-source heterogeneous conditions remains challenging, as existing methods typically rely on simple feature concatenation or static layer-wise injection, which often causes attribute entanglement and semantic interference. To address these issues, we propose VersaVogue, a unified framework for multi-condition controllable fashion synthesis that jointly supports garment generation and virtual dressing, corresponding to the design and showcase stages of the fashion lifecycle. Specifically, we introduce a trait-routing attention (TA) module that leverages a mixture-of-experts mechanism to dynamically route condition features to the most compatible experts and generative layers, enabling disentangled injection of visual attributes such as texture, shape, and color. To further improve realism and controllability, we develop an automated multi-perspective preference optimization (MPO) pipeline that constructs preference data without human annotation or task-specific reward models. By combining evaluators of content fidelity, textual alignment, and perceptual quality, MPO identifies reliable preference pairs, which are then used to optimize the model via direct preference optimization (DPO). Extensive experiments on both garment generation and virtual dressing benchmarks demonstrate that VersaVogue consistently outperforms existing methods in visual fidelity, semantic consistency, and fine-grained controllability.

18. 【2604.07209】INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

链接：https://arxiv.org/abs/2604.07209

作者：InSpatio Team(Alphabetical Order):Donghui Shen,Guofeng Zhang,Haomin Liu,Haoyu Ji,Hujun Bao,Hongjia Zhai,Jialin Liu,Jing Guo,Nan Wang,Siji Pan,Weihong Pan,Weijian Xie,Xianbin Liu,Xiaojun Xiang,Xiaoyu Zhang,Xinyu Chen,Yifu Wang,Yipeng Chen,Zhenzhou Fan,Zhewen Le,Zhichao Ye,Ziqiang Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-time interactivity remains, Building world models, Building world, computer vision, interactivity remains

备注：

点击查看摘要

Abstract:Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

19. 【2604.07201】BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

链接：https://arxiv.org/abs/2604.07201

作者：Mohamed Darwish Mounis,Mohamed Mahmoud,Shaimaa Sedek,Mahmoud Abdalla,Mahmoud SalahEldin Kasem,Abdelrahman Abdallah,Hyun-Soo Kang

类目：Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词：textbf, underperforming strong text-only, underperforming strong, resolve image-text queries, retrieval systems struggle

备注： Accepted at CVPR 2026 Workshop GRAIL-V

点击查看摘要

20. 【2604.07182】aLeafVision: An Explainable and Robust Deep Learning Framework for Tea Leaf Disease Classification

链接：https://arxiv.org/abs/2604.07182

作者：Rafi Ahamed,Sidratul Moon Nafsin,Md Abir Rahman,Tasnia Tarannum Roza,Munaia Jannat Easha,Abu Raihan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：global economic force, beverage after water, scale and influence, consumed beverage, cultural staple

备注：

点击查看摘要

Abstract:As the worlds second most consumed beverage after water, tea is not just a cultural staple but a global economic force of profound scale and influence. More than a mere drink, it represents a quiet negotiation between nature, culture, and the human desire for a moment of reflection. So, the precise identification and detection of tea leaf disease is crucial. With this goal, we have evaluated several Convolutional Neural Networks (CNN) models, among them three shows noticeable performance including DenseNet201, MobileNetV2, InceptionV3 on the teaLeafBD dataset. teaLeafBD dataset contains seven classes, six disease classes and one healthy class, collected under various field conditions reflecting real world challenges. Among the CNN models, DenseNet201 has achieved the highest test accuracy of 99%. In order to enhance the model reliability and interpretability, we have implemented Gradient weighted Class Activation Mapping (Grad CAM), occlusion sensitivity analysis and adversarial training techniques to increase the noise resistance of the model. Finally, we have developed a prototype in order to leverage the models capabilities on real life agriculture. This paper illustrates the deep learning models capabilities to classify the disease in real life tea leaf disease detection and management.

21. 【2604.07180】Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis

链接：https://arxiv.org/abs/2604.07180

作者：Kartikay Tehlan,Lukas Förner,Nico Schmutzenhofer,Michael Frühwald,Matthias Wagner,Nassir Navab,Thomas Wendler

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：longitudinal multi-parametric MRI, multi-parametric MRI analysis, MRI analysis based, energy, multi-parametric MRI

备注： The code is available at [this https URL](https://github.com/tkartikay/EnFold-MRI)

点击查看摘要

Abstract:We propose a geometric framework for longitudinal multi-parametric MRI analysis based on patient-specific energy modelling in sequence space. Rather than operating on images with spatial networks, each voxel is represented by its multi-sequence intensity vector ($T1$, $T1c$, $T2$, FLAIR, ADC), and a compact implicit neural representation is trained via denoising score matching to learn an energy function $E_{\theta}(\mathbf{u})$ over $\mathbb{R}^d$ from a single baseline scan. The learned energy landscape provides a differential-geometric description of tissue regimes without segmentation labels. Local minima define tissue basins, gradient magnitude reflects proximity to regime boundaries, and Laplacian curvature characterises local constraint structure. Importantly, this baseline energy manifold is treated as a fixed geometric reference: it encodes the set of contrast combinations observed at diagnosis and is not retrained at follow-up. Longitudinal assessment is therefore formulated as evaluation of subsequent scans relative to this baseline geometry. Rather than comparing anatomical segmentations, we analyse how the distribution of MRI sequence vectors evolves under the baseline energy function. In a paediatric case with later recurrence, follow-up scans show progressive deviation in energy and directional displacement in sequence space toward the baseline tumour-associated regime before clear radiological reappearance. In a case with stable disease, voxel distributions remain confined to established low-energy basins without systematic drift. The presented cases serve as proof-of-concept that patient-specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for further investigation of manifold-based tissue-at-risk tracking in neuro-oncology.

22. 【2604.07175】Multiple Domain Generalization Using Category Information Independent of Domain Differences

链接：https://arxiv.org/abs/2604.07175

作者：Reiji Saito,Kazuhiro Hotta

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：maintain high accuracy, Domain, domain differences, technique aimed, aimed at enabling

备注：

点击查看摘要

Abstract:Domain generalization is a technique aimed at enabling models to maintain high accuracy when applied to new environments or datasets (unseen domains) that differ from the datasets used in training. Generally, the accuracy of models trained on a specific dataset (source domain) often decreases significantly when evaluated on different datasets (target domain). This issue arises due to differences in domains caused by varying environmental conditions such as imaging equipment and staining methods. Therefore, we undertook two initiatives to perform segmentation that does not depend on domain differences. We propose a method that separates category information independent of domain differences from the information specific to the source domain. By using information independent of domain differences, our method enables learning the segmentation targets (e.g., blood vessels and cell nuclei). Although we extract independent information of domain differences, this cannot completely bridge the domain gap between training and test data. Therefore, we absorb the domain gap using the quantum vectors in Stochastically Quantized Variational AutoEncoder (SQ-VAE). In experiments, we evaluated our method on datasets for vascular segmentation and cell nucleus segmentation. Our methods improved the accuracy compared to conventional methods.

23. 【2604.07166】DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification

链接：https://arxiv.org/abs/2604.07166

作者：Robert Zimmermann,Thomas Norrenbrock,Bodo Rosenhahn

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：create substantial hurdles, high-dimensional representations create, representations create substantial, Quadratic Programming Enhanced, visual foundation models

备注： Accepted to the 5th Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2026

点击查看摘要

Abstract:Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \texttt{CLS} token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model's features and therefore enable spatial localisation of DINO-QPM's globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.

24. 【2604.07154】Bridging MRI and PET physiology: Untangling complementarity through orthogonal representations

链接：https://arxiv.org/abs/2604.07154

作者：Sonja Adomeit,Kartikay Tehlan,Lukas Förner,Katharina Weisser,Helen Scholtiseek,David Kaufmann,Julie Steinestel,Constantin Lapa,Thomas Kröncke,Thomas Wendler

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：shared versus modality-specific, approaches rarely define, Multimodal imaging analysis, joint latent representations, MRI feature manifold

备注： The code is available at [this https URL](https://github.com/SonjaA14/inrmri2pet)

点击查看摘要

Abstract:Multimodal imaging analysis often relies on joint latent representations, yet these approaches rarely define what information is shared versus modality-specific. Clarifying this distinction is clinically relevant, as it delineates the irreducible contribution of each modality and informs rational acquisition strategies. We propose a subspace decomposition framework that reframes multimodal fusion as a problem of orthogonal subspace separation rather than translation. We decompose Prostate-Specific Membrane Antigen (PSMA) PET uptake into an MRI-explainable physiological envelope and an orthogonal residual reflecting signal components not expressible within the MRI feature manifold. Using multiparametric MRI, we train an intensity-based, non-spatial implicit neural representation (INR) to map MRI feature vectors to PET uptake. We introduce a projection-based regularization using singular value decomposition to penalize residual components lying within the span of the MRI feature manifold. This enforces mathematical orthogonality between tissue-level physiological properties (structure, diffusion, perfusion) and intracellular PSMA expression. Tested on 13 prostate cancer patients, the model demonstrates that residual components spanned by MRI features are absorbed into the learned envelope, while the orthogonal residual is largest in tumour regions. This indicates that PSMA PET contains signal components not recoverable from MRI-derived physiological descriptors. The resulting decomposition provides a structured characterization of modality complementarity grounded in representation geometry rather than image translation.

25. 【2604.07151】An RTK-SLAM Dataset for Absolute Accuracy Evaluation in GNSS-Degraded Environments

链接：https://arxiv.org/abs/2604.07151

作者：Wei Zhang,Vincent Ress,David Skuddis,Uwe Soergel,Norbert Haala

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：efficient georeferenced surveying, integrate simultaneous localization, globally referenced coordinates, systems integrate simultaneous, RTK-SLAM systems integrate

备注： Accepted by ISPRS congress 2026

点击查看摘要

Abstract:RTK-SLAM systems integrate simultaneous localization and mapping (SLAM) with real-time kinematic (RTK) GNSS positioning, promising both relative consistency and globally referenced coordinates for efficient georeferenced surveying. A critical and underappreciated issue is that the standard evaluation metric, Absolute Trajectory Error (ATE), first fits an optimal rigid-body transformation between the estimated trajectory and reference before computing errors. This so-called SE(3) alignment absorbs global drift and systematic errors, making trajectories appear more accurate than they are in practice, and is unsuitable for evaluating the global accuracy of RTK-SLAM. We present a geodetically referenced dataset and evaluation methodology that expose this gap. A key design principle is that the RTK receiver is used solely as a system input, while ground truth is established independently via a geodetic total station. This separation is absent from all existing datasets, where GNSS typically serves as (part of) the ground truth. The dataset is collected with a handheld RTK-SLAM device, comprising two scenes. We evaluate LiDAR-inertial, visual-inertial, and LiDAR-visual-inertial RTK-SLAM systems alongside standalone RTK, reporting direct global accuracy and SE(3)-aligned relative accuracy to make the gap explicit. Results show that SE(3) alignment can underestimate absolute positioning error by up to 76\%. RTK-SLAM achieves centimeter-level absolute accuracy in open-sky conditions and maintains decimeter-level global accuracy indoors, where standalone RTK degrades to tens of meters. The dataset, calibration files, and evaluation scripts are publicly available at this https URL.

26. 【2604.07146】Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

链接：https://arxiv.org/abs/2604.07146

作者：Zhuohong Chen,Zhenxian Wu,Yunyao Yu,Hangrui Xu,Zirui Liao,Zhifang Liu,Xiangwen Deng,Pen Jiao,Haoqian Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Knowledge-based visual question, Knowledge-based visual, visual question answering, requires vision-language models, requires vision-language

备注：

点击查看摘要

Abstract:Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based on its current information state. We further design an automated pipeline to collect multi-step trajectories that record the agent's reasoning process, tool usage, and intermediate decisions. These trajectories are then used as supervision for fine-tuning. Experiments on InfoSeek and E-VQA demonstrate that our method achieves state-of-the-art performance, consistently outperforming prior baselines and confirming the effectiveness of our framework.

27. 【2604.07141】USCNet: Transformer-Based Multimodal Fusion with Segmentation Guidance for Urolithiasis Classification

链接：https://arxiv.org/abs/2604.07141

作者：Changmiao Wang,Songqi Zhang,Yongquan Zhang,Yifei Wang,Liya Liu,Nannan Li,Xingzhi Li,Jiexin Pan,Yi Jiang,Xiang Wan,Hai Wang,Ahmed Elazab

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：creating personalized treatment, personalized treatment plans, stone disease ranks, Kidney stone disease, conditions in urology

备注： Accepted by IEEE Journal of Biomedical and Health Informatics. Early Access

点击查看摘要

Abstract:Kidney stone disease ranks among the most prevalent conditions in urology, and understanding the composition of these stones is essential for creating personalized treatment plans and preventing recurrence. Current methods for analyzing kidney stones depend on postoperative specimens, which prevents rapid classification before surgery. To overcome this limitation, we introduce a new approach called the Urinary Stone Segmentation and Classification Network (USCNet). This innovative method allows for precise preoperative classification of kidney stones by integrating Computed Tomography (CT) images with clinical data from Electronic Health Records (EHR). USCNet employs a Transformer-based multimodal fusion framework with CT-EHR attention and segmentation-guided attention modules for accurate classification. Moreover, a dynamic loss function is introduced to effectively balance the dual objectives of segmentation and classification. Experiments on an in-house kidney stone dataset show that USCNet demonstrates outstanding performance across all evaluation metrics, with its classification efficacy significantly surpassing existing mainstream methods. This study presents a promising solution for the precise preoperative classification of kidney stones, offering substantial clinical benefits. The source code has been made publicly available: this https URL.

28. 【2604.07132】CSA-Graphs: A Privacy-Preserving Structural Dataset for Child Sexual Abuse Research

链接：https://arxiv.org/abs/2604.07132

作者：Carlos Caetano,Camila Laranjeira,Clara Ernesto,Artur Barros,João Macedo,Leo S. F. Ribeiro,Jefersson A. dos Santos,Sandra Avila

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Sexual Abuse Imagery, Child Sexual Abuse, Abuse Imagery, Sexual Abuse, Child Sexual

备注： Conference on Computer Vision and Pattern Recognition (CVPR 2026), in the Workshop on Computer Vision for Children (CV4CHL)

点击查看摘要

Abstract:Child Sexual Abuse Imagery (CSAI) classification is an important yet challenging problem for computer vision research due to the strict legal and ethical restrictions that prevent the public sharing of CSAI datasets. This limitation hinders reproducibility and slows progress in developing automated methods. In this work, we introduce CSA-Graphs, a privacy-preserving structural dataset. Instead of releasing the original images, we provide structural representations that remove explicit visual content while preserving contextual information. CSA-Graphs includes two complementary graph-based modalities: scene graphs describing object relationships and skeleton graphs encoding human pose. Experiments show that both representations retain useful information for classifying CSAI, and that combining them further improves performance. This dataset enables broader research on computer vision methods for child safety while respecting legal and ethical constraints.

29. 【2604.07128】A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing

链接：https://arxiv.org/abs/2604.07128

作者：Chenhao Liu,Zelin Wen,Yan Tong,Junjie Zhu,Xinyu Tian,Yuchi Liu,Ashu Gupta,Syed M. S. Islam,Tom Gedeon,Yue Yao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：developing robust medical, medical AI systems, critical for developing, developing robust, robust medical

备注：

点击查看摘要

Abstract:Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-language model training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.

30. 【2604.07122】Accuracy Improvement of Semi-Supervised Segmentation Using Supervised ClassMix and Sup-Unsup Feature Discriminator

链接：https://arxiv.org/abs/2604.07122

作者：Takahiro Mano,Reiji Saito,Kazuhiro Hotta

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：incurs significant costs, unlabeled images, training data incurs, data incurs significant, images

备注：

点击查看摘要

Abstract:In semantic segmentation, the creation of pixel-level labels for training data incurs significant costs. To address this problem, semi-supervised learning, which utilizes a small number of labeled images alongside unlabeled images to enhance the performance, has gained attention. A conventional semi-supervised learning method, ClassMix, pastes class labels predicted from unlabeled images onto other images. However, since ClassMix performs operations using pseudo-labels obtained from unlabeled images, there is a risk of handling inaccurate labels. Additionally, there is a gap in data quality between labeled and unlabeled images, which can impact the feature maps. This study addresses these two issues. First, we propose a method where class labels from labeled images, along with the corresponding image regions, are pasted onto unlabeled images and their pseudo-labeled images. Second, we introduce a method that trains the model to make predictions on unlabeled images more similar to those on labeled images. Experiments on the Chase and COVID-19 datasets demonstrated an average improvement of 2.07% in mIoU compared to conventional semi-supervised learning methods.

31. 【2604.07120】Assessing the Added Value of Onboard Earth Observation Processing with the IRIDE HEO Service Segment

链接：https://arxiv.org/abs/2604.07120

作者：Parampuneet Kaur Thind,Charles Mwangi,Giovanni Varetto,Lorenzo Sarti,Andrea Papa,Andrea Taramelli

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)

关键词：European Forest Fire, Copernicus Land Monitoring, Forest Fire Information, Land Monitoring Service, European Forest

备注：

点击查看摘要

Abstract:Current operational Earth Observation (EO) services, including the Copernicus Emergency Management Service (CEMS), the European Forest Fire Information System (EFFIS), and the Copernicus Land Monitoring Service (CLMS), rely primarily on ground-based processing pipelines. While these systems provide mature large-scale information products, they remain constrained by downlink latency, bandwidth limitations, and limited capability for autonomous observation prioritisation. The International Report for an Innovative Defence of Earth (IRIDE) programme is a national Earth observation initiative led by the Italian government to support public authorities through timely, objective information derived from spaceborne data. Rather than a single constellation, IRIDE is designed as a constellation of constellations, integrating heterogeneous sensing technologies within a unified service-oriented architecture. Within this framework, Hawk for Earth Observation (HEO) enables onboard generation of data products, allowing information extraction earlier in the processing chain. This paper examines the limitations of ground-only architectures and evaluates the added value of onboard processing at the operational service level. The IRIDE burnt-area mapping service is used as a representative case study to demonstrate how onboard intelligence can support higher spatial detail (sub-three-metre ground sampling distance), smaller detectable events (minimum mapping unit of three hectares), and improved system responsiveness. Rather than replacing existing Copernicus services, the IRIDE HEO capability is positioned as a complementary layer providing image-driven pre-classification to support downstream emergency and land-management workflows. This work highlights the operational value of onboard intelligence for emerging low-latency EO service architectures.

32. 【2604.07101】SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

链接：https://arxiv.org/abs/2604.07101

作者：Qizhou Wang,Guansong Pang,Christopher Leckie

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)

关键词：Image Test Range, Test Range, Forgery Image Test, falsifying visual evidence, open-access image generation

备注：

点击查看摘要

Abstract:We present the Surveillance Forgery Image Test Range (SurFITR), a dataset for surveillance-style image forgery detection and localisation, in response to recent advances in open-access image generation models that raise concerns about falsifying visual evidence. Existing forgery models, trained on datasets with full-image synthesis or large manipulated regions in object-centric images, struggle to generalise to surveillance scenarios. This is because tampering in surveillance imagery is typically localised and subtle, occurring in scenes with varied viewpoints, small or occluded subjects, and lower visual quality. To address this gap, SurFITR provides a large collection of forensically valuable imagery generated via a multimodal LLM-powered pipeline, enabling semantically aware, fine-grained editing across diverse surveillance scenes. It contains over 137k tampered images with varying resolutions and edit types, generated using multiple image editing models. Extensive experiments show that existing detectors degrade significantly on SurFITR, while training on SurFITR yields substantial improvements in both in-domain and cross-domain performance. SurFITR is publicly available on GitHub.

33. 【2604.07097】Novel Anomaly Detection Scenarios and Evaluation Metrics to Address the Ambiguity in the Definition of Normal Samples

链接：https://arxiv.org/abs/2604.07097

作者：Reiji Saito,Satoshi Kamiya,Kazuhiro Hotta

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：training data consist, conventional anomaly detection, training data, data consist, normal samples

备注： Accepted by CVPR 2026 Workshop

点击查看摘要

Abstract:In conventional anomaly detection, training data consist of only normal samples. However, in real-world scenarios, the definition of a normal sample is often ambiguous. For example, there are cases where a sample has small scratches or stains but is still acceptable for practical usage. On the other hand, higher precision is required when manufacturing equipment is upgraded. In such cases, normal samples may include small scratches, tiny dust particles, or a foreign object that we would prefer to classify as an anomaly. Such cases frequently occur in industrial settings, yet they have not been discussed until now. Thus, we propose novel scenarios and an evaluation metric to accommodate specification changes in real-world applications. Furthermore, to address the ambiguity of normal samples, we propose the RePaste, which enhances learning by re-pasting regions with high anomaly scores from the previous step into the input for the next step. On our scenarios using the MVTec AD benchmark, RePaste achieved the state-of-the-art performance with respect to the proposed evaluation metric, while maintaining high AUROC and PRO scores. Code: this https URL

34. 【2604.07092】Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data

链接：https://arxiv.org/abs/2604.07092

作者：Mojgan Madadikhaljan,Jonathan Prexl,Isabelle Wittmann,Conrad M Albrecht,Michael Schmitt

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：spaceborne Earth observation, multi-temporal spaceborne Earth, Earth observation, spatiotemporal neural field, models multi-temporal spaceborne

备注：

点击查看摘要

Abstract:In this work, we present LIANet (Location Is All You Need Network), a coordinate-based neural representation that models multi-temporal spaceborne Earth observation (EO) data for a given region of interest as a continuous spatiotemporal neural field. Given only spatial and temporal coordinates, LIANet reconstructs the corresponding satellite imagery. Once pretrained, this neural representation can be adapted to various EO downstream tasks, such as semantic segmentation or pixel-wise regression, importantly, without requiring access to the original satellite data. LIANet intends to serve as a user-friendly alternative to Geospatial Foundation Models (GFMs) by eliminating the overhead of data access and preprocessing for end-users and enabling fine-tuning solely based on labels. We demonstrate the pretraining of LIANet across target areas of varying sizes and show that fine-tuning it for downstream tasks achieves competitive performance compared to training from scratch or using established GFMs. The source code and datasets are publicly available at this https URL.

35. 【2604.07053】AnchorSplat: Feed-Forward 3D Gaussian SplattingWith 3D Geometric Priors

链接：https://arxiv.org/abs/2604.07053

作者：Xiaoxue Zhang,Xiaoxu Zheng,Yixuan Yin,Tiao Zhao,Kaihua Tang,Michael Bi Mi,Zhan Xu,Dave Zhenyu Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent feed-forward Gaussian, reconstruction models adopt, entangling Gaussian representations, Recent feed-forward, Gaussian representations tightly

备注：

点击查看摘要

Abstract:Recent feed-forward Gaussian reconstruction models adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, entangling Gaussian representations tightly with the input images. In this paper, we propose AnchorSplat, a novel feed-forward 3DGS framework for scene-level reconstruction that represents the scene directly in 3D space. AnchorSplat introduces an anchor-aligned Gaussian representation guided by 3D geometric priors (e.g., sparse point clouds, voxels, or RGB-D point clouds), enabling a more geometry-aware renderable 3D Gaussians that is independent of image resolution and number of views. This design substantially reduces the number of required Gaussians, improving computational efficiency while enhancing reconstruction fidelity. Beyond the anchor-aligned design, we utilize a Gaussian Refiner to adjust the intermediate Gaussiansy via merely a few forward passes. Experiments on the ScanNet++ v2 NVS benchmark demonstrate the SOTA performance, outperforming previous methods with more view-consistent and substantially fewer Gaussian primitives.

36. 【2604.07048】PRISM: Rethinking Scattered Atmosphere Reconstruction as a Unified Understanding and Generation Model for Real-world Dehazing

链接：https://arxiv.org/abs/2604.07048

作者：Chengyu Fang,Chunming He,Yuelin Zhang,Chubin Chen,Chenyang Zhu,Longxiang Tang,Xiu Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Real-world image dehazing, remove haze induced, haze induced degradation, image dehazing, aims to remove

备注： 24 Pages, 7 Figures

点击查看摘要

Abstract:Real-world image dehazing (RID) aims to remove haze induced degradation from real scenes. This task remains challenging due to non-uniform haze distribution, spatially varying illumination from multiple light sources, and the scarcity of paired real hazy-clean data. In PRISM, we propose Proximal Scattered Atmosphere Reconstruction (PSAR), a physically structured framework that jointly reconstructs the clear scene and scattering variables under the atmospheric scattering model, thereby improving reliability in complex regions and mixed-light conditions. To bridge the synthetic-to-real gap, we design an online non-uniform haze synthesis pipeline and a Selective Self-distillation Adaptation scheme for unpaired real-world scenarios, which enables the model to selectively learn from high-quality perceptual targets while leveraging its intrinsic scattering understanding to audit residual haze and guide self-refinement. Extensive experiments on real-world benchmarks demonstrate that PRISM achieves state-of-the-art performance on RID tasks.

37. 【2604.07034】KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

链接：https://arxiv.org/abs/2604.07034

作者：Mehdi Hosseinzadeh,King Hang Wong,Feras Dayoub

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：converts long robot-execution, long robot-execution videos, interpretable tokenized evidence, videos into compact, converts long

备注： ICRA 2026; Project page: [this https URL](https://m80hz.github.io/kite/)

点击查看摘要

Abstract:We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: this https URL

38. 【2604.07026】Not all tokens contribute equally to diffusion learning

链接：https://arxiv.org/abs/2604.07026

作者：Guoqing Zhang,Lu Shi,Wanru Xu,Linna Zhang,Sen Wang,Fangfang Wang,Yigang Cen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：semantically important tokens, semantically important, rapid development, important tokens, spatial

备注：

点击查看摘要

Abstract:With the rapid development of conditional diffusion models, significant progress has been made in text-to-video generation. However, we observe that these models often neglect semantically important tokens during inference, leading to biased or incomplete generations under classifier-free guidance. We attribute this issue to two key factors: distributional bias caused by the long-tailed token frequency in training data, and spatial misalignment in cross-attention where semantically important tokens are overshadowed by less informative ones. To address these issues, we propose Distribution-Aware Rectification and Spatial Ensemble (DARE), a unified framework that improves semantic guidance in diffusion models from the perspectives of distributional debiasing and spatial consistency. First, we introduce Distribution-Rectified Classifier-Free Guidance (DR-CFG), which regularizes the training process by dynamically suppressing dominant tokens with low semantic density, encouraging the model to better capture underrepresented semantic cues and learn a more balanced conditional distribution. This design mitigates the risk of the model distribution overfitting to tokens with low semantic density. Second, we propose Spatial Representation Alignment (SRA), which adaptively reweights cross-attention maps according to token importance and enforces representation consistency, enabling semantically important tokens to exert stronger spatial guidance during generation. This mechanism effectively prevents low semantic-density tokens from dominating the attention allocation, thereby avoiding the dilution of the spatial and distributional guidance provided by high semantic-density tokens. Extensive experiments on multiple benchmark datasets demonstrate that DARE consistently improves generation fidelity and semantic alignment, achieving significant gains over existing approaches.

39. 【2604.07021】ModuSeg: Decoupling Object Discovery and Semantic Retrieval for Training-Free Weakly Supervised Segmentation

链接：https://arxiv.org/abs/2604.07021

作者：Qingze He,Fagui Liu,Dengke Zhang,Qingmao Wei,Quan Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieve pixel-level predictions, image-level labels, Weakly supervised semantic, pixel-level predictions, predictions using image-level

备注：

点击查看摘要

Abstract:Weakly supervised semantic segmentation aims to achieve pixel-level predictions using image-level labels. Existing methods typically entangle semantic recognition and object localization, which often leads models to focus exclusively on sparse discriminative regions. Although foundation models show immense potential, many approaches still follow the tightly coupled optimization paradigm, struggling to effectively alleviate pseudo-label noise and often relying on time-consuming multi-stage retraining or unstable end-to-end joint optimization. To address the above challenges, we present ModuSeg, a training-free weakly supervised semantic segmentation framework centered on explicitly decoupling object discovery and semantic assignment. Specifically, we integrate a general mask proposer to extract geometric proposals with reliable boundaries, while leveraging semantic foundation models to construct an offline feature bank, transforming segmentation into a non-parametric feature retrieval process. Furthermore, we propose semantic boundary purification and soft-masked feature aggregation strategies to effectively mitigate boundary ambiguity and quantization errors, thereby extracting high-quality category prototypes. Extensive experiments demonstrate that the proposed decoupled architecture better preserves fine boundaries without parameter fine-tuning and achieves highly competitive performance on standard benchmark datasets. Code is available at this https URL.

40. 【2604.07010】Synthetic Dataset Generation for Partially Observed Indoor Objects

链接：https://arxiv.org/abs/2604.07010

作者：Jelle Vermandere,Maarten Bassier,Maarten Vergauwen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：completion require large, require large datasets, require large, object completion require, partial scans paired

备注：

点击查看摘要

Abstract:Learning-based methods for 3D scene reconstruction and object completion require large datasets containing partial scans paired with complete ground-truth geometry. However, acquiring such datasets using real-world scanning systems is costly and time-consuming, particularly when accurate ground truth for occluded regions is required. In this work, we present a virtual scanning framework implemented in Unity for generating realistic synthetic 3D scan datasets. The proposed system simulates the behaviour of real-world scanners using configurable parameters such as scan resolution, measurement range, and distance-dependent noise. Instead of directly sampling mesh surfaces, the framework performs ray-based scanning from virtual viewpoints, enabling realistic modelling of sensor visibility and occlusion effects. In addition, panoramic images captured at the scanner location are used to assign colours to the resulting point clouds. To support scalable dataset creation, the scanner is integrated with a procedural indoor scene generation pipeline that automatically produces diverse room layouts and furniture arrangements. Using this system, we introduce the \textit{V-Scan} dataset, which contains synthetic indoor scans together with object-level partial point clouds, voxel-based occlusion grids, and complete ground-truth geometry. The resulting dataset provides valuable supervision for training and evaluating learning-based methods for scene reconstruction and object completion.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.07010 [cs.CV]

(or
arXiv:2604.07010v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.07010

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

41. 【2604.07000】IQ-LUT: interpolated and quantized LUT for efficient image super-resolution

链接：https://arxiv.org/abs/2604.07000

作者：Yuxuan Zhang,Zhikai Dong,Xinning Chai,Xiangyun Zhou,Yi Xu,Zhengxue Cheng,Li Song

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Lookup table, methods demonstrate considerable, image super-resolution inference, demonstrate considerable potential, accelerating image super-resolution

备注：

点击查看摘要

Abstract:Lookup table (LUT) methods demonstrate considerable potential in accelerating image super-resolution inference. However, pursuing higher image quality through larger receptive fields and bit-depth triggers exponential growth in the LUT's index space, creating a storage bottleneck that limits deployment on resource-constrained devices. We introduce IQ-LUT, which achieves a reduction in LUT size while simultaneously enhancing super-resolution quality. First, we integrate interpolation and quantization into the single-input, multiple-output ECNN, which dramatically reduces the index space and thereby the overall LUT size. Second, the integration of residual learning mitigates the dependence on LUT bit-depth, which facilitates training stability and prioritizes the reconstruction of fine-grained details for superior visual quality. Finally, guided by knowledge distillation, our non-uniform quantization process optimizes the quantization levels, thereby reducing storage while also compensating for quantization loss. Extensive benchmarking demonstrates our approach substantially reduces storage costs (by up to 50x compared to ECNN) while achieving superior super-resolution quality.

42. 【2604.06989】Generative Phomosaic with Structure-Aligned and Personalized Diffusion

链接：https://arxiv.org/abs/2604.06989

作者：Jaeyoung Chung,Hyunjin Son,Kyoung Mu Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：photomosaic creation, Abstract, photomosaic, images, generative approach

备注： Project page: [this https URL](https://robot0321.github.io/GenerativePhotomosaic/index.html)

点击查看摘要

Abstract:We present the first generative approach to photomosaic creation. Traditional photomosaic methods rely on a large number of tile images and color-based matching, which limits both diversity and structural consistency. Our generative photomosaic framework synthesizes tile images using diffusion-based generation conditioned on reference images. A low-frequency conditioned diffusion mechanism aligns global structure while preserving prompt-driven details. This generative formulation enables photomosaic composition that is both semantically expressive and structurally coherent, effectively overcoming the fundamental limitations of matching-based approaches. By leveraging few-shot personalized diffusion, our model is able to produce user-specific or stylistically consistent tiles without requiring an extensive collection of images.

43. 【2604.06988】Canopy Tree Height Estimation Using Quantile Regression: Modeling and Evaluating Uncertainty in Remote Sensing

链接：https://arxiv.org/abs/2604.06988

作者：Karsten Schrödter,Jan Pauls,Fabian Gieseke

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate tree height, tree height estimation, Accurate tree, tree height, height estimation

备注： Accepted to AISTATS 2026

点击查看摘要

Abstract:Accurate tree height estimation is vital for ecological monitoring and biomass assessment. We apply quantile regression to existing tree height estimation models based on satellite data to incorporate uncertainty quantification. Most current approaches for tree height estimation rely on point predictions, which limits their applicability in risk-sensitive scenarios. In this work, we show that, with minor modifications of a given prediction head, existing models can be adapted to provide statistically calibrated uncertainty estimates via quantile regression. Furthermore, we demonstrate how our results correlate with known challenges in remote sensing (e.g., terrain complexity, vegetation heterogeneity), indicating that the model is less confident in more challenging conditions.

44. 【2604.06987】CAAP: Capture-Aware Adversarial Patch Attacks on Palmprint Recognition Models

链接：https://arxiv.org/abs/2604.06987

作者：Renyang Liu,Jiale Li,Jie Zhang,Cong Wu,Xiaojun Jia,Shuxin Li,Wei Zhou,Kwok-Yan Lam,See-kiong Ng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词：including access control, Palmprint recognition, palmprint recognition systems, deep palmprint recognition, security-critical applications

备注：

点击查看摘要

Abstract:Palmprint recognition is deployed in security-critical applications, including access control and palm-based payment, due to its contactless acquisition and highly discriminative ridge-and-crease textures. However, the robustness of deep palmprint recognition systems against physically realizable attacks remains insufficiently understood. Existing studies are largely confined to the digital setting and do not adequately account for the texture-dominant nature of palmprint recognition or the distortions introduced during physical acquisition. To address this gap, we propose CAAP, a capture-aware adversarial patch framework for palmprint recognition. CAAP learns a universal patch that can be reused across inputs while remaining effective under realistic acquisition variation. To match the structural characteristics of palmprints, the framework adopts a cross-shaped patch topology, which enlarges spatial coverage under a fixed pixel budget and more effectively disrupts long-range texture continuity. CAAP further integrates three modules: ASIT for input-conditioned patch rendering, RaS for stochastic capture-aware simulation, and MS-DIFE for feature-level identity-disruptive guidance. We evaluate CAAP on the Tongji, IITD, and AISEC datasets against generic CNN backbones and palmprint-specific recognition models. Experiments show that CAAP achieves strong untargeted and targeted attack performance with favorable cross-model and cross-dataset transferability. The results further show that, although adversarial training can partially reduce the attack success rate, substantial residual vulnerability remains. These findings indicate that deep palmprint recognition systems remain vulnerable to physically realizable, capture-aware adversarial patch attacks, underscoring the need for more effective defenses in practice. Code available at this https URL.

45. 【2604.06966】MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

链接：https://arxiv.org/abs/2604.06966

作者：Xiaoxiao Ma,Jiachen Lei,Tianfei Ren,Jie Huang,Siming Fu,Aiming Hao,Jiahong Wu,Xiangxiang Chu,Feng Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reinforcement learning, successfully applied, Reinforcement, masked autoregressive models, models

备注：

点击查看摘要

Abstract:Reinforcement learning (RL) has been successfully applied to autoregressive (AR) and diffusion models. However, extending RL to hybrid AR-diffusion frameworks remains challenging due to interleaved inference and noisy log-probability estimation. In this work, we study masked autoregressive models (MAR) and show that the diffusion head plays a critical role in training dynamics, often introducing noisy gradients that lead to instability and early performance saturation. To address this issue, we propose a stabilized RL framework for MAR. We introduce multi-trajectory expectation (MTE), which estimates the optimization direction by averaging over multiple diffusion trajectories, thereby reducing diffusion-induced gradient noise. To avoid over-smoothing, we further estimate token-wise uncertainty from multiple trajectories and apply multi-trajectory optimization only to the top-k% uncertain tokens. In addition, we introduce a consistency-aware token selection strategy that filters out AR tokens that are less aligned with the final generated content. Extensive experiments across multiple benchmarks demonstrate that our method consistently improves visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models. Code is available at: this https URL.

46. 【2604.06961】Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction

链接：https://arxiv.org/abs/2604.06961

作者：Pablo Parte,Roberto Valle,José M. Buenaposada,Luis Baumela

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：interpret human behavior, human-robot interaction critically, interaction critically depends, human behavior, human-robot interaction

备注：

点击查看摘要

Abstract:Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender and race biases. To this end we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Evaluations of a standard representative model demonstrate that confounding visual factors, particularly head pose and image resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, we show that performance disparities across gender and race vanish. However, we identify a statistically significant age-related effect, with higher biases observed for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline, disproportionately affecting vulnerable populations. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.

47. 【2604.06954】Compression as an Adversarial Amplifier Through Decision Space Reduction

链接：https://arxiv.org/abs/2604.06954

作者：Lewis Evans,Harkrishan Jandu,Zihan Ye,Yang Lu,Shreyank N Gowda

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modern visual pipelines, social media platforms, resource-constrained systems prior, visual pipelines, prior to inference

备注：

点击查看摘要

Abstract:Image compression is a ubiquitous component of modern visual pipelines, routinely applied by social media platforms and resource-constrained systems prior to inference. Despite its prevalence, the impact of compression on adversarial robustness remains poorly understood. We study a previously unexplored adversarial setting in which attacks are applied directly in compressed representations, and show that compression can act as an adversarial amplifier for deep image classifiers. Under identical nominal perturbation budgets, compression-aware attacks are substantially more effective than their pixel-space counterparts. We attribute this effect to decision space reduction, whereby compression induces a non-invertible, information-losing transformation that contracts classification margins and increases sensitivity to perturbations. Extensive experiments across standard benchmarks and architectures support our analysis and reveal a critical vulnerability in compression-in-the-loop deployment settings. Code will be released.

48. 【2604.06950】Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

链接：https://arxiv.org/abs/2604.06950

作者：Zhiheng Li,Zongyang Ma,Yuntong Pan,Ziqi Zhang,Xiaolei Lv,Bo Li,Jun Gao,Jianing Zhang,Chunfeng Yuan,Bing Li,Weiming Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models

备注： Accepted to ACL 2026. 19 pages, 6 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at this https URL.

49. 【2604.06945】NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results

链接：https://arxiv.org/abs/2604.06945

作者：Wenbin Zou,Tianyi Li,Kejun Wu,Huiping Zhuang,Zongwei Wu,Zhuyun Zhou,Radu Timofte,Kim-Hui Yap,Lap-Pui Chau,Yi Wang,Shiqi Zhou,Xiaodi Shi,Yuxiang Chen,Yilian Zhong,Shibo Yin,Yushun Fang,Xilei Zhu,Yahui Wang,Chen Lu,Zhitao Wang,Lifa Ha,Hengyu Man,Xiaopeng Fan,Priyansh Singh,Sidharth,Krrish Dev,Soham Kakkar,Vinit Jakhetiya,Ovais Iqbal Shah,Wei Zhou,Linfeng Li,Qi Xu,Zhenyang Liu,Kepeng Xu,Tong Qiao,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：paper reports, Bitstream-Corrupted Video Restoration, Bitstream-Corrupted Video, Video Restoration, BSCVR

备注： 15 pages, 8 figures, 1 table, CVPRW2026 NTIRE Challenge Report

点击查看摘要

Abstract:This paper reports on the NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration (BSCVR). The challenge aims to advance research on recovering visually coherent videos from corrupted bitstreams, whose decoding often produces severe spatial-temporal artifacts and content distortion. Built upon recent progress in bitstream-corrupted video recovery, the challenge provides a common benchmark for evaluating restoration methods under realistic corruption settings. We describe the dataset, evaluation protocol, and participating methods, and summarize the final results and main technical trends. The challenge highlights the difficulty of this emerging task and provides useful insights for future research on robust video restoration under practical bitstream corruption.

50. 【2604.06939】Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

链接：https://arxiv.org/abs/2604.06939

作者：Jintao Chen,Chengyu Bai,Junjun hu,Xinda Xue,Mu Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：interactive instruction switching, Autoregressive video synthesis, Autoregressive video, intertwined challenges, context limitations

备注：

点击查看摘要

Abstract:Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.

51. 【2604.06938】POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP

链接：https://arxiv.org/abs/2604.06938

作者：Jiyun Won,Heemin Yang,Woohyeok Kim,Jungseul Ok,Sunghyun Cho

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image signal processing, explored optimizing image, optimizing image signal, Recent work, composing predefined modules

备注：

点击查看摘要

Abstract:Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-specific objectives. However, jointly optimizing module sequences and parameters remains challenging. Existing approaches rely on neural architecture search (NAS) or step-wise reinforcement learning (RL), but NAS suffers from a training-inference mismatch, while step-wise RL leads to unstable training and high computational overhead due to stage-wise decision-making. We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward, eliminating the need for intermediate supervision and redundant executions. Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost, highlighting sequence-level optimization as a stable and efficient paradigm for task-aware ISP. The project page is available at this https URL

52. 【2604.06934】Multi-modal user interface control detection using cross-attention

链接：https://arxiv.org/abs/2604.06934

作者：Milad Moradi,Ke Yan,David Colwell,Matthias Samwald,Rhona Asgari

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：remains challenging due, Detecting user interface, design variability, user interface, pixel-only approaches

备注：

点击查看摘要

Abstract:Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.

53. 【2604.06916】FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

链接：https://arxiv.org/abs/2604.06916

作者：Yitong Li,Junsong Chen,Shuchen Xue,Pengcuo Zeren,Siyuan Fu,Dinghao Yang,Yangyang Tang,Junjie Bai,Ping Luo,Song Han,Enze Xie

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：post-training has recently, paradigm for aligning, human preferences, recently emerged, promising paradigm

备注：

点击查看摘要

Abstract:Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to $4.64\times$, unlocking the power of massive rollout scaling at a fraction of the cost.

54. 【2604.06912】Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

链接：https://arxiv.org/abs/2604.06912

作者：Yuheng Shi,Xiaohuan Pei,Linfeng Wen,Minjing Dong,Chang Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：MLLMs require high-resolution, require high-resolution visual, high-resolution visual inputs, MLLMs require, visual inputs

备注： 16 pages, 9 figures

点击查看摘要

Abstract:MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at this https URL.

55. 【2604.06901】XR-CareerAssist: An Immersive Platform for Personalised Career Guidance Leveraging Extended Reality and Multimodal AI

链接：https://arxiv.org/abs/2604.06901

作者：N.D. Tantaroudas,A.J. McCracken,I. Karachalios,E. Papatheou,V. Pastrikakis

类目：Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Emerging Technologies (cs.ET)

关键词：Conventional career guidance, guidance platforms rely, career guidance platforms, career guidance, Conventional career

备注： 21

点击查看摘要

Abstract:Conventional career guidance platforms rely on static, text-driven interfaces that struggle to engage users or deliver personalised, evidence-based insights. Although Computer-Assisted Career Guidance Systems have evolved since the 1960s, they remain limited in interactivity and pay little attention to the narrative dimensions of career development. We introduce XR-CareerAssist, a platform that unifies Extended Reality (XR) with several Artificial Intelligence (AI) modules to deliver immersive, multilingual career guidance. The system integrates Automatic Speech Recognition for voice-driven interaction, Neural Machine Translation across English, Greek, French, and Italian, a Langchain-based conversational Training Assistant for personalised dialogue, a BLIP-based Vision-Language model for career visualisations, and AWS Polly Text-to-Speech delivered through an interactive 3D avatar. Career trajectories are rendered as dynamic Sankey diagrams derived from a repository of more than 100,000 anonymised professional profiles. The application was built in Unity for Meta Quest 3, with backend services hosted on AWS. A pilot evaluation at the University of Exeter with 23 participants returned 95.6% speech recognition accuracy, 78.3% overall user satisfaction, and 91.3% favourable ratings for system responsiveness, with feedback informing subsequent improvements to motion comfort, audio clarity, and text legibility. XR-CareerAssist demonstrates how the fusion of XR and AI can produce more engaging, accessible, and effective career development tools, with the integration of five AI modules within a single immersive environment yielding a multimodal interaction experience that distinguishes it from existing career guidance platforms.

56. 【2604.06893】Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models

链接：https://arxiv.org/abs/2604.06893

作者：Tom Devynck Bilal Faye Djamel Bouchaffra Nadjib Lazaar Hanane Azzag Mustapha Lebbah

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：spurious background correlations, Deep convolutional neural, achieve remarkable performance, exhaustively processing dense, brute-force strategy introduces

备注：

点击查看摘要

Abstract:Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

57. 【2604.06885】me-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer

链接：https://arxiv.org/abs/2604.06885

作者：Sambit Tarai,Ashish Chauhan,Elin Lundström,Johan Öfverstedt,Therese Sjöholm,Veronica Sanchez Rodriguez,Håkan Ahlström,Joel Kullberg

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：personalized treatment planning, Cell Lung Cancer, improving patient prognostics, Automated medical image-based, medical image-based prediction

备注： Under review

点击查看摘要

Abstract:Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction.

Comments:
Under review

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.06885 [cs.CV]

(or
arXiv:2604.06885v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.06885

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Ashish Chauhan [view email] [v1]
Wed, 8 Apr 2026 09:43:30 UTC (2,246 KB)

58. 【2604.06883】SCT-MOT: Enhancing Air-to-Air Multiple UAVs Tracking with Swarm-Coupled Motion and Trajectory Guidance

链接：https://arxiv.org/abs/2604.06883

作者：Zhaochen Chu,Tao Song,Ren Jin,Shaoming He,Defu Lin,Siqing Cheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：UAVs presents significant, presents significant challenges, significant challenges due, swarm UAVs presents, detection failures

备注： 17 pages, 7 figures. Under review at IEEE Transactions on Aerospace and Electronic Systems (TAES). This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Air-to-air tracking of swarm UAVs presents significant challenges due to the complex nonlinear group motion and weak visual cues for small objects, which often cause detection failures, trajectory fragmentation, and identity switches. Although existing methods have attempted to improve performance by incorporating trajectory prediction, they model each object independently, neglecting the swarm-level motion dependencies. Their limited integration between motion prediction and appearance representation also weakens the spatio-temporal consistency required for tracking in visually ambiguous and cluttered environments, making it difficult to maintain coherent trajectories and reliable associations. To address these challenges, we propose SCT-MOT, a tracking framework that integrates Swarm-Coupled motion modeling and Trajectory-guided feature fusion. First, we develop a Swarm Motion-Aware Trajectory Prediction (SMTP) module jointly models historical trajectories and posture-aware appearance features from a swarm-level perspective, enabling more accurate forecasting of the nonlinear, coupled group trajectories. Second, we design a Trajectory-Guided Spatio-Temporal Feature Fusion (TG-STFF) module aligns predicted positions with historical visual cues and deeply integrates them with current frame features, enhancing temporal consistency and spatial discriminability for weak objects. Extensive experiments on three public air-to-air swarm UAV tracking datasets, including AIRMOT, MOT-FLY, and UAVSwarm, demonstrate that SMTP achieves more accurate trajectory forecasts and yields a 1.21\% IDF1 improvement over the state-of-the-art trajectory prediction module EqMotion when integrated into the same MOT framework. Overall, our SCT-MOT consistently achieves superior accuracy and robustness compared to state-of-the-art trackers across multiple metrics under complex swarm scenarios.

59. 【2604.06870】RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

链接：https://arxiv.org/abs/2604.06870

作者：Dewei Zhou,You Li,Zongxin Yang,Yi Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：pixels strictly unchanged, non-edited pixels strictly, restore fine-grained details, dedicated problem setting, region-specific image refinement

备注： 18 pages

点击查看摘要

Abstract:We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: this https URL.

60. 【2604.06865】Physical Adversarial Attacks on AI Surveillance Systems:Detection, Tracking, and Visible--Infrared Evasion

链接：https://arxiv.org/abs/2604.06865

作者：Miguel A.DelaCruz,Patricia Mae Santos,Rafael T.Navarro

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：resemble deployed surveillance, increasingly studied, resemble deployed, Physical adversarial attacks, deployed surveillance systems

备注：

点击查看摘要

Abstract:Physical adversarial attacks are increasingly studied in settings that resemble deployed surveillance systems rather than isolated image benchmarks. In these settings, person detection, multi-object tracking, visible--infrared sensing, and the practical form of the attack carrier all matter at once. This changes how the literature should be read. A perturbation that suppresses a detector in one frame may have limited practical effect if identity is recovered over time; an RGB-only result may say little about night-time systems that rely on visible and thermal inputs together; and a conspicuous patch can imply a different threat model from a wearable or selectively activated carrier. This paper reviews physical attacks from that surveillance-oriented viewpoint. Rather than attempting a complete catalogue of all physical attacks in computer vision, we focus on the technical questions that become central in surveillance: temporal persistence, sensing modality, carrier realism, and system-level objective. We organize prior work through a four-part taxonomy and discuss how recent results on multi-object tracking, dual-modal visible--infrared evasion, and controllable clothing reflect a broader change in the field. We also summarize evaluation practices and unresolved gaps, including distance robustness, camera-pipeline variation, identity-level metrics, and activation-aware testing. The resulting picture is that surveillance robustness cannot be judged reliably from isolated per-frame benchmarks alone; it has to be examined as a system problem unfolding over time, across sensors, and under realistic physical deployment constraints.

61. 【2604.06849】Vision-Language Model-Guided Deep Unrolling Enables Personalized, Fast MRI

链接：https://arxiv.org/abs/2604.06849

作者：Fangmao Ju,Yuzhu He,Zhiwen Xue,Chunfeng Lian,Jianhua Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Magnetic Resonance Imaging, Magnetic Resonance, long acquisition times, Resonance Imaging, acquisition times

备注：

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is a cornerstone in medicine and healthcare but suffers from long acquisition times. Traditional accelerated MRI methods optimize for generic image quality, lacking adaptability for specific clinical tasks. To address this, we introduce PASS (Personalized, Anomaly-aware Sampling and reconStruction), an intelligent MRI framework that leverages a Vision-Language Model (VLM) to guide a deep unrolling network for task-oriented, fast imaging. PASS dynamically personalizes the imaging pipeline through three core contributions: (1) a deep unrolled reconstruction network derived from a physics-based MRI model; (2) a sampling module that generates patient-specific $k$-space trajectories; and (3) an anomaly-aware prior, extracted from a pretrained VLM, which steers both sampling and reconstruction toward clinically relevant regions. By integrating the high-level clinical reasoning of a VLM with an interpretable, physics-aware network, PASS achieves superior image quality across diverse anatomies, contrasts, anomalies, and acceleration factors. This enhancement directly translates to improvements in downstream diagnostic tasks, including fine-grained anomaly detection, localization, and diagnosis.

62. 【2604.06844】CloudMamba: An Uncertainty-Guided Dual-Scale Mamba Network for Cloud Detection in Remote Sensing Imagery

链接：https://arxiv.org/abs/2604.06844

作者：Jiajun Yang,Keyan Chen,Zhengxia Zou,Zhenwei Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：highly challenging problem, remote sensing imagery, Cloud detection, challenging problem, remote sensing

备注：

点击查看摘要

Abstract:Cloud detection in remote sensing imagery is a fundamental, critical, and highly challenging problem. Existing deep learning-based cloud detection methods generally formulate it as a single-stage pixel-wise binary segmentation task with one forward pass. However, such single-stage approaches exhibit ambiguity and uncertainty in thin-cloud regions and struggle to accurately handle fragmented clouds and boundary details. In this paper, we propose a novel deep learning framework termed CloudMamba. To address the ambiguity in thin-cloud regions, we introduce an uncertainty-guided two-stage cloud detection strategy. An embedded uncertainty estimation module is proposed to automatically quantify the confidence of thin-cloud segmentation, and a second-stage refinement segmentation is introduced to improve the accuracy in low-confidence hard regions. To better handle fragmented clouds and fine-grained boundary details, we design a dual-scale Mamba network based on a CNN-Mamba hybrid architecture. Compared with Transformer-based models with quadratic computational complexity, the proposed method maintains linear computational complexity while effectively capturing both large-scale structural characteristics and small-scale boundary details of clouds, enabling accurate delineation of overall cloud morphology and precise boundary segmentation. Extensive experiments conducted on the GF1_WHU and Levir_CS public datasets demonstrate that the proposed method outperforms existing approaches across multiple segmentation accuracy metrics, while offering high efficiency and process transparency. Our code is available at this https URL.

63. 【2604.06830】VGGT-SLAM++

链接：https://arxiv.org/abs/2604.06830

作者：Avilasha Mandal,Rajesh Kumar,Sudarshan Sunil Harithas,Chetan Arora

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Geometry Grounded Transformer, Visual Geometry Grounded, Geometry Grounded, Digital Elevation Map, complete visual SLAM

备注： 8 pages (main paper) + supplementary material. Accepted at CVPR 2026 Workshop (VOCVALC)

点击查看摘要

Abstract:We introduce VGGT-SLAM++, a complete visual SLAM system that leverages the geometry-rich outputs of the Visual Geometry Grounded Transformer (VGGT). The system comprises a visual odometry (front-end) fusing the VGGT feed-forward transformer and a Sim(3) solution, a Digital Elevation Map (DEM)-based graph construction module, and a back-end that jointly enable accurate large-scale mapping with bounded memory. While prior transformer-based SLAM pipelines such as VGGT-SLAM rely primarily on sparse loop closures or global Sim(3) manifold constraints - allowing short-horizon pose drift - VGGT-SLAM++ restores high-cadence local bundle adjustment (LBA) through a spatially corrective back-end. For each VGGT submap, we construct a dense planar-canonical DEM, partition it into patches, and compute their DINOv2 embeddings to integrate the submap into a covisibility graph. Spatial neighbors are retrieved using a Visual Place Recognition (VPR) module within the covisibility window, triggering frequent local optimization that stabilizes trajectories. Across standard SLAM benchmarks, VGGT-SLAM++ achieves state-of-the-art accuracy, substantially reducing short-term drift, accelerating graph convergence, and maintaining global consistency with compact DEM tiles and sublinear retrieval.

64. 【2604.06825】RePL: Pseudo-label Refinement for Semi-supervised LiDAR Semantic Segmentation

链接：https://arxiv.org/abs/2604.06825

作者：Donghyeon Kwon,Taegyu Park,Suha Kwak

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：confirmation bias caused, Semi-supervised learning, LiDAR semantic segmentation, propagation and confirmation, confirmation bias

备注：

点击查看摘要

Abstract:Semi-supervised learning for LiDAR semantic segmentation often suffers from error propagation and confirmation bias caused by noisy pseudo-labels. To tackle this chronic issue, we introduce RePL, a novel framework that enhances pseudo-label quality by identifying and correcting potential errors in pseudo-labels through masked reconstruction, along with a dedicated training strategy. We also provide a theoretical analysis demonstrating the condition under which the pseudo-label refinement is beneficial, and empirically confirm that the condition is mild and clearly met by RePL. Extensive evaluations on the nuScenes-lidarseg and SemanticKITTI datasets show that RePL improves pseudo-label quality a lot and, as a result, achieves the state of the art in LiDAR semantic segmentation.

65. 【2604.06824】Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

链接：https://arxiv.org/abs/2604.06824

作者：Subin Park,Jung Uk Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：localization task aims, Sound source localization, source localization task, visual modalities, Large Language Models

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at this https URL.

66. 【2604.06795】FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift

链接：https://arxiv.org/abs/2604.06795

作者：Huy Q. Le,Loc X. Nguyen,Yu Qiao,Seong Tae Kim,Eui-Nam Huh,Choong Seon Hong

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：exposing private data, making it ideal, privacy-sensitive applications, exposing private, ideal for privacy-sensitive

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Federated Learning (FL) enables decentralized model training across multiple clients without exposing private data, making it ideal for privacy-sensitive applications. However, in real-world FL scenarios, clients often hold data from distinct domains, leading to severe domain shift and degraded global model performance. To address this, prototype learning has been emerged as a promising solution, which leverages class-wise feature representations. Yet, existing methods face two key limitations: (1) Existing prototype-based FL methods typically construct a $\textit{single global prototype}$ per class by aggregating local prototypes from all clients without preserving domain information. (2) Current feature-prototype alignment is $\textit{domain-agnostic}$, forcing clients to align with global prototypes regardless of domain origin. To address these challenges, we propose Federated Domain-Aware Prototypes (FedDAP) to construct domain-specific global prototypes by aggregating local client prototypes within the same domain using a similarity-weighted fusion mechanism. These global domain-specific prototypes are then used to guide local training by aligning local features with prototypes from the same domain, while encouraging separation from prototypes of different domains. This dual alignment enhances domain-specific learning at the local level and enables the global model to generalize across diverse domains. Finally, we conduct extensive experiments on three different datasets: DomainNet, Office-10, and PACS to demonstrate the effectiveness of our proposed framework to address the domain shift challenges. The code is available at this https URL.

67. 【2604.06789】Video-guided Machine Translation with Global Video Context

链接：https://arxiv.org/abs/2604.06789

作者：Jian Chen,JinZe Lv,Zi Long,XiangHua Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Video-guided Multimodal Translation, Video-guided Multimodal, recent years, globally video-guided multimodal, Multimodal Translation

备注：

点击查看摘要

68. 【2604.06783】Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

链接：https://arxiv.org/abs/2604.06783

作者：Bohao Xing,Deng Li,Rong Gao,Xin Liu,Heikki Kälviäinen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made significant progress, video tasks, made significant, significant progress, vision tasks

备注：

点击查看摘要

Abstract:Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at this https URL.

69. 【2604.06782】EventFace: Event-Based Face Recognition via Structure-Driven Spatiotemporal Modeling

链接：https://arxiv.org/abs/2604.06782

作者：Qingguo Meng,Xingbo Dong,Zhe Jin,Massimo Tistarelli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：promising sensing modality, event-based face recognition, face recognition, Event cameras offer, face recognition due

备注：

点击查看摘要

Abstract:Event cameras offer a promising sensing modality for face recognition due to their inherent advantages in illumination robustness and privacy-friendliness. However, because event streams lack the stable photometric appearance relied upon by conventional RGB-based face recognition systems, we argue that event-based face recognition should model structure-driven spatiotemporal identity representations shaped by rigid facial motion and individual facial geometry. Since dedicated datasets for event-based face recognition remain lacking, we construct EFace, a small-scale event-based face dataset captured under rigid facial motion. To learn effectively from this limited event data, we further propose EventFace, a framework for event-based face recognition that integrates spatial structure and temporal dynamics for identity modeling. Specifically, we employ Low-Rank Adaptation (LoRA) to transfer structural facial priors from pretrained RGB face models to the event domain, thereby establishing a reliable spatial basis for identity modeling. Building on this foundation, we further introduce a Motion Prompt Encoder (MPE) to explicitly encode temporal features and a Spatiotemporal Modulator (STM) to fuse them with spatial features, thereby enhancing the representation of identity-relevant event patterns. Extensive experiments demonstrate that EventFace achieves the best performance among the evaluated baselines, with a Rank-1 identification rate of 94.19% and an equal error rate (EER) of 5.35%. Results further indicate that EventFace exhibits stronger robustness under degraded illumination than the competing methods. In addition, the learned representations exhibit reduced template reconstructability.

70. 【2604.06777】Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

链接：https://arxiv.org/abs/2604.06777

作者：Wenhao Yang,Yu Xia,Jinlong Huang,Shiyin Lu,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Yuchen Zhou,Xiaobo Xia,Yuanyu Wan,Lijun Zhang,Tat-Seng Chua

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Large Language, Recent advancements, actively invoking visual

备注：

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images'' by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model's multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal Chain-of-Thought (MCoT). Specifically, MAPO mandates the model to generate explicit textual descriptions for the visual content obtained via tool usage. We then employ a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward. Theoretical findings are provided to justify the rationale behind MAPO, which inherently reduces the variance of gradients, and extensive experiments demonstrate that our method achieves superior performance across multiple visual reasoning benchmarks.

71. 【2604.06770】FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts

链接：https://arxiv.org/abs/2604.06770

作者：Guillermo Gil de Avalle,Laura Maruster,Eric Sloot,Christos Emmanouilidis

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Maintenance procedures, procedures in manufacturing, manufacturing facilities, static PDFs, PDFs or scanned

备注：

点击查看摘要

Abstract:Maintenance procedures in manufacturing facilities are often documented as flowcharts in static PDFs or scanned images. They encode procedural knowledge essential for asset lifecycle management, yet inaccessible to modern operator support systems. Vision-language models, the dominant paradigm for image understanding, struggle to reconstruct connection topology from such diagrams. We present FlowExtract, a pipeline for extracting directed graphs from ISO 5807-standardized flowcharts. The system separates element detection from connectivity reconstruction, using YOLOv8 and EasyOCR for standard domain-aligned node detection and text extraction, combined with a novel edge detection method that analyzes arrowhead orientations and traces connecting lines backward to source nodes. Evaluated on industrial troubleshooting guides, FlowExtract achieves very high node detection and substantially outperforms vision-language model baselines on edge extraction, offering organizations a practical path toward queryable procedural knowledge representations. The implementation is available athttps://github.com/guille-gil/FlowExtract.

72. 【2604.06757】FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

链接：https://arxiv.org/abs/2604.06757

作者：Junchao Yi,Rui Zhao,Jiahao Tang,Weixian Lei,Linjie Li,Qisheng Su,Zhengyuan Yang,Lijuan Wang,Xiaofeng Zhu,Alex Jinpeng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：language dictates vision, long been dominated, dominated by text-driven, language dictates, dictates vision

备注：

点击查看摘要

Abstract:Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.

73. 【2604.06750】How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

链接：https://arxiv.org/abs/2604.06750

作者：Roberto Brusnicki,Mattia Piccinini,Johannes Betz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains poorly characterized, sequential driving scenes, scenes remains poorly, autonomous driving tasks, sequential driving

备注： 8 pages, 5 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at this https URL

74. 【2604.06748】From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

链接：https://arxiv.org/abs/2604.06748

作者：Carlos Schmidt,Simon Reiß

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual in-context learning, enabling rapid generalization, Visual in-context, in-context learning models, in-context learning

备注：

点击查看摘要

Abstract:Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.

75. 【2604.06740】LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video

链接：https://arxiv.org/abs/2604.06740

作者：Pedro Quesado,Erkut Akdag,Yasaman Kashefbahrami,Willem Menu,Egor Bondarev

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：View Synthesis, range of applications, remains an open, open challenge, wide range

备注：

点击查看摘要

Abstract:Live-streaming Novel View Synthesis (NVS) from unposed multi-view video remains an open challenge in a wide range of applications. Existing methods for dynamic scene representation typically require ground-truth camera parameters and involve lengthy optimizations ($\approx 2.67$s), which makes them unsuitable for live streaming scenarios. To address this issue, we propose a novel viewpoint video live-streaming method (LiveStre4m), a feed-forward model for real-time NVS from unposed sparse multi-view inputs. LiveStre4m introduces a multi-view vision transformer for keyframe 3D scene reconstruction coupled with a diffusion-transformer interpolation module that ensures temporal consistency and stable streaming. In addition, a Camera Pose Predictor module is proposed to efficiently estimate both poses and intrinsics directly from RGB images, removing the reliance on known camera calibration information. Our approach enables temporally consistent novel-view video streaming in real-time using as few as two synchronized unposed input streams. LiveStre4m attains an average reconstruction time of $ 0.07$s per-frame at $ 1024 \times 768$ resolution, outperforming the optimization-based dynamic scene representation methods by orders of magnitude in runtime. These results demonstrate that LiveStre4m makes real-time NVS streaming feasible in practical settings, marking a substantial step toward deployable live novel-view synthesis systems. Code available at: this https URL

76. 【2604.06739】DOC-GS: Dual-Domain Observation and Calibration for Reliable Sparse-View Gaussian Splatting

链接：https://arxiv.org/abs/2604.06739

作者：Hantang Li,Qiang Zhu,Xiandong Meng,Debin Zhao,Xiaopeng Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fundamentally ill-posed due, Gaussian Splatting, insufficient geometric supervision, translucent haze-like artifacts, fundamentally ill-posed

备注： 10 pages, 5 figures

点击查看摘要

Abstract:Sparse-view reconstruction with 3D Gaussian Splatting (3DGS) is fundamentally ill-posed due to insufficient geometric supervision, often leading to severe overfitting and the emergence of structural distortions and translucent haze-like artifacts. While existing approaches attempt to alleviate this issue via dropout-based regularization, they are largely heuristic and lack a unified understanding of artifact formation. In this paper, we revisit sparse-view 3DGS reconstruction from a new perspective and identify the core challenge as the unobservability of Gaussian primitive reliability. Unreliable Gaussians are insufficiently constrained during optimization and accumulate as haze-like degradations in rendered images. Motivated by this observation, we propose a unified Dual-domain Observation and Calibration (DOC-GS) framework that models and corrects Gaussian reliability through the synergy of optimization-domain inductive bias and observation-domain evidence. Specifically, in the optimization domain, we characterize Gaussian reliability by the degree to which each primitive is constrained during training, and instantiate this signal via a Continuous Depth-Guided Dropout (CDGD) strategy, where the dropout probability serves as an explicit proxy for primitive reliability. This imposes a smooth depth-aware inductive bias to suppress weakly constrained Gaussians and improve optimization stability. In the observation domain, we establish a connection between floater artifacts and atmospheric scattering, and leverage the Dark Channel Prior (DCP) as a structural consistency cue to identify and accumulate anomalous regions. Based on cross-view aggregated evidence, we further design a reliability-driven geometric pruning strategy to remove low-confidence Gaussians.

77. 【2604.06728】URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

链接：https://arxiv.org/abs/2604.06728

作者：Zhenyu Wang,Weichen Cheng,Weijia Li,Junjie Mou,Zongyou Zhao,Guoying Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词：identify sarcastic intent, Multimodal sarcasm detection, sarcasm detection, aims to identify, identify sarcastic

备注：

点击查看摘要

Abstract:Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, they often assume that all modalities are equally reliable. In real-world social media, however, textual content may be ambiguous and visual content may be weakly relevant or even irrelevant, causing deterministic fusion to introduce noisy evidence and weaken robust reasoning. To address this issue, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework that explicitly models modality reliability during interaction and fusion. URMF first employs multi-head cross-attention to inject visual evidence into textual representations, followed by multi-head self-attention in the fused semantic space to enhance incongruity-aware reasoning. It then performs unified unimodal aleatoric uncertainty modeling over text, image, and interaction-aware latent representations by parameterizing each modality as a learnable Gaussian posterior. The estimated uncertainty is further used to dynamically regulate modality contributions during fusion, suppressing unreliable modalities and yielding a more robust joint representation. In addition, we design a joint training objective integrating task supervision, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven self-sampling contrastive learning. Experiments on public MSD benchmarks show that URMF consistently outperforms strong unimodal, multimodal, and MLLM-based baselines, demonstrating the effectiveness of uncertainty-aware fusion for improving both accuracy and robustness.

78. 【2604.06725】Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

链接：https://arxiv.org/abs/2604.06725

作者：Jiahua Chen,Qihong Tang,Weinong Wang,Qi Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, achieved remarkable progress, Multimodal Large, Large Language

备注：

点击查看摘要

Abstract:Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textit{GPT-5.2} and \textit{Gemini-2.5-Flash}, on major benchmarks such as 3DSRBench and Rel3D.

79. 【2604.06720】Exploring 6D Object Pose Estimation with Deformation

链接：https://arxiv.org/abs/2604.06720

作者：Zhiqiang Liu,Rui Song,Duanmu Chuangqi,Jiaojiao Li,David Ferstl,Yinlin Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：object pose, object pose methods, present DeSOPE, https URL, pose

备注：

点击查看摘要

Abstract:We present DeSOPE, a large-scale dataset for 6DoF deformed objects. Most 6D object pose methods assume rigid or articulated objects, an assumption that fails in practice as objects deviate from their canonical shapes due to wear, impact, or deformation. To model this, we introduce the DeSOPE dataset, which features high-fidelity 3D scans of 26 common object categories, each captured in one canonical state and three deformed configurations, with accurate 3D registration to the canonical mesh. Additionally, it features an RGB-D dataset with 133K frames across diverse scenarios and 665K pose annotations produced via a semi-automatic pipeline. We begin by annotating 2D masks for each instance, then compute initial poses using an object pose method, refine them through an object-level SLAM system, and finally perform manual verification to produce the final annotations. We evaluate several object pose methods and find that performance drops sharply with increasing deformation, suggesting that robust handling of such deformations is critical for practical applications. The project page and dataset are available at this https URL}{this https URL.

80. 【2604.06715】HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation

链接：https://arxiv.org/abs/2604.06715

作者：Md Aminur Hossain,Ayush V. Patel,Siddhant Gole,Sanjay K. Singh,Biplab Banerjee

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：jointly capture fine, capture fine spatial, fine spatial details, segmentation requires models, high-level semantic context

备注： 17 pages

点击查看摘要

Abstract:Remote sensing semantic segmentation requires models that can jointly capture fine spatial details and high-level semantic context across complex scenes. While classical encoder-decoder architectures such as U-Net remain strong baselines, they often struggle to fully exploit global semantics and structured feature interactions. In this work, we propose HQF-Net, a hybrid quantum-classical multi-scale fusion network for remote sensing image segmentation. HQF-Net integrates multi-scale semantic guidance from a frozen DINOv3 ViT-L/16 backbone with a customized U-Net architecture through a Deformable Multiscale Cross-Attention Fusion (DMCAF) module. To enhance feature refinement, the framework further introduces quantum-enhanced skip connections (QSkip) and a Quantum bottleneck with Mixture-of-Experts (QMoE), which combines complementary local, global, and directional quantum circuits within an adaptive routing mechanism. Experiments on three remote sensing benchmarks show consistent improvements with the proposed design. HQF-Net achieves 0.8568 mIoU and 96.87% overall accuracy on this http URL, 71.82% mIoU on OpenEarthMap, and 55.28% mIoU with 99.37% overall accuracy on SeasoNet. An architectural ablation study further confirms the contribution of each major component. These results show that structured hybrid quantum-classical feature processing is a promising direction for improving remote sensing semantic segmentation under near-term quantum constraints.

81. 【2604.06714】Steering the Verifiability of Multimodal AI Hallucinations

链接：https://arxiv.org/abs/2604.06714

作者：Jianhong Pang,Ruoxi Cheng,Ziyi Ye,Xingjun Ma,Zuxuan Wu,Xuanjing Huang,Yu-Gang Jiang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：pose considerable risks, human users, multimodal large language, large language models, large language

备注：

点击查看摘要

82. 【2604.06713】Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency

链接：https://arxiv.org/abs/2604.06713

作者：Ke Jin,Jiming Chen,Qi Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent semi-dense image, achieved remarkable success, Recent semi-dense, semi-dense image matching, remarkable success

备注：

点击查看摘要

Abstract:Recent semi-dense image matching methods have achieved remarkable success, but two long-standing issues still impair their performance. At the coarse stage, the over-exclusion issue of their mutual nearest neighbor (MNN) matching layer makes them struggle to handle cases with scale difference between images. To this end, we comprehensively revisit the matching mechanism and make a key observation that the hint concealed in the score matrix can be exploited to indicate the scale ratio. Based on this, we propose a scale-aware matching module which is exceptionally effective but introduces negligible overhead. At the fine stage, we point out that existing methods neglect the local consistency of final matches, which undermines their robustness. To this end, rather than independently predicting the correspondence for each source pixel, we reformulate the fine stage as a cascaded flow refinement problem and introduce a novel gradient loss to encourage local consistency of the flow field. Extensive experiments demonstrate that our novel matching pipeline, with these proposed modifications, achieves robust and accurate matching performance on downstream tasks.

83. 【2604.06711】Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation

链接：https://arxiv.org/abs/2604.06711

作者：Jianing Zhang,Runan Li,Honglin Pang,Ding Xia,Zhou Zhu,Qian Zhang,Chuntao Li,Xi Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Oracle Bone Script, Chinese Oracle Bone, Deciphering ancient Chinese, ancient Chinese Oracle, Bone Script

备注：

点击查看摘要

84. 【2604.06687】RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection

链接：https://arxiv.org/abs/2604.06687

作者：Hui Li,Peien Ding,Jun Li,Guoqi Ma,Zhanyu Liu,Ge Xu,Junfeng Yao,Jinsong Su

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：crucial research direction, online information, crucial research, research direction, direction for maintaining

备注： 10 pages,5 figures

点击查看摘要

Abstract:Multimodal fake news video detection is a crucial research direction for maintaining the credibility of online information. Existing studies primarily verify content authenticity by constructing multimodal feature fusion representations or utilizing pre-trained language models to analyze video-text consistency. However, these methods still face the following limitations: (1) lacking cross-instance global semantic correlations, making it difficult to effectively utilize historical associative evidence to verify the current video; (2) semantic discrepancies across domains hinder the transfer of general knowledge, lacking the guidance of domain-specific expert knowledge. To this end, we propose a novel Retrieval-Augmented Semantic Reasoning (RASR) framework. First, a Cross-instance Semantic Parser and Retriever (CSPR) deconstructs the video into high-level semantic primitives and retrieves relevant associative evidence from a dynamic memory bank. Subsequently, a Domain-Guided Multimodal Reasoning (DGMP) module incorporates domain priors to drive an expert multimodal large language model in generating domain-aware, in-depth analysis reports. Finally, a Multi-View Feature Decoupling and Fusion (MVDFF) module integrates multi-dimensional features through an adaptive gating mechanism to achieve robust authenticity determination. Extensive experiments on the FakeSV and FakeTT datasets demonstrate that RASR significantly outperforms state-of-the-art baselines, achieves superior cross-domain generalization, and improves the overall detection accuracy by up to 0.93%.

85. 【2604.06665】VDPP: Video Depth Post-Processing for Speed and Scalability

链接：https://arxiv.org/abs/2604.06665

作者：Daewon Yoon,Injun Baek,Sangyu Han,Yearim Kim,Nojun Kwak

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Video depth, Video depth estimation, video depth models, essential for providing, mixed reality

备注： 8 pages, 6 figures. Accepted to CVPR 2024 Workshop. Project page: [this https URL](https://github.com/injun-baek/VDPP)

点击查看摘要

Abstract:Video depth estimation is essential for providing 3D scene structure in applications ranging from autonomous driving to mixed reality. Current end-to-end video depth models have established state-of-the-art performance. Although current end-to-end (E2E) models have achieved state-of-the-art performance, they function as tightly coupled systems that suffer from a significant adaptation lag whenever superior single-image depth estimators are released. To mitigate this issue, post-processing methods such as NVDS offer a modular plug-and-play alternative to incorporate any evolving image depth model without retraining. However, existing post-processing methods still struggle to match the efficiency and practicality of E2E systems due to limited speed, accuracy, and RGB reliance. In this work, we revitalize the role of post-processing by proposing VDPP (Video Depth Post-Processing), a framework that improves the speed and accuracy of post-processing methods for video depth estimation. By shifting the paradigm from computationally expensive scene reconstruction to targeted geometric refinement, VDPP operates purely on geometric refinements in low-resolution space. This design achieves exceptional speed (43.5 FPS on NVIDIA Jetson Orin Nano) while matching the temporal coherence of E2E systems, with dense residual learning driving geometric representations rather than full reconstructions. Furthermore, our VDPP's RGB-free architecture ensures true scalability, enabling immediate integration with any evolving image depth model. Our results demonstrate that VDPP provides a superior balance of speed, accuracy, and memory efficiency, making it the most practical solution for real-time edge deployment. Our project page is at this https URL

86. 【2604.06662】owards Robust Content Watermarking Against Removal and Forgery Attacks

链接：https://arxiv.org/abs/2604.06662

作者：Yifan Zhu,Yihan Wang,Xiao-Shan Gao

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Generated contents, image provenance, copyright protection, credit attribution, raised serious concerns

备注： 14 pages, 5 figures, CVPR 2026 Findings

点击查看摘要

Abstract:Generated contents have raised serious concerns about copyright protection, image provenance, and credit attribution. A potential solution for these problems is watermarking. Recently, content watermarking for text-to-image diffusion models has been studied extensively for its effective detection utility and robustness. However, these watermarking techniques are vulnerable to potential adversarial attacks, such as removal attacks and forgery attacks. In this paper, we build a novel watermarking paradigm called Instance-Specific watermarking with Two-Sided detection (ISTS) to resist removal and forgery attacks. Specifically, we introduce a strategy that dynamically controls the injection time and watermarking patterns based on the semantics of users' prompts. Furthermore, we propose a new two-sided detection approach to enhance robustness in watermark detection. Experiments have demonstrated the superiority of our watermarking against removal and forgery attacks.

87. 【2604.06658】GPAFormer: Graph-guided Patch Aggregation Transformer for Efficient 3D Medical Image Segmentation

链接：https://arxiv.org/abs/2604.06658

作者：Chung-Ming Lo,I-Yun Liu,Wei-Yang Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Deep learning, medical image segmentation, widely applied, medical image, image segmentation

备注：

点击查看摘要

Abstract:Deep learning has been widely applied to 3D medical image segmentation tasks. However, due to the diversity of imaging modalities, the high-dimensional nature of the data, and the heterogeneity of anatomical structures, achieving both segmentation accuracy and computational efficiency in multi-organ segmentation remains a challenge. This study proposed GPAFormer, a lightweight network architecture specifically designed for 3D medical image segmentation, emphasizing efficiency while keeping high accuracy. GPAFormer incorporated two core modules: the multi-scale attention-guided stacked aggregation (MASA) and the mutual-aware patch graph aggregator (MPGA). MASA utilized three parallel paths with different receptive fields, combined through planar aggregation, to enhance the network's capability in handling structures of varying sizes. MPGA employed a graph-guided approach to dynamically aggregate regions with similar feature distributions based on inter-patch feature similarity and spatial adjacency, thereby improving the discrimination of both internal and boundary structures of organs. Experiments were performed on public whole-body CT and MRI datasets including BTCV, Synapse, ACDC, and BraTS. Compared to the existed 3D segmentation networkd, GPAFormer using only 1.81 M parameters achieved overall highest DSC on BTCV (75.70%), Synapse (81.20%), ACDC (89.32%), and BraTS (82.74%). Using consumer level GPU, the inference time for one validation case of BTCV spent less than one second. The results demonstrated that GPAFormer balanced accuracy and efficiency in multi-organ, multi-modality 3D segmentation tasks across various clinical scenarios especially for resource-constrained and time-sensitive clinical environments.

88. 【2604.06655】Controllable Generative Video Compression

链接：https://arxiv.org/abs/2604.06655

作者：Ding Ding,Daowen Li,Ying Chen,Yixin Gao,Ruixiao Dong,Kai Li,Li Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：compression adopts generative, generative video modeling, adopts generative video, Generative Video Compression, faithfully reproduce visual

备注：

点击查看摘要

Abstract:Perceptual video compression adopts generative video modeling to improve perceptual realism but frequently sacrifices signal fidelity, diverging from the goal of video compression to faithfully reproduce visual signal. To alleviate the dilemma between perception and fidelity, in this paper we propose Controllable Generative Video Compression (CGVC) paradigm to faithfully generate details guided by multiple visual conditions. Under the paradigm, representative keyframes of the scene are coded and used to provide structural priors for non-keyframe generation. Dense per-frame control prior is additionally coded to better preserve finer structure and semantics of each non-keyframe. Guided by these priors, non-keyframes are reconstructed by controllable video generation model with temporal and content consistency. Furthermore, to accurately recover color information of the video, we develop a color-distance-guided keyframe selection algorithm to adaptively choose keyframes. Experimental results show CGVC outperforms previous perceptual video compression method in terms of both signal fidelity and perceptual quality.

89. 【2604.06644】Variational Feature Compression for Model-Specific Representations

链接：https://arxiv.org/abs/2604.06644

作者：Zinan Guo,Zihan Wang,Chuan Yan,Liuhuo Wan,Ethan Ma,Guangdong Bai

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：deep learning inference, deep learning, increasingly deployed, deployed in shared, shared and cloud-based

备注：

点击查看摘要

Abstract:As deep learning inference is increasingly deployed in shared and cloud-based settings, a growing concern is input repurposing, in which data submitted for one task is reused by unauthorized models for another. Existing privacy defenses largely focus on restricting data access, but provide limited control over what downstream uses a released representation can still support. We propose a feature extraction framework that suppresses cross-model transfer while preserving accuracy for a designated classifier. The framework employs a variational latent bottleneck, trained with a task-driven cross-entropy objective and KL regularization, but without any pixel-level reconstruction loss, to encode inputs into a compact latent space. A dynamic binary mask, computed from per-dimension KL divergence and gradient-based saliency with respect to the frozen target model, suppresses latent dimensions that are uninformative for the intended task. Because saliency computation requires gradient access, the encoder is trained in a white-box setting, whereas inference requires only a forward pass through the frozen target model. On CIFAR-100, the processed representations retain strong utility for the designated classifier while reducing the accuracy of all unintended classifiers to below 2%, yielding a suppression ratio exceeding 45 times relative to unintended models. Preliminary experiments on CIFAR-10, Tiny ImageNet, and Pascal VOC provide exploratory evidence that the approach extends across task settings, although further evaluation is needed to assess robustness against adaptive adversaries.

90. 【2604.06631】SubFLOT: Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport

链接：https://arxiv.org/abs/2604.06631

作者：Zheng Jiang,Nan He,Yiming Chen,Lifeng Sun

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Federated Learning, preserving data privacy, enables collaborative model, enables collaborative, statistical heterogeneity

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training while preserving data privacy, but its practical deployment is hampered by system and statistical heterogeneity. While federated network pruning offers a path to mitigate these issues, existing methods face a critical dilemma: server-side pruning lacks personalization, whereas client-side pruning is computationally prohibitive for resource-constrained devices. Furthermore, the pruning process itself induces significant parametric divergence among heterogeneous submodels, destabilizing training and hindering global convergence. To address these challenges, we propose SubFLOT, a novel framework for server-side personalized federated pruning. SubFLOT introduces an Optimal Transport-enhanced Pruning (OTP) module that treats historical client models as proxies for local data distributions, formulating the pruning task as a Wasserstein distance minimization problem to generate customized submodels without accessing raw data. Concurrently, to counteract parametric divergence, our Scaling-based Adaptive Regularization (SAR) module adaptively penalizes a submodel's deviation from the global model, with the penalty's strength scaled by the client's pruning rate. Comprehensive experiments demonstrate that SubFLOT consistently and substantially outperforms state-of-the-art methods, underscoring its potential for deploying efficient and personalized models on resource-constrained edge devices.

91. 【2604.06623】WeatherRemover: All-in-one Adverse Weather Removal with Multi-scale Feature Map Compression

链接：https://arxiv.org/abs/2604.06623

作者：Weikai Qu,Sijun Liang,Cheng Pan,Zikuan Yang,Guanchi Zhou,Xianjun Fu,Bo Liu,Changmiao Wang,Ahmed Elazab

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：low brightness due, adverse weather conditions, suffer from blurriness, interference from rain, low brightness

备注： Accepted by IEEE Transactions on Artificial Intelligence

点击查看摘要

Abstract:Photographs taken in adverse weather conditions often suffer from blurriness, occlusion, and low brightness due to interference from rain, snow, and fog. These issues can significantly hinder the performance of subsequent computer vision tasks, making the removal of weather effects a crucial step in image enhancement. Existing methods primarily target specific weather conditions, with only a few capable of handling multiple weather scenarios. However, mainstream approaches often overlook performance considerations, resulting in large parameter sizes, long inference times, and high memory costs. In this study, we introduce the WeatherRemover model, designed to enhance the restoration of images affected by various weather conditions while balancing performance. Our model adopts a UNet-like structure with a gating mechanism and a multi-scale pyramid vision Transformer. It employs channel-wise attention derived from convolutional neural networks to optimize feature extraction, while linear spatial reduction helps curtail the computational demands of attention. The gating mechanisms, strategically placed within the feed-forward and downsampling phases, refine the processing of information by selectively addressing redundancy and mitigating its influence on learning. This approach facilitates the adaptive selection of essential data, ensuring superior restoration and maximizing efficiency. Additionally, our lightweight model achieves an optimal balance between restoration quality, parameter efficiency, computational overhead, and memory usage, distinguishing it from other multi-weather models, thereby meeting practical application demands effectively. The source code is available at this https URL.

92. 【2604.06622】Balancing Efficiency and Restoration: Lightweight Mamba-Based Model for CT Metal Artifact Reduction

链接：https://arxiv.org/abs/2604.06622

作者：Weikai Qu,Sijun Liang,Xianfeng Li,Cheng Pan,An Yan,Ahmed Elazab,Shanzhou Niu,Dong Zeng,Xiang Wan,Changmiao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computed tomography imaging, hinder diagnostic accuracy, implants frequently generate, frequently generate severe, metal implants frequently

备注： Accepted by IEEE Transactions on Radiation and Plasma Medical Sciences

点击查看摘要

Abstract:In computed tomography imaging, metal implants frequently generate severe artifacts that compromise image quality and hinder diagnostic accuracy. There are three main challenges in the existing methods: the deterioration of organ and tissue structures, dependence on sinogram data, and an imbalance between resource use and restoration efficiency. Addressing these issues, we introduce MARMamba, which effectively eliminates artifacts caused by metals of different sizes while maintaining the integrity of the original anatomical structures of the image. Furthermore, this model only focuses on CT images affected by metal artifacts, thus negating the requirement for additional input data. The model is a streamlined UNet architecture, which incorporates multi-scale Mamba (MS-Mamba) as its core module. Within MS-Mamba, a flip mamba block captures comprehensive contextual information by analyzing images from multiple orientations. Subsequently, the average maximum feed-forward network integrates critical features with average features to suppress the artifacts. This combination allows MARMamba to reduce artifacts efficiently. The experimental results demonstrate that our model excels in reducing metal artifacts, offering distinct advantages over other models. It also strikes an optimal balance between computational demands, memory usage, and the number of parameters, highlighting its practical utility in the real world. The code of the presented model is available at: this https URL.

93. 【2604.06614】Holistic Optimal Label Selection for Robust Prompt Learning under Partial Labels

链接：https://arxiv.org/abs/2604.06614

作者：Yaqi Zhao,Haoliang Sun,Yating Wang,Yongshun Gong,Yilong Yin

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：gained significant attention, large pre-trained vision-language, pre-trained vision-language models, adapting large pre-trained, downstream tasks

备注：

点击查看摘要

Abstract:Prompt learning has gained significant attention as a parameter-efficient approach for adapting large pre-trained vision-language models to downstream tasks. However, when only partial labels are available, its performance is often limited by label ambiguity and insufficient supervisory information. To address this issue, we propose Holistic Optimal Label Selection (HopS), leveraging the generalization ability of pre-trained feature encoders through two complementary strategies. First, we design a local density-based filter that selects the top frequent labels from the nearest neighbors' candidate sets and uses the softmax scores to identify the most plausible label, capturing structural regularities in the feature space. Second, we introduce a global selection objective based on optimal transport that maps the uniform sampling distribution to the candidate label distributions across a batch. By minimizing the expected transport cost, it can determine the most likely label assignments. These two strategies work together to provide robust label selection from both local and global perspectives. Extensive experiments on eight benchmark datasets show that HopS consistently improves performance under partial supervision and outperforms all baselines. Those results highlight the merit of holistic label selection and offer a practical solution for prompt learning in weakly supervised settings.

94. 【2604.06583】VAMAE: Vessel-Aware Masked Autoencoders for OCT Angiography

链接：https://arxiv.org/abs/2604.06583

作者：Ilerioluwakiiye Abolade,Prince Mireku,Kelechi Chibundu,Peace Ododo,Emmanuel Idoko,Promise Omoigui,Solomon Odelola

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Optical coherence tomography, coherence tomography angiography, robust representations remains, representations remains challenging, remains challenging due

备注： 8 pages, 5 figures. Accepted at ICPR 2026

点击查看摘要

Abstract:Optical coherence tomography angiography (OCTA) provides non-invasive visualization of retinal microvasculature, but learning robust representations remains challenging due to sparse vessel structures and strong topological constraints. Many existing self-supervised learning approaches, including masked autoencoders, are primarily designed for dense natural images and rely on uniform masking and pixel-level reconstruction, which may inadequately capture vascular geometry. We propose VAMAE, a vessel-aware masked autoencoding framework for self-supervised pretraining on OCTA images. The approach incorporates anatomically informed masking that emphasizes vessel-rich regions using vesselness and skeleton-based cues, encouraging the model to focus on vascular connectivity and branching patterns. In addition, the pretraining objective includes reconstructing multiple complementary targets, enabling the model to capture appearance, structural, and topological information. We evaluate the proposed pretraining strategy on the OCTA-500 benchmark for several vessel segmentation tasks under varying levels of supervision. The results indicate that vessel-aware masking and multi-target reconstruction provide consistent improvements over standard masked autoencoding baselines, particularly in limited-label settings, suggesting the potential of geometry-aware self-supervised learning for OCTA analysis.

Comments:
8 pages, 5 figures. Accepted at ICPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.06583 [cs.CV]

(or
arXiv:2604.06583v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.06583

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

95. 【2604.06576】LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation

链接：https://arxiv.org/abs/2604.06576

作者：Shuai Li,Huibin Bai,Yanbo Gao,Chong Lv,Hui Yuan,Chuankun Li,Wei Hua,Tian Xie

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：attracted increasing interest, Monocular depth estimation, depth, MDE, past few years

备注： Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.

96. 【2604.06494】DesigNet: Learning to Draw Vector Graphics as Designers Do

链接：https://arxiv.org/abs/2604.06494

作者：Tomas Guija-Valiente,Iago Suárez

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：AI-driven content generation, made remarkable progress, Scalable Vector Graphics, AI-driven content, recent years

备注：

点击查看摘要

Abstract:AI-driven content generation has made remarkable progress in recent years. However, neural networks and human designers operate in fundamentally different ways, making collaboration between them challenging. We address this gap for Scalable Vector Graphics (SVG) by equipping neural networks with tools commonly used by designers, such as axis alignment and explicit continuity control at command junctions. We introduce DesigNet, a hierarchical Transformer-VAE that operates directly on SVG sequences with a continuous command parameterization. Our main contributions are two differentiable modules: a continuity self-refinement module that predicts $C^0$, $G^1$, and $C^1$ continuity for each curve point and enforces it by modifying Bézier control points, and an alignment self-refinement module with snapping capabilities for horizontal or vertical lines. DesigNet produces editable outlines and achieves competitive results against state-of-the-art methods, with notably higher accuracy in continuity and alignment. These properties ensure the outputs are easier to refine and integrate into professional design workflows. Source Code: this https URL.

97. 【2604.06481】Hybrid ResNet-1D-BiGRU with Multi-Head Attention for Cyberattack Detection in Industrial IoT Environments

链接：https://arxiv.org/abs/2604.06481

作者：Afrah Gueriani,Hamza Kheddar,Ahmed Cherif Mazari

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词：attention-based feature weighting, spatial-temporal feature extraction, effective spatial-temporal feature, hybrid deep learning, Multi-Head Attention

备注：

点击查看摘要

Abstract:This study introduces a hybrid deep learning model for intrusion detection in Industrial IoT (IIoT) systems, combining ResNet-1D, BiGRU, and Multi-Head Attention (MHA) for effective spatial-temporal feature extraction and attention-based feature weighting. To address class imbalance, SMOTE was applied during training on the EdgeHoTset dataset. The model achieved 98.71% accuracy, a loss of 0.0417%, and low inference latency (0.0001 sec /instance), demonstrating strong real-time capability. To assess generalizability, the model was also tested on the CICIoV2024 dataset, where it reached 99.99% accuracy and F1-score, with a loss of 0.0028, 0 % FPR, and 0.00014 sec/instance inference time. Across all metrics and datasets, the proposed model outperformed existing methods, confirming its robustness and effectiveness for real-time IoT intrusion detection.

98. 【2604.06469】Predicting Alzheimer's disease progression using rs-fMRI and a history-aware graph neural network

链接：https://arxiv.org/abs/2604.06469

作者：Mahdi Moghaddami,Mohammad-Reza Siadat,Austin Toma,Connor Laming,Huirong Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：United States, Alzheimer disease, cognitive impairment, neurodegenerative disorder, disorder that affects

备注： Proc. SPIE 13926, Medical Imaging 2026: Computer-Aided Diagnosis, 1392604

点击查看摘要

Abstract:Alzheimer's disease (AD) is a neurodegenerative disorder that affects more than seven million people in the United States alone. AD currently has no cure, but there are ways to potentially slow its progression if caught early enough. In this study, we propose a graph neural network (GNN)-based model for predicting whether a subject will transition to a more severe stage of cognitive impairment at their next clinical visit. We consider three stages of cognitive impairment in order of severity: cognitively normal (CN), mild cognitive impairment (MCI), and AD. We use functional connectivity graphs derived from resting-state functional magnetic resonance imaging (rs-fMRI) scans of 303 subjects, each with a different number of visits. Our GNN-based model incorporates a recurrent neural network (RNN) block, enabling it to process data from the subject's entire visit history. It can also work with irregular time gaps between visits by incorporating visit distance information into our input features. Our model demonstrates robust predictive performance, even with missing visits in the subjects' visit histories. It achieves an accuracy of 82.9%, with an especially impressive accuracy of 68.8% on CN to MCI conversions - a task that poses a substantial challenge in the field. Our results highlight the effectiveness of rs-fMRI in predicting the onset of MCI or AD and, in conjunction with other modalities, could offer a viable method for enabling timely interventions to slow the progression of cognitive impairment.

99. 【2604.06467】PhysHead: Simulation-Ready Gaussian Head Avatars

链接：https://arxiv.org/abs/2604.06467

作者：Berna Kabadayi,Vanessa Sklyarova,Wojciech Zielonka,Justus Thies,Gerard Pons-Moll

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Realistic digital avatars, head, hair, avatar methods assume, Realistic digital

备注： Project Page: see [this https URL](https://phys-head.github.io/;) Youtube Video: see [this https URL](https://www.youtube.com/watch?v=k68fsSSwzc0;) Accepted to CVPR 2026

点击查看摘要

Abstract:Realistic digital avatars require expressive and dynamic hair motion; however, most existing head avatar methods assume rigid hair movement. These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. At the core is a 3D Gaussian-based layered representation of the head. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods. However, these animation capabilities also require new training schemes. In particular, we propose the use of VLM-based models to generate appearance of regions that are occluded in the dynamic training sequences. In quantitative and qualitative studies, we demonstrate the capabilities of the proposed model and compare it with existing baselines. We show that our method can synthesize physically plausible hair motion besides expression and camera control.

100. 【2604.06440】Visual prompting reimagined: The power of the Activation Prompts

链接：https://arxiv.org/abs/2604.06440

作者：Yihua Zhang,Hongkang Li,Yuguang Yao,Aochuan Chen,Shuai Zhang,Pin-Yu Chen,Meng Wang,Sijia Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：repurpose pretrained vision, downstream tasks, Visual prompting, repurpose pretrained, adaptation to downstream

备注： AISTATS 2026

点击查看摘要

Abstract:Visual prompting (VP) has emerged as a popular method to repurpose pretrained vision models for adaptation to downstream tasks. Unlike conventional model fine-tuning techniques, VP introduces a universal perturbation directly into the input data to facilitate task-specific fine-tuning rather than modifying model parameters. However, there exists a noticeable performance gap between VP and conventional fine-tuning methods, highlighting an unexplored realm in theory and practice to understand and advance the input-level VP to reduce its current performance gap. Towards this end, we introduce a generalized concept, termed activation prompt (AP), which extends the scope of the input-level VP by enabling universal perturbations to be applied to activation maps within the intermediate layers of the model. By using AP to revisit the problem of VP and employing it as an analytical tool, we demonstrate the intrinsic limitations of VP in both performance and efficiency, revealing why input-level prompting may lack effectiveness compared to AP, which exhibits a model-dependent layer preference. We show that AP is closely related to normalization tuning in convolutional neural networks and vision transformers, although each model type has distinct layer preferences for prompting. We also theoretically elucidate the rationale behind such a preference by analyzing global features across layers. Through extensive experiments across 29 datasets and various model architectures, we provide a comprehensive performance analysis of AP, comparing it with VP and parameter-efficient fine-tuning baselines. Our results demonstrate AP's superiority in both accuracy and efficiency, considering factors such as time, parameters, memory usage, and throughput.

101. 【2604.06435】Continual Visual Anomaly Detection on the Edge: Benchmark and Efficient Solutions

链接：https://arxiv.org/abs/2604.06435

作者：Manuel Barusco,Francesco Borsatti,David Petrovic,Davide Dalle Pezze,Gian Antonio Susto

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Visual Anomaly Detection, Visual Anomaly, applications including industrial, including industrial inspection, Anomaly Detection

备注：

点击查看摘要

Abstract:Visual Anomaly Detection (VAD) is a critical task for many applications including industrial inspection and healthcare. While VAD has been extensively studied, two key challenges remain largely unaddressed in conjunction: edge deployment, where computational resources are severely constrained, and continual learning, where models must adapt to evolving data distributions without forgetting previously acquired knowledge. Our benchmark provides guidance for the selection of the optimal backbone and VAD method under joint efficiency and adaptability constraints, characterizing the trade-offs between memory footprint, inference cost, and detection performance. Studying these challenges in isolation is insufficient, as methods designed for one setting make assumptions that break down when the other constraint is simultaneously imposed. In this work, we propose the first comprehensive benchmark for VAD on the edge in the continual learning scenario, evaluating seven VAD models across three lightweight backbone architectures. Furthermore, we propose Tiny-Dinomaly, a lightweight adaptation of the Dinomaly model built on the DINO foundation model that achieves 13x smaller memory footprint and 20x lower computational cost while improving Pixel F1 by 5 percentage points. Finally, we introduce targeted modifications to PatchCore and PaDiM to improve their efficiency in the continual learning setting.

102. 【2604.06422】When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't

链接：https://arxiv.org/abs/2604.06422

作者：Jonathan Nemitz,Carsten Eickhoff,Junyi Jessy Li,Kyle Mahowald,Michal Golovanevsky,William Rudman

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Understanding when Vision-Language, behave unexpectedly, Color, Graded Color Attribution, reliably predict

备注：

点击查看摘要

103. 【2604.06401】ProofSketcher: Hybrid LLM + Lightweight Proof Checker for Reliable Math/Logic Reasoning

链接：https://arxiv.org/abs/2604.06401

作者：Kranthi Kommuru,Kunal Khanvilkar,Gaurav Parekh

类目：Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：invalid inference patterns, large language models, persuasive argument, logical fields, minor missteps

备注：

点击查看摘要

Abstract:The large language models (LLMs) might produce a persuasive argument within mathematical and logical fields, although such argument often includes some minor missteps, including the entire omission of side conditions, invalid inference patterns, or appeals to a lemma that cannot be derived logically out of the context being discussed. These omissions are infamously hard to notice solely out of the text, as even the misconstrued construction still may seem mostly accurate. Conversely, interactive theorem provers like Lean and Coq have rigorous reliability by ensuring that syntactic and semantic statements only accept statements that can pass all the syntactic and semantic steps in the program which is a small trusted kernel of the language type-checks with. Despite the fact that this technique provides strong guarantees, it comes at quite a heavy price: the evidence must be completely formalized, and the evidence user or a auxiliary search program must provide an avalanche of low-level information. This paper presents a hybrid pipeline where an LLM generates a typed proof sketch in a compact DSL and a lightweight trusted kernel expands the sketch into explicit proof obligations.

104. 【2604.06390】MorphDistill: Distilling Unified Morphological Knowledge from Pathology Foundation Models for Colorectal Cancer Survival Prediction

链接：https://arxiv.org/abs/2604.06390

作者：Hikmat Khan,Usama Sajjad,Metin N. Gurcan,Anil Parwani,Wendy L. Frankel,Wei Chen,Muhammad Khalid Khan Niazi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：cancer-related mortality worldwide, remains a leading, mortality worldwide, pathology foundation models, foundation models

备注：

点击查看摘要

Abstract:Background: Colorectal cancer (CRC) remains a leading cause of cancer-related mortality worldwide. Accurate survival prediction is essential for treatment stratification, yet existing pathology foundation models often overlook organ-specific features critical for CRC prognostication. Methods: We propose MorphDistill, a two-stage framework that distills complementary knowledge from multiple pathology foundation models into a compact CRC-specific encoder. In Stage I, a student encoder is trained using dimension-agnostic multi-teacher relational distillation with supervised contrastive regularization on large-scale colorectal datasets. This preserves inter-sample relationships from ten foundation models without explicit feature alignment. In Stage II, the encoder extracts patch-level features from whole-slide images, which are aggregated via attention-based multiple instance learning to predict five-year survival. Results: On the Alliance/CALGB 89803 cohort (n=424, stage III CRC), MorphDistill achieves an AUC of 0.68 (SD 0.08), an approximately 8% relative improvement over the strongest baseline (AUC 0.63). It also attains a C-index of 0.661 and a hazard ratio of 2.52 (95% CI: 1.73-3.65), outperforming all baselines. On an external TCGA cohort (n=562), it achieves a C-index of 0.628, demonstrating strong generalization across datasets and robustness across clinical subgroups. Conclusion: MorphDistill enables task-specific representation learning by integrating knowledge from multiple foundation models into a unified encoder. This approach provides an efficient strategy for prognostic modeling in computational pathology, with potential for broader oncology applications. Further validation across additional cohorts and disease stages is warranted.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.06390 [cs.CV]

(or
arXiv:2604.06390v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.06390

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Hikmat Khan Ph.D [view email] [v1]
Tue, 7 Apr 2026 19:21:18 UTC (48,820 KB)

105. 【2604.06376】MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

链接：https://arxiv.org/abs/2604.06376

作者：Xiangyu Peng,Can Qin,An Yan,Xinyi Yang,Zeyuan Chen,Ran Xu,Chien-Sheng Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large language models, demonstrated strong capabilities, Multimodal large language, requires deep searching, integrating visual evidence

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage verification process to ensure factual consistency and answer uniqueness. Using MTA-Vision-DeepSearch, a 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63\% across six challenging benchmarks, outperforming GPT-5 (51.86\%), Gemini-2.5-Pro (50.98\%), and Gemini-3-Pro (54.46\%) under the same tool settings. We further show that training on our data improves both reasoning depth and tool-use behavior, increasing the average number of steps from 2.27 to 4.28, and leading to more systematic and persistent search strategies. Additionally, we demonstrate that training can be performed without real-time tool calls by replaying cached interactions, significantly reducing training cost. Importantly, we present MTA-Agent as a fully open recipe for multimodal deep search: we release the entire dataset, training trajectories, and implementation details to enable reproducibility and future research on open multimodal search agents.

106. 【2604.06352】DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images

链接：https://arxiv.org/abs/2604.06352

作者：Gautham Vinod,Siddeshwar Raghavan,Bruce Coburn,Fengqing Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)

关键词：Accurate dietary assessment, image-based methods rely, precision nutrition, provide only coarse, Accurate dietary

备注：

点击查看摘要

Abstract:Accurate dietary assessment is critical for precision nutrition, yet most image-based methods rely on a single pre-consumption image and provide only coarse, meal-level estimates. These approaches cannot determine what was actually consumed and often require restrictive inputs such as depth sensing, multi-view imagery, or explicit segmentation. In this paper, we propose a simple vision-language framework for food-item-level nutritional analysis using paired before-and-after eating images. Instead of relying on rigid segmentation masks, our method leverages natural language prompts to localize specific food items and estimate their weight directly from a single RGB image. We further estimate food consumption by predicting weight differences between paired images using a two-stage training strategy. We evaluate our method on three publicly available datasets and demonstrate consistent improvements over existing approaches, establishing a strong baseline for before-and-after dietary image analysis.

107. 【2604.06349】Bi-Level Optimization for Single Domain Generalization

链接：https://arxiv.org/abs/2604.06349

作者：Marzi Heidari,Hanping Zhang,Hao Yan,Yuhong Guo

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：unseen target domains, robust machine learning, single labeled source, remains a fundamental, unseen target

备注： CVPR Findings Track, 2026

点击查看摘要

Abstract:Generalizing from a single labeled source domain to unseen target domains, without access to any target data during training, remains a fundamental challenge in robust machine learning. We address this underexplored setting, known as Single Domain Generalization (SDG), by proposing BiSDG, a bi-level optimization framework that explicitly decouples task learning from domain modeling. BiSDG simulates distribution shifts through surrogate domains constructed via label-preserving transformations of the source data. To capture domain-specific context, we propose a domain prompt encoder that generates lightweight modulation signals to produce augmenting features via feature-wise linear modulation. The learning process is formulated as a bi-level optimization problem: the inner objective optimizes task performance under fixed prompts, while the outer objective maximizes generalization across the surrogate domains by updating the domain prompt encoder. We further develop a practical gradient approximation scheme that enables efficient bi-level training without second-order derivatives. Extensive experiments on various SGD benchmarks demonstrate that BiSDG consistently outperforms prior methods, setting new state-of-the-art performance in the SDG setting.

108. 【2604.06347】Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents

链接：https://arxiv.org/abs/2604.06347

作者：Peng Huang,Yiming Wang,Yineng Chen,Liangqiao Gui,Hui Guo,Bo Peng,Shu Hu,Xi Wu,Tsao Connie,Hongtu Zhu,Balakrishnan Prabhakaran,Xin Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：cardiovascular diseases, plays an important, screening and diagnosis, diagnosis of cardiovascular, Echocardiography plays

备注： cvprw 2026(AIMS)

点击查看摘要

Abstract:Echocardiography plays an important role in the screening and diagnosis of cardiovascular diseases. However, automated intelligent analysis of echocardiographic data remains challenging due to complex cardiac dynamics and strong view heterogeneity. In recent years, visual language models (VLM) have opened a new avenue for building ultrasound understanding systems for clinical decision support. Nevertheless, most existing methods formulate this task as a direct mapping from video and question to answer, making them vulnerable to template shortcuts and spurious explanations. To address these issues, we propose EchoTrust, an evidence-driven Actor-Verifier framework for trustworthy reasoning in echocardiography VLM-based agents. EchoTrust produces a structured intermediate representation that is subsequently analyzed by distinct roles, enabling more reliable and interpretable decision-making for high-stakes clinical applications.

109. 【2604.06339】Evolution of Video Generative Foundations

链接：https://arxiv.org/abs/2604.06339

作者：Teng Hu,Jiangning Zhang,Hongrui Huang,Ran Yi,Zihan Su,Jieyu Weng,Zhucun Xue,Lizhuang Ma,Ming-Hsuan Yang,Dacheng Tao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Intelligence Generated Content, Artificial Intelligence Generated, Generated Content, Artificial Intelligence, Intelligence Generated

备注：

点击查看摘要

Abstract:The rapid advancement of Artificial Intelligence Generated Content (AIGC) has revolutionized video generation, enabling systems ranging from proprietary pioneers like OpenAI's Sora, Google's Veo3, and Bytedance's Seedance to powerful open-source contenders like Wan and HunyuanVideo to synthesize temporally coherent and semantically rich videos. These advancements pave the way for building "world models" that simulate real-world dynamics, with applications spanning entertainment, education, and virtual reality. However, existing reviews on video generation often focus on narrow technical fields, e.g., Generative Adversarial Networks (GAN) and diffusion models, or specific tasks (e. g., video editing), lacking a comprehensive perspective on the field's evolution, especially regarding Auto-Regressive (AR) models and integration of multimodal information. To address these gaps, this survey firstly provides a systematic review of the development of video generation technology, tracing its evolution from early GANs to dominant diffusion models, and further to emerging AR-based and multimodal techniques. We conduct an in-depth analysis of the foundational principles, key advancements, and comparative strengths/limitations. Then, we explore emerging trends in multimodal video generation, emphasizing the integration of diverse data types to enhance contextual awareness. Finally, by bridging historical developments and contemporary innovations, this survey offers insights to guide future research in video generation and its applications, including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and advanced world models, in this rapidly evolving field. For more details, please refer to the project at this https URL.

110. 【2604.06333】Drifting Fields are not Conservative

链接：https://arxiv.org/abs/2604.06333

作者：Leonard Franz,Sebastian Hoffmann,Georg Martius

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：generate high-quality samples, transporting generated samples, single forward pass, vector valued drift, models generate high-quality

备注： 19 pages, 7 figures

点击查看摘要

Abstract:Drifting models generate high-quality samples in a single forward pass by transporting generated samples toward the data distribution using a vector valued drift field. We investigate whether this procedure is equivalent to optimizing a scalar loss and find that, in general, it is not: drift fields are not conservative - they cannot be written as the gradient of any scalar potential. We identify the position-dependent normalization as the source of non-conservatism. The Gaussian kernel is the unique exception where the normalization is harmless and the drift field is exactly the gradient of a scalar function. Generalizing this, we propose an alternative normalization via a related kernel (the sharp kernel) which restores conservatism for any radial kernel, yielding well-defined loss functions for training drifting models. While we identify that the drifting field matching objective is strictly more general than loss minimization, as it can implement non-conservative transport fields that no scalar loss can reproduce, we observe that practical gains obtained utilizing this flexibility are minimal. We thus propose to train drifting models with the conceptually simpler formulations utilizing loss functions.

111. 【2604.06332】scope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection

链接：https://arxiv.org/abs/2604.06332

作者：Parker Ewen,Dmitriy Rivkin,Mario Bijelic,Felix Heide

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：long-haul heavy trucks, satisfy braking distance, braking distance requirements, Autonomous highway driving, heavy trucks

备注： Project website: [this https URL](https://light.princeton.edu/telescope)

点击查看摘要

Abstract:Autonomous highway driving, especially for long-haul heavy trucks, requires detecting objects at long ranges beyond 500 meters to satisfy braking distance requirements at high speeds. At long distances, vehicles and other critical objects occupy only a few pixels in high-resolution images, causing state-of-the-art object detectors to fail. This challenge is compounded by the limited effective range of commercially available LiDAR sensors, which fall short of ultra-long range thresholds because of quadratic loss of resolution with distance, making image-based detection the most practically scalable solution given commercially available sensor constraints. We introduce Telescope, a two-stage detection model designed for ultra-long range autonomous driving. Alongside a powerful detection backbone, this model contains a novel re-sampling layer and image transformation to address the fundamental challenges of detecting small, distant objects. Telescope achieves $76\%$ relative improvement in mAP in ultra-long range detection compared to state-of-the-art methods (improving from an absolute mAP of 0.185 to 0.326 at distances beyond 250 meters), requires minimal computational overhead, and maintains strong performance across all detection ranges.

112. 【2604.06285】Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

链接：https://arxiv.org/abs/2604.06285

作者：Igor Maljkovic,Maria Rosaria Briglia,Iacopo Masi,Antonio Emanuele Cinà,Fabio Roli

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：shared embedding space, image synthesis, essential for tasks, retrieval by aligning, aligning textual

备注： Paper accepted at ICLR 2026. Webpage available at: [this https URL](https://hype-vlm.github.io)

点击查看摘要

Abstract:Vision-Language Models (VLMs) have become essential for tasks such as image synthesis, captioning, and retrieval by aligning textual and visual information in a shared embedding space. Yet, this flexibility also makes them vulnerable to malicious prompts designed to produce unsafe content, raising critical safety concerns. Existing defenses either rely on blacklist filters, which are easily circumvented, or on heavy classifier-based systems, both of which are costly and fragile under embedding-level attacks. We address these challenges with two complementary components: Hyperbolic Prompt Espial (HyPE) and Hyperbolic Prompt Sanitization (HyPS). HyPE is a lightweight anomaly detector that leverages the structured geometry of hyperbolic space to model benign prompts and detect harmful ones as outliers. HyPS builds on this detection by applying explainable attribution methods to identify and selectively modify harmful words, neutralizing unsafe intent while preserving the original semantics of user prompts. Through extensive experiments across multiple datasets and adversarial scenarios, we prove that our framework consistently outperforms prior defenses in both detection accuracy and robustness. Together, HyPE and HyPS offer an efficient, interpretable, and resilient approach to safeguarding VLMs against malicious prompt misuse.

113. 【2604.06254】SE-Enhanced ViT and BiLSTM-Based Intrusion Detection for Secure IIoT and IoMT Environments

链接：https://arxiv.org/abs/2604.06254

作者：Afrah Gueriani,Hamza Kheddar,Ahmed Cherif Mazari,Seref Sagiroglu,Onur Ceran

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Internet of Things, Industrial and Medical, Medical Internet, devices in Industrial, Long Short-Term Memory

备注：

点击查看摘要

Abstract:With the rapid growth of interconnected devices in Industrial and Medical Internet of Things (IIoT and MIoT) ecosystems, ensuring timely and accurate detection of cyber threats has become a critical challenge. This study presents an advanced intrusion detection framework based on a hybrid Squeeze-and-Excitation Attention Vision Transformer-Bidirectional Long Short-Term Memory (SE ViT-BiLSTM) architecture. In this design, the traditional multi-head attention mechanism of the Vision Transformer is replaced with Squeeze-and-Excitation attention, and integrated with BiLSTM layers to enhance detection accuracy and computational efficiency. The proposed model was trained and evaluated on two real-world benchmark datasets; EdgeIIoT and CICIoMT2024; both before and after data balancing using the Synthetic Minority Over-sampling Technique (SMOTE) and RandomOverSampler. Experimental results demonstrate that the SE ViT-BiLSTM model outperforms existing approaches across multiple metrics. Before balancing, the model achieved accuracies of 99.11% (FPR: 0.0013%, latency: 0.00032 sec/inst) on EdgeIIoT and 96.10% (FPR: 0.0036%, latency: 0.00053 sec/inst) on CICIoMT2024. After balancing, performance further improved, reaching 99.33% accuracy with 0.00035 sec/inst latency on EdgeIIoT and 98.16% accuracy with 0.00014 sec/inst latency on CICIoMT2024.

114. 【2604.06250】DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

链接：https://arxiv.org/abs/2604.06250

作者：Dikshant Kukreja,Kshitij Sah,Karan Goyal,Mukesh Mohania,Vikram Goyal

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Model correctly identifies, Vision-Language Model correctly, asked to describe, group. When asked, correctly identifies

备注：

点击查看摘要

Abstract:When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.'' When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes -- Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description -- yielding diagnostic gaps that decompose performance into language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness. Evaluating 18~VLMs, we find that: (1) Chemistry exhibits substantially lower language-prior exploitability than Biology, confirming molecular visual content as a harder test of genuine visual reasoning; (2) Open-source models consistently score higher when reasoning from their own verbalized descriptions than from raw images, exposing a systematic integration bottleneck; and (3) Closed-source models show no such gap, indicating that bridging perception and integration is the frontier separating open-source from closed-source multimodal capability. The Model Oracle protocol is both model and benchmark agnostic, applicable post-hoc to any VLM evaluation to diagnose integration failures.

115. 【2604.06246】No-reference based automatic parameter optimization for iterative reconstruction using a novel search space aware crow search algorithm

链接：https://arxiv.org/abs/2604.06246

作者：Poorya MohammadiNasab,Ander Biguri,Philipp Steininger,Peter Keuschnigg,Lukas Lamminger,Agnieszka Lach,S M Ragib Shahriar Islam,Anna Breger,Clemens Karner,Carola-Bibiane Schönlieb,Wolfgang Birkfellner,Sepideh Hatamikia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：attracted significant attention, reduce radiation exposure, reconstruction technique ability, Iterative reconstruction technique, significant attention

备注：

点击查看摘要

Abstract:Iterative reconstruction technique's ability to reduce radiation exposure by using fewer projections has attracted significant attention. However, these methods typically require a precise tuning of several hyperparameters, which can have a major impact on reconstruction quality. Manually setting these parameters is time-consuming and increases the workload for human operators. In this paper, we introduce a novel fully automatic parameter optimization framework that can be applied to a wide range of Cone-beam computed tomography (CBCT) iterative reconstruction algorithms to determine optimal parameters without requiring a reference reconstruction. The proposed method incorporates a modified crow search algorithm (CSA) featuring a superior set-dependent local search mechanism, a search-space-aware global search strategy, and an objective-driven balance between local and global search. Additionally, to ensure an effective initial population, we propose a chaotic diagonal linear uniform initialization scheme that accelerates algorithm convergence. The performance of the proposed framework was evaluated on three imaging machines and four real datasets, as well as three different iterative reconstruction methods with the highest number of tunable parameters, representing the most challenging senario. The results indicate that the proposed method could outperform manual settings and CSA, with an 4.19% improvement in average fitness and 4.89% and 3.82% improvements on CHILL@UK and RPI_AXIS, respectively, which are two benchmark no-reference learning-based quality metrics. In addition, the qualitative results clearly show the superiority of the proposed method by maintaining fine details sharply. The overall performance of the proposed framework across different comparison scenarios demonstrates its effectiveness and robustness across all cases.

116. 【2604.06245】CraterBench-R: Instance-Level Crater Retrieval for Planetary Scale

链接：https://arxiv.org/abs/2604.06245

作者：Jichao Fang,Lei Zhang,Michael Phillips,Wei Luo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Impact craters, planetary surface analysis, Impact, surface analysis, self-supervised Vision Transformers

备注： Accepted at the EarthVision 2026 Workshop at CVPR 2026

点击查看摘要

Abstract:Impact craters are a cornerstone of planetary surface analysis. However, while most deep learning pipelines treat craters solely as a detection problem, critical scientific workflows such as catalog deduplication, cross-observation matching, and morphological analog discovery are inherently retrieval tasks. To address this, we formulate crater analysis as an instance-level image retrieval problem and introduce CraterBench-R, a curated benchmark featuring about 25,000 crater identities with multi-scale gallery views and manually verified queries spanning diverse scales and contexts. Our baseline evaluations across various architectures reveal that self-supervised Vision Transformers (ViTs), particularly those with in-domain pretraining, dominate the task, outperforming generic models with significantly more parameters. Furthermore, we demonstrate that retaining multiple ViT patch tokens for late-interaction matching dramatically improves accuracy over standard single-vector pooling. However, storing all tokens per image is operationally inefficient at a planetary scale. To close this efficiency gap, we propose instance-token aggregation, a scalable, training-free method that selects K seed tokens, assigns the remaining tokens to these seeds via cosine similarity, and aggregates each cluster into a single representative token. This approach yields substantial gains: at K=16, aggregation improves mAP by 17.9 points over raw token selection, and at K=64, it matches the accuracy of using all 196 tokens with significantly less storage. Finally, we demonstrate that a practical two-stage pipeline, with single-vector shortlisting followed by instance-token reranking, recovers 89-94% of the full late-interaction accuracy while searching only a small candidate set. The benchmark is publicly available at this http URL.

117. 【2405.03420】Implantable Adaptive Cells: A Novel Enhancement for Pre-Trained U-Nets in Medical Image Segmentation

链接：https://arxiv.org/abs/2405.03420

作者：Emil Benedykciuk,Marcin Denkowski,Grzegorz Wójcik

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Neural Architecture Search, pre-trained neural networks, gradient-based Neural Architecture, Implantable Adaptive Cell, pre-trained neural

备注：

点击查看摘要

Abstract:This paper introduces a novel approach to enhance the performance of pre-trained neural networks in medical image segmentation using gradient-based Neural Architecture Search (NAS) methods. We present the concept of Implantable Adaptive Cell (IAC), small modules identified through Partially-Connected DARTS based approach, designed to be injected into the skip connections of an existing and already trained U-shaped model. Unlike traditional NAS methods, our approach refines existing architectures without full retraining. Experiments on four medical datasets with MRI and CT images show consistent accuracy improvements on various U-Net configurations, with segmentation accuracy gain by approximately 5 percentage points across all validation datasets, with improvements reaching up to 11\%pt in the best-performing cases. The findings of this study not only offer a cost-effective alternative to the complete overhaul of complex models for performance upgrades but also indicate the potential applicability of our method to other architectures and problem domains.

118. 【2604.07248】urPy: a physics-based and differentiable optical turbulence simulator for algorithmic development and system optimization

链接：https://arxiv.org/abs/2604.07248

作者：Joseph L. Greene,Alfred Moore,Iris Ochoa,Emily Kwan,Patrick Marano,Christopher R. Valenta

类目：Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)

关键词：Developing optical systems, free-space applications requires, accurately capture turbulence-induced, capture turbulence-induced wavefront, turbulence-induced wavefront distortions

备注： 19 pages, 7 figures, 1 table. Presented at 2026 SPIE DS Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications IV

点击查看摘要

Abstract:Developing optical systems for free-space applications requires simulation tools that accurately capture turbulence-induced wavefront distortions and support gradient-based optimization. Here we introduce TurPy, a GPU-accelerated, fully differentiable wave optics turbulence simulator to bridge high fidelity simulation with end-to-end optical system design. TurPy incorporates subharmonic phase screen generation, autoregressive temporal evolution, and an automated screen placement routine balancing Fourier aliasing constraints and weak-turbulence approximations into a unified, user-ready framework. Because TurPy's phase screen generation is parameterized through a media-specific power spectral density, the framework extends to atmospheric, oceanic, and biological propagation environments with minimal modification. We validate TurPy against established atmospheric turbulence theory by matching 2nd order Gaussian beam broadening and 4th order plane wave scintillation to closed-form models with 98% accuracy across weak to strong turbulence regimes, requiring only the medium's refractive index structure constant and power spectral density as inputs. To demonstrate TurPy as a gradient-based training platform, we optimize a dual-domain diffractive deep neural network (D2NN) in a two-mask dual-domain architecture to recover a Gaussian beam from a weakly turbulent path and achieving over 20x reduction in scintillation relative to an uncompensated receiver in simulation. TurPy is released as an open-source package to support synthetic data generation, turbulence-informed algorithm development, and the end-to-end design of optical platforms operating in turbulent environments.

119. 【2604.07037】owards foundation-style models for energy-frontier heterogeneous neutrino detectors via self-supervised pre-training

链接：https://arxiv.org/abs/2604.07037

作者：Saúl Alonso-Monsalve,Fabio Cufino,Umut Kose,Anna Mascellani,André Rubbia

类目：High Energy Physics - Experiment (hep-ex); Computer Vision and Pattern Recognition (cs.CV)

关键词：produce exceptionally dense, Accelerator-based neutrino physics, overlapping detector signatures, exceptionally dense, Accelerator-based neutrino

备注： 18 pages, 6 figures

点击查看摘要

Abstract:Accelerator-based neutrino physics is entering an energy-frontier regime in which interactions reach the TeV scale and produce exceptionally dense, overlapping detector signatures. In this regime, event interpretation becomes impractical for conventional reconstruction approaches, particularly when labelled data are scarce and the analysis spans diverse downstream objectives. We present a sparse ViT framework for learning reusable representations from heterogeneous detector data. Self-supervised pre-training combines masked autoencoder reconstruction with relational voxel-level objectives for hierarchy, ghost and particle identification, and the resulting shared encoder is then jointly fine-tuned across classification and regression tasks. Evaluated on simulated events from the proposed FASERCal concept at the LHC, we find that pre-training consistently improves neutrino flavour and charm-quark identification, momentum regression, and vertex reconstruction over training from scratch, with the addition of relational objectives yielding further gains in the most topologically complex channels. Interpretability analyses further show that pre-training yields a more structured latent space, while detector-subsystem ablations recover physically plausible channel-dependent roles for the heterogeneous inputs. A data-efficiency study shows that, with roughly $10^3$ labelled events, the pre-trained encoder already matches the flavour-classification performance of a randomly initialised model trained on an order of magnitude more data. The learned representations also transfer effectively to publicly available benchmarks spanning different detector technologies and energy scales, matching or exceeding published baselines. These results support self-supervised pre-training on multimodal detector data as a scalable route towards reusable representations for neutrino and particle-detector analysis.

120. 【2604.06816】Enhanced Self-Supervised Multi-Image Super-Resolution for Camera Array Images

链接：https://arxiv.org/abs/2604.06816

作者：Yating Chen,Feng Huang,Xianyu Wu,Jing Wu,Ying Shen

类目：Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)

关键词：Conventional multi-image super-resolution, Conventional multi-image, multi-image super-resolution, rely on sequential, single camera

备注：

点击查看摘要

Abstract:Conventional multi-image super-resolution (MISR) methods, such as burst and video SR, rely on sequential frames from a single camera. Consequently, they suffer from complex image degradation and severe occlusion, increasing the difficulty of accurate image restoration. In contrast, multi-aperture camera-array imaging captures spatially distributed views with sampling offsets forming a stable disk-like distribution, which enhances the non-redundancy of observed data. Existing MISR algorithms fail to fully exploit these unique properties. Supervised MISR methods tend to overfit the degradation patterns in training data, and current self-supervised learning (SSL) techniques struggle to recover fine-grained details. To address these issues, this paper thoroughly investigates the strengths, limitations and applicability boundaries of multi-image-to-single-image (Multi-to-Single) and multi-image-to-multi-image (Multi-to-Multi) SSL methods. We propose the Multi-to-Single-Guided Multi-to-Multi SSL framework that combines the advantages of Multi-to-Single and Multi-to-Multi to generate visually appealing and high-fidelity images rich in texture details. The Multi-to-Single-Guided Multi-to-Multi SSL framework provides a new paradigm for integrating deep neural network with classical physics-based variational methods. To enhance the ability of MISR network to recover high-frequency details from aliased artifacts, this paper proposes a novel camera-array SR network called dual Transformer suitable for SSL. Experiments on synthetic and real-world datasets demonstrate the superiority of the proposed method.

121. 【2604.06671】4D Vessel Reconstruction for Benchtop Thrombectomy Analysis

链接：https://arxiv.org/abs/2604.06671

作者：Ethan Nguyen,Javier Carmona,Arisa Matsuzaki,Naoki Kaneko,Katsushi Arisaka

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词：Mechanical thrombectomy, procedure-related injury, Introduction, Mechanical, stress proxy

备注： 20 pages, 10 figures, 1 table, supplementary material (3 tables, 3 figures, and 11 videos). Project page: [this https URL](https://ethanuser.github.io/vessel4D/)

点击查看摘要

Abstract:Introduction: Mechanical thrombectomy can cause vessel deformation and procedure-related injury. Benchtop models are widely used for device testing, but time-resolved, full-field 3D vessel-motion measurements remain limited. Methods: We developed a nine-camera, low-cost multi-view workflow for benchtop thrombectomy in silicone middle cerebral artery phantoms (2160p, 20 fps). Multi-view videos were calibrated, segmented, and reconstructed with 4D Gaussian Splatting. Reconstructed point clouds were converted to fixed-connectivity edge graphs for region-of-interest (ROI) displacement tracking and a relative surface-based stress proxy. Stress-proxy values were derived from edge stretch using a Neo-Hookean mapping and reported as comparative surface metrics. A synthetic Blender pipeline with known deformation provided geometric and temporal validation. Results: In synthetic bulk translation, the stress proxy remained near zero for most edges (median $\approx$ 0 MPa; 90th percentile 0.028 MPa), with sparse outliers. In synthetic pulling (1-5 mm), reconstruction showed close geometric and temporal agreement with ground truth, with symmetric Chamfer distance of 1.714-1.815 mm and precision of 0.964-0.972 at $\tau = 1$ mm. In preliminary benchtop comparative trials (one trial per condition), cervical aspiration catheter placement showed higher max-median ROI displacement and stress-proxy values than internal carotid artery terminus placement. Conclusion: The proposed protocol provides standardized, time-resolved surface kinematics and comparative relative displacement and stress proxy measurements for thrombectomy benchtop studies. The framework supports condition-to-condition comparisons and methods validation, while remaining distinct from absolute wall-stress estimation. Implementation code and example data are available at this https URL

Comments:
20 pages, 10 figures, 1 table, supplementary material (3 tables, 3 figures, and 11 videos). Project page: this https URL

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

Cite as:
arXiv:2604.06671 [eess.IV]

(or
arXiv:2604.06671v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2604.06671

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Ethan Nguyen [view email] [v1]
Wed, 8 Apr 2026 04:45:17 UTC (10,245 KB)

122. 【2604.06648】Euclid Quick Data Release (Q1). AgileLens: A scalable CNN-based pipeline for strong gravitational lens identification

链接：https://arxiv.org/abs/2604.06648

作者：Euclid Collaboration:X. Xu(1 and 2),R. Chen(1),T. Li(1),A. R. Cooray(1),S. Schuldt(3 and 4),J. A. Acevedo Barroso(5),D. Stern(5),D. Scott(6),M. Meneghetti(7 and 8),G. Despali(9 and 7 and 8),J. Chopra(1),Y. Cao(1),M. Cheng(1),J. Buda(1),J. Zhang(1),J. Furumizo(1),R. Valencia(1),Z. Jiang(2),C. Tortora(10),N. E. P. Lines(11),T. E. Collett(11),S. Fotopoulou(12),A. Galan(13 and 14),A. Manjón-García(15),R. Gavazzi(16 and 17),L. Iwamoto(18),S. Kruk(19),M. Millon(20),P. Nugent(21),C. Saulder(22 and 23),D. Sluse(24),J. Wilde(25),M. Walmsley(26 and 27),F. Courbin(25 and 28 and 29),R. B. Metcalf(9 and 7),B. Altieri(19),A. Amara(30),S. Andreon(31),N. Auricchio(7),C. Baccigalupi(32 and 33 and 34 and 35),M. Baldi(36 and 7 and 8),A. Balestra(37),S. Bardelli(7),P. Battaglia(7),R. Bender(22 and 23),A. Biviano(33 and 32),E. Branchini(38 and 39 and 31),M. Brescia(40 and 10),S. Camera(41 and 42 and 43),V. Capobianco(43),C. Carbone(4),V. F. Cardone(44 and 45),J. Carretero(46 and 47),S. Casas(48 and 49),M. Castellano(44),G. Castignani(7),S. Cavuoti(10 and 50),A. Cimatti(51),C. Colodro-Conde(52),G. Congedo(53),C. J. Conselice(27),L. Conversi(54 and 19),Y. Copin(55),H. M. Courtois(56),M. Cropper(57),A. Da Silva(58 and 59),H. Degaudenzi(60),G. De Lucia(33),C. Dolding(57),H. Dole(61),F. Dubath(60),X. Dupac(19),S. Dusini(62),S. Escoffier(63),M. Farina(64),R. Farinelli(7),S. Farrens(65),S. Ferriol(55),F. Finelli(7 and 66),P. Fosalba(67 and 68),M. Frailis(33),E. Franceschi(7),M. Fumana(4),S. Galeotta(33),K. George(69),W. Gillard(63),B. Gillis(53),C. Giocoli(7 and 8),P. Gómez-Alvarez(70 and 19),J. Gracia-Carpio(22),A. Grazian(37),F. Grupp(22 and 23),S. V. H. Haugan(71),W. Holmes(5),F. Hormuth(72),A. Hornstrup(73 and 74),K. Jahnke(75),M. Jhabvala(76),B. Joachimi

类目：Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)

关键词：galaxy lensing systems, strong galaxy, imaging data, NISP colour contrast, pipeline for efficient

备注： 30 pages, 16 figures

点击查看摘要

Abstract:We present an end-to-end, iterative pipeline for efficient identification of strong galaxy--galaxy lensing systems, applied to the Euclid Q1 imaging data. Starting from VIS catalogues, we reject point sources, apply a magnitude cut (I$_E$ $\leq$ 24) on deflectors, and run a pixel-level artefact/noise filter to build 96 $\times$ 96 pix cutouts; VIS+NISP colour composites are constructed with a VIS-anchored luminance scheme that preserves VIS morphology and NISP colour contrast. A VIS-only seed classifier supplies clear positives and typical impostors, from which we curate a morphology-balanced negative set and augment scarce positives. Among the six CNNs studied initially, a modified VGG16 (GlobalAveragePooling + 256/128 dense layers with the last nine layers trainable) performs best; the training set grows from 27 seed lenses (augmented to 1809) plus 2000 negatives to a colour dataset of 30,686 images. After three rounds of iterative fine-tuning, human grading of the top 4000 candidates ranked by the final model yields 441 Grade A/B candidate lensing systems, including 311 overlapping with the existing Q1 strong-lens catalogue, and 130 additional A/B candidates (9 As and 121 Bs) not previously reported. Independently, the model recovers 740 out of 905 (81.8%) candidate Q1 lenses within its top 20,000 predictions, considering off-centred samples. Candidates span I$_E$ $\simeq$ 17--24 AB mag (median 21.3 AB mag) and are redder in Y$_E$--H$_E$ than the parent population, consistent with massive early-type deflectors. Each training iteration required a week for a small team, and the approach easily scales to future Euclid releases; future work will calibrate the selection function via lens injection, extend recall through uncertainty-aware active learning, explore multi-scale or attention-based neural networks with fast post-hoc vetters that incorporate lens models into the classification.

123. 【2604.06568】A Noise Constrained Diffusion (NC-Diffusion) Framework for High Fidelity Image Compression

链接：https://arxiv.org/abs/2604.06568

作者：Zhenyu Du,Yanbo Gao,Shuai Li,Yiyang Li,Hui Yuan,Mao Ye

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：attracting increasing interests, Noise Constrained Diffusion, image compression, noise, increasing interests

备注： Accepted by IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

点击查看摘要

Abstract:With the great success of diffusion models in image generation, diffusion-based image compression is attracting increasing interests. However, due to the random noise introduced in the diffusion learning, they usually produce reconstructions with deviation from the original images, leading to suboptimal compression results. To address this problem, in this paper, we propose a Noise Constrained Diffusion (NC-Diffusion) framework for high fidelity image compression. Unlike existing diffusion-based compression methods that add random Gaussian noise and direct the noise into the image space, the proposed NC-Diffusion formulates the quantization noise originally added in the learned image compression as the noise in the forward process of diffusion. Then a noise constrained diffusion process is constructed from the ground-truth image to the initial compression result generated with quantization noise. The NC-Diffusion overcomes the problem of noise mismatch between compression and diffusion, significantly improving the inference efficiency. In addition, an adaptive frequency-domain filtering module is developed to enhance the skip connections in the U-Net based diffusion architecture, in order to enhance high-frequency details. Moreover, a zero-shot sample-guided enhancement method is designed to further improve the fidelity of the image. Experiments on multiple benchmark datasets demonstrate that our method can achieve the best performance compared with existing methods.

124. 【2604.06564】CWRNN-INVR: A Coupled WarpRNN based Implicit Neural Video Representation

链接：https://arxiv.org/abs/2604.06564

作者：Yiyang Li,Yanbo Gao,Shuai Li,Zhenyu Du,Jinglin Zhang,Hui Yuan,Mao Ye,Xingyu Gao

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Implicit Neural Video, Neural Video Representation, neural network, Implicit Neural, Video Representation

备注： Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Implicit Neural Video Representation (INVR) has emerged as a novel approach for video representation and compression, using learnable grids and neural networks. Existing methods focus on developing new grid structures efficient for latent representation and neural network architectures with large representation capability, lacking the study on their roles in video representation. In this paper, the difference between INVR based on neural network and INVR based on grid is first investigated from the perspective of video information composition to specify their own advantages, i.e., neural network for general structure while grid for specific detail. Accordingly, an INVR based on mixed neural network and residual grid framework is proposed, where the neural network is used to represent the regular and structured information and the residual grid is used to represent the remaining irregular information in a video. A Coupled WarpRNN-based multi-scale motion representation and compensation module is specifically designed to explicitly represent the regular and structured information, thus terming our method as CWRNN-INVR. For the irregular information, a mixed residual grid is learned where the irregular appearance and motion information are represented together. The mixed residual grid can be combined with the coupled WarpRNN in a way that allows for network reuse. Experiments show that our method achieves the best reconstruction results compared with the existing methods, with an average PSNR of 33.73 dB on the UVG dataset under the 3M model and outperforms existing INVR methods in other downstream tasks. The code can be found at this https URL}{this https URL.

125. 【2604.06518】Adaptive Differential Privacy for Federated Medical Image Segmentation Across Diverse Modalities

链接：https://arxiv.org/abs/2604.06518

作者：Puja Saha,Eranga Ukwatta

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large volumes, data remain underutilized, centralizing distributed data, strict privacy regulations, Federated learning

备注： 10 pages, 8 figures. Accepted in SPIE Medical Imaging 2026. Recipient of CAD Best Paper Award: 1st Place, and Robert F. Wagner All-Conference Best Paper Award: Finalist

点击查看摘要

Abstract:Large volumes of medical data remain underutilized because centralizing distributed data is often infeasible due to strict privacy regulations and institutional constraints. In addition, models trained in centralized settings frequently fail to generalize across clinical sites because of heterogeneity in imaging protocols and continuously evolving data distributions arising from differences in scanners, acquisition parameters, and patient populations. Federated learning offers a promising solution by enabling collaborative model training without sharing raw data. However, incorporating differential privacy into federated learning, while essential for privacy guarantees, often leads to degraded accuracy, unstable convergence, and reduced generalization. In this work, we propose an adaptive differentially private federated learning (ADP-FL) framework for medical image segmentation that dynamically adjusts privacy mechanisms to better balance the privacy-utility trade-off. The proposed approach stabilizes training, significantly improves Dice scores and segmentation boundary quality, and maintains rigorous privacy guarantees. We evaluated ADP-FL across diverse imaging modalities and segmentation tasks, including skin lesion segmentation in dermoscopic images, kidney tumor segmentation in 3D CT scans, and brain tumor segmentation in multi-parametric MRI. Compared with conventional federated learning and standard differentially private federated learning, ADP-FL consistently achieves higher accuracy, improved boundary delineation, faster convergence, and greater training stability, with performance approaching that of non-private federated learning under the same privacy budgets. These results demonstrate the practical viability of ADP-FL for high-performance, privacy-preserving medical image segmentation in real-world federated settings.

126. 【2604.06276】Structural Regularities of Cinema SDR-to-HDR Mapping in a Controlled Mastering Workflow: A Pixel-wise Case Study on ASC StEM2

链接：https://arxiv.org/abs/2604.06276

作者：Xin Zhang,Xiaoyi Chen

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：rare common-source dataset, mapping using ASC, HDR release masters, SDR release masters, empirical case study

备注： 15 pages, 6 figures. Empirical case study on cinema SDR-to-HDR mapping using ASC StEM2

点击查看摘要

Abstract:We present an empirical case study of cinema SDR-to-HDR mapping using ASC StEM2, a rare common-source dataset containing EXR scene-referred images and matched SDR/HDR cinema release masters from the same ACES-based mastering workflow. Based on pixel-wise statistics over all 18,580 frames of the test film, we construct a three-domain comparison involving EXR source data, SDR release masters, and HDR release masters to characterize their luminance and color structural relationships within this controlled workflow. In the luminance dimension, SDR and HDR masters exhibit a highly stable global monotonic correspondence, with geometric structure remaining largely consistent overall; sparse and structured deviations appear in self-luminous highlights and specific material regions. In the color dimension, the two masters remain largely consistent in hue, with saturation exhibiting a redistribution pattern of shadow suppression, midtone expansion, and highlight convergence. Using EXR as a scene-referred anchor, we further define a pixel-level decision map that operationally separates EXR-closer recovery regions from content-adaptive adjustment regions. Under this operational definition, 82.4% of sampled image regions are classified as EXR-closer recovery, while the remainder require localized adaptive adjustment. Rather than claiming a universal law for all cinema mastering pipelines, the study provides an interpretable quantitative baseline for structure-aware SDR-to-HDR analysis and for designing learning-based models under shared-source mastering conditions.

127. 【2604.06180】MedRoute: RL-Based Dynamic Specialist Routing in Multi-Agent Medical Diagnosis

链接：https://arxiv.org/abs/2604.06180

作者：Ashmal Vayani,Parth Parag Kulkarni,Joseph Fioresi,Song Wang,Mubarak Shah

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：Large Multimodal Models, Large Multimodal, gained increasing attention, increasing attention due, providing precise diagnoses

备注：

点击查看摘要

Abstract:Medical diagnosis using Large Multimodal Models (LMMs) has gained increasing attention due to capability of these models in providing precise diagnoses. These models generally combine medical questions with visual inputs to generate diagnoses or treatments. However, they are often overly general and unsuitable under the wide range of medical conditions in real-world healthcare. In clinical practice, diagnosis is performed by multiple specialists, each contributing domain-specific expertise. To emulate this process, a potential solution is to deploy a dynamic multi-agent LMM framework, where each agent functions as a medical specialist. Current approaches in this emerging area, typically relying on static or predefined selection of various specialists, cannot be adapted to the changing practical scenario. In this paper, we propose MedRoute, a flexible and dynamic multi-agent framework that comprises of a collaborative system of specialist LMM agents. Furthermore, we add a General Practitioner with an RL-trained router for dynamic specialist selection, and a Moderator that produces the final decision. In this way, our framework closely mirrors real clinical workflows. Extensive evaluations on text and image-based medical datasets demonstrate improved diagnostic accuracy, outperforming the state-of-the-art baselines. Our work lays a strong foundation for future research. Code and models are available at this https URL.

128. 【2509.10554】MAE-SAM2: Mask Autoencoder-Enhanced SAM2 for Clinical Retinal Vascular Leakage Segmentation

链接：https://arxiv.org/abs/2509.10554

作者：Xin Xing,Irmak Karaca,Amir Akhavanrezayat,Samira Badrloo,Quan Dong Nguyen,Mahadevan Subramaniam

类目：Tissues and Organs (q-bio.TO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：fluorescein angiography images, retinal vascular leakage, vascular leakage segmentation, angiography images, retinal vascular

备注：

点击查看摘要

Abstract:We propose MAE-SAM2, a novel foundation model for retinal vascular leakage segmentation on fluorescein angiography images. Due to the small size and dense distribution of the leakage areas, along with the limited availability of labeled clinical data, this presents a significant challenge for segmentation tasks. Our approach integrates a Self-Supervised learning (SSL) strategy, Masked Autoencoder (MAE), with SAM2. In our implementation, we explore different loss functions and conclude a task-specific combined loss. Extensive experiments and ablation studies demonstrate that MAE-SAM2 outperforms several state-of-the-art models, achieving the highest Dice score and Intersection-over-Union (IoU). Compared to the original SAM2, our model achieves a $5\%$ performance improvement, highlighting the promise of foundation models with self-supervised pretraining in clinical imaging tasks.