本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新644篇论文，其中：

自然语言处理76篇
信息检索17篇
计算机视觉135篇

自然语言处理

1. 【2603.30035】Reward-Based Online LLM Routing via NeuralUCB

链接：https://arxiv.org/abs/2603.30035

作者：Ming-Hua Tsai,Phat Tran

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：large language model, cost-aware large language, language model, study investigates, large language

备注：

点击查看摘要

Abstract:This study investigates the use of NeuralUCB for cost-aware large language model (LLM) routing. Existing routing approaches can be broadly grouped into supervised routing methods and partial-feedback methods, each with different tradeoffs in efficiency and adaptivity. We implement a NeuralUCB-based routing policy and evaluate it on RouterBench under a simulated online setting. Experimental results show that the proposed method consistently outperforms random and min-cost baselines in utility reward. Compared with the max-quality reference, our method achieves substantially lower inference cost while maintaining competitive reward. These findings suggest that NeuralUCB is a promising approach for cost-aware LLM routing, while also highlighting remaining challenges in action discrimination and exploration.

2. 【2603.30032】Covertly improving intelligibility with data-driven adaptations of speech timing

链接：https://arxiv.org/abs/2603.30032

作者：Paige Tuttösí,Angelica Lim,H. Henny Yeung,Yue Wang,Jean-Julien Aucouturier

类目：Computation and Language (cs.CL); Sound (cs.SD)

关键词：Human talkers, speech, language-comprehension challenges, non-native adults, talkers often address

备注：

点击查看摘要

Abstract:Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech. However, it remains unclear whether this strategy actually makes speech more intelligible. Here, we take advantage of recent advancements in machine-generated speech allowing more precise control of speech rate in order to systematically examine how targeted speech-rate adjustments may improve comprehension. We first use reverse-correlation experiments to show that the temporal influence of speech rate prior to a target vowel contrast (ex. the tense-lax distinction) in fact manifests in a scissor-like pattern, with opposite effects in early versus late context windows; this pattern is remarkably stable both within individuals and across native L1-English listeners and L2-English listeners with French, Mandarin, and Japanese L1s. Second, we show that this speech rate structure not only facilitates L2 listeners' comprehension of the target vowel contrast, but that native listeners also rely on this pattern in challenging acoustic conditions. Finally, we build a data-driven text-to-speech algorithm that replicates this temporal structure on novel speech sequences. Across a variety of sentences and vowel contrasts, listeners remained unaware that such targeted slowing improved word comprehension. Strikingly, participants instead judged the common strategy of global slowing as clearer, even though it actually increased comprehension errors. Together, these results show that targeted adjustments to speech rate significantly aid intelligibility under challenging conditions, while often going unnoticed. More generally, this paper provides a data-driven methodology to improve the accessibility of machine-generated speech which can be extended to other aspects of speech comprehension and a wide variety of listeners and environments.

3. 【2603.30025】ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection

链接：https://arxiv.org/abs/2603.30025

作者：Yufeng Li,Rrubaa Panchendrarajan,Arkaitz Zubiaga

类目：Computation and Language (cs.CL)

关键词：Verifiable claim detection, claim detection, claim, expresses a factual, factual statement

备注：

点击查看摘要

Abstract:Verifiable claim detection asks whether a claim expresses a factual statement that can, in principle, be assessed against external evidence. As an early filtering stage in automated fact-checking, it plays an important role in reducing the burden on downstream verification components. However, existing approaches to claim detection, whether based on check-worthiness or verifiability, rely solely on the claim text itself. This is a notable limitation for verifiable claim detection in particular, where determining whether a claim is checkable may benefit from knowing what entities and events it refers to and whether relevant information exists to support verification. Inspired by the established role of evidence retrieval in later-stage claim verification, we propose Context-Driven Claim Detection (ContextClaim), a paradigm that advances retrieval to the detection stage. ContextClaim extracts entity mentions from the input claim, retrieves relevant information from Wikipedia as a structured knowledge source, and employs large language models to produce concise contextual summaries for downstream classification. We evaluate ContextClaim on two datasets covering different topics and text genres, the CheckThat! 2022 COVID-19 Twitter dataset and the PoliClaim political debate dataset, across encoder-only and decoder-only models under fine-tuning, zero-shot, and few-shot settings. Results show that context augmentation can improve verifiable claim detection, although its effectiveness varies across domains, model architectures, and learning settings. Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.

4. 【2603.30002】racking Equivalent Mechanistic Interpretations Across Neural Networks

链接：https://arxiv.org/abs/2603.30002

作者：Alan Sun,Mariya Toneva

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：interpreting neural networks, Mechanistic interpretability, neural networks, interpreting neural, Mechanistic

备注： 32 pages, 5 figures, ICLR 2026

点击查看摘要

Abstract:Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models' representation similarity. We provide guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.

5. 【2603.29997】Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives

链接：https://arxiv.org/abs/2603.29997

作者：Mohammadhossein Khojasteh,Yifan Jiang,Stefano De Giorgis,Frank van Harmelen,Filip Ilievski

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：problem-solving and argumentation, driver of human, human generalization, generalization in problem-solving, Analogical reasoning

备注：

点击查看摘要

Abstract:Analogical reasoning is a key driver of human generalization in problem-solving and argumentation. Yet, analogies between narrative structures remain challenging for machines. Cognitive engines for structural mapping are not directly applicable, as they assume pre-extracted entities, whereas LLMs' performance is sensitive to prompt format and the degree of surface similarity between narratives. This gap motivates a key question: What is the impact of enhancing structural mapping with LLM-derived abstractions on their analogical reasoning ability in narratives? To that end, we propose a modular framework named YARN (Yielding Abstractions for Reasoning in Narratives), which uses LLMs to decompose narratives into units, abstract these units, and then passes them to a mapping component that aligns elements across stories to perform analogical reasoning. We define and operationalize four levels of abstraction that capture both the general meaning of units and their roles in the story, grounded in prior work on framing. Our experiments reveal that abstractions consistently improve model performance, resulting in competitive or better performance than end-to-end LLM baselines. Closer error analysis reveals the remaining challenges in abstraction at the right level, in incorporating implicit causality, and an emerging categorization of analogical patterns in narratives. YARN enables systematic variation of experimental settings to analyze component contributions, and to support future work, we make the code for YARN openly available.

6. 【2603.29979】Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior

链接：https://arxiv.org/abs/2603.29979

作者：Junwei Yu,Mufeng Yang,Yepeng Ding,Hiroyuki Sato

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：direct answer generation, AI-powered search engines, traditional link-based retrieval, Generative Engine Optimization, Generative Engine

备注： 12 pages, 5 figures. This paper proposes GEO-SFE, a structural feature engineering framework for generative engine optimization

点击查看摘要

Abstract:The proliferation of AI-powered search engines has shifted information discovery from traditional link-based retrieval to direct answer generation with selective source citation, creating new challenges for content visibility. While existing Generative Engine Optimization (GEO) approaches focus primarily on semantic content modification, the role of structural features in influencing citation behavior remains underexplored. In this paper, we propose GEO-SFE, a systematic framework for structural feature engineering in generative engine optimization. Our approach decomposes content structure into three hierarchical levels: macro-structure (document architecture), meso-structure (information chunking), and micro-structure (visual emphasis), and models their impact on citation probability across different generative engine architectures. We develop architecture-aware optimization strategies and predictive models that preserve semantic integrity while improving structural effectiveness. Experimental evaluation across six mainstream generative engines demonstrates consistent improvements in citation rate (17.3 percent) and subjective quality (18.5 percent), validating the effectiveness and generalizability of the proposed framework. This work establishes structural optimization as a foundational component of GEO, providing a data-driven methodology for enhancing content visibility in LLM-powered information ecosystems.

Comments:
12 pages, 5 figures. This paper proposes GEO-SFE, a structural feature engineering framework for generative engine optimization

Subjects:

Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

ACMclasses:
H.3.3; I.2.7

Cite as:
arXiv:2603.29979 [cs.CL]

(or
arXiv:2603.29979v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29979

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

7. 【2603.29950】Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

链接：https://arxiv.org/abs/2603.29950

作者：Xiaoshan Huang,Conrad Borchers,Jiayi Zhang,Susanne P. Lajoie

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Socially Shared Regulation, Effective collaboration requires, Regulation of Learning, manage complex cognitive, Effective collaboration

备注： Accepted as short paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:Effective collaboration requires teams to manage complex cognitive and emotional states through Socially Shared Regulation of Learning (SSRL). Physiological synchrony (i.e., longitudinal alignment in physiological signals) can indicate these states, but is hard to interpret on its own. We investigate the physiological and conversational dynamics of four medical dyads diagnosing a virtual patient case using an intelligent tutoring system. Semantic shifts in dialogue were correlated with transient physiological synchrony peaks. We also coded utterance segments for SSRL and derived cosine similarity using sentence embeddings. The results showed that activating prior knowledge featured significantly lower semantic similarity than simpler task execution. High physiological synchrony was associated with lower semantic similarity, suggesting that such moments involve exploratory and varied language use. Qualitative analysis triangulated these synchrony peaks as ``pivotal moments'': successful teams synchronized during shared discovery, while unsuccessful teams peaked during shared uncertainty. This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.

8. 【2603.29937】Rewrite the News: Tracing Editorial Reuse Across News Agencies

链接：https://arxiv.org/abs/2603.29937

作者：Soveatin Kuntur,Nina Smirnova,Anna Wroblewska,Philipp Mayr,Sebastijan Razboršek Maček

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：paper investigates sentence-level, investigates sentence-level text, Slovenian Press Agency, multilingual journalism, paper investigates

备注： The paper is accepted to SoCon-NLPSI 2026 : Social Context (SoCon) and Integrating NLP and Psychology to Study Social Interactions (NLPSI) workshop co-located with LREC 2026

点击查看摘要

Abstract:This paper investigates sentence-level text reuse in multilingual journalism, analyzing where reused content occurs within articles. We present a weakly supervised method for detecting sentence-level cross-lingual reuse without requiring full translations, designed to support automated pre-selection to reduce information overload for journalists (Holyst et al., 2024). The study compares English-language articles from the Slovenian Press Agency (STA) with reports from 15 foreign agencies (FA) in seven languages, using publication timestamps to retain the earliest likely foreign source for each reused sentence. We analyze 1,037 STA and 237,551 FA articles from two time windows (October 7-November 2, 2023; February 1-28, 2025) and identify 1,087 aligned sentence pairs after filtering to the earliest sources. Reuse occurs in 52% of STA articles and 1.6% of FA articles and is predominantly non-literal, involving paraphrase and compositional reuse from multiple sources. Reused content tends to appear in the middle and end of English articles, while leads are more often original, indicating that simple lexical matching overlooks substantial editorial reuse. Compared with prior work focused on monolingual overlap, we (i) detect reuse across languages without requiring full translation, (ii) use publication timing to identify likely sources, and (iii) analyze where reused material is situated within articles. Dataset and code: this https URL.

9. 【2603.29901】Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

链接：https://arxiv.org/abs/2603.29901

作者：Mst. Fahmida Sultana Naznin,Adnan Ibney Faruq,Mushfiqur Rahman,Niloy Kumar Mondal,Md. Mehedi Hasan Shawon,Md Rakibul Hasan

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：concise clinical impressions, Automated radiology report, strong text-only baselines, distill verbose findings, IMPRESSION transformation

备注：

点击查看摘要

Abstract:Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.

10. 【2603.29893】Perfecting Human-AI Interaction at Clinical Scale. Turning Production Signals into Safer, More Human Conversations

链接：https://arxiv.org/abs/2603.29893

作者：Subhabrata Mukherjee,Markel Sanz Ausin,Kriti Aggarwal,Debajyoti Datta,Shanil Puri,Woojeong Jin,Tanmay Laud,Neha Manjunath,Jiayuan Ding,Bibek Paudel,Jan Schellenberger,Zepeng Frazier Huo,Walter Shen,Nima Shirazian,Nate Potter,Sathvik Perkari,Darya Filippova,Anton Morozov,Austin Mease,Vivek Muppalla,Ghada Shakir,Alex Miller,Juliana Ghukasyan,Mariska Raglow-Defranco,Maggie Taylor,Herprit Mahal,Jonathan Agnew

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：clean benchmark accuracy, language shifts mid-call, Healthcare conversational, intent is indirect, production-first regime

备注：

点击查看摘要

Abstract:Healthcare conversational AI agents shouldn't be optimized only for clean benchmark accuracy in production-first regime; they must be optimized for the lived reality of patient conversations, where audio is imperfect, intent is indirect, language shifts mid-call, and compliance hinges on how guidance is delivered. We present a production-validated framework grounded in real-time signals from 115M+ live patient-AI interactions and clinician-led testing (7K+ licensed clinicians; 500K+ test calls). These in-the-wild cues -- paralinguistics, turn-taking dynamics, clarification triggers, escalation markers, multilingual continuity, and workflow confirmations -- reveal failure modes that curated data misses and provide actionable training and evaluation signals for safety and reliability. We further show why healthcare-grade safety cannot rely on a single LLM: long-horizon dialogue and limited attention demand redundancy via governed orchestration, independent checks, and verification. Many apparent "reasoning" errors originate upstream, motivating vertical integration across contextual ASR, clarification/repair, ambient speech handling, and latency-aware model/hardware choices. Treating interaction intelligence (tone, pacing, empathy, clarification, turn-taking) as first-class safety variables, we drive measurable gains in safety, documentation, task completion, and equity in building the safest generative AI solution for autonomous patient-facing care. Deployed across more than 10 million real patient calls, Polaris attains a clinical safety score of 99.9%, while significantly improving patient experience with average patient rating of 8.95 and reducing ASR errors by 50% over enterprise ASR. These results establish real-world interaction intelligence as a critical -- and previously underexplored -- determinant of safety and reliability in patient-facing clinical AI systems.

Subjects:

Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

Cite as:
arXiv:2603.29893 [cs.HC]

(or
arXiv:2603.29893v1 [cs.HC] for this version)

https://doi.org/10.48550/arXiv.2603.29893

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

11. 【2603.29892】FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

链接：https://arxiv.org/abs/2603.29892

作者：Daban Q. Jaff,Mohammad Mohammadamini

类目：Computation and Language (cs.CL)

关键词：FLEURS offers n-way, n-way parallel speech, automatic speech recognition, offers n-way parallel, Northern Kurdish

备注：

点击查看摘要

Abstract:FLEURS offers n-way parallel speech for 100+ languages, but Northern Kurdish is not one of them, which limits benchmarking for automatic speech recognition and speech translation tasks in this language. We present FLEURS-Kobani, a Northern Kurdish (ISO 639-3 KMR) spoken extension of the FLEURS benchmark. The FLEURS-Kobani dataset consists of 5,162 validated utterances, totaling 18 hours and 24 minutes. The data were recorded by 31 native speakers. It extends benchmark coverage to an under-resourced Kurdish variety. As baselines, we fine-tuned Whisper v3-large for ASR and E2E S2TT. A two-stage fine-tuning strategy (Common Voice to FLEURS-Kobani) yields the best ASR performance (WER 28.11, CER 9.84 on test). For E2E S2TT (KMR to EN), Whisper achieves 8.68 BLEU on test; we additionally report pivot-derived targets and a cascaded S2TT setup. FLEURS-Kobani provides the first public Northern Kurdish benchmark for evaluation of ASR, S2TT and S2ST tasks. The dataset is publicly released for research use under a CC BY 4.0 license.

12. 【2603.29875】UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough

链接：https://arxiv.org/abs/2603.29875

作者：Ryszard Tuora,Mateusz Galiński,Michał Godziszewski,Michał Karpowicz,Mateusz Czyżnikiewicz,Adam Kozakiewicz,Tomasz Ziętkiewicz

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：chunk-based retrieval pipelines, atomic objects, retrieval pipelines represent, Retrieval-augmented generation, pipelines represent

备注：

点击查看摘要

Abstract:One of the key problems in Retrieval-augmented generation (RAG) systems is that chunk-based retrieval pipelines represent the source chunks as atomic objects, mixing the information contained within such a chunk into a single vector. These vector representations are then fundamentally treated as isolated, independent and self-sufficient, with no attempt to represent possible relations between them. Such an approach has no dedicated mechanisms for handling multi-hop questions. Graph-based RAG systems aimed to ameliorate this problem by modeling information as knowledge-graphs, with entities represented by nodes being connected by robust relations, and forming hierarchical communities. This approach however suffers from its own issues with some of them being: orders of magnitude increased componential complexity in order to create graph-based indices, and reliance on heuristics for performing retrieval. We propose UnWeaver, a novel RAG framework simplifying the idea of GraphRAG. UnWeaver disentangles the contents of the documents into entities which can occur across multiple chunks using an LLM. In the retrieval process entities are used as an intermediate way of recovering original text chunks hence preserving fidelity to the source material. We argue that entity-based decomposition yields a more distilled representation of original information, and additionally serves to reduce noise in the indexing, and generation process.

13. 【2603.29861】owards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports

链接：https://arxiv.org/abs/2603.29861

作者：Benjamin Josef Schüßler,Jakob Prange

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：economy and society, consumers need reliable, ever-growing urgency, urgency of sustainability, massive stream

备注： accepted to NLP4Ecology workshop at LREC 2026

点击查看摘要

Abstract:With the ever-growing urgency of sustainability in the economy and society, and the massive stream of information that comes with it, consumers need reliable access to that information. To address this need, companies began publishing so called Environmental, Social, and Governance (ESG) reports, both voluntarily and forced by law. To serve the public, these reports must be addressed not only to financial experts but also to non-expert audiences. But are they written clearly enough? In this work, we extend an existing sentence-level dataset of German ESG reports with crowdsourced readability annotations. We find that, in general, native speakers perceive sentences in ESG reports as easy to read, but also that readability is subjective. We apply various readability scoring methods and evaluate them regarding their prediction error and correlation with human rankings. Our analysis shows that, while LLM prompting has potential for distinguishing clear from hard-to-read sentences, a small finetuned transformer predicts human readability with the lowest error. Averaging predictions of multiple models can slightly improve the performance at the cost of slower inference.

14. 【2603.29846】SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models

链接：https://arxiv.org/abs/2603.29846

作者：Adar Avsian,Larry Heck

类目：Computation and Language (cs.CL)

关键词：increasingly deployed, deployed in multi-agent, Secret-aware Natural language, multi-agent settings, Natural language Evaluation

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in multi-agent settings where communication must balance informativeness and secrecy. In such settings, an agent may need to signal information to collaborators while preventing an adversary from inferring sensitive details. However, existing LLM benchmarks primarily evaluate capabilities such as reasoning, factual knowledge, or instruction following, and do not directly measure strategic communication under asymmetric information. We introduce SNEAK (Secret-aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models. In SNEAK, a model is given a semantic category, a candidate set of words, and a secret word, and must generate a message that indicates knowledge of the secret without revealing it too clearly. We evaluate generated messages using two simulated agents with different information states: an ally, who knows the secret and must identify the intended message, and a chameleon, who does not know the secret and attempts to infer it from the message. This yields two complementary metrics: utility, measuring how well the message communicates to collaborators, and leakage, measuring how much information it reveals to an adversary. Using this framework, we analyze the trade-off between informativeness and secrecy in modern language models and show that strategic communication under asymmetric information remains a challenging capability for current systems. Notably, human participants outperform all evaluated models by a large margin, achieving up to four times higher scores.

15. 【2603.29828】Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data Analysis

链接：https://arxiv.org/abs/2603.29828

作者：Han Deng,Anqi Zou,Hanling Zhang,Ben Fei,Chengyu Zhang,Haobo Wang,Xinru Guo,Zhenyu Li,Xuzhu Wang,Peng Yang,Fujian Zhang,Weiyu Guo,Xiaohong Shao,Zhaoyang Liu,Shixiang Tang,Zhihui Wang,Wanli Ouyang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：discovery increasingly depends, Scientific discovery increasingly, existing API-based systems, high-throughput characterization, discovery increasingly

备注： 17 pages

点击查看摘要

Abstract:Scientific discovery increasingly depends on high-throughput characterization, yet automation is hindered by proprietary GUIs and the limited generalizability of existing API-based systems. We present Owl-AuraID, a software-hardware collaborative embodied agent system that adopts a GUI-native paradigm to operate instruments through the same interfaces as human experts. Its skill-centric framework integrates Type-1 (GUI operation) and Type-2 (data analysis) skills into end-to-end workflows, connecting physical sample handling with scientific interpretation. Owl-AuraID demonstrates broad coverage across ten categories of precision instruments and diverse workflows, including multimodal spectral analysis, microscopic imaging, and crystallographic analysis, supporting modalities such as FTIR, NMR, AFM, and TGA. Overall, Owl-AuraID provides a practical, extensible foundation for autonomous laboratories and illustrates a path toward evolving laboratory intelligence through reusable operational and analytical skills. The code are available at this https URL.

16. 【2603.29801】ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian

链接：https://arxiv.org/abs/2603.29801

作者：Cristian Santini,Sebastian Barzaghi,Paolo Sernani,Emanuele Frontoni,Laura Melosi,Mehwish Alam

类目：Computation and Language (cs.CL)

关键词：Extracting Named Entities, Named Entity Recognition, Italian Digital Editions, Extracting Named, Recognition and Linking

备注：

点击查看摘要

Abstract:This paper introduces ENEIDE (Extracting Named Entities from Italian Digital Editions), a silver standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian texts. The corpus comprises 2,111 documents with over 8,000 entity annotations semi-automatically extracted from two scholarly digital editions: Digital Zibaldone, the philosophical diary of the Italian poet Giacomo Leopardi (1798--1837), and Aldo Moro Digitale, the complete works of the Italian politician Aldo Moro (1916--1978). Annotations cover multiple entity types (person, location, organization, literary work) linked to Wikidata identifiers, including NIL entities that cannot be mapped to the knowledge graph. To the best of our knowledge, ENEIDE represents the first multi-domain, publicly available NERL dataset for historical Italian with training, development, and test splits. We present a methodology for semi-automatic annotations extraction from manually curated scholarly digital editions, including quality control and annotation enhancement procedures. Baseline experiments using state-of-the-art models demonstrate the dataset's challenge for NERL and the gap between zero-shot approaches and fine-tuned models. The dataset's diachronic coverage spanning two centuries makes it particularly suitable for temporal entity disambiguation and cross-domain evaluation. ENEIDE is released under a CC BY-NC-SA 4.0 license.

17. 【2603.29791】Reasoning-Driven Synthetic Data Generation and Evaluation

链接：https://arxiv.org/abs/2603.29791

作者：Tim R. Davidson,Benoit Seguin,Enrico Bacis,Cesar Ilharco,Hamza Harkous

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：interest require specialized, require specialized multi-modal, specialized multi-modal models, scarce or inaccessible, applications of interest

备注： Accepted to TMLR 2026, J2C Certification

点击查看摘要

Abstract:Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

18. 【2603.29765】raining-Free Dynamic Upcycling of Expert Language Models

链接：https://arxiv.org/abs/2603.29765

作者：Eros Fanì,Oğuzhan Ersoy

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：exhibiting strong problem-solving, Large Language Models, Large Language, strong problem-solving capabilities, achieved remarkable performance

备注： Accepted at the ICLR 2026 Workshop on Scaling Post-training for LLMs

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization, and developing a single multi-domain expert remains difficult due to diverging objectives. Furthermore, multitask training is challenging due to interference and catastrophic forgetting. Existing work proposes combining the expertise of dense models within a Mixture of Experts (MoE) architecture, although this approach still requires multitask finetuning. To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model. Our method builds a single multitask model that preserves the capabilities of the original dense experts without requiring additional training. DUME is both cost-efficient and scalable: by leveraging the closed-form solution of ridge regression, it eliminates the need for further optimization and enables experts to be added dynamically while maintaining the model's original performance. We demonstrate that DUME consistently outperforms baseline approaches in both causal language modeling and reasoning settings. Finally, we also show that the DUME model can be fine-tuned to further improve performance. We show that, in the causal language modeling setting, DUME can retain up to 97.6% of a dense expert model specialized in one particular domain, and that it can also surpass it in the reasoning setting, where it can achieve 102.1% of the dense expert performance. Our code is available at: this http URL.

19. 【2603.29676】A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

链接：https://arxiv.org/abs/2603.29676

作者：Lixin Xiu,Xufang Luo,Hideki Nakayama

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large vision-language models, achieve impressive performance, processes remain opaque, internal decision-making processes, decision-making processes remain

备注： Accepted at ICLR 2026. Project page: [this https URL](https://riishin.github.io/pid-lvlm-iclr26/)

点击查看摘要

Abstract:Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs -- decomposing a model's decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions -- breadth (cross-model cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at this https URL .

20. 【2603.29665】Near-Miss: Latent Policy Failure Detection in Agentic Workflows

链接：https://arxiv.org/abs/2603.29665

作者：Ella Rabinovich,David Boaz,Naama Zwerdling,Ateret Anaby-Tavor

类目：Computation and Language (cs.CL)

关键词：governing conditional updates, policies governing conditional, business process automation, automation often require, require compliance

备注：

点击查看摘要

Abstract:Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as $\textit{near-misses}$ or $\textit{latent failures}$. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent's tool-calling decisions where sufficiently informed. We evaluate our approach on the $\tau^2$-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.29665 [cs.CL]

(or
arXiv:2603.29665v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29665

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

21. 【2603.29661】Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models

链接：https://arxiv.org/abs/2603.29661

作者：Brian Felipe Keith-Norambuena,Carolina Inés Rojas-Córdova,Claudio Juvenal Meneses-Villegas,Elizabeth Johanna Lam-Esquenazi,Angélica María Flores-Bustos,Ignacio Alejandro Molina-Villablanca,Joshua Emanuel Leyton-Vallejos

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Existing narrative extraction, Existing narrative, Narrative Maps supports, face a trade-off, Maps supports rich

备注： Text2Story Workshop 2026 at ECIR 2026

点击查看摘要

Abstract:Existing narrative extraction methods face a trade-off between coherence, interactivity, and multi-storyline support. Narrative Maps supports rich interaction and generates multiple storylines as a byproduct of its coverage constraints, though this comes at the cost of individual path coherence. Narrative Trails achieves high coherence through maximum capacity path optimization but provides no mechanism for user guidance or multiple perspectives. We introduce agenda-based narrative extraction, a method that bridges this gap by integrating large language models into the Narrative Trails pathfinding process to steer storyline construction toward user-specified perspectives. Our approach uses an LLM at each step to rank candidate documents based on their alignment with a given agenda while maintaining narrative coherence. Running the algorithm with different agendas yields different storylines through the same corpus. We evaluated our approach on a news article corpus using LLM judges with Claude Opus 4.5 and GPT 5.1, measuring both coherence and agenda alignment across 64 endpoint pairs and 6 agendas. LLM-driven steering achieves 9.9% higher alignment than keyword matching on semantic agendas (p=0.017), with 13.3% improvement on \textit{Regime Crackdown} specifically (p=0.037), while keyword matching remains competitive on agendas with literal keyword overlap. The coherence cost is minimal: LLM steering reduces coherence by only 2.2% compared to the agenda-agnostic baseline. Counter-agendas that contradict the source material score uniformly low (2.2-2.5) across all methods, confirming that steering cannot fabricate unsupported narratives.

22. 【2603.29651】Semantic Interaction for Narrative Map Sensemaking: An Insight-based Evaluation

链接：https://arxiv.org/abs/2603.29651

作者：Brian Felipe Keith-Norambuena,Fausto German,Eric Krokos,Sarah Joseph,Chris North

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Semantic interaction, incorporate their cognitive, cognitive processes, narrative map, narrative

备注： Text2Story Workshop 2026 at ECIR 2026

点击查看摘要

Abstract:Semantic interaction (SI) enables analysts to incorporate their cognitive processes into AI models through direct manipulation of visualizations. While SI frameworks for narrative extraction have been proposed, empirical evaluations of their effectiveness remain limited. This paper presents a user study that evaluates SI for narrative map sensemaking, involving 33 participants under three conditions: a timeline baseline, a basic narrative map, and an interactive narrative map with SI capabilities. The results show that the map-based prototypes yielded more insights than the timeline baseline, with the SI-enabled condition reaching statistical significance and the basic map condition trending in the same direction. The SI-enabled condition showed the highest mean performance; differences between the map conditions were not statistically significant but showed large effect sizes (d 0.8), suggesting that the study was underpowered to detect them. Qualitative analysis identified two distinct SI approaches-corrective and additive-that enable analysts to impose quality judgments and organizational structure on extracted narratives. We also find that SI users achieved comparable exploration breadth with less parameter manipulation, suggesting that SI serves as an alternative pathway for model refinement. This work provides empirical evidence that map-based representations outperform timelines for narrative sensemaking, along with qualitative insights into how analysts use SI for narrative refinement.

23. 【2603.29608】Learning Diagnostic Reasoning for Decision Support in Toxicology

链接：https://arxiv.org/abs/2603.29608

作者：Nico Oberländer,David Bani-Harouni,Tobias Zellner,Nassir Navab,Florian Eyer,Matthias Keicher

类目：Computation and Language (cs.CL)

关键词：Acute poly-substance intoxication, incomplete ingestion details, intoxication requires rapid, Acute poly-substance, poly-substance intoxication requires

备注：

点击查看摘要

Abstract:Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model's reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating absent poisons. Our model significantly outperforms its unadapted base LLM counterpart and supervised baselines. Furthermore, in a clinical validation study, the model indicates a clinical advantage by outperforming an expert toxicologist in identifying the correct poisons (Micro-F1: 0.644 vs. 0.473). These results demonstrate the potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments.

24. 【2603.29559】When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

链接：https://arxiv.org/abs/2603.29559

作者：Robinson Ferrer,Damla Turgut,Zhongzhou Chen,Shashank Sonkar

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Large Language Models, Large Language, Language Models, show promise, promise for automated

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38\% worse despite requiring 5$\times$ the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28\% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor'' that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \href{this https URL}{here}.

25. 【2603.29557】FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration

链接：https://arxiv.org/abs/2603.29557

作者：Qiyao Wang,Hongbo Wang,Longze Chen,Zhihao Yang,Guhong Chen,Hamid Alinejad-Rokny,Hui Li,Yuan Lin,Min Yang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：AI-driven autonomous research, Scientific idea generation, insufficiently divergent ideas, Carlo Tree Search, Scientific idea

备注： 30 pages, 11 figures, 15 tables

点击查看摘要

Abstract:Scientific idea generation (SIG) is critical to AI-driven autonomous research, yet existing approaches are often constrained by a static retrieval-then-generation paradigm, leading to homogeneous and insufficiently divergent ideas. In this work, we propose FlowPIE, a tightly coupled retrieval-generation framework that treats literature exploration and idea generation as a co-evolving process. FlowPIE expands literature trajectories via a flow-guided Monte Carlo Tree Search (MCTS) inspired by GFlowNets, using the quality of current ideas assessed by an LLM-based generative reward model (GRM) as a supervised signal to guide adaptive retrieval and construct a diverse, high-quality initial population. Based on this population, FlowPIE models idea generation as a test-time idea evolution process, applying selection, crossover, and mutation with the isolation island paradigm and GRM-based fitness computation to incorporate cross-domain knowledge. It effectively mitigates the information cocoons arising from over-reliance on parametric knowledge and static literature. Extensive evaluations demonstrate that FlowPIE consistently produces ideas with higher novelty, feasibility and diversity compared to strong LLM-based and agent-based frameworks, while enabling reward scaling during test time.

26. 【2603.29552】Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

链接：https://arxiv.org/abs/2603.29552

作者：Linda Zeng,Steven Y. Feng,Michael C. Frank

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Multilingualism is incredibly, children learn multiple, learn multiple languages, incredibly common, important theoretical

备注： Code and data at [this https URL](https://github.com/styfeng/bilingual-babyLM)

点击查看摘要

Abstract:Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.

27. 【2603.29541】Can LLM Agents Identify Spoken Dialects like a Linguist?

链接：https://arxiv.org/abs/2603.29541

作者：Tobias Bystrich,Lukas Hamm,Maria Hassan,Lea Fischbach,Lucie Flek,Akbar Karimi

类目：Computation and Language (cs.CL)

关键词：including Swiss German, Swiss German, labeled dialectal speech, including Swiss, audio dialect classification

备注： Accepted to DialRes Workshop @ LREC 2026

点击查看摘要

Abstract:Due to the scarcity of labeled dialectal speech, audio dialect classification is a challenging task for most languages, including Swiss German. In this work, we explore the ability of large language models (LLMs) as agents in understanding the dialects and whether they can show comparable performance to models such as HuBERT in dialect classification. In addition, we provide an LLM baseline and a human linguist one. Our approach uses phonetic transcriptions produced by ASR systems and combines them with linguistic resources such as dialect feature maps, vowel history, and rules. Our findings indicate that, when linguistic information is provided, the LLM predictions improve. The human baseline shows that automatically generated transcriptions can be beneficial for such classifications, but also presents opportunities for improvement.

28. 【2603.29522】Baby Scale: Investigating Models Trained on Individual Children's Language Input

链接：https://arxiv.org/abs/2603.29522

作者：Steven Y. Feng,Alvin W.M. Tan,Michael C. Frank

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Modern language models, Modern language, produce useful behavior, orders of magnitude, begin to produce

备注： Code and data at [this https URL](https://github.com/styfeng/babyscale-LM)

点击查看摘要

Abstract:Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.

29. 【2603.29518】Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics

链接：https://arxiv.org/abs/2603.29518

作者：Alain Vázquez,Maria Inés Torres

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：generate diverse language, diverse language forms, Natural Language Generation, Conversational systems, convert Meaning Representations

备注：

点击查看摘要

Abstract:Conversational systems should generate diverse language forms to interact fluently and accurately with users. In this context, Natural Language Generation (NLG) engines convert Meaning Representations (MRs) into sentences, directly influencing user perception. These MRs usually encode the communicative function (e.g., inform, request, confirm) via DAs and enumerate the semantic content with slot-value pairs. In this work, our objective is to analyse whether providing a task demonstrator to the generator enhances the generations of a fine-tuned model. This demonstrator is an MR-sentence pair extracted from the original dataset that enriches the input at training and inference time. The analysis involves five metrics that focus on different linguistic aspects, and four datasets that differ in multiple features, such as domain, size, lexicon, MR variability, and acquisition process. To the best of our knowledge, this is the first study on dialogue NLG implementing a comparative analysis of the impact of MRs on generation quality across domains, corpus characteristics, and the metrics used to evaluate these generations. Our key insight is that the proposed enriched inputs are effective for complex tasks and small datasets with high variability in MRs and sentences. They are also beneficial in zero-shot settings for any domain. Moreover, the analysis of the metrics shows that semantic metrics capture generation quality more accurately than lexical metrics. In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss. Finally, the evolution of the metric scores and the excellent results for Slot Accuracy and Dialogue Act Accuracy demonstrate that the generative models present fast adaptability to different tasks and robustness at semantic and communicative intention levels.

30. 【2603.29517】LLM Probe: Evaluating LLMs for Low-Resource Languages

链接：https://arxiv.org/abs/2603.29517

作者：Hailay Kidu Teklehaymanot,Gebrearegawi Gebremariam,Wolfgang Nejdl

类目：Computation and Language (cs.CL)

关键词：limited annotated resources, morphologically rich languages, rapid advances, advances in large, morphologically rich

备注： 11 pages, 6 tables

点击查看摘要

Abstract:Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset comprises bilingual lexicons with linguistic annotations, including part-of-speech tags, grammatical gender, and morphosyntactic features, which demonstrate high inter-annotator agreement to ensure reliable annotations. We test a variety of models, including causal language models and sequence-to-sequence architectures. The results reveal notable differences in performance across various linguistic tasks: sequence-to-sequence models generally excel in morphosyntactic analysis and translation quality, whereas causal models demonstrate strong performance in lexical alignment but exhibit weaker translation accuracy. Our results emphasize the need for linguistically grounded evaluation to better understand LLM limitations in low-resource settings. We release LLM Probe and the accompanying benchmark dataset as open-source tools to promote reproducible benchmarking and to support the development of more inclusive multilingual language technologies.

31. 【2603.29497】Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

链接：https://arxiv.org/abs/2603.29497

作者：Gabriel Loiseau,Damien Sileo,Damien Riquet,Maxime Meyer,Marc Tommasi

类目：Computation and Language (cs.CL)

关键词：privacy-preserving natural language, Accurate privacy evaluation, textual data remains, Accurate privacy, natural language processing

备注： Accepted to the LREC CALD-pseudo 2026 Workshop

点击查看摘要

Abstract:Accurate privacy evaluation of textual data remains a critical challenge in privacy-preserving natural language processing. Recent work has shown that large language models (LLMs) can serve as reliable privacy evaluators, achieving strong agreement with human judgments; however, their computational cost and impracticality for processing sensitive data at scale limit real-world deployment. We address this gap by distilling the privacy assessment capabilities of Mistral Large 3 (675B) into lightweight encoder models with as few as 150M parameters. Leveraging a large-scale dataset of privacy-annotated texts spanning 10 diverse domains, we train efficient classifiers that preserve strong agreement with human annotations while dramatically reducing computational requirements. We validate our approach on human-annotated test data and demonstrate its practical utility as an evaluation metric for de-identification systems.

32. 【2603.29493】MemFactory: Unified Inference Training Framework for Agent Memory

链接：https://arxiv.org/abs/2603.29493

作者：Ziliang Guo,Ziheng Li,Zhiyu Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Memory-augmented Large Language, Large Language Models, Large Language, applying Reinforcement Learning, Memory-augmented Large

备注： 10 pages, Code: [this https URL](https://github.com/Valsure/MemFactory)

点击查看摘要

Abstract:Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a "Lego-like" architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across both in-domain and out-of-distribution evaluation sets, MemFactory consistently improves performance over the corresponding base models, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.

33. 【2603.29492】Calibrated Confidence Expression for Radiology Report Generation

链接：https://arxiv.org/abs/2603.29492

作者：David Bani-Harouni,Chantal Pellegrini,Julian Lüers,Su Hwan Kim,Markus Baalmann,Benedikt Wiestler,Rickmer Braren,Nassir Navab,Matthias Keicher

类目：Computation and Language (cs.CL)

关键词：Large Vision-Language Models, enabling selective radiologist, clinically interpretable indicators, selective radiologist verification, hallucinated findings influencing

备注：

点击查看摘要

Abstract:Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.

34. 【2603.29467】M-MiniGPT4: Multilingual VLLM Alignment via Translated Data

链接：https://arxiv.org/abs/2603.29467

作者：Seung Hun Han,Youssef Mohamed,Mohamed Elhoseiny

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Vision Large Language, Multilingual Vision Large, Vision Large, Large Language Model, Large Language

备注： 6 pages, ACL 2026, Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)

点击查看摘要

Abstract:This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.

35. 【2603.29466】An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms

链接：https://arxiv.org/abs/2603.29466

作者：Nils Grünefeld,Jes Frellsen,Christian Hardmeier

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：quantifying predictive uncertainty, typically unavailable, Existing methods, quantifying predictive, computationally intractable

备注：

点击查看摘要

Abstract:Existing methods for quantifying predictive uncertainty in neural networks are either computationally intractable for large language models or require access to training data that is typically unavailable. We derive a lightweight alternative through two approximations: a first-order Taylor expansion that expresses uncertainty in terms of the gradient of the prediction and the parameter covariance, and an isotropy assumption on the parameter covariance. Together, these yield epistemic uncertainty as the squared gradient norm and aleatoric uncertainty as the Bernoulli variance of the point prediction, from a single forward-backward pass through an unmodified pretrained model. We justify the isotropy assumption by showing that covariance estimates built from non-training data introduce structured distortions that isotropic covariance avoids, and that theoretical results on the spectral properties of large networks support the approximation at scale. Validation against reference Markov Chain Monte Carlo estimates on synthetic problems shows strong correspondence that improves with model size. We then use the estimates to investigate when each uncertainty type carries useful signal for predicting answer correctness in question answering with large language models, revealing a benchmark-dependent divergence: the combined estimate achieves the highest mean AUROC on TruthfulQA, where questions involve genuine conflict between plausible answers, but falls to near chance on TriviaQA's factual recall, suggesting that parameter-level uncertainty captures a fundamentally different signal than self-assessment methods.

36. 【2603.29454】Authorship Impersonation via LLM Prompting does not Evade Authorship Verification Methods

链接：https://arxiv.org/abs/2603.29454

作者：Baoyi Zeng,Andrea Nini

类目：Computation and Language (cs.CL)

关键词：Authorship verification, specific individual, task of determining, Authorship, forensic linguistics

备注： 11 pages, 3 figures

点击查看摘要

Abstract:Authorship verification (AV), the task of determining whether a questioned text was written by a specific individual, is a critical part of forensic linguistics. While manual authorial impersonation by perpetrators has long been a recognized threat in historical forensic cases, recent advances in large language models (LLMs) raise new challenges, as adversaries may exploit these tools to impersonate another's writing. This study investigates whether prompted LLMs can generate convincing authorial impersonations and whether such outputs can evade existing forensic AV systems. Using GPT-4o as the adversary model, we generated impersonation texts under four prompting conditions across three genres: emails, text messages, and social media posts. We then evaluated these outputs against both non-neural AV methods (n-gram tracing, Ranking-Based Impostors Method, LambdaG) and neural approaches (AdHominem, LUAR, STAR) within a likelihood-ratio framework. Results show that LLM-generated texts failed to sufficiently replicate authorial individuality to bypass established AV systems. We also observed that some methods achieved even higher accuracy when rejecting impersonation texts compared to genuine negative samples. Overall, these findings indicate that, despite the accessibility of LLMs, current AV systems remain robust against entry-level impersonation attempts across multiple genres. Furthermore, we demonstrate that this counter-intuitive resilience stems, at least in part, from the higher lexical diversity and entropy inherent in LLM-generated texts.

37. 【2603.29429】CounselReflect: A Toolkit for Auditing Mental-Health Dialogues

链接：https://arxiv.org/abs/2603.29429

作者：Yahan Li,Chaohao Du,Zeyang Li,Christopher Chun Kuizon,Shupeng Cheng,Angel Hsing-Chi Hwang,Adam C. Frank,Ruishan Liu

类目：Computation and Language (cs.CL)

关键词：LLM-based tools, increasingly mediated, mediated by conversational, potential risks, mental-health support dialogues

备注：

点击查看摘要

Abstract:Mental-health support is increasingly mediated by conversational systems (e.g., LLM-based tools), but users often lack structured ways to audit the quality and potential risks of the support they receive. We introduce CounselReflect, an end-to-end toolkit for auditing mental-health support dialogues. Rather than producing a single opaque quality score, CounselReflect provides structured, multi-dimensional reports with session-level summaries, turn-level scores, and evidence-linked excerpts to support transparent inspection. The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined custom metrics, operationalized with configurable LLM judges. CounselReflect is available as a web application, browser extension, and command-line interface (CLI), enabling use in real-time settings as well as at scale. Human evaluation includes a user study with 20 participants and an expert review with 6 mental-health professionals, suggesting that CounselReflect supports understandable, usable, and trustworthy auditing. A demo video and full source code are also provided.

38. 【2603.29406】PRISM: PRIor from corpus Statistics for topic Modeling

链接：https://arxiv.org/abs/2603.29406

作者：Tal Ishon,Yoav Goldberg,Uri Shaham

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：foundational probabilistic framework, uncover latent semantic, latent semantic structure, probabilistic framework, seeks to uncover

备注：

点击查看摘要

Abstract:Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: this https URL.

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2603.29406 [cs.LG]

(or
arXiv:2603.29406v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.29406

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

39. 【2603.29396】Is my model perplexed for the right reason? Contrasting LLMs' Benchmark Behavior with Token-Level Perplexity

链接：https://arxiv.org/abs/2603.29396

作者：Zoë Prins,Samuele Punzo,Frank Wildenburg,Giovanni Cinà,Sandro Pezzelle

类目：Computation and Language (cs.CL)

关键词：offering limited insight, risking confirmation bias, Large language models, Standard evaluations, evaluations of Large

备注：

点击查看摘要

Abstract:Standard evaluations of Large language models (LLMs) focus on task performance, offering limited insight into whether correct behavior reflects appropriate underlying mechanisms and risking confirmation bias. We introduce a simple, principled interpretability framework based on token-level perplexity to test whether models rely on linguistically relevant cues. By comparing perplexity distributions over minimal sentence pairs differing in one or a few `pivotal' tokens, our method enables precise, hypothesis-driven analysis without relying on unstable feature-attribution techniques. Experiments on controlled linguistic benchmarks with several open-weight LLMs show that, while linguistically important tokens influence model behavior, they never fully explain perplexity shifts, revealing that models rely on heuristics other than the expected linguistic ones.

40. 【2603.29373】Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

链接：https://arxiv.org/abs/2603.29373

作者：Yahan Li,Xinyi Jie,Wanjia Ruan,Xubei Zhang,Huaijie Zhu,Yicheng Gao,Chaohao Du,Ruishan Liu

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, health information support, challenging patient behaviors, Large

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria that capture unsafe responses. Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors. We evaluate a range of open- and closed-source LLMs on their responses to challenging patient utterances. While models perform well overall, we identify consistent, behavior-specific failure patterns, with particular difficulty in handling contradictory or medically implausible patient information. We also study four intervention strategies and find that they yield inconsistent improvements and can introduce unnecessary corrections. We release the dataset and code.

41. 【2603.29347】Developing a Guideline for the Labovian-Structural Analysis of Oral Narratives in Japanese

链接：https://arxiv.org/abs/2603.29347

作者：Amane Watahiki,Tomoki Doi,Akari Kikuchi,Hiroshi Ohata,Yuki I. Nakata,Takuya Niikawa,Taiga Shinozaki,Hitomi Yanaka

类目：Computation and Language (cs.CL)

关键词：Labovian narrative analysis, Labovian, Labovian model, Narrative analysis, Japanese

备注： Accepted at The Fifteenth biennial Language Resources and Evaluation Conference (LREC) 2026

点击查看摘要

Abstract:Narrative analysis is a cornerstone of qualitative research. One leading approach is the Labovian model, but its application is labor-intensive, requiring a holistic, recursive interpretive process that moves back and forth between individual parts of the transcript and the transcript as a whole. Existing Labovian datasets are available only in English, which differs markedly from Japanese in terms of grammar and discourse conventions. To address this gap, we introduce the first systematic guidelines for Labovian narrative analysis of Japanese narrative data. Our guidelines retain all six Labovian categories and extend the framework by providing explicit rules for clause segmentation tailored to Japanese constructions. In addition, our guidelines cover a broader range of clause types and narrative types. Using these guidelines, annotators achieved high agreement in clause segmentation (Fleiss' kappa = 0.80) and moderate agreement in two structural classification tasks (Krippendorff's alpha = 0.41 and 0.45, respectively), one of which is slightly higher than that found in prior work despite the use of finer-grained distinctions. This paper describes the Labovian model, the proposed guidelines, the annotation process, and their utility. It concludes by discussing the challenges encountered during the annotation process and the prospects for developing a larger dataset for structural narrative analysis in Japanese qualitative research.

42. 【2603.29346】L-ReLF: A Framework for Lexical Dataset Creation

链接：https://arxiv.org/abs/2603.29346

作者：Anass Sedrati,Mounir Afifi,Reda Benkhadra

类目：Computation and Language (cs.CL)

关键词：Low-Resource Lexical Framework, creating high-quality, Lexical Framework, paper introduces, Optical Character Recognition

备注： Accepted to the 2026 International Conference on Natural Language Processing (ICNLP). 6 pages, 1 figure

点击查看摘要

Abstract:This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

43. 【2603.29345】Open Machine Translation for Esperanto

链接：https://arxiv.org/abs/2603.29345

作者：Ona de Gibert,Lluís de Gibert

类目：Computation and Language (cs.CL)

关键词：productive word formation, widespread constructed language, word formation, widespread constructed, regular grammar

备注： Accepted to SIGUL 2026

点击查看摘要

Abstract:Esperanto is a widespread constructed language, known for its regular grammar and productive word formation. Besides having substantial resources available thanks to its online community, it remains relatively underexplored in the context of modern machine translation (MT) approaches. In this work, we present the first comprehensive evaluation of open-source MT systems for Esperanto, comparing rule-based systems, encoder-decoder models, and LLMs across model sizes. We evaluate translation quality across six language directions involving English, Spanish, Catalan, and Esperanto using multiple automatic metrics as well as human evaluation. Our results show that the NLLB family achieves the best performance in all language pairs, followed closely by our trained compact models and a fine-tuned general-purpose LLM. Human evaluation confirms this trend, with NLLB translations preferred in approximately half of the comparisons, although noticeable errors remain. In line with Esperanto's tradition of openness and international collaboration, we release our code and best-performing models publicly.

44. 【2603.29336】CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

链接：https://arxiv.org/abs/2603.29336

作者：Shohei Higashiyama,Masao Ideuchi,Masao Utiyama

类目：Computation and Language (cs.CL)

关键词：represent real-world entities, associating linguistic expressions, Japanese entity linking, evaluating Japanese entity, knowledge base

备注：

点击查看摘要

Abstract:Entity linking is the task of associating linguistic expressions with entries in a knowledge base that represent real-world entities and concepts. Language resources for this task have primarily been developed for English, and the resources available for evaluating Japanese systems remain limited. In this study, we develop a corpus design policy for the entity linking task and construct an annotated corpus for training and evaluating Japanese entity linking systems, with rich coverage of linguistic expressions referring to entities that are specific to Japan. Evaluation of inter-annotator agreement confirms the high consistency of the annotations in the corpus, and a preliminary experiment on entity disambiguation based on string matching suggests that the corpus contains a substantial number of non-trivial cases, supporting its potential usefulness as an evaluation benchmark.

45. 【2603.29288】Sima AIunty: Caste Audit in LLM-Driven Matchmaking

链接：https://arxiv.org/abs/2603.29288

作者：Atharva Naik,Shounok Kar,Varnika Sharma,Ashwin Rajadesingan,Koustuv Saha

类目：Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)

关键词：South Asian contexts, personal decisions, decisions in relational, deeply entwined, potentially be shaped

备注：

点击查看摘要

Abstract:Social and personal decisions in relational domains such as matchmaking are deeply entwined with cultural norms and historical hierarchies, and can potentially be shaped by algorithmic and AI-mediated assessments of compatibility, acceptance, and stability. In South Asian contexts, caste remains a central aspect of marital decision-making, yet little is known about how contemporary large language models (LLMs) reproduce or disrupt caste-based stratification in such settings. In this work, we conduct a controlled audit of caste bias in LLM-mediated matchmaking evaluations using real-world matrimonial profiles. We vary caste identity across Brahmin, Kshatriya, Vaishya, Shudra, and Dalit, and income across five buckets, and evaluate five LLM families (GPT, Gemini, Llama, Qwen, and BharatGPT). Models are prompted to assess profiles along dimensions of social acceptance, marital stability, and cultural compatibility. Our analysis reveals consistent hierarchical patterns across models: same-caste matches are rated most favorably, with average ratings up to 25% higher (on a 10-point scale) than inter-caste matches, which are further ordered according to traditional caste hierarchy. These findings highlight how existing caste hierarchies are reproduced in LLM decision-making and underscore the need for culturally grounded evaluation and intervention strategies in AI systems deployed in socially sensitive domains, where such systems risk reinforcing historical forms of exclusion.

46. 【2603.29259】Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

链接：https://arxiv.org/abs/2603.29259

作者：Hejin Huang,Jusheng Zhang,Kaitong Cai,Jian Wang,Rong Pan

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Preference-based alignment objectives, RLHF-style pairwise learning, large language models, Preference-based alignment, widely adopted

备注：

点击查看摘要

Abstract:Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems. Yet, existing work rarely examines how Direct Preference Optimization (DPO) behaves under implicit feedback, where unobserved items are not reliable negatives. We conduct systematic experiments on multimodal sequential recommendation to compare common negative-selection strategies and their interaction with DPO training. Our central finding is that a simple modification, replacing deterministic hard negatives with stochastic sampling from a dynamic top-K candidate pool, consistently improves ranking performance. We attribute its effectiveness to two factors: (1) reducing erroneous suppressive gradients caused by false negatives, and (2) retaining informative hard signals while smoothing optimization via controlled stochasticity. With an optional sparse Mixture-of-Experts encoder for efficient capacity scaling, RoDPO achieves up to 5.25% NDCG@5 on three Amazon benchmarks, with nearly unchanged inference cost.

47. 【2603.29247】MemRerank: Preference Memory for Personalized Product Reranking

链接：https://arxiv.org/abs/2603.29247

作者：Zhiyuan Peng,Xuyang Wu,Huaixiao Tou,Yi Fang,Yi Gong

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：shopping agents increasingly, agents increasingly rely, naively appending raw, long purchase histories, appending raw history

备注：

点击查看摘要

Abstract:LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.

48. 【2603.29244】he Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

链接：https://arxiv.org/abs/2603.29244

作者：Hillary Mutisya,John Mugane,Gavin Nyamboga,Brian Chege,Maryruth Gathoni

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：West Africa, East Africa, Central Africa, large-scale multimodal corpus, multimodal corpus spanning

备注：

点击查看摘要

Abstract:We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset's utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all ten languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.

49. 【2603.29232】Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

链接：https://arxiv.org/abs/2603.29232

作者：Zhuowen Liang,Xiaotian Lin,Zhengxuan Zhang,Yuyu Luo,Haixun Wang,Nan Tang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：noisy documents remains, Large language models, documents remains brittle, Large language, reasoning over long

备注： 26 pages, 17 figures, 10 tables. Accepted at ICLR 2026

点击查看摘要

Abstract:Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST template, a schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine-tuning. The compact models are trained on LLM-generated CoST data in two stages: Supervised Fine-Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure-first behavior into SLMs, this approach achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, while delivering 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B). The code is available at this https URL.

50. 【2603.29221】SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali

链接：https://arxiv.org/abs/2603.29221

作者：Ranidu Gurusinghe,Nevidu Jayatilleke

类目：Computation and Language (cs.CL)

关键词：Pali doctrinal texts, complete web-scraped Tripitaka, Tripitaka canonical texts, web-scraped Tripitaka canonical, texts comprising approximately

备注： 17 pages, 5 figures, 5 tables, Accepted paper at the 2nd Workshop on Challenges in Processing South Asian Languages (CHiPSAL) @ LREC 2026

点击查看摘要

Abstract:SiPaKosa is a comprehensive corpus of Sinhala and Pali doctrinal texts comprising approximately 786K sentences and 9.25M words, incorporating 16 copyright-cleared historical Buddhist documents alongside the complete web-scraped Tripitaka canonical texts. The corpus was created through high-quality OCR using Google Document AI on historical manuscripts, combined with systematic web scraping of canonical repositories, followed by rigorous quality control and metadata annotation. The corpus is organised into language-specific subcorpora: Sinhala and Mixed Sinhala-Pali. We evaluate the performance of language models using ten pretrained models, with perplexity scores ranging from 1.09 to 189.67 on our corpus. This analysis shows that proprietary models significantly outperform open-source alternatives by factors of three to six times. This corpus supports the pretraining of domain-adapted language models, facilitates historical language analysis, and aids in the development of information retrieval systems for Buddhist scholarship while preserving Sinhala cultural heritage.

51. 【2603.29219】SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation

链接：https://arxiv.org/abs/2603.29219

作者：Mohammad Amer Khalil,Raghad Nahas,Ahmad Nassar,Khloud Al Jallad

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：Arabic Sign Language, Syrian Arabic Sign, Sign language, primary approach, high-resource sign languages

备注：

点击查看摘要

Abstract:Sign language is the primary approach of communication for the Deaf and Hard-of-Hearing (DHH) community. While there are numerous benchmarks for high-resource sign languages, low-resource languages like Arabic remain underrepresented. Currently, there is no publicly available dataset for Syrian Arabic Sign Language (SyArSL). To overcome this gap, we introduce SyriSign, a dataset comprising 1500 video samples across 150 unique lexical signs, designed for text-to-SyArSL translation tasks. This work aims to reduce communication barriers in Syria, as most news are delivered in spoken or written Arabic, which is often inaccessible to the deaf community. We evaluated SyriSign using three deep learning architectures: MotionCLIP for semantic motion generation, T2M-GPT for text-conditioned motion synthesis, and SignCLIP for bilingual embedding alignment. Experimental results indicate that while generative approaches show strong potential for sign representation, the limited dataset size constrains generalization performance. We will release SyriSign publicly, hoping it serves as an initial benchmark.

52. 【2603.29211】Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

链接：https://arxiv.org/abs/2603.29211

作者：Zhiqian Zhang,Xu Zhao,Xiaoqing Xu,Guangdong Liang,Weijia Wang,Xiaolei Lv,Bo Li,Jun Gao

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：recent years, continued to improve, multimodal large models, fine-grained visual perception, visual perception

备注： 41 pages, 10 figures

点击查看摘要

Abstract:In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.

53. 【2603.29159】Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa

链接：https://arxiv.org/abs/2603.29159

作者：George Boateng,Samuel Boateng,Victor Kumbol

类目：Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词：large-scale online coding, Providing timely, large-scale online, online coding, accurate learning support

备注： 8 pages, Accepted at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:Providing timely and accurate learning support in large-scale online coding courses is challenging, particularly in resource-constrained contexts. We present Kwame 2.0, a bilingual (English-French) generative AI teaching assistant built using retrieval-augmented generation and deployed in a human-in-the-loop forum within SuaCode, an introductory mobile-based coding course for learners across Africa. Kwame 2.0 retrieves relevant course materials and generates context-aware responses while encouraging human oversight and community participation. We deployed the system in a 15-month longitudinal study spanning 15 cohorts with 3,717 enrollments across 35 African countries. Evaluation using community feedback and expert ratings shows that Kwame 2.0 provided high-quality and timely support, achieving high accuracy on curriculum-related questions, while human facilitators and peers effectively mitigated errors, particularly for administrative queries. Our findings demonstrate that human-in-the-loop generative AI systems can combine the scalability and speed of AI with the reliability of human support, offering an effective approach to learning assistance for underrepresented populations in resource-constrained settings at scale.

54. 【2603.29140】Designing FSMs Specifications from Requirements with GPT 4.0

链接：https://arxiv.org/abs/2603.29140

作者：Omer Nguena Timo,Paul-Alexis Rodriguez,Florent Avellaneda

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)

关键词：Finite state machines, executable formal specifications, Finite state, executable formal, formal specifications

备注：

点击查看摘要

Abstract:Finite state machines (FSM) are executable formal specifications of reactive systems. These machines are designed based on systems' requirements. The requirements are often recorded in textual documents written in natural languages. FSMs play a crucial role in different phases of the model-driven system engineering (MDE). For example, they serve to automate testing activities. FSM quality is critical: the lower the quality of FSM, the higher the number of faults surviving the testing phase and the higher the risk of failure of the systems in production, which could lead to catastrophic scenarios. Therefore, this paper leverages recent advances in the domain of LLM to propose an LLM-based framework for designing FSMs from requirements. The framework also suggests an expert-centric approach based on FSM mutation and test generation for repairing the FSMs produced by LLMs. This paper also provides an experimental analysis and evaluation of LLM's capacities in performing the tasks presented in the framework and FSM repair via various methods. The paper presents experimental results with simulated data. These results and methods bring a new analysis and vision of LLMs that are useful for further development of machine learning technology and its applications to MDE.

55. 【2603.29123】Concept Training for Human-Aligned Language Models

链接：https://arxiv.org/abs/2603.29123

作者：Christine Zhang,Dan Jurafsky,Chen Shani

类目：Computation and Language (cs.CL)

关键词：single continuation token, next-token prediction, single continuation, trains language models, objective trains language

备注：

点击查看摘要

Abstract:The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}'' could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.

56. 【2603.29112】GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

链接：https://arxiv.org/abs/2603.29112

作者：Iordanis Fostiropoulos,Muhammad Rafay Azhar,Abdalaziz Sawwan,Boyu Fang,Yuchen Liu,Jiayi Liu,Hanchao Yu,Qi Guo,Jianyu Wang,Fei Liu,Xiangjun Fan

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models', evaluating Large Language, Language Models', Large Language, evaluating Large

备注： 9 figures, 20 tables; code at [this https URL](https://github.com/facebookresearch/GISTBench)

点击查看摘要

Abstract:We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

57. 【2603.29093】APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

链接：https://arxiv.org/abs/2603.29093

作者：Pratyay Banerjee,Masud Moshtaghi,Ankit Chadha

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：LLM-based autonomous agents, autonomous agents lack, agents lack persistent, structurally identical tasks, lack persistent procedural

备注： 17 pages, 13 figures

点击查看摘要

Abstract:LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity's Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6\% accuracy versus 41.3\% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9\%). On BigCodeBench, it reaches 83.3\% SR from a 53.9\% baseline (+29.4pp), exceeding MemRL's~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0\% from 25.2\% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

Comments:
17 pages, 13 figures

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.29093 [cs.CL]

(or
arXiv:2603.29093v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29093

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

58. 【2603.29078】PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

链接：https://arxiv.org/abs/2603.29078

作者：Caio Vicentino

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：neural network weights, post-training weight quantization, weight quantization method, achieve near-lossless compression, large language models

备注： 10 pages, 5 tables, 2 algorithms. Code: [this https URL](https://github.com/caiovicentino/eoq-quantization) Models: [this https URL](https://huggingface.co/caiovicentino1)

点击查看摘要

Abstract:We present PolarQuant, a post-training weight quantization method for large language models (LLMs) that exploits the distributional structure of neural network weights to achieve near-lossless compression. PolarQuant operates in three stages: (1) block-wise normalization to the unit hypersphere, (2) Walsh-Hadamard rotation to transform coordinates into approximately Gaussian random variables, and (3) quantization with centroids matched to the Gaussian distribution. Our ablation reveals that Hadamard rotation alone accounts for 98% of the quality improvement, reducing Qwen3.5-9B perplexity from 6.90 (absmax Q5) to 6.40 (Delta = +0.03 from FP16), making it practically lossless without any calibration data. Furthermore, PolarQuant functions as an effective preprocessing step for downstream INT4 quantizers: PolarQuant Q5 dequantized and re-quantized by torchao INT4 achieves perplexity 6.56 versus 6.68 for direct absmax INT4, while maintaining 43.1 tok/s throughput at 6.5 GB VRAM. Code and models are publicly available.

59. 【2603.29077】Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs

链接：https://arxiv.org/abs/2603.29077

作者：Aizirek Turdubaeva,Uichin Lee

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, language models, understand and adapt, adapt to human

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in cross-cultural systems to understand and adapt to human emotions, which are shaped by cultural norms of expression and interpretation. However, prior work on emotion attribution has focused mainly on interpretation, overlooking the cultural background of emotion generators. This assumption of universality neglects variation in how emotions are expressed and perceived across nations. To address this gap, we propose a Generator-Interpreter framework that captures dual perspectives of emotion attribution by considering both expression and interpretation. We systematically evaluate six LLMs on an emotion attribution task using data from 15 countries. Our analysis reveals that performance variations depend on the emotion type and cultural context. Generator-interpreter alignment effects are present; the generator's country of origin has a stronger impact on performance. We call for culturally sensitive emotion modeling in LLM-based systems to improve robustness and fairness in emotion understanding across diverse cultural contexts.

60. 【2603.29042】An Empirical Recipe for Universal Phone Recognition

链接：https://arxiv.org/abs/2603.29042

作者：Shikhar Bharadwaj,Chin-Jou Li,Kwanghee Choi,Eunjung Yeo,William Chen,Shinji Watanabe,David R. Mortensen

类目：Computation and Language (cs.CL)

关键词：Phone recognition, speech processing tasks, low-resource speech processing, processing tasks, performance remains elusive

备注： Submitted to Interspeech 2026. Code: [this https URL](https://github.com/changelinglab/PhoneticXeus)

点击查看摘要

Abstract:Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS -- trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.

61. 【2603.29038】rojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

链接：https://arxiv.org/abs/2603.29038

作者：Bilgehan Sel,Xuanli He,Alwin Peng,Ming Jin,Jerry Wei

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：bypass safety measures, Anthropic Constitutional Classifiers, Fine-tuning APIs offered, Anthropic Constitutional, APIs offered

备注：

点击查看摘要

Abstract:Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.

62. 【2603.29026】On the limited utility of parallel data for learning shared multilingual representations

链接：https://arxiv.org/abs/2603.29026

作者：Julius Leino,Jörg Tiedemann

类目：Computation and Language (cs.CL)

关键词：Shared multilingual representations, Shared multilingual, parallel data, transfer across languages, tasks and knowledge

备注：

点击查看摘要

Abstract:Shared multilingual representations are essential for cross-lingual tasks and knowledge transfer across languages. This study looks at the impact of parallel data, i.e. translated sentences, in pretraining as a signal to trigger representations that are aligned across languages. We train reference models with different proportions of parallel data and show that parallel data seem to have only a minimal effect on the cross-lingual alignment. Based on multiple evaluation methods, we find that the effect is limited to potentially accelerating the representation sharing in the early phases of pretraining, and to decreasing the amount of language-specific neurons in the model. Cross-lingual alignment seems to emerge on similar levels even without the explicit signal from parallel data.

63. 【2603.29025】he Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

链接：https://arxiv.org/abs/2603.29025

作者：Yubo Li,Lu Zhang,Tianchong Jiang,Ramayya Krishnan,Rema Padman

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, salient surface cue, surface cue conflicts, language models systematically

备注：

点击查看摘要

Abstract:Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem'' across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) -- 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients -- demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.

64. 【2603.29023】Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction

链接：https://arxiv.org/abs/2603.29023

作者：Diego C. Lerma-Torres(Universidad de Guanajuato)

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, models lack persistent, Large language, language models lack, lack persistent

备注： 14 pages, 1 figure. Accepted at the MemAgents Workshop, ICLR 2026

点击查看摘要

Abstract:Large language models lack persistent, structured memory for long-term interaction and context-sensitive retrieval. Expanding context windows does not solve this: recent evidence shows that context length alone degrades reasoning by up to 85% - even with perfect retrieval. We propose a bio-inspired memory framework grounded in complementary learning systems theory, cognitive behavioral therapy's belief hierarchy, dual-process cognition, and fuzzy-trace theory, organized around three principles: (1) Memory has valence, not just content - pre-computed emotional-associative summaries (valence vectors) organized in an emergent belief hierarchy inspired by Beck's cognitive model enable instant orientation before deliberation; (2) Retrieval defaults to System 1 with System 2 escalation - automatic spreading activation and passive priming as default, with deliberate retrieval only when needed, and graded epistemic states that address hallucination structurally; and (3) Encoding is active, present, and feedback-dependent - a thalamic gateway tags and routes information between stores, while the executive forms gists through curiosity-driven investigation, not passive exposure. Seven functional properties specify what any implementation must satisfy. Over time, the system converges toward System 1 processing - the computational analog of clinical expertise - producing interactions that become cheaper, not more expensive, with experience.

65. 【2603.28929】Known Intents, New Combinations: Clause-Factorized Decoding for Compositional Multi-Intent Detection

链接：https://arxiv.org/abs/2603.28929

作者：Abhilash Nandy

类目：Computation and Language (cs.CL)

关键词：recover multiple intents, Multi-intent detection papers, Multi-intent detection, recover multiple, pairs

备注： 6 pages, 3 tables

点击查看摘要

Abstract:Multi-intent detection papers usually ask whether a model can recover multiple intents from one utterance. We ask a harder and, for deployment, more useful question: can it recover new combinations of familiar intents? Existing benchmarks only weakly test this, because train and test often share the same broad co-occurrence patterns. We introduce CoMIX-Shift, a controlled benchmark built to stress compositional generalization in multi-intent detection through held-out intent pairs, discourse-pattern shift, longer and noisier wrappers, held-out clause templates, and zero-shot triples. We also present ClauseCompose, a lightweight decoder trained only on singleton intents, and compare it to whole-utterance baselines including a fine-tuned tiny BERT model. Across three random seeds, ClauseCompose reaches 95.7 exact match on unseen intent pairs, 93.9 on discourse-shifted pairs, 62.5 on longer/noisier pairs, 49.8 on held-out templates, and 91.1 on unseen triples. WholeMultiLabel reaches 81.4, 55.7, 18.8, 15.5, and 0.0; the BERT baseline reaches 91.5, 77.6, 48.9, 11.0, and 0.0. We also add a 240-example manually authored SNIPS-style compositional set with five held-out pairs; there, ClauseCompose reaches 97.5 exact match on unseen pairs and 86.7 under connector shift, compared with 41.3 and 10.4 for WholeMultiLabel. The results suggest that multi-intent detection needs more compositional evaluation, and that simple factorization goes surprisingly far once evaluation asks for it.

66. 【2603.28925】heory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

链接：https://arxiv.org/abs/2603.28925

作者：Junsol Kim,Winnie Street,Roberta Rocca,Daine M. Korngiebel,Adam Waytz,James Evans,Geoff Keeling

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, suppress potentially harmful, potentially harmful forms, Large Language, fine-tuning in Large

备注：

点击查看摘要

Abstract:Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

67. 【2603.28924】CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation

链接：https://arxiv.org/abs/2603.28924

作者：Andrew Bouras,OMS-II Research Fellow

类目：Computation and Language (cs.CL)

关键词：evaluating hypothesis-generating models, traces connecting prior, lack explicit reasoning, connecting prior knowledge, explicit reasoning traces

备注： 14 pages, 1 figure, 8 tables. Dataset and code available at [this https URL](https://github.com/andrewbouras/crosstrace)

点击查看摘要

Abstract:Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis-generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518), AI/ML (605), and cross-domain work (266). Each trace captures the structured reasoning chain from established knowledge through intermediate logical steps to a novel hypothesis, with every step grounded in source paper text. I define an Input/Trace/Output schema that extends the Bit-Flip-Spark framework of HypoGen with step-level verification, a taxonomy of eight discovery patterns, and multi-domain coverage. Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves from 0% to 100%, and spark cosine similarity increases from 0.221 to 0.620. Balanced cross-domain training (biomedical + AI/ML + CS) outperforms single-domain training, providing evidence that scientific reasoning patterns transfer across disciplines. Human validation of 150 stratified records confirms 99.7% step-level grounding accuracy and a 0.0% fabrication rate. To my knowledge, CrossTrace is the first large-scale, cross-domain dataset with step-level grounded reasoning traces for hypothesis generation, and my results demonstrate that such traces are an effective training signal whose benefits are at least partially domain-general.

68. 【2603.28913】From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories

链接：https://arxiv.org/abs/2603.28913

作者：Daban Q. Jaff

类目：Computation and Language (cs.CL)

关键词：Holocaust oral histories, complex discourse structure, long-form Holocaust oral, Holocaust oral, oral histories

备注：

点击查看摘要

Abstract:Polarity detection becomes substantially more challenging under domain shift, particularly in heterogeneous, long-form narratives with complex discourse structure, such as Holocaust oral histories. This paper presents a corpus-scale diagnostic study of off-the-shelf sentiment classifiers on long-form Holocaust oral histories, using three pretrained transformer-based polarity classifiers on a corpus of 107,305 utterances and 579,013 sentences. After assembling model outputs, we introduce an agreement-based stability taxonomy (ABC) to stratify inter-model output stability. We report pairwise percent agreement, Cohen kappa, Fleiss kappa, and row-normalized confusion matrices to localize systematic disagreement. As an auxiliary descriptive signal, a T5-based emotion classifier is applied to stratified samples from each agreement stratum to compare emotion distributions across strata. The combination of multi-model label triangulation and the ABC taxonomy provides a cautious, operational framework for characterizing where and how sentiment models diverge in sensitive historical narratives. Inter-model agreement is low to moderate overall and is driven primarily by boundary decisions around neutrality.

69. 【2603.28858】OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

链接：https://arxiv.org/abs/2603.28858

作者：Haiyue Song,Masao Utiyama

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：expensive to tune, weeks of compute, training data remains, adapt LLMs, LLMs to target

备注： Preprint, 20 pages, 10 tables, 12 figures

点击查看摘要

Abstract:Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

70. 【2603.28845】OneComp: One-Line Revolution for Generative AI Model Compression

链接：https://arxiv.org/abs/2603.28845

作者：Yuma Ichikawa,Keiji Kimura,Akihiro Yoshida,Yudai Fujimoto,Hiroki Tokura,Yamato Arai,Yoshiyuki Ishii,Yusei Kawakami,Genki Shikada,Achille Jacquemond,Yoshihiko Fujisawa,Katsuki Fujisawa,Takumi Honda,Akira Sakai

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)

关键词：Deploying foundation models, Deploying foundation, memory footprint, increasingly constrained, constrained by memory

备注： 31 pages, 6 figures

点击查看摘要

Abstract:Deploying foundation models is increasingly constrained by memory footprint, latency, and hardware costs. Post-training compression can mitigate these bottlenecks by reducing the precision of model parameters without significantly degrading performance; however, its practical implementation remains challenging as practitioners navigate a fragmented landscape of quantization algorithms, precision budgets, data-driven calibration strategies, and hardware-dependent execution regimes. We present OneComp, an open-source compression framework that transforms this expert workflow into a reproducible, resource-adaptive pipeline. Given a model identifier and available hardware, OneComp automatically inspects the model, plans mixed-precision assignments, and executes progressive quantization stages, ranging from layer-wise compression to block-wise refinement and global refinement. A key architectural choice is treating the first quantized checkpoint as a deployable pivot, ensuring that each subsequent stage improves the same model and that quality increases as more compute is invested. By converting state-of-the-art compression research into an extensible, open-source, hardware-aware pipeline, OneComp bridges the gap between algorithmic innovation and production-grade model deployment.

71. 【2603.28795】StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving

链接：https://arxiv.org/abs/2603.28795

作者：Azam Nouri

类目：Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词：address LLM serving, LLM serving workloads, common solution structure, address LLM, LLM serving

备注： 9 pages, 1 figure

点击查看摘要

Abstract:We address LLM serving workloads where repeated requests share a common solution structure but differ in localized constraints, such as output schema, variable names, or numeric constants. Prior caching approaches typically reuse either full responses (semantic caching) or model-internal KV/prefix states, which are respectively brittle under partial changes or tightly coupled to specific backends. We present StepCache, a backend-agnostic step-level reuse layer that segments outputs into ordered steps, retrieves the best-matching cached request, verifies steps using lightweight task-aware checks, and regenerates only failing regions via selective patching. StepCache additionally supports strict structured-output enforcement for JSON, including single-step extraction, required-key constraints, and one-shot repair, as well as conservative skip-reuse fallbacks for semantic changes. For linear equations, StepCache promotes verification into correction via a bounded repair loop with a deterministic fallback that guarantees correctness when the backend model fails. In a CPU-only perturbation-heavy micro-benchmark on math and JSON variants, averaged over three seeds, StepCache reduces mean latency from 2.13 s to 0.67 s, median latency from 2.42 s to 0.01 s, and p95 latency from 3.38 s to 3.30 s. It also reduces total token usage from 36.1k to 27.3k and improves end-to-end correctness from 72.5% to 100% under task-specific checks and a stitched-output integrity check. Across requests, 79.7% take the reuse-only fast path, 5.4% require patching, and 14.9% trigger skip-reuse.

Comments:
9 pages, 1 figure

Subjects:

Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)

ACMclasses:
I.2.7; H.3.4; C.2.4

Cite as:
arXiv:2603.28795 [cs.OS]

(or
arXiv:2603.28795v1 [cs.OS] for this version)

https://doi.org/10.48550/arXiv.2603.28795

Focus to learn more

              arXiv-issued DOI via DataCite</p>

72. 【2603.28773】UltRAG: a Universal Simple Scalable Recipe for Knowledge Graph RAG

链接：https://arxiv.org/abs/2603.28773

作者：Dobrik Georgiev,Kheeran Naidu,Alberto Cattaneo,Federico Monti,Carlo Luschi,Daniel Justus

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：frequently generate confident, factually incorrect content, Large language models, frequently generate, Large language

备注：

点击查看摘要

Abstract:Large language models (LLMs) frequently generate confident yet factually incorrect content when used for language generation (a phenomenon often known as hallucination). Retrieval augmented generation (RAG) tries to reduce factual errors by identifying information in a knowledge corpus and putting it in the context window of the model. While this approach is well-established for document-structured data, it is non-trivial to adapt it for Knowledge Graphs (KGs), especially for queries that require multi-node/multi-hop reasoning on graphs. We introduce ULTRAG, a general framework for retrieving information from Knowledge Graphs that shifts away from classical RAG. By endowing LLMs with off-the-shelf neural query executing modules, we highlight how readily available language models can achieve state-of-the-art results on Knowledge Graph Question Answering (KGQA) tasks without any retraining of the LLM or executor involved. In our experiments, ULTRAG achieves better performance when compared to state-of-the-art KG-RAG solutions, and it enables language models to interface with Wikidata-scale graphs (116M entities, 1.6B relations) at comparable or lower costs.

73. 【2603.28769】Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation

链接：https://arxiv.org/abs/2603.28769

作者：Subhadip Mitra

类目：Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Evaluating large language, Evaluating large, large language models, large language, remains a practical

备注： 16 pages, 2 figures, 6 tables. Open source: [this https URL](https://github.com/bassrehab/spark-llm-eval) . Cross-list requested: cs.CL, cs.LG

点击查看摘要

Abstract:Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model comparisons come with appropriate significance tests (paired t-tests, McNemar's test, or Wilcoxon signed-rank, depending on the metric type). The framework also addresses the cost problem inherent in LLM evaluation through content-addressable response caching backed by Delta Lake, which allows iterating on metric definitions without re-running inference. We describe the system architecture, the statistical methodology, and report benchmark results showing linear scaling with cluster size. The framework and all evaluation code are available as open source.

74. 【2603.27006】he Last Fingerprint: How Markdown Training Shapes LLM Prose

链接：https://arxiv.org/abs/2603.27006

作者：E. M. Freeburg

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：Large language models, widely discussed markers, Large language, varying rates, AI-generated text

备注： 14 pages, 3 tables. Code and data: [this https URL](https://github.com/emfreeburg/the-last-fingerprint)

点击查看摘要

Abstract:Large language models produce em dashes at varying rates, and the observation that some models "overuse" them has become one of the most widely discussed markers of AI-generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown-formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose -- the smallest surviving unit of the structural orientation that LLMs acquire from markdown-saturated training corpora. We present a five-step genealogy connecting training data composition, structural internalization, the dual-register status of the em dash, and post-training amplification. We test this with a two-condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting, overt features (headers, bullets, bold) are eliminated or nearly eliminated, but em dashes persist -- except in Meta's Llama models, which produce none at all. Em dash frequency and suppression resistance vary from 0.0 per 1,000 words (Llama) to 9.1 (GPT-4.1 under suppression), functioning as a signature of the specific fine-tuning procedure applied. A three-condition suppression gradient shows that even explicit em dash prohibition fails to eliminate the artifact in some models, and a base-vs-instruct comparison confirms that the latent tendency exists pre-RLHF. These findings connect two previously isolated online discourses and reframe em dash frequency as a diagnostic of fine-tuning methodology rather than a stylistic defect.

75. 【2603.29617】Convergent Representations of Linguistic Constructions in Human and Artificial Neural Systems

链接：https://arxiv.org/abs/2603.29617

作者：Pegah Ramezani,Thomas Kinfe,Andreas Maier,Achim Schilling,Patrick Krauss

类目：Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：brain processes linguistic, brain processes, central challenge, challenge in cognitive, cognitive neuroscience

备注：

点击查看摘要

Abstract:Understanding how the brain processes linguistic constructions is a central challenge in cognitive neuroscience and linguistics. Recent computational studies show that artificial neural language models spontaneously develop differentiated representations of Argument Structure Constructions (ASCs), generating predictions about when and how construction-level information emerges during processing. The present study tests these predictions in human neural activity using electroencephalography (EEG). Ten native English speakers listened to 200 synthetically generated sentences across four construction types (transitive, ditransitive, caused-motion, resultative) while neural responses were recorded. Analyses using time-frequency methods, feature extraction, and machine learning classification revealed construction-specific neural signatures emerging primarily at sentence-final positions, where argument structure becomes fully disambiguated, and most prominently in the alpha band. Pairwise classification showed reliable differentiation, especially between ditransitive and resultative constructions, while other pairs overlapped. Crucially, the temporal emergence and similarity structure of these effects mirror patterns in recurrent and transformer-based language models, where constructional representations arise during integrative processing stages. These findings support the view that linguistic constructions are neurally encoded as distinct form-meaning mappings, in line with Construction Grammar, and suggest convergence between biological and artificial systems on similar representational solutions. More broadly, this convergence is consistent with the idea that learning systems discover stable regions within an underlying representational landscape - recently termed a Platonic representational space - that constrains the emergence of efficient linguistic abstractions.

76. 【2603.29217】Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

链接：https://arxiv.org/abs/2603.29217

作者：Lukuang Dong,Ziwei Li,Saierdaer Yusuyin,Xianyu Zhao,Zhijian Ou

类目：Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词：Phoneme-based ASR factorizes, ASR factorizes recognition, Phoneme-based ASR, enabling cross-lingual acoustic, cross-lingual acoustic sharing

备注： Update after INTERSPEECH2026 submission

点击查看摘要

Abstract:Phoneme-based ASR factorizes recognition into speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G), enabling cross-lingual acoustic sharing while keeping language-specific orthography in a separate module. While large language models (LLMs) are promising for P2G, multilingual P2G remains challenging due to language-aware generation and severe cross-language data imbalance. We study multilingual LLM-based P2G on the ten-language CV-Lang10 benchmark. We examine robustness strategies that account for S2P uncertainty, including DANP and Simplified SKM (S-SKM). S-SKM is a Monte Carlo approximation that avoids CTC-based S2P probability weighting in P2G training. Robust training and low-resource oversampling reduce the average WER from 10.56% to 7.66%.

信息检索

1. 【2603.29979】Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior

链接：https://arxiv.org/abs/2603.29979

作者：Junwei Yu,Mufeng Yang,Yepeng Ding,Hiroyuki Sato

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：direct answer generation, AI-powered search engines, traditional link-based retrieval, Generative Engine Optimization, Generative Engine

备注： 12 pages, 5 figures. This paper proposes GEO-SFE, a structural feature engineering framework for generative engine optimization

点击查看摘要

Comments:
12 pages, 5 figures. This paper proposes GEO-SFE, a structural feature engineering framework for generative engine optimization

Subjects:

Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

ACMclasses:
H.3.3; I.2.7

Cite as:
arXiv:2603.29979 [cs.CL]

(or
arXiv:2603.29979v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29979

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

2. 【2603.29937】Rewrite the News: Tracing Editorial Reuse Across News Agencies

链接：https://arxiv.org/abs/2603.29937

作者：Soveatin Kuntur,Nina Smirnova,Anna Wroblewska,Philipp Mayr,Sebastijan Razboršek Maček

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：paper investigates sentence-level, investigates sentence-level text, Slovenian Press Agency, multilingual journalism, paper investigates

备注： The paper is accepted to SoCon-NLPSI 2026 : Social Context (SoCon) and Integrating NLP and Psychology to Study Social Interactions (NLPSI) workshop co-located with LREC 2026

点击查看摘要

3. 【2603.29897】UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

链接：https://arxiv.org/abs/2603.29897

作者：Yupei Yang,Lin Yang,Wanxi Deng,Lin Qu,Shikui Tu,Lei Xu

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：critical component, reranking remains challenging, multimodal reranking remains, information retrieval pipelines, Reranking

备注：

点击查看摘要

Abstract:Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.

4. 【2603.29881】A Hybrid Machine Learning Approach for Graduate Admission Prediction and Combined University-Program Recommendation

链接：https://arxiv.org/abs/2603.29881

作者：Melina Heidari Far,Elham Tabrizi

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：graduate admission prediction, World University Rankings, Graduate admissions, graduate admission, Wikidata SPARQL queries

备注：

点击查看摘要

Abstract:Graduate admissions have become increasingly competitive. This study highlights the need for a hybrid machine learning framework for graduate admission prediction, focusing on high-quality similar applicants and a recommendation system. The dataset, collected and enriched by the authors, includes 13,000 self-reported GradCafe application records from 2021 to 2025, enriched with features from the OpenAlex API, QS World University Rankings by Subject, and Wikidata SPARQL queries. A hybrid model was developed by combining XGBoost with a residual refinement k-nearest neighbors module, achieving 87\% accuracy on the test set. A recommendation module, then built on the model for rejected applicants, provided targeted university and program alternatives, resulting in actionable guidance and improving expected acceptance probability by 70\%. The results indicate that university quality metrics strongly influence admission decisions in competitive applicant pools. The features used in the study include applicant quality metrics, university quality metrics, program-level metrics, and interaction features.

5. 【2603.29878】Performance Evaluation of LLMs in Automated RDF Knowledge Graph Generation

链接：https://arxiv.org/abs/2603.29878

作者：Ioana Ramona Martin,Tudor Cioara,Ionut Anghel,Gabriel Arcas

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词：Cloud systems generate, systems generate large, heterogeneous log data, Large Language Models, critical infrastructure

备注： submitted to journal

点击查看摘要

Abstract:Cloud systems generate large, heterogeneous log data containing critical infrastructure, application, and security information. Transforming these logs into RDF triples enables their integration into knowledge graphs, improving interpretability, root-cause analysis, and cross-service reasoning beyond what raw logs allow. Large Language Models (LLMs) offer a promising approach to automate RDF knowledge graph generation; however, their effectiveness on complex cloud logs remains largely unexplored. In this paper, we evaluate multiple LLM architectures and prompting strategies for automated RDF extraction using a controlled framework with two pipelines for systematically processing semi-structured log data. The extraction pipeline integrates multiple LLMs to identify relevant entities and relationships, automatically generating subject-predicate-object triples. These outputs are evaluated using a dedicated validation pipeline with both syntactic and semantic metrics to assess accuracy, completeness, and quality. Due to the lack of public ground-truth datasets, we created a reference Log-to-KG dataset from OpenStack logs using manual annotation and ontology-driven methods, enabling objective baseline. Our analysis shows that Few-Shot learning is the most effective strategy, with Llama achieving a 99.35% F1 score and 100% valid RDF output while Qwen, NuExtract, and Gemma also perform well under Few-Shot prompting, with Chain-of-Thought approaches maintaining similar accuracy. One-Shot prompting offers a lighter but effective alternative, while Zero-Shot and advanced strategies such as Tree-of-Thought, Self-Critique, and Generate-Multiple perform substantially worse. These results highlight the importance of contextual examples and prompt design for accurate RDF extraction and reveal model-specific limitations across LLM architectures.

6. 【2603.29875】UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough

链接：https://arxiv.org/abs/2603.29875

作者：Ryszard Tuora,Mateusz Galiński,Michał Godziszewski,Michał Karpowicz,Mateusz Czyżnikiewicz,Adam Kozakiewicz,Tomasz Ziętkiewicz

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：chunk-based retrieval pipelines, atomic objects, retrieval pipelines represent, Retrieval-augmented generation, pipelines represent

备注：

点击查看摘要

7. 【2603.29845】Cold-Starts in Generative Recommendation: A Reproducibility Study

链接：https://arxiv.org/abs/2603.29845

作者：Zhen Zhang,Jujia Zhao,Xinyu Ma,Xin Xin,Maarten de Rijke,Zhaochun Ren

类目：Information Retrieval (cs.IR)

关键词：recommend newly introduced, missing interaction signals, newly introduced items, newly registered users, open-world platforms

备注：

点击查看摘要

Abstract:Cold-start recommendation remains a central challenge in dynamic, open-world platforms, requiring models to recommend for newly registered users (user cold-start) and to recommend newly introduced items to existing users (item cold-start) under sparse or missing interaction signals. Recent generative recommenders built on pre-trained language models (PLMs) are often expected to mitigate cold-start by using item semantic information (e.g., titles and descriptions) and test-time conditioning on limited user context. However, cold-start is rarely treated as a primary evaluation setting in existing studies, and reported gains are difficult to interpret because key design choices, such as model scale, identifier design, and training strategy, are frequently changed together. In this work, we present a systematic reproducibility study of generative recommendation under a unified suite of cold-start protocols.

8. 【2603.29705】Drift-Aware Continual Tokenization for Generative Recommendation

链接：https://arxiv.org/abs/2603.29705

作者：Yuebo Feng,Jiahao Liu,Mingzhe Han,Dongsheng Li,Hansu Gu,Peng Zhang,Tun Lu,Ning Gu

类目：Information Retrieval (cs.IR)

关键词：generative recommender model, autoregressive generative recommender, Generative recommendation commonly, performs prediction based, recommendation commonly adopts

备注：

点击查看摘要

Abstract:Generative recommendation commonly adopts a two-stage pipeline in which a learnable tokenizer maps items to discrete token sequences (i.e. identifiers) and an autoregressive generative recommender model (GRM) performs prediction based on these identifiers. Recent tokenizers further incorporate collaborative signals so that items with similar user-behavior patterns receive similar codes, substantially improving recommendation quality. However, real-world environments evolve continuously: new items cause identifier collision and shifts, while new interactions induce collaborative drift in existing items (e.g., changing co-occurrence patterns and popularity). Fully retraining both tokenizer and GRM is often prohibitively expensive, yet naively fine-tuning the tokenizer can alter token sequences for the majority of existing items, undermining the GRM's learned token-embedding alignment. To balance plasticity and stability for collaborative tokenizers, we propose DACT, a Drift-Aware Continual Tokenization framework with two stages: (i) tokenizer fine-tuning, augmented with a jointly trained Collaborative Drift Identification Module (CDIM) that outputs item-level drift confidence and enables differentiated optimization for drifting and stationary items; and (ii) hierarchical code reassignment using a relaxed-to-strict strategy to update token sequences while limiting unnecessary changes. Experiments on three real-world datasets with two representative GRMs show that DACT consistently achieves better performance than baselines, demonstrating effective adaptation to collaborative evolution with reduced disruption to prior knowledge. Our implementation is publicly available at this https URL for reproducibility.

9. 【2603.29661】Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models

链接：https://arxiv.org/abs/2603.29661

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Existing narrative extraction, Existing narrative, Narrative Maps supports, face a trade-off, Maps supports rich

备注： Text2Story Workshop 2026 at ECIR 2026

点击查看摘要

10. 【2603.29651】Semantic Interaction for Narrative Map Sensemaking: An Insight-based Evaluation

链接：https://arxiv.org/abs/2603.29651

作者：Brian Felipe Keith-Norambuena,Fausto German,Eric Krokos,Sarah Joseph,Chris North

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：Semantic interaction, incorporate their cognitive, cognitive processes, narrative map, narrative

备注： Text2Story Workshop 2026 at ECIR 2026

点击查看摘要

11. 【2603.29631】Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

链接：https://arxiv.org/abs/2603.29631

作者：Sherif Abdelwahab

类目：Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)

关键词：Always-on edge cameras, edge cameras generate, cameras generate continuous, generate continuous video, continuous video streams

备注： 6 pages, 3 figures, 5 tables; supplementary video included as ancillary file

点击查看摘要

Abstract:Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.

12. 【2603.29519】On Strengths and Limitations of Single-Vector Embeddings

链接：https://arxiv.org/abs/2603.29519

作者：Archish S,Mihir Agarwal,Ankit Garg,Neeraj Kayal,Kirankumar Shiragur

类目：Information Retrieval (cs.IR)

关键词：dataset called LIMIT, Recent work, called LIMIT, LIMIT and showed, models suffer substantial

备注：

点击查看摘要

Abstract:Recent work (Weller et al., 2025) introduced a naturalistic dataset called LIMIT and showed empirically that a wide range of popular single-vector embedding models suffer substantial drops in retrieval quality, raising concerns about the reliability of single-vector embeddings for retrieval. Although (Weller et al., 2025) proposed limited dimensionality as the main factor contributing to this, we show that dimensionality alone cannot explain the observed failures. We observe from results in (Alon et al., 2016) that $2k+1$-dimensional vector embeddings suffice for top-$k$ retrieval. This result points to other drivers of poor performance. Controlling for tokenization artifacts and linguistic similarity between attributes yields only modest gains. In contrast, we find that domain shift and misalignment between embedding similarities and the task's underlying notion of relevance are major contributors; finetuning mitigates these effects and can improve recall substantially. Even with finetuning, however, single-vector models remain markedly weaker than multi-vector representations, pointing to fundamental limitations. Moreover, finetuning single-vector models on LIMIT-like datasets leads to catastrophic forgetting (performance on MSMARCO drops by more than 40%), whereas forgetting for multi-vector models is minimal. To better understand the gap between performance of single-vector and multi-vector models, we study the drowning in documents paradox (Reimers \ Gurevych, 2021; Jacob et al., 2025): as the corpus grows, relevant documents are increasingly "drowned out" because embedding similarities behave, in part, like noisy statistical proxies for relevance. Through experiments and mathematical calculations on toy mathematical models, we illustrate why single-vector models are more susceptible to drowning effects compared to multi-vector models.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2603.29519 [cs.IR]

(or
arXiv:2603.29519v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.29519

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

13. 【2603.29259】Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

链接：https://arxiv.org/abs/2603.29259

作者：Hejin Huang,Jusheng Zhang,Kaitong Cai,Jian Wang,Rong Pan

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Preference-based alignment objectives, RLHF-style pairwise learning, large language models, Preference-based alignment, widely adopted

备注：

点击查看摘要

14. 【2603.29093】APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

链接：https://arxiv.org/abs/2603.29093

作者：Pratyay Banerjee,Masud Moshtaghi,Ankit Chadha

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：LLM-based autonomous agents, autonomous agents lack, agents lack persistent, structurally identical tasks, lack persistent procedural

备注： 17 pages, 13 figures

点击查看摘要

Comments:
17 pages, 13 figures

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.29093 [cs.CL]

(or
arXiv:2603.29093v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29093

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

15. 【2603.28994】Zero-shot Cross-domain Knowledge Distillation: A Case study on YouTube Music

链接：https://arxiv.org/abs/2603.28994

作者：Srivaths Ranganathan,Nikhil Khani,Shawn Andrews,Chieh Lo,Li Wei,Gergo Varady,Jochen Klingenhoefer,Tim Steele,Bernardo Cunha,Aniruddh Nath,Yanwei Song

类目：Information Retrieval (cs.IR)

关键词：sensitive models serving, latency sensitive models, Knowledge Distillation, quality of latency, latency sensitive

备注：

点击查看摘要

Abstract:Knowledge Distillation (KD) has been widely used to improve the quality of latency sensitive models serving live traffic. However, applying KD in production recommender systems with low traffic is challenging: the limited amount of data restricts the teacher model size, and the cost of training a large dedicated teacher may not be justified. Cross-domain KD offers a cost-effective alternative by leveraging a teacher from a data-rich source domain, but introduces unique technical difficulties, as the features, user interfaces, and prediction tasks can significantly differ. We present a case study of using zero-shot cross-domain KD for multi-task ranking models, transferring knowledge from a (100x) large-scale video recommendation platform (YouTube) to a music recommendation application with significantly lower traffic. We share offline and live experiment results and present findings evaluating different KD techniques in this setting across two ranking models on the music app. Our results demonstrate that zero-shot cross-domain KD is a practical and effective approach to improve the performance of ranking models on low traffic surfaces.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2603.28994 [cs.IR]

(or
arXiv:2603.28994v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.28994

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

16. 【2603.28886】Calibrated Fusion for Heterogeneous Graph-Vector Retrieval in Multi-Hop QA

链接：https://arxiv.org/abs/2603.28886

作者：Andre Bacellar

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Graph-augmented retrieval combines, combines dense similarity, graph-based relevance signals, Personalized PageRank, retrieval combines dense

备注： 10 pages, 5 figures

点击查看摘要

Abstract:Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiHopQA (11W/2L, p=0.023), both on independent held-out test splits. A theory-driven ablation shows that percentile-based calibration is directionally more robust than min-max normalization on both tune and test splits (1W/6L, p=0.125), while Boltzmann weighting performs comparably to linear fusion after calibration (0W/3L, p=0.25). These results suggest that score commensuration is a robust design choice, and the exact post-calibration operator appears to matter less on these benchmarks.

Comments:
10 pages, 5 figures

Subjects:

Information Retrieval (cs.IR); Machine Learning (cs.LG)

ACMclasses:
H.3.3

Cite as:
arXiv:2603.28886 [cs.IR]

(or
arXiv:2603.28886v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.28886

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

17. 【2603.28773】UltRAG: a Universal Simple Scalable Recipe for Knowledge Graph RAG

链接：https://arxiv.org/abs/2603.28773

作者：Dobrik Georgiev,Kheeran Naidu,Alberto Cattaneo,Federico Monti,Carlo Luschi,Daniel Justus

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：frequently generate confident, factually incorrect content, Large language models, frequently generate, Large language

备注：

点击查看摘要

计算机视觉

1. 【2603.30045】OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation

链接：https://arxiv.org/abs/2603.30045

作者：Yuheng Liu,Xin Lin,Xinke Li,Baihan Yang,Chen Wang,Kalyan Sunkavalli,Yannick Hold-Geoffroy,Hao Tan,Kai Zhang,Xiaohui Xie,Zifan Shi,Yiwei Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：garnered growing research, growing research interest, Modeling scenes, recent years, video generation

备注： Code is available at [this https URL](https://github.com/yuhengliu02/OmniRoam)

点击查看摘要

Abstract:Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction. Code is available at this https URL.

2. 【2603.30043】Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

链接：https://arxiv.org/abs/2603.30043

作者：Kaleb Newman,Tyler Zhu,Olga Russakovsky

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：models exhibit emergent, exhibit emergent reasoning, exhibit emergent, diffusion models exhibit, Video diffusion models

备注：

点击查看摘要

Abstract:Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. Our second finding is that path length, not obstacle density, is the dominant predictor of maze difficulty, with a sharp failure threshold at 12 steps. This means video models can only reason over long mazes by chaining together multiple sequential generations. To demonstrate the practical benefits of our findings, we introduce Chaining with Early Planning, or ChEaP, which only spends compute on seeds with promising early plans and chains them together to tackle complex mazes. This improves accuracy from 7% to 67% on long-horizon mazes and by 2.5x overall on hard tasks in Frozen Lake and VR-Bench across Wan2.2-14B and HunyuanVideo-1.5. Our analysis reveals that current video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling.

3. 【2603.30038】Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

链接：https://arxiv.org/abs/2603.30038

作者：Wenyi Li,Renkai Luo,Yue Yu,Huan-ang Gao,Mingju Gao,Li Yuan,Chaoyou Fu,Hao Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：rapidly reshaped software, reshaped software practice, produce correct code, rapidly reshaped, reshaped software

备注： Accepted by CVPR 2026; Project page: [this https URL](https://geocodebench.github.io/)

点击查看摘要

Abstract:AI-assisted coding has rapidly reshaped software practice and research workflows, yet today's models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that "more paper text" is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.

4. 【2603.30008】Conditional Polarization Guidance for Camouflaged Object Detection

链接：https://arxiv.org/abs/2603.30008

作者：QIfan Zhang,Hao Wang,Xiangrong Qin,Ruijie Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Camouflaged object detection, object detection, Camouflaged object, aims to identify, identify targets

备注： 11 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Camouflaged object detection (COD) aims to identify targets that are highly blended with their backgrounds. Recent works have shown that the optical characteristics of polarization cues play a significant role in improving camouflaged object detection. However, most existing polarization-based approaches depend on complex visual encoders and fusion mechanisms, leading to increased model complexity and computational overhead, while failing to fully explore how polarization can explicitly guide hierarchical RGB representation learning. To address these limitations, we propose CPGNet, an asymmetric RGB-polarization framework that introduces a conditional polarization guidance mechanism to explicitly regulate RGB feature learning for camouflaged object detection. Specifically, we design a lightweight polarization interaction module that jointly models these complementary cues and generates reliable polarization guidance in a unified manner. Unlike conventional feature fusion strategies, the proposed conditional guidance mechanism dynamically modulates RGB features using polarization priors, enabling the network to focus on subtle discrepancies between camouflaged objects and their backgrounds. Furthermore, we introduce a polarization edge-guided frequency refinement strategy that enhances high-frequency components under polarization constraints, effectively breaking camouflage patterns. Finally, we develop an iterative feedback decoder to perform coarse-to-fine feature calibration and progressively refine camouflage prediction. Extensive experiments on polarization datasets across multiple tasks, along with evaluations on non-polarization datasets, demonstrate that CPGNet consistently outperforms state-of-the-art methods.

5. 【2603.29990】SurgNavAR: An Augmented Reality Surgical Navigation Framework for Optical See-Through Head Mounted Displays

链接：https://arxiv.org/abs/2603.29990

作者：Abdullah Thabit,Mohamed Benmahdjoub,Rafiuddin Jinabade,Hizirwan S. Salim,Marie-Lise C. van Veelen,Mark G. van Vledder,Eppo B. Wolvius,Theo van Walsum

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：preoperative imaging data, head mounted displays, Augmented reality, imaging data, preoperative imaging

备注： This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Augmented reality (AR) devices with head mounted displays (HMDs) facilitate the direct superimposition of 3D preoperative imaging data onto the patient during surgery. To use an HMD-AR device as a stand-alone surgical navigation system, the device should be able to locate the patient and surgical instruments, align preoperative imaging data with the patient, and visualize navigation data in real time during surgery. Whereas some of the technologies required for this are known, integration in such devices is cumbersome and requires specific knowledge and expertise, hampering scientific progress in this field. This work therefore aims to present and evaluate an integrated HMD-based AR surgical navigation framework that is adaptable to diverse surgical applications. The framework tracks 2D patterns as reference markers attached to the patient and surgical instruments. It allows for the calibration of surgical tools using pivot and reference-based calibration techniques. It enables image-to-patient registration using point-based matching and manual positioning. The integrated functionalities of the framework are evaluated on two HMD devices, the HoloLens 2 and Magic Leap 2, with two surgical use cases being evaluated in a phantom setup: AR-guided needle insertion and rib fracture localization. The framework was able to achieve a mean tooltip calibration accuracy of 1 mm, a registration accuracy of 3 mm, and a targeting accuracy below 5 mm on the two surgical use cases. The framework presents an easy-to-use configurable tool for HMD-based AR surgical navigation, which can be extended and adapted to many surgical applications. The framework is publicly available at this https URL.

6. 【2603.29968】rimodal Deep Learning for Glioma Survival Prediction: A Feasibility Study Integrating Histopathology, Gene Expression, and MRI

链接：https://arxiv.org/abs/2603.29968

作者：Iain Swift,JingHua Ye

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Attenuated Inversion Recovery, frameworks remains unexplored, Fluid Attenuated Inversion, unified survival frameworks, survival frameworks remains

备注： 6 pages, 1 figure, submitted to the IEEE CBMS 2026 conference, still waiting for notification

点击查看摘要

Abstract:Multimodal deep learning has improved prognostic accuracy for brain tumours by integrating histopathology and genomic data, yet the contribution of volumetric MRI within unified survival frameworks remains unexplored. This pilot study extends a bimodal framework by incorporating Fluid Attenuated Inversion Recovery (FLAIR) MRI from BraTS2021 as a third modality. Using the TCGA-GBMLGG cohort (664 patients), we evaluate three unimodal models, nine bimodal configurations, and three trimodal configurations across early, late, and joint fusion strategies. In this small cohort setting, trimodal early fusion achieves an exploratory Composite Score (CS = 0.854), with a controlled $\Delta$CS of +0.011 over the bimodal baseline on identical patients, though this difference is not statistically significant (p = 0.250, permutation test). MRI achieves reasonable unimodal discrimination (CS = 0.755) but does not substantially improve bimodal pairs, while providing measurable uplift in the three-way combination. All MRI containing experiments are constrained to 19 test patients, yielding wide bootstrap confidence intervals (e.g. [0.400,1.000]) that preclude definitive conclusions. These findings provide preliminary evidence that a third imaging modality may add prognostic value even with limited sample sizes, and that additional modalities require sufficient multimodal context to contribute effectively.

7. 【2603.29967】Learning Structural-Functional Brain Representations through Multi-Scale Adaptive Graph Attention for Cognitive Insight

链接：https://arxiv.org/abs/2603.29967

作者：Badhan Mazumder,Sir-Lord Wiafe,Aline Kotoski,Vince D. Calhoun,Dong Hye Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：connectome capture complementary, capture complementary aspects, functional connectome capture, Multi-scale Adaptive Graph, introduced Multi-scale Adaptive

备注： Preprint version of the paper accepted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026). This is the author's accepted manuscript. The final published version will appear in IEEE Xplore

点击查看摘要

Abstract:Understanding how brain structure and function interact is key to explaining intelligence yet modeling them jointly is challenging as the structural and functional connectome capture complementary aspects of organization. We introduced Multi-scale Adaptive Graph Network (MAGNet), a Transformer-style graph neural network framework that adaptively learns structure-function interactions. MAGNet leverages source-based morphometry from structural MRI to extract inter-regional morphological features and fuses them with functional network connectivity from resting-state fMRI. A hybrid graph integrates direct and indirect pathways, while local-global attention refines connectivity importance and a joint loss simultaneously enforces cross-modal coherence and optimizes the prediction objective end-to-end. On the ABCD dataset, MAGNet outperformed relevant baselines, demonstrating effective multimodal integration for advancing our understanding of cognitive function.

8. 【2603.29966】Scaling Video Pretraining for Surgical Foundation Models

链接：https://arxiv.org/abs/2603.29966

作者：Sicheng Lu,Zikai Xiao,Jianhui Wei,Danyu Sun,Qi Lu,Keli Hu,Yang Feng,Jian Wu,Zongxin Yang,Zuozhu Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Surgical video understanding, limited data scale, models remain constrained, Surgical video, existing surgical foundation

备注：

点击查看摘要

Abstract:Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.

9. 【2603.29962】SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

链接：https://arxiv.org/abs/2603.29962

作者：Shi Li(1),Vinkle Srivastav(1),Nicolas Chanel(1),Saurav Sharma(1),Nabani Banik(1),Lorenzo Arboit(1),Kun Yuan(1),Pietro Mascagni(1 and 2),Nicolas Padoy(1) ((1) University of Strasbourg, CNRS, INSERM, ICube, Strasbourg, France, (2) Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requiring extensive expertise, evolving intraoperative scenes, navigate evolving intraoperative, Surgical video question, surgical VQA

备注： 29 pages, 14 figures, 9 tables

点击查看摘要

Abstract:Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy -- Perception, Assessment, and Reasoning -- spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

10. 【2603.29960】NeuroBRIDGE: Behavior-Conditioned Koopman Dynamics with Riemannian Alignment for Early Substance Use Initiation Prediction from Longitudinal Functional Connectome

链接：https://arxiv.org/abs/2603.29960

作者：Badhan Mazumder,Sir-Lord Wiafe,Vince D. Calhoun,Dong Hye Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：predictors treat connectivity, Early identification, brain networks change, RIemannian Koopman Dynamics, substance use initiation

点击查看摘要

Abstract:Early identification of adolescents at risk for substance use initiation (SUI) is vital yet difficult, as most predictors treat connectivity as static or cross-sectional and miss how brain networks change over time and with behavior. We proposed NeuroBRIDGE (Behavior conditioned RIemannian Koopman Dynamics on lonGitudinal connEctomes), a novel graph neural network-based framework that aligns longitudinal functional connectome in a Riemannian tangent space and couples dual-time attention with behavioral-conditioned Koopman dynamics to capture temporal change. Evaluated on ABCD, NeuroBRIDGE improved future SUI prediction over relevant baselines while offering interpretable insights into neural pathways, refining our understanding of neurodevelopmental risk and informing targeted prevention.

11. 【2603.29954】Detecting Unknown Objects via Energy-based Separation for Open World Object Detection

链接：https://arxiv.org/abs/2603.29954

作者：Jun-Woo Heo,Keonhee Park,Gyeong-Moon Park

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Open World Object, World Object Detection, Open World, World Object, unknown objects

备注： 8 pages, Accepted at CVPR 2026

点击查看摘要

Abstract:In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector's known class predictions for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (Detecting Unknowns via energy-based Separation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of Equiangular Tight Frame (ETF)-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations. Unlike prior energy-based approaches that consider only the known space, EUS utilizes energies from both spaces to better capture distinct patterns of unknown objects. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.

12. 【2603.29943】EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

链接：https://arxiv.org/abs/2603.29943

作者：Fumihiko Tsuchiya,Taiki Miyanishi,Mahiro Ukai,Nakamasa Inoue,Shuhei Kurita,Yusuke Iwasawa,Yutaka Matsuo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：long videos remains, computer vision, underexplored challenge, challenge in computer, Counting

备注： The first two authors are equally contributed. The data and code are publicly available at: [this https URL](https://github.com/matsuolab/EC-Bench)

点击查看摘要

Abstract:Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.

13. 【2603.29941】Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

链接：https://arxiv.org/abs/2603.29941

作者：Vanessa Emanuela Guarino,Claudia Winklmayr,Jannik Franzen,Josef Lorenz Rumberger,Manuel Pfeuffer,Sonja Greven,Klaus Maier-Hein,Carsten T. Lüth,Christoph Karg,Dagmar Kainmueller

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：biomedical image analysis, Uncertainty Quantification, autonomous driving, crucial for ensuring, ensuring the reliability

备注： 27 pages, 13 figures, 6 tables. Accepted at CVPR 2026 (The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026)

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. In segmentation, UQ generates pixel-wise uncertainty scores that must be aggregated into image-level scores for downstream tasks like Out-of-Distribution (OoD) or failure detection. Despite routine use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied. Global Average is the default choice, yet it does not account for spatial and structural features of segmentation uncertainty. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices. We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure. We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, the performance of individual aggregators depends heavily on dataset characteristics, so we (4) propose a meta-aggregator that integrates multiple aggregators and performs robustly across datasets.

14. 【2603.29931】Gloria: Consistent Character Video Generation via Content Anchors

链接：https://arxiv.org/abs/2603.29931

作者：Yuhang Yang,Fan Zhang,Huaijin Pi,Shuai Guo,Guowei Xu,Wei Zhai,Yang Cao,Zheng-Jun Zha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：identity remains challenging, consistent multi-view appearance, Digital characters, modern media, consistent multi-view

备注： Accepted by CVPR2026 Main, project: [this https URL](https://yyvhang.github.io/Gloria_Page/)

点击查看摘要

Abstract:Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.

15. 【2603.29927】End-to-End Image Compression with Segmentation Guided Dual Coding for Wind Turbines

链接：https://arxiv.org/abs/2603.29927

作者：Raül Pérez-Gonzalo,Andreas Espersen,Søren Forchhammer,Antonio Agudo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Transferring large volumes, Transferring large, detecting severe defects, large volumes, volumes of high-resolution

备注： Accepted to TNNLS 2026

点击查看摘要

Abstract:Transferring large volumes of high-resolution images during wind turbine inspections introduces a bottleneck in assessing and detecting severe defects. Efficient coding must preserve high fidelity in blade regions while aggressively compressing the background. In this work, we propose an end-to-end deep learning framework that jointly performs segmentation and dual-mode (lossy and lossless) compression. The segmentation module accurately identifies the blade region, after which our region-of-interest (ROI) compressor encodes it at superior quality compared to the rest of the image. Unlike conventional ROI schemes that merely allocate more bits to salient areas, our framework integrates: (i) a robust segmentation network (BU-Netv2+P) with a CRF-regularized loss for precise blade localization, (ii) a hyperprior-based autoencoder optimized for lossy compression, and (iii) an extended bits-back coder with hierarchical models for fully lossless blade reconstruction. Furthermore, our ROI framework removes the sequential dependency in bits-back coding by reusing background-coded bits, enabling parallelized and efficient dual-mode compression. To the best of our knowledge, this is the first fully integrated learning-based ROI codec combining segmentation, lossy, and lossless compression, ensuring that subsequent defect detection is not compromised. Experiments on a large-scale wind turbine dataset demonstrate superior compression performance and efficiency, offering a practical solution for automated inspections.

16. 【2603.29924】Abstraction in Style

链接：https://arxiv.org/abs/2603.29924

作者：Min Lu,Yuanfeng He,Anthony Chen,Jianhuang He,Pu Wang,Daniel Cohen-Or,Hui Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：involving deliberate reinterpretation, Artistic styles, involving deliberate, texture or color, abstraction

备注： siggraph 2026 conditionally accepted paper

点击查看摘要

Abstract:Artistic styles often embed abstraction beyond surface appearance, involving deliberate reinterpretation of structure rather than mere changes in texture or color. Conventional style transfer methods typically preserve the input geometry and therefore struggle to capture this deeper abstraction behavior, especially for illustrative and nonphotorealistic styles. In this work, we introduce Abstraction in Style (AiS), a generative framework that separates structural abstraction from visual stylization. Given a target image and a small set of style exemplars, AiS first derives an intermediate abstraction proxy that reinterprets the target's structure in accordance with the abstraction logic exhibited by the style. The proxy captures semantic structure while relaxing geometric fidelity, enabling subsequent stylization to operate on an abstracted representation rather than the original image. In a second stage, the abstraction proxy is rendered to produce the final stylized output, preserving visual coherence with the reference style. Both stages are implemented using a shared image space analogy, enabling transformations to be learned from visual exemplars without explicit geometric supervision. By decoupling abstraction from appearance and treating abstraction as an explicit, transferable process, AiS supports a wider range of stylistic transformations, improves controllability, and enables more expressive stylization.

17. 【2603.29922】raining deep learning based dynamic MR image reconstruction using synthetic fractals

链接：https://arxiv.org/abs/2603.29922

作者：Anirudh Raman,Olivier Jaubert,Mark Wrobel,Tina Yao,Ruaraidh Campbell,Rebecca Baker,Ruta Virsinskaite,Daniel Knight,Michael Quail,Jennifer Steeden,Vivek Muthurangu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：cardiac MRI data, cardiac MRI, MRI, train deep learning, synthetically generated fractal

备注：

点击查看摘要

Abstract:Purpose: To investigate whether synthetically generated fractal data can be used to train deep learning (DL) models for dynamic MRI reconstruction, thereby avoiding the privacy, licensing, and availability limitations associated with cardiac MR training datasets. Methods: A training dataset was generated using quaternion Julia fractals to produce 2D+time images. Multi-coil MRI acquisition was simulated to generate paired fully sampled and radially undersampled k-space data. A 3D UNet deep artefact suppression model was trained using these fractal data (F-DL) and compared with an identical model trained on cardiac MRI data (CMR-DL). Both models were evaluated on prospectively acquired radial real-time cardiac MRI from 10 patients. Reconstructions were compared against compressed sensing(CS) and low-rank deep image prior (LR-DIP). All reconstrctuions were ranked for image quality, while ventricular volumes and ejection fraction were compared with reference breath-hold cine MRI. Results: There was no significant difference in qualitative ranking between F-DL and CMR-DL (p=0.9), while both outperformed CS and LR-DIP (p0.001). Ventricular volumes and function derived from F-DL were similar to CMR-DL, showing no significant bias and accptable limits of agreement compared to reference cine imaging. However, LR-DIP had a signifcant bias (p=0.016) and wider lmits of agreement. Conclusion: DL models trained using synthetic fractal data can reconstruct real-time cardiac MRI with image quality and clinical measurements comparable to models trained on true cardiac MRI data. Fractal training data provide an open, scalable alternative to clinical datasets and may enable development of more generalisable DL reconstruction models for dynamic MRI.

18. 【2603.29917】Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification

链接：https://arxiv.org/abs/2603.29917

作者：Hiba Adil Al-kharsan,Róbert Rajkó

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diffusion-driven feature denoising, combines diffusion-driven feature, Nonnegative Matrix Factorization, combines diffusion-driven, Matrix Factorization

备注：

点击查看摘要

Abstract:This work presents a robust multi-class classification framework for handwritten digits that combines diffusion-driven feature denoising with a hybrid feature representation. Inspired by our previous work on brain tumor classification, the proposed approach operates in a feature space to improve the robustness to noise and adversarial attacks. First, the input images are converted into tight, interpretable exemplification using Nonnegative Matrix Factorization (NNMF). In parallel, special deep features are extracted using a computational neural network (CNN). These integral features are combined into a united hybrid representation. To improve robustness, a step diffusion operation is used in the feature space by gradually adding Gaussian noise. A feature denoiser network is trained to reverse this operation and rebuild clean representations from tilted inputs. The courteous features are then applied for multi-class classification. The suggested method is evaluated in both baseline and adversarial settings using AutoAttack. The experimental outcome present that the diffusion-based hybrid model is both effective and robust, the CNN baseline models outperforming while maintain powerful classification performance. These results explain the activity of feature-level diffusion defense for reliable multi-class handwritten digit classification.

19. 【2603.29901】Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

链接：https://arxiv.org/abs/2603.29901

作者：Mst. Fahmida Sultana Naznin,Adnan Ibney Faruq,Mushfiqur Rahman,Niloy Kumar Mondal,Md. Mehedi Hasan Shawon,Md Rakibul Hasan

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：concise clinical impressions, Automated radiology report, strong text-only baselines, distill verbose findings, IMPRESSION transformation

备注：

点击查看摘要

20. 【2603.29860】GENIE: Gram-Eigenmode INR Editing with Closed-Form Geometry Updates

链接：https://arxiv.org/abs/2603.29860

作者：Samundra Karki,Adarsh Krishnamurthy,Baskar Ganapathysubramanian

类目：Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Implicit Neural Representations, provide compact models, Implicit Neural, Neural Representations, provide compact

备注： 9 pages, 9 figures

点击查看摘要

Abstract:Implicit Neural Representations (INRs) provide compact models of geometry, but it is unclear when their learned shapes can be edited without retraining. We show that the Gram operator induced by the INR's penultimate features admits deformation eigenmodes that parameterize a family of realizable edits of the SDF zero level set. A key finding is that these modes are not intrinsic to the geometry alone: they are reliably recoverable only when the Gram operator is estimated from sufficiently rich sampling distributions. We derive a single closed-form update that performs geometric edits to the INR without optimization by leveraging the deformation modes. We characterize theoretically the precise set of deformations that are feasible under this one-shot update, and show that editing is well-posed exactly within the span of these deformation modes.

21. 【2603.29852】VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing

链接：https://arxiv.org/abs/2603.29852

作者：Juan Rodriguez,Haotian Zhang,Abhay Puri,Tianyang Zhang,Rishav Pramanik,Meng Lin,Xiaoqing Xie,Marco Terral,Darsh Kaushik,Aly Shariff,Perouz Taslakian,Spandana Gella,Sai Rajeswar,David Vazquez,Christopher Pal,Marco Pedersoli

类目：Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Scalable Vector Graphics, Vector Graphics, Scalable Vector, suite for Scalable, comprehensive benchmark suite

备注：

点击查看摘要

Abstract:We introduce VectorGym, a comprehensive benchmark suite for Scalable Vector Graphics (SVG) that spans generation from text and sketches, complex editing, and visual understanding. VectorGym addresses the lack of realistic, challenging benchmarks aligned with professional design workflows. Our benchmark comprises four tasks with expert human-authored annotations: the novel Sketch2SVG task (VG-Sketch); a new SVG editing dataset (VG-Edit) featuring complex, multi-step edits with higher-order primitives; Text2SVG generation (VG-Text); and SVG captioning (VG-Cap). Unlike prior benchmarks that rely on synthetic edits, VectorGym provides gold-standard human annotations that require semantic understanding and design intent. We also propose a multi-task reinforcement learning approach that jointly optimizes across all four tasks using rendering-based rewards. Our method, built on GRPO with curriculum learning, trains a Qwen3-VL 8B model that achieves state-of-the-art performance among open-source models, surpassing much larger models including Qwen3-VL 235B and matching GPT-4o. We also introduce a VLM-as-a-Judge metric for SVG generation, validated through human correlation studies. Our evaluation of frontier VLMs reveals significant performance gaps, positioning VectorGym as a rigorous framework for advancing visual code generation. VectorGym is publicly available on this http URL.

22. 【2603.29847】CADReasoner: Iterative Program Editing for CAD Reverse Engineering

链接：https://arxiv.org/abs/2603.29847

作者：Soslan Kabisov,Vsevolod Kirichuk,Andrey Volkov,Gennadii Savrasov,Marina Barannikov,Anton Konushin,Andrey Kuznetsov,Dmitrii Zhemchuzhnikov

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：substantial expert effort, powers modern engineering, producing high-quality parts, demands substantial expert, CAD reverse engineering

备注：

点击查看摘要

Abstract:Computer-Aided Design (CAD) powers modern engineering, yet producing high-quality parts still demands substantial expert effort. Many AI systems tackle CAD reverse engineering, but most are single-pass and miss fine geometric details. In contrast, human engineers compare the input shape with the reconstruction and iteratively modify the design based on remaining discrepancies. Agent-based methods mimic this loop with frozen VLMs, but weak 3D grounding of current foundation models limits reliability and efficiency. We introduce CADReasoner, a model trained to iteratively refine its prediction using geometric discrepancy between the input and the predicted shape. The model outputs a runnable CadQuery Python program whose rendered mesh is fed back at the next step. CADReasoner fuses multi-view renders and point clouds as complementary modalities. To bridge the realism gap, we propose a scan-simulation protocol applied during both training and evaluation. Across DeepCAD, Fusion 360, and MCB benchmarks, CADReasoner attains state-of-the-art results on clean and scan-sim tracks.

23. 【2603.29844】DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

链接：https://arxiv.org/abs/2603.29844

作者：Yi Chen,Yuying Ge,Hui Zhou,Mingyu Ding,Yixiao Ge,Xihui Liu

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：pre-trained Vision-Language Models, Vision-Language Models, significantly accelerated, models, VLM

备注： Project page: [this https URL](https://xpeng-robotics.github.io/dial)

点击查看摘要

Abstract:The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

24. 【2603.29842】oward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data

链接：https://arxiv.org/abs/2603.29842

作者：Minyoung E. Kim,Dae Hee Yun,Aditi V. Patel,Madeline Hon,Webster Guan,Taegeon Lee,Brian Nguyen

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Unprecedented visual details, light-sheet fluorescence microscopy, Unprecedented visual, fluorescence microscopy, subcellular-resolution whole-brain

备注： 21 pages, 12 figures. Accepted at CVPR 2026

点击查看摘要

Abstract:Unprecedented visual details of biological structures are being revealed by subcellular-resolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these petabyte-scale data poses a substantial challenge for accurate interpretation. Further, existing models for visual tasks such as object detection and classification struggle to generalize to this type of data. To accelerate the development of suitable methods and foundational models, we present CANVAS, a comprehensive set of high-resolution whole mouse brain LSFM benchmark data, encompassing six neuronal and immune cell-type markers, along with cell annotations and a leaderboard. We also demonstrate challenges in generalization of baseline models built on existing architectures, especially due to the heterogeneity in cellular morphology across phenotypes and anatomical locations in the brain. To the best of our knowledge, CANVAS is the first and largest LSFM benchmark that captures intact mouse brain tissue at subcellular level, and includes extensive annotations of cells throughout the brain.

25. 【2603.29832】AutoFormBench: Benchmark Dataset for Automating Form Understanding

链接：https://arxiv.org/abs/2603.29832

作者：Gaurab Baral,Junxiu Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：persistent challenge due, layout variability encountered, enterprise invoices remains, Automated processing, healthcare records

备注： 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Automated processing of structured documents such as government forms, healthcare records, and enterprise invoices remains a persistent challenge due to the high degree of layout variability encountered in real-world settings. This paper introduces AutoFormBench, a benchmark dataset of 407 annotated real-world forms spanning government, healthcare, and enterprise domains, designed to train and evaluate form element detection models. We present a systematic comparison of classical OpenCV approaches and four YOLO architectures (YOLOv8, YOLOv11, YOLOv26-s, and YOLOv26-l) for localizing and classifying fillable form elements. specifically checkboxes, input lines, and text boxes across diverse PDF document types. YOLOv11 demonstrates consistently superior performance in both F1 score and Jaccard accuracy across all element classes and tolerance levels.

26. 【2603.29798】SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

链接：https://arxiv.org/abs/2603.29798

作者：Léopold Maillard,Francis Engelmann,Tom Durand,Boxiao Pan,Yang You,Or Litany,Leonidas Guibas,Maks Ovsjanikov

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：support meaningful activities, depends on interactive, diverse users, support meaningful, functional affordances remains

备注： Project page: [this https URL](https://sceneteract.github.io/)

点击查看摘要

Abstract:Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.

27. 【2603.29788】Multi-Feature Fusion Approach for Generative AI Images Detection

链接：https://arxiv.org/abs/2603.29788

作者：Abderrezzaq Sendjasni,Mohamed-Chaker Larabi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Local Binary Patterns, Multi-scale Local Binary, Subtracted Contrast Normalized, unprecedented realism, natural photographs

备注： This work has been submitted to IEEE Transactions for possible publication

点击查看摘要

Abstract:The rapid evolution of Generative AI (GenAI) models has led to synthetic images of unprecedented realism, challenging traditional methods for distinguishing them from natural photographs. While existing detectors often rely on single-feature spaces, such as statistical regularities, semantic embeddings, or texture patterns, these approaches tend to lack robustness when confronted with diverse and evolving generative models. In this work, we investigate and systematically evaluate a multi-feature fusion framework that combines complementary cues from three distinct spaces: (1) Mean Subtracted Contrast Normalized (MSCN) features capturing low-level statistical deviations; (2) CLIP embeddings encoding high-level semantic coherence; and (3) Multi-scale Local Binary Patterns (MLBP) characterizing mid-level texture anomalies. Through extensive experiments on four benchmark datasets covering a wide range of generative models, we show that individual feature spaces exhibit significant performance variability across different generators. Crucially, the fusion of all three representations yields superior and more consistent performance, particularly in a challenging mixed-model scenario. Compared to state-of-the-art methods, the proposed framework yields consistently improved performance across all evaluated datasets. Overall, this work highlights the importance of hybrid representations for robust GenAI image detection and provides a principled framework for integrating complementary visual cues.

28. 【2603.29784】MAPLE: Multi-Path Adaptive Propagation with Level-Aware Embeddings for Hierarchical Multi-Label Image Classification

链接：https://arxiv.org/abs/2603.29784

作者：Boshko Koloski,Marjan Stoimchev,Jurica Levatić,Dragi Kocev,Sašo Džeroski

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modeling structured label, structured label dependencies, Hierarchical multi-label classification, multi-label classification, essential for modeling

备注： REO: Advances in Representation Learning for Earth Observation, accepted workshow paper at EurIPS

点击查看摘要

Abstract:Hierarchical multi-label classification (HMLC) is essential for modeling structured label dependencies in remote sensing. Yet existing approaches struggle in multi-path settings, where images may activate multiple taxonomic branches, leading to underuse of hierarchical information. We propose MAPLE (Multi-Path Adaptive Propagation with Level-Aware Embeddings), a framework that integrates (i) hierarchical semantic initialization from graph-aware textual descriptions, (ii) graph-based structure encoding via graph convolutional networks (GCNs), and (iii) adaptive multi-modal fusion that dynamically balances semantic priors and visual evidence. An adaptive level-aware objective automatically selects appropriate losses per hierarchy level. Evaluations on CORINE-aligned remote sensing datasets (AID, DFC-15, and MLRSNet) show consistent improvements of up to +42% in few-shot regimes while adding only 2.6% parameter overhead, demonstrating that MAPLE effectively and efficiently models hierarchical semantics for Earth observation (EO).

29. 【2603.29777】From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety

链接：https://arxiv.org/abs/2603.29777

作者：Ganen Sethupathy,Lalit Dumka,Jan Schagen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：event venues require, venues require timely, potentially violent behaviour, city centres, transport hubs

备注： Preprint version of a manuscript currently under review at IEEE Access

点击查看摘要

Abstract:Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and demonstrator-based deployment of a hybrid edge-based action detection system that combines skeleton-based motion analysis with vision-language models for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. The results highlight the complementary strengths and limitations of motioncentric and semantic approaches and motivate a hybrid architecture that selectively augments fast skeletonbased detection with higher-level semantic reasoning. The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications.

30. 【2603.29773】Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration

链接：https://arxiv.org/abs/2603.29773

作者：Fengyang Xiao,Peng Hu,Lei Xu,XingE Guo,Guanyi Qin,Yuqi Shen,Chengyu Fang,Rihan Zhang,Chunming He,Sina Farsiu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：restore high-quality, degraded low-quality, inputs captured, uncontrolled conditions, aims to restore

备注： Accepted by CVPR

点击查看摘要

Abstract:Real-world image restoration aims to restore high-quality (HQ) images from degraded low-quality (LQ) inputs captured under uncontrolled conditions. Existing methods typically depend on ground-truth (GT) supervision, assuming that GT provides perfect reference quality. However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable. To address these problems, we propose a novel framework, termed IQPIR, that introduces an Image Quality Prior (IQP)-extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models-to guide the restoration process toward perceptually optimal outputs explicitly. Our approach synergistically integrates IQP with a learned codebook prior through three key mechanisms: (1) a quality-conditioned Transformer, where NR-IQA-derived scores serve as conditioning signals to steer the predicted representation toward maximal perceptual quality. This design provides a plug-and-play enhancement compatible with existing restoration architectures without structural modification; and (2) a dual-branch codebook structure, which disentangles common and HQ-specific features, ensuring a comprehensive representation of both generic structural information and quality-sensitive attributes; and (3) a discrete representation-based quality optimization strategy, which mitigates over-optimization effects commonly observed in continuous latent spaces. Extensive experiments on real-world image restoration demonstrate that our method not only surpasses cutting-edge methods but also serves as a generalizable quality-guided enhancement strategy for existing methods. The code is available.

31. 【2603.29759】SHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios

链接：https://arxiv.org/abs/2603.29759

作者：Qiucheng Yu,Ruijie Xu,Mingang Chen,Xuequan Lu,Jianfeng Dong,Chaochao Lu,Xin Tan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent advances, textbf, advances in vision-language, accelerated their application, Recent

备注：

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have accelerated their application to indoor safety hazards assessment. However, existing benchmarks suffer from three fundamental limitations: (1) heavy reliance on synthetic datasets constructed via simulation software, creating a significant domain gap with real-world environments; (2) oversimplified safety tasks with artificial constraints on hazard and scene types, thereby limiting model generalization; and (3) absence of rigorous evaluation protocols to thoroughly assess model capabilities in complex home safety scenarios. To address these challenges, we introduce TSHA (\textbf{T}rustworthy \textbf{S}afety \textbf{H}azards \textbf{A}ssessment), a comprehensive benchmark comprising 81,809 carefully curated training samples drawn from four complementary sources: existing indoor datasets, internet images, AIGC images, and newly captured images. This benchmark set also includes a highly challenging test set with 1707 samples, comprising not only a carefully selected subset from the training distribution but also newly added videos and panoramic images containing multiple safety hazards, used to evaluate the model's robustness in complex safety scenarios. Extensive experiments on 23 popular VLMs demonstrate that current VLMs lack robust capabilities for safety hazard assessment. Importantly, models trained on the TSHA training set not only achieve a significant performance improvement of up to +18.3 points on the TSHA test set but also exhibit enhanced generalizability across other benchmarks, underscoring the substantial contribution and importance of the TSHA benchmark.

32. 【2603.29742】SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

链接：https://arxiv.org/abs/2603.29742

作者：Rui Bao,Zheng Gao,Xiaoyu Li,Xiaoyan Feng,Yang Song,Jiaojiao Jiang

类目：Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词：embed verifiable marks, Diffusion-based watermarking methods, methods embed verifiable, Diffusion-based watermarking, mathbf

备注：

点击查看摘要

Abstract:Diffusion-based watermarking methods embed verifiable marks by manipulating the initial noise or the reverse diffusion trajectory. However, these methods share a critical assumption: verification can succeed only if the diffusion trajectory can be faithfully reconstructed. This reliance on trajectory recovery constitutes a fundamental and exploitable vulnerability. We propose $\underline{\mathbf{S}}$tochastic $\underline{\mathbf{Hi}}$dden-Trajectory De$\underline{\mathbf{f}}$lec$\underline{\mathbf{t}}$ion ($\mathbf{SHIFT}$), a training-free attack that exploits this common weakness across diverse watermarking paradigms. SHIFT leverages stochastic diffusion resampling to deflect the generative trajectory in latent space, making the reconstructed image statistically decoupled from the original watermark-embedded trajectory while preserving strong visual quality and semantic consistency. Extensive experiments on nine representative watermarking methods spanning noise-space, frequency-domain, and optimization-based paradigms show that SHIFT achieves 95%--100% attack success rates with nearly no loss in semantic quality, without requiring any watermark-specific knowledge or model retraining.

33. 【2603.29734】GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis

链接：https://arxiv.org/abs/2603.29734

作者：Thomas Tanay,Mohammed Brahimi,Michal Nazarczuk,Qingwen Zhang,Sibi Catley-Chandar,Arthur Moreau,Zhensong Zhang,Eduardo Pérez-Pellitero

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Synthesizing novel views, challenging problem, remains a challenging, dynamic scenes remains, dynamic

备注： CVPR Findings 2026

点击查看摘要

Abstract:Synthesizing novel views from monocular videos of dynamic scenes remains a challenging problem. Scene-specific methods that optimize 4D representations with explicit motion priors often break down in highly dynamic regions where multi-view information is hard to exploit. Diffusion-based approaches that integrate camera control into large pre-trained models can produce visually plausible videos but frequently suffer from geometric inconsistencies across both static and dynamic areas. Both families of methods also require substantial computational resources. Building on the success of generalizable models for static novel view synthesis, we adapt the framework to dynamic inputs and propose a new model with two key components: (1) a recurrent loop that enables unbounded and asynchronous mapping between input and target videos and (2) an efficient use of plane sweeps over dynamic inputs to disentangle camera and scene motion, and achieve fine-grained, six-degrees-of-freedom camera controls. We train and evaluate our model on the UCSD dataset and on Kubric-4D-dyn, a new monocular dynamic dataset featuring longer, higher resolution sequences with more complex scene dynamics than existing alternatives. Our model outperforms four Gaussian Splatting-based scene-specific approaches, as well as two diffusion-based approaches in reconstructing fine-grained geometric details across both static and dynamic regions.

34. 【2603.29733】Leveraging Synthetic Data for Enhancing Egocentric Hand-Object Interaction Detection

链接：https://arxiv.org/abs/2603.29733

作者：Rosario Leonardi,Antonino Furnari,Francesco Ragusa,Giovanni Maria Farinella

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：synthetic data, data, real labeled data, improve HOI detection, synthetic

备注：

点击查看摘要

Abstract:In this work, we explore the role of synthetic data in improving the detection of Hand-Object Interactions from egocentric images. Through extensive experimentation and comparative analysis on VISOR, EgoHOS, and ENIGMA-51 datasets, our findings demonstrate the potential of synthetic data to significantly improve HOI detection, particularly when real labeled data are scarce or unavailable. By using synthetic data and only 10% of the real labeled data, we achieve improvements in Overall AP over models trained exclusively on real data, with gains of +5.67% on VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Furthermore, we systematically study how aligning synthetic data to specific real-world benchmarks with respect to objects, grasps, and environments, showing that the effectiveness of synthetic data consistently improves with better synthetic-real alignment. As a result of this work, we release a new data generation pipeline and the new HOI-Synth benchmark, which augments existing datasets with synthetic images of hand-object interaction. These data are automatically annotated with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. All data, code, and tools for synthetic data generation are available at: this https URL.

35. 【2603.29732】Compressive sensing inspired self-supervised single-pixel imaging

链接：https://arxiv.org/abs/2603.29732

作者：Jijun Lu,Yifan Chen,Libang Chen,Yiqiang Zhou,Ye Zheng,Mingliang Chen,Zhe Sun,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：strongly perturbed environments, promising imaging modality, perturbed environments, Existing SPI methods, modality with distinctive

备注： 10 pages, 9 figures, 2 algorithms, 2 tables, journal paper

点击查看摘要

Abstract:Single-pixel imaging (SPI) is a promising imaging modality with distinctive advantages in strongly perturbed environments. Existing SPI methods lack physical sparsity constraints and overlook the integration of local and global features, leading to severe noise vulnerability, structural distortions and blurred details. To address these limitations, we propose SISTA-Net, a compressive sensing-inspired self-supervised method for single-pixel imaging. SISTA-Net unfolds the Iterative Shrinkage-Thresholding Algorithm (ISTA) into an interpretable network consisting of a data fidelity module and a proximal mapping module. The fidelity module adopts a hybrid CNN-Visual State Space Model (VSSM) architecture to integrate local and global feature modeling, enhancing reconstruction integrity and fidelity. We leverage deep nonlinear networks as adaptive sparse transforms combined with a learnable soft-thresholding operator to impose explicit physical sparsity in the latent domain, enabling noise suppression and robustness to interference even at extremely low sampling rates. Extensive experiments on multiple simulation scenarios demonstrate that SISTA-Net outperforms state-of-the-art methods by 2.6 dB in PSNR. Real-world far-field underwater tests yield a 3.4 dB average PSNR improvement, validating its robust anti-interference capability.

36. 【2603.29697】FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing

链接：https://arxiv.org/abs/2603.29697

作者：Fengjian Xue,Xuecheng Wu,Heli Sun,Yunyun Shi,Shi Chen,Liangyu Fu,Jinheng Xie,Dingkang Yang,Hao Wang,Junxiao Xue,Liang He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：strictly preserve human, precisely manipulating expression, preserve human identity, requires fine-grained control, control to strictly

备注：

点击查看摘要

Abstract:Facial expression image editing requires fine-grained control to strictly preserve human identity and background while precisely manipulating expression. However, existing editing benchmarks primarily focus on general scenarios, lacking high-quality facial images and corresponding editing instructions. Furthermore, current evaluation metrics exhibit systemic biases in this task, often favoring lazy editing or overfit editing. To bridge these gaps, we propose FED-Bench, a comprehensive benchmark featuring rigorous testing and an accurate evaluation suite. First, we carefully construct a benchmark of 747 triplets through a cascaded and scalable pipeline, each comprising an original image, an editing instruction, and a ground-truth image for precise evaluation. Second, we introduce FED-Score, a cross-granularity evaluation protocol that disentangles assessment into three dimensions: Alignment for verifying instruction following, Fidelity for testing image quality and identity preservation, and Relative Expression Gain for quantifying the magnitude of expression changes, effectively mitigating the aforementioned evaluation biases. Third, we benchmark 18 image editing models, revealing that current approaches struggle to simultaneously achieve high fidelity and accurate expression manipulation, with fine-grained instruction following identified as the primary bottleneck. Finally, leveraging the scalable characteristic of introduced benchmark engine, we provide a 20k+ in-the-wild facial training set and demonstrate its effectiveness by fine-tuning a baseline model that achieves significant performance gains. Our benchmark and related code will be made publicly open soon.

37. 【2603.29694】Exploring the Impact of Skin Color on Skin Lesion Segmentation

链接：https://arxiv.org/abs/2603.29694

作者：Kuniko Paxton,Medina Kapo,Amila Akagić,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：making early detection, early detection critical, morbidity and mortality, making early, detection critical

备注：

点击查看摘要

Abstract:Skin cancer, particularly melanoma, remains a major cause of morbidity and mortality, making early detection critical. AI-driven dermatology systems often rely on skin lesion segmentation as a preprocessing step to delineate the lesion from surrounding skin and support downstream analysis. While fairness concerns regarding skin tone have been widely studied for lesion classification, the influence of skin tone on the segmentation stage remains under-quantified and is frequently assessed using coarse, discrete skin tone categories. In this work, we evaluate three strong segmentation architectures (UNet, DeepLabV3 with a ResNet50 backbone, and DINOv2) on two public dermoscopic datasets (HAM10000 and ISIC2017) and introduce a continuous pigment or contrast analysis that treats pixel-wise ITA values as distributions. Using Wasserstein distances between within-image distributions for skin-only, lesion-only, and whole-image regions, we quantify lesion skin contrast and relate it to segmentation performance across multiple metrics. Within the range represented in these datasets, global skin tone metrics (Fitzpatrick grouping or mean ITA) show weak association with segmentation quality. In contrast, low lesion-skin contrast is consistently associated with larger segmentation errors in models, indicating that boundary ambiguity and low contrast are key drivers of failure. These findings suggest that fairness improvements in dermoscopic segmentation should prioritize robust handling of low-contrast lesions, and the distribution-based pigment measures provide a more informative audit signal than discrete skin-tone categories.

38. 【2603.29692】SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition

链接：https://arxiv.org/abs/2603.29692

作者：Ning Wang,Tieyue Wu,Naeha Sharif,Farid Boussaid,Guangming Zhu,Lin Mei,Mohammed Bennamoun,zhang liang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：skeleton-based action recognition, action recognition aims, recognize unseen actions, recognition aims, aims to recognize

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Zero-shot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained language model to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions.

39. 【2603.29676】A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

链接：https://arxiv.org/abs/2603.29676

作者：Lixin Xiu,Xufang Luo,Hideki Nakayama

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large vision-language models, achieve impressive performance, processes remain opaque, internal decision-making processes, decision-making processes remain

备注： Accepted at ICLR 2026. Project page: [this https URL](https://riishin.github.io/pid-lvlm-iclr26/)

点击查看摘要

40. 【2603.29670】Clinical DVH metrics as a loss function for 3D dose prediction in head and neck radiotherapy

链接：https://arxiv.org/abs/2603.29670

作者：Ruochen Gao,Marius Staring,Frank Dankers

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：automated radiotherapy workflows, radiotherapy workflows, DVH metrics, automated radiotherapy, DVH

备注： 19 pages

点击查看摘要

Abstract:Purpose: Deep-learning-based three-dimensional (3D) dose prediction is widely used in automated radiotherapy workflows. However, most existing models are trained with voxel-wise regression losses, which are poorly aligned with clinical plan evaluation criteria based on dose-volume histogram (DVH) metrics. This study aims to develop a clinically guided loss formulation that directly optimizes clinically used DVH metrics while remaining computationally efficient for head and neck (H\N) dose prediction. Methods: We propose a clinical DVH metric loss (CDM loss) that incorporates differentiable \textit{D-metrics} and surrogate \textit{V-metrics}, together with a lossless bit-mask region-of-interest (ROI) encoding to improve training efficiency. The method was evaluated on 174 H\N patients using a temporal split (137 training, 37 testing). Results: Compared with MAE- and DVH-curve based losses, CDM loss substantially improved target coverage and satisfied all clinical constraints. Using a standard 3D U-Net, the PTV Score was reduced from 1.544 (MAE) to 0.491 (MAE + CDM), while OAR sparing remained comparable. Bit-mask encoding reduced training time by 83\% and lowered GPU memory usage. Conclusion: Directly optimizing clinically used DVH metrics enables 3D dose predictions that are better aligned with clinical treatment planning criteria than conventional voxel-wise or DVH-curve-based supervision. The proposed CDM loss, combined with efficient ROI bit-mask encoding, provides a practical and scalable framework for H\N dose prediction.

Comments:
19 pages

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.29670 [cs.CV]

(or
arXiv:2603.29670v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.29670

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Ruochen Gao [view email] [v1]
Tue, 31 Mar 2026 12:27:41 UTC (2,209 KB)

41. 【2603.29666】CoRe-DA: Contrastive Regression for Unsupervised Domain Adaptation in Surgical Skill Assessment

链接：https://arxiv.org/abs/2603.29666

作者：Dimitrios Anastasiou,Razvan Caramalau,Jialang Xu,Runlong He,Freweini Tesfai,Matthew Boal,Nader Francis,Danail Stoyanov,Evangelos B. Mazomenos

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision-based surgical skill, surgical skill assessment, Vision-based surgical, operative performance, evaluation of operative

备注：

点击查看摘要

Abstract:Vision-based surgical skill assessment (SSA) enables objective and scalable evaluation of operative performance. Progress in this field is constrained by the high cost and time demands for manual annotation of quantitative skill scores, as well as the poor generalization of existing regression models to new surgical tasks and environments. Meanwhile, appreciable volumes of unlabeled video data are now available, motivating the development of unsupervised domain adaptation (UDA) methods for SSA. We introduce the first benchmark for UDA in SSA regression, spanning four datasets across dry-lab and clinical settings as well as open and robotic surgery. We evaluate eight representative models under challenging domain shifts and propose CoRe-DA, a novel contrastive regression-based adaptation framework. Our method learns domain-invariant representations through relative-score supervision and target-domain self-training. Comprehensive experiments across two UDA settings show that CoRe-DA is superior to state-of-the-art methods, achieving Spearman Correlation Coefficients of 0.46 and 0.41 on dry-lab and clinical target datasets, respectively, without using any labeled target data for training. Overall, CoRe-DA enables scalable SSA with reliable cross-domain generalization, where existing methods underperform. Our code and datasets will be released at this https URL.

42. 【2603.29664】CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

链接：https://arxiv.org/abs/2603.29664

作者：Shifang Zhao,Yihan Hu,Ying Shan,Yunchao Wei,Xiaodong Cun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：current social media, digital human-made art, audio alignment forms, social media, Multimodal Language Models

备注： Project Code: [this https URL](https://github.com/GVCLab/CutClaw)

点击查看摘要

Abstract:Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: this https URL.

43. 【2603.29655】Not All Frames Are Equal: Complexity-Aware Masked Motion Generation via Motion Spectral Descriptors

链接：https://arxiv.org/abs/2603.29655

作者：Pengfei Zhou,Xiangyue Zhang,Xukun Shen,Yong Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：treat motion frames, Masked generative models, motion, masked motion, generative models

备注：

点击查看摘要

Abstract:Masked generative models have become a strong paradigm for text-to-motion synthesis, but they still treat motion frames too uniformly during masking, attention, and decoding. This is a poor match for motion, where local dynamic complexity varies sharply over time. We show that current masked motion generators degrade disproportionately on dynamically complex motions, and that frame-wise generation error is strongly correlated with motion dynamics. Motivated by this mismatch, we introduce the Motion Spectral Descriptor (MSD), a simple and parameter-free measure of local dynamic complexity computed from the short-time spectrum of motion velocity. Unlike learned difficulty predictors, MSD is deterministic, interpretable, and derived directly from the motion signal itself. We use MSD to make masked motion generation complexity-aware. In particular, MSD guides content-focused masking during training, provides a spectral similarity prior for self-attention, and can additionally modulate token-level sampling during iterative decoding. Built on top of masked motion generators, our method, DynMask, improves motion generation most clearly on dynamically complex motions while also yielding stronger overall FID on HumanML3D and KIT-ML. These results suggest that respecting local motion complexity is a useful design principle for masked motion generation. Project page: this https URL

44. 【2603.29634】MacTok: Robust Continuous Tokenization for Image Generation

链接：https://arxiv.org/abs/2603.29634

作者：Hengyu Zeng,Xin Gao,Guanghao Li,Yuxiang Yan,Jiaoyang Ruan,Junpeng Ma,Haoyu Albert Wang,Jian Pu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Continuous image tokenizers, image tokenizers enable, tokenizers enable efficient, Continuous image, textbf

备注：

点击查看摘要

Abstract:Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce \textbf{MacTok}, a \textbf{M}asked \textbf{A}ugmenting 1D \textbf{C}ontinuous \textbf{Tok}enizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256$\times$256 and a state-of-the-art 1.52 at 512$\times$512 with SiT-XL, while reducing token usage by up to 64$\times$. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.

45. 【2603.29633】Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification

链接：https://arxiv.org/abs/2603.29633

作者：Mingkun Tan,Xilu Wang,Michael Kloster,Tim W. Nattkemper

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Label-scarce visual classification, Label-scarce visual, sites exhibit partially, exhibit partially overlapping, overlapping class sets

备注： 22 pages, 9 figures

点击查看摘要

Abstract:Label-scarce visual classification under decentralized and heterogeneous data is a fundamental challenge in pattern recognition, especially when sites exhibit partially overlapping class sets. While self-supervised federated learning (SSFL) offers a promising solution, existing studies commonly assume the same data heterogeneity pattern throughout pre-training and fine-tuning. Moreover, current partitioning schemes often fail to generate pure partially class-disjoint data settings, limiting controllable simulation of real-world label-space heterogeneity. In this work, we introduce SSFL for diatom classification as a representative real-world instance and systematically investigate stage-specific data heterogeneity. We study cross-site variation in unlabeled data volume during pre-training and label-space misalignment during downstream fine-tuning. To study the latter in a controllable setting, we propose PreDi, a partitioning scheme that disentangles label-space heterogeneity into two orthogonal dimensions, namely class Prevalence and class-set size Disparity, enabling separate analysis of their effects. Guided by the resulting insights, we further propose PreP-WFL (Prevalence-based Personalized Weighted Federated Learning) to adaptively strengthen rare-class representations in low-prevalence scenarios. Extensive experiments show that SSFL consistently outperforms local-only training under both homogeneous and heterogeneous settings. The pronounced heterogeneity in unlabeled data volume is associated with improved representation pre-training, whereas under label-space heterogeneity, prevalence dominates performance and disparity has a smaller effect. PreP-WFL effectively mitigates this degradation, with gains increasing as prevalence decreases. These findings provide a mechanistic basis for characterizing label-space heterogeneity in decentralized recognition systems.

46. 【2603.29631】Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

链接：https://arxiv.org/abs/2603.29631

作者：Sherif Abdelwahab

类目：Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)

关键词：Always-on edge cameras, edge cameras generate, cameras generate continuous, generate continuous video, continuous video streams

备注： 6 pages, 3 figures, 5 tables; supplementary video included as ancillary file

点击查看摘要

47. 【2603.29630】BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation

链接：https://arxiv.org/abs/2603.29630

作者：Johann-Ludwig Herzog,Mathis Jürgen Adler,Leonard Hackel,Yan Shu,Angelos Zavras,Ioannis Papoutsis,Paolo Rota,Begüm Demir

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：data remains limited, remains limited due, http URL, shown strong performance, computer vision

备注： For details, see [this https URL](https://txt.bigearth.net)

点击查看摘要

Abstract:Vision-langugage models (VLMs) have shown strong performance in computer vision (CV), yet their performance on remote sensing (RS) data remains limited due to the lack of large-scale, multi-sensor RS image-text datasets with diverse textual annotations. Existing datasets predominantly include aerial Red-Green-Blue imagery, with short or weakly grounded captions, and provide limited diversity in annotation types. To address this limitation, we introduce this http URL, a large-scale, multi-sensor image-text dataset designed to advance instruction-driven image-text learning in Earth observation across multiple tasks. this http URL contains 464044 co-registered Sentinel-1 synthetic aperture radar and Sentinel-2 multispectral images with 9.6M text annotations, including: i) geographically anchored captions describing land-use/land-cover (LULC) classes, their spatial relations, and environmental context; ii) visual question answering pairs relevant for different tasks; and iii) referring expression detection instructions for bounding box prediction. Through a comparative statistical analysis, we demonstrate that this http URL surpasses existing RS image-text datasets in textual richness and annotation type variety. We further establish a manually-verified benchmark split to evaluate VLMs in RS and CV. The results show the limitations of these models on tasks that involve complex LULC classes, whereas fine-tuning using this http URL results in consistent performance gains across all considered tasks.

48. 【2603.29620】Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

链接：https://arxiv.org/abs/2603.29620

作者：Shuang Chen,Quanxin Shou,Hangting Chen,Yucheng Zhou,Kaituo Feng,Wenbo Hu,Yi-Fan Zhang,Yunlong Lin,Wenxuan Huang,Mingyang Song,Dasen Dai,Bolin Jiang,Manyuan Zhang,Shi-Xue Zhang,Zhengkai Jiang,Lucas Wang,Zhao Zhong,Yu Cheng,Nanyun Peng

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：world-grounded image synthesis, image synthesis, provide a natural, natural and promising, promising architecture

备注： Project Page: [this https URL](https://github.com/shawn0728/Unify-Agent)

点击查看摘要

Abstract:Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

49. 【2603.29616】Video-Oasis: Rethinking Evaluation of Video Understanding

链接：https://arxiv.org/abs/2603.29616

作者：Geuntaek Lim,Minho Shim,Sungjune Park,Jaeyun Lee,Inwoong Lee,Taeoh Kim,Dongyoon Wee,Yukyung Choi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：performance gains stem, video understanding, video understanding makes, knowledge priors, linguistic reasoning

备注：

点击查看摘要

Abstract:The inherent complexity of video understanding makes it difficult to attribute whether performance gains stem from visual perception, linguistic reasoning, or knowledge priors. While many benchmarks have emerged to assess high-level reasoning, the essential criteria that constitute video understanding remain largely overlooked. Instead of introducing yet another benchmark, we take a step back to re-examine the current landscape of video understanding. In this work, we provide Video-Oasis, a sustainable diagnostic suite designed to systematically evaluate existing evaluations and distill spatio-temporal challenges for video understanding. Our analysis reveals two critical findings: (1) 54% of existing benchmark samples are solvable without visual input or temporal context, and (2) on the remaining samples, state-of-the-art models exhibit performance barely exceeding random guessing. To bridge this gap, we investigate which algorithmic design choices contribute to robust video understanding, providing practical guidelines for future research. We hope our work serves as a standard guideline for benchmark construction and the rigorous evaluation of architecture development. Code is available at this https URL.

50. 【2603.29602】IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection

链接：https://arxiv.org/abs/2603.29602

作者：Fei Shen,Chengyu Xie,Lihong Wang,Zhanyi Zhang,Xin Jiang,Xiaoyu Du,Jinhui Tang

类目：Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词：isolated single-step execution, multi-turn image editing, image editing paradigms, confined to isolated, isolated single-step

备注：

点击查看摘要

Abstract:Existing multi-turn image editing paradigms are often confined to isolated single-step execution. Due to a lack of context-awareness and closed-loop feedback mechanisms, they are prone to error accumulation and semantic drift during multi-turn interactions, ultimately resulting in severe structural distortion of the generated images. For that, we propose \textbf{IMAGAgent}, a multi-turn image editing agent framework based on a "plan-execute-reflect" closed-loop mechanism that achieves deep synergy among instruction parsing, tool scheduling, and adaptive correction within a unified pipeline. Specifically, we first present a constraint-aware planning module that leverages a vision-language model (VLM) to precisely decompose complex natural language instructions into a series of executable sub-tasks, governed by target singularity, semantic atomicity, and visual perceptibility. Then, the tool-chain orchestration module dynamically constructs execution paths based on the current image, the current sub-task, and the historical context, enabling adaptive scheduling and collaborative operation among heterogeneous operation models covering image retrieval, segmentation, detection, and editing. Finally, we devise a multi-expert collaborative reflection mechanism where a central large language model (LLM) receives the image to be edited and synthesizes VLM critiques into holistic feedback, simultaneously triggering fine-grained self-correction and recording feedback outcomes to optimize future decisions. Extensive experiments on our constructed \textbf{MTEditBench} and the MagicBrush dataset demonstrate that IMAGAgent achieves performance significantly superior to existing methods in terms of instruction consistency, editing precision, and overall quality. The code is available at this https URL.

51. 【2603.29592】Bioinspired123D: Generative 3D Modeling System for Bioinspired Structures

链接：https://arxiv.org/abs/2603.29592

作者：Rachel K. Luu,Markus J. Buehler

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：high computational cost, made rapid progress, progress in text, video synthesis, computational cost

备注：

点击查看摘要

Abstract:Generative AI has made rapid progress in text, image, and video synthesis, yet text-to-3D modeling for scientific design remains particularly challenging due to limited controllability and high computational cost. Most existing 3D generative methods rely on meshes, voxels, or point clouds which can be costly to train and difficult to control. We introduce Bioinspired123D, a lightweight and modular code-as-geometry pipeline that generates fabricable 3D structures directly through parametric programs rather than dense visual representations. At the core of Bioinspired123D is Bioinspired3D, a compact language model finetuned to translate natural language design cues into Blender Python scripts encoding smooth, biologically inspired geometries. We curate a domain-specific dataset of over 4,000 bioinspired and geometric design scripts spanning helical, cellular, and tubular motifs with parametric variability. The dataset is expanded and validated through an automated LLM-driven, Blender-based quality control pipeline. Bioinspired3D is then embedded in a graph-based agentic framework that integrates multimodal retrieval-augmented generation and a vision-language model critic to iteratively evaluate, critique, and repair generated scripts. We evaluate performance on a new benchmark for 3D geometry script generation and show that Bioinspired123D demonstrates a near fourfold improvement over its non-finetuned base model, while also outperforming substantially larger state-of-the-art language models despite using far fewer parameters and compute. By prioritizing code-as-geometry representations, Bioinspired123D enables compute-efficient, controllable, and interpretable text-to-3D generation, lowering barriers to AI driven scientific discovery in materials and structural design.

52. 【2603.29591】FlowID : Enhancing Forensic Identification with Latent Flow-Matching Models

链接：https://arxiv.org/abs/2603.29591

作者：Jules Ripoll,David Bertoin,Alasdair Newson,Charles Dossal,Jose Pablo Baraybar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：climate disasters, people die, violent circumstances, facial reconstruction, Abstract

备注：

点击查看摘要

Abstract:Every day, many people die under violent circumstances, whether from crimes, war, migration, or climate disasters. Medico-legal and law enforcement institutions document many portraits of the deceased for evidence, but cannot immediately carry out identification on them. While traditional image editing tools can process these photos for public release, the workflow is lengthy and produces suboptimal results. In this work, we leverage advances in image generation models, which can now produce photorealistic human portraits, to introduce FlowID, an identity-preserving facial reconstruction method. Our approach combines single-image fine-tuning, which adapts the generative model to out-of-distribution injured faces, with attention-based masking that localizes edits to damaged regions while preserving identity-critical features. Together, these components enable the removal of artifacts from violent death while retaining sufficient identity information to support identification. To evaluate our method, we introduce InjuredFaces, a novel benchmark for identity-preserving facial reconstruction under severe facial damage. Beyond serving as an evaluation tool for this work, InjuredFaces provides a standardized resource for the community to study and compare methods addressing facial reconstruction in extreme conditions. Experimental results show that FlowID outperforms state-of-the-art open-source methods while maintaining low memory requirements, making it suitable for local deployment without compromising data privacy.

53. 【2603.29578】Emotion Diffusion Classifier with Adaptive Margin Discrepancy Training for Facial Expression Recognition

链接：https://arxiv.org/abs/2603.29578

作者：Rongkang Dong,Cuixin Yang,Cong Zhang,Yushen Zuo,Kin-Man Lam

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Facial Expression Recognition, facial affective behaviors, Facial Expression, interpret human emotions, facial affective

备注：

点击查看摘要

Abstract:Facial Expression Recognition (FER) is essential for human-machine interaction, as it enables machines to interpret human emotions and internal states from facial affective behaviors. Although deep learning has significantly advanced FER performance, most existing deep-learning-based FER methods rely heavily on discriminative classifiers for fast predictions. These models tend to learn shortcuts and are vulnerable to even minor distribution shifts. To address this issue, we adopt a conditional generative diffusion model and introduce the Emotion Diffusion Classifier (EmoDC) for FER, which demonstrates enhanced adversarial robustness. However, retraining EmoDC using standard strategies fails to penalize incorrect categorical descriptions, leading to suboptimal recognition performance. To improve EmoDC, we propose margin-based discrepancy training, which encourages accurate predictions when conditioned on correct categorical descriptions and penalizes predictions conditioned on mismatched ones. This method enforces a minimum margin between noise-prediction errors for correct and incorrect categories, thereby enhancing the model's discriminative capability. Nevertheless, using a fixed margin fails to account for the varying difficulty of noise prediction across different images, limiting its effectiveness. To overcome this limitation, we propose Adaptive Margin Discrepancy Training (AMDiT), which dynamically adjusts the margin for each sample. Extensive experiments show that AMDiT significantly improves the accuracy of EmoDC over the Base model with standard denoising diffusion training on the RAF-DB basic subset, the RAF-DB compound subset, SFEW-2.0, and AffectNet, in 100-step evaluations. Additionally, EmoDC outperforms state-of-the-art discriminative classifiers in terms of robustness against noise and blur.

54. 【2603.29572】urbo4DGen: Ultra-Fast Acceleration for 4D Generation

链接：https://arxiv.org/abs/2603.29572

作者：Yuanbin Man,Ying Huang,Zhile Ren,Miao Yin

类目：Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：realistic dynamic scenes, model realistic dynamic, advancing world models, integrates spatial, dynamic scenes

备注：

点击查看摘要

Abstract:4D generation, or dynamic 3D content generation, integrates spatial, temporal, and view dimensions to model realistic dynamic scenes, playing a foundational role in advancing world models and physical AI. However, maintaining long-chain consistency across both frames and viewpoints through the unique spatio-camera-motion (SCM) attention mechanism introduces substantial computational and memory overhead, often leading to out-of-memory (OOM) failures and prohibitive generation times. To address these challenges, we propose Turbo4DGen, an ultra-fast acceleration framework for diffusion-based multi-view 4D content generation. Turbo4DGen introduces a spatiotemporal cache mechanism that persistently reuses intermediate attention across denoising steps, combined with dynamically semantic-aware attention pruning and an adaptive SCM chain bypass scheduler, to drastically reduce redundant SCM attention computation. Our experimental results show that Turbo4DGen achieves an average 9.7$\times$ speedup without quality degradation on the ObjaverseDy and Consistent4D datasets. To the best of our knowledge, Turbo4DGen is the first dedicated acceleration framework for 4D generation.

55. 【2603.29570】Generating Key Postures of Bharatanatyam Adavus with Pose Estimation

链接：https://arxiv.org/abs/2603.29570

作者：Jagadish Kashinath Kamble,Jayanta Mukhopadhyay,Debaditya Roy,Partha Pratim Das

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Preserving intangible cultural, symbolic rules presents, rules presents unique, presents unique challenges, Preserving intangible

备注： Published in ICVGIP, 2025

点击查看摘要

Abstract:Preserving intangible cultural dances rooted in centuries of tradition and governed by strict structural and symbolic rules presents unique challenges in the digital era. Among these, Bharatanatyam, a classical Indian dance form, stands out for its emphasis on codified adavus and precise key postures. Accurately generating these postures is crucial not only for maintaining anatomical and stylistic integrity, but also for enabling effective documentation, analysis, and transmission to broader global audiences through digital means. We propose a pose-aware generative framework integrated with a pose estimation module, guided by keypoint-based loss and pose consistency constraints. These supervisory signals ensure anatomical accuracy and stylistic integrity in the synthesized outputs. We evaluate four configurations: standard conditional generative adversarial network (cGAN), cGAN with pose supervision, conditional diffusion, and conditional diffusion with pose supervision. Each model is conditioned on key posture class labels and optimized to maintain geometric structure. In both cGAN and conditional diffusion settings, the integrated pose guidance aligns generated poses with ground-truth keypoint structures, promoting cultural fidelity. Our results demonstrate that incorporating pose supervision significantly enhances the quality, realism, and authenticity of generated Bharatanatyam postures. This framework provides a scalable approach for the digital preservation, education, and dissemination of traditional dance forms, enabling high-fidelity generation without compromising cultural precision. Code is available at this https URL.

56. 【2603.29535】Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

链接：https://arxiv.org/abs/2603.29535

作者：Sowmya Vajrala,Aakash Parmar,Prasanna R,Sravanth Kodavanti,Manjunath Arveti,Srinivas Soumitri Miriyala,Ashok Senapati

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Generative Artificial Intelligence, Generative Artificial, Artificial Intelligence, prompt-guided image transformation, deploying Large Vision

备注： Accepted at the Mobile AI Workshop, CVPR 2026

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) features such as image editing, object removal, and prompt-guided image transformation are increasingly integrated into mobile applications. However, deploying Large Vision Models (LVMs) for such tasks on resource-constrained devices remains challenging due to their high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, existing Mobile deployment pipelines typically compile separate model binaries for each LoRA + a copy of the foundation model, resulting in redundant storage and increased runtime overhead. In this work, we present a unified framework for enabling multi-task GenAI inference on edge devices using a single shared model. Our key idea is to treat LoRA weights as runtime inputs rather than embedding them into the compiled model graph, allowing dynamic task switching at runtime without recompilation. Then, to support efficient on-device execution, we introduce QUAD (Quantization with Unified Adaptive Distillation), a quantizationaware training strategy that aligns multiple LoRA adapters under a shared quantization profile. We implement the proposed system with a lightweight runtime stack compatible with mobile NPUs and evaluate it across multiple chipsets. Experimental results demonstrate up to 6x and 4x reduction in memory footprint and latency improvements, respectively, while maintaining high visual quality across multiple GenAI tasks.

57. 【2603.29507】ransmittance-Guided Structure-Texture Decomposition for Nighttime Image Dehazing

链接：https://arxiv.org/abs/2603.29507

作者：Francesco Moretti,Giulia Bianchi,Andrea Gallo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：including low visibility, hazy conditions suffer, severe quality degradation, Nighttime images captured, artificial light sources

备注：

点击查看摘要

Abstract:Nighttime images captured under hazy conditions suffer from severe quality degradation, including low visibility, color distortion, and reduced contrast, caused by the combined effects of atmospheric scattering, absorption by suspended particles, and non-uniform illumination from artificial light sources. While existing nighttime dehazing methods have achieved partial success, they typically address only a subset of these issues, such as glow suppression or brightness enhancement, without jointly tackling the full spectrum of degradation factors. In this paper, we propose a two-stage nighttime image dehazing framework that integrates transmittance correction with structure-texture layered optimization. In the first stage, we introduce a novel transmittance correction method that establishes boundary-constrained initial transmittance maps and subsequently applies region-adaptive compensation and normalization based on whether image regions correspond to light source areas. A quadratic Gaussian filtering scheme operating in the YUV color space is employed to estimate the spatially varying atmospheric light map. The corrected transmittance map and atmospheric light map are then used in conjunction with an improved nighttime imaging model to produce the initial dehazed image. In the second stage, we propose a STAR-YUV decomposition model that separates the dehazed image into structure and texture layers within the YUV color space. Gamma correction and MSRCR-based color restoration are applied to the structure layer for illumination compensation and color bias correction, while Laplacian-of-Gaussian filtering is applied to the texture layer for detail enhancement. A novel two-phase fusion strategy, comprising nonlinear Retinex-based fusion of the enhanced layers followed by linear blending with the initial dehazing result, yields the final output.

58. 【2603.29495】All-in-One Augmented Reality Guided Head and Neck Tumor Resection

链接：https://arxiv.org/abs/2603.29495

作者：Yue Yang,Matthieu Chabanas,Carrie Reale,Annie Benson,Jason Slagle,Matthew Weinger,Michael Topf,Jie Ying Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)

关键词：squamous cell carcinoma, neck squamous cell, typically communicated verbally, cell carcinoma, verbally from pathology

备注：

点击查看摘要

Abstract:Positive margins are common in head and neck squamous cell carcinoma, yet intraoperative re-resection is often imprecise because margin locations are typically communicated verbally from pathology. We present an all-in-one augmented reality (AR) system that relocalizes positive margins from a resected specimen to the resection bed and visualizes them in situ using HoloLens 2 depth sensing and fully automated markerless surface registration. In a silicone phantom study with six medical trainees, markerless registration achieved target registration errors comparable to a marker-based baseline (median 1.8 mm vs. 1.7 mm; maximum 4 mm). In a margin relocalization task, AR guidance reduced error from verbal guidance (median 14.2 mm) to a few millimeters (median 3.2 mm), with all AR localizations within 5 mm error. These results support the feasibility of markerless AR margin guidance for more precise intraoperative re-excision.

59. 【2603.29494】VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

链接：https://arxiv.org/abs/2603.29494

作者：Anmin Liu,Ruixuan Yang,Huiqiang Jiang,Bin Lin,Minmin Sun,Yong Li,Chen Zhang,Tao Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significant computational challenge, Transformer-based video models, challenge for Transformer-based, Long-context video understanding, Long-context video

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbf{VecAttention}, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65$\times$ speedup over full attention and a 1.83$\times$ speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention. Our code is available at this https URL.

60. 【2603.29460】Square Superpixel Generation and Representation Learning via Granular Ball Computing

链接：https://arxiv.org/abs/2603.29460

作者：Shuyin Xia,Meng Yang,Dawei Dai,Fan Chen,Shilin Zhao,Junwei Han,Xinbo Gao,Guoyin Wang,Wen Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：preserves object boundaries, reduce computational cost, compact region-based representation, local structures, provide a compact

备注：

点击查看摘要

Abstract:Superpixels provide a compact region-based representation that preserves object boundaries and local structures, and have therefore been widely used in a variety of vision tasks to reduce computational cost. However, most existing superpixel algorithms produce irregularly shaped regions, which are not well aligned with regular operators such as convolutions. Consequently, superpixels are often treated as an offline preprocessing step, limiting parallel implementation and hindering end-to-end optimization within deep learning pipelines. Motivated by the adaptive representation and coverage property of granular-ball computing, we develop a square superpixel generation approach. Specifically, we approximate superpixels using multi-scale square blocks to avoid the computational and implementation difficulties induced by irregular shapes, enabling efficient parallel processing and learnable feature extraction. For each block, a purity score is computed based on pixel-intensity similarity, and high-quality blocks are selected accordingly. The resulting square superpixels can be readily integrated as graph nodes in graph neural networks (GNNs) or as tokens in Vision Transformers (ViTs), facilitating multi-scale information aggregation and structured visual representation. Experimental results on downstream tasks demonstrate consistent performance improvements, validating the effectiveness of the proposed method.

61. 【2603.29455】FedDBP: Enhancing Federated Prototype Learning with Dual-Branch Features and Personalized Global Fusion

链接：https://arxiv.org/abs/2603.29455

作者：Ningzhi Gao,Siquan Huang,Leyu Shi,Ying Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：heterogeneous federated learning, FPL methods fail, http URL, Federated prototype learning, heterogeneous federated

备注：

点击查看摘要

Abstract:Federated prototype learning (FPL), as a solution to heterogeneous federated learning (HFL), effectively alleviates the challenges of data and model this http URL, existing FPL methods fail to balance the fidelity and discriminability of the feature, and are limited by a single global prototype. In this paper, we propose FedDBP, a novel FPL method to address the above issues. On the client-side, we design a Dual-Branch feature projector that employs L2 alignment and contrastive learning simultaneously, thereby ensuring both the fidelity and discriminability of local features. On the server-side, we introduce a Personalized global prototype fusion approach that leverages Fisher information to identify the important channels of local prototypes. Extensive experiments demonstrate the superiority of FedDBP over ten existing advanced methods.

62. 【2603.29450】Few-shot Writer Adaptation via Multimodal In-Context Learning

链接：https://arxiv.org/abs/2603.29450

作者：Tom Simon,Stephane Nicolas,Pierrick Tranouez,Clement Chatelain,Thierry Paquet

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Handwritten Text Recognition, Handwritten Text, Text Recognition, exhibiting highly specific, highly specific styles

备注：

点击查看摘要

Abstract:While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.

63. 【2603.29449】NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification

链接：https://arxiv.org/abs/2603.29449

作者：Youngung Han,Minkyung Cha,Kyeonghun Kim,Induk Um,Myeongbin Sho,Joo Young Bae,Jaewon Jung,Jung Hyeok Park,Seojun Lee,Nam-Joon Kim,Woo Kyoung Jeong,Won Jae Lee,Pa Hong,Ken Ying-Kai Liao,Hyuk-Jae Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Minimizing invasive diagnostic, invasive diagnostic procedures, Minimizing invasive, invasive diagnostic, diagnostic procedures

备注： 15 pages, 5 figures. Accepted for oral presentation at W3PHIAI Workshop, AAAI 2026

点击查看摘要

Abstract:Minimizing invasive diagnostic procedures to reduce the risk of patient injury and infection is a central goal in medical imaging. And yet, noninvasive diagnosis of perineural invasion (PNI), a critical prognostic factor involving infiltration of tumor cells along the surrounding nerve, still remains challenging, due to the lack of clear and consistent imaging criteria criteria for identifying PNI. To address this challenge, we present NeoNet, an integrated end-to-end 3D deep learning framework for PNI prediction in cholangiocarcinoma that does not rely on predefined image features. NeoNet integrates three modules: (1) NeoSeg, utilizing a Tumor-Localized ROI Crop (TLCR) algorithm; (2) NeoGen, a 3D Latent Diffusion Model (LDM) with ControlNet, conditioned on anatomical masks to generate synthetic image patches, specifically balancing the dataset to a 1:1 ratio; and (3) NeoCls, the final prediction module. For NeoCls, we developed the PNI-Attention Network (PattenNet), which uses the frozen LDM encoder and specialized 3D Dual Attention Blocks (DAB) designed to detect subtle intensity variations and spatial patterns indicative of PNI. In 5-fold cross-validation, NeoNet outperformed baseline 3D models and achieved the highest performance with a maximum AUC of 0.7903.

64. 【2603.29441】EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images

链接：https://arxiv.org/abs/2603.29441

作者：Yijie Zheng,Weijie Wu,Bingyue Wu,Long Zhao,Guoqing Li,Mikolaj Czerkawski,Konstantin Klemmer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：freely accessible tools, significant barrier remains, Earth observation community, high-impact foundation models, Earth embedding datasets

备注： ICLR 2026 Workshop ML4RS Tutorial Track (oral)

点击查看摘要

Abstract:While the Earth observation community has witnessed a surge in high-impact foundation models and global Earth embedding datasets, a significant barrier remains in translating these academic assets into freely accessible tools. This tutorial introduces EarthEmbeddingExplorer, an interactive web application designed to bridge this gap, transforming static research artifacts into dynamic, practical workflows for discovery. We will provide a comprehensive hands-on guide to the system, detailing its cloud-native software architecture, demonstrating cross-modal queries (natural language, visual, and geolocation), and showcasing how to derive scientific insights from retrieval results. By democratizing access to precomputed Earth embeddings, this tutorial empowers researchers to seamlessly transition from state-of-the-art models and data archives to real-world application and analysis. The web application is available at this https URL.

65. 【2603.29437】SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

链接：https://arxiv.org/abs/2603.29437

作者：Wenli Li,Kai Zhao,Haoran Jiang,Enquan Yang,Yi Su,Dan Zeng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision-language models, question answering, widely adopted, token, large language model

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.

66. 【2603.29428】Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions

链接：https://arxiv.org/abs/2603.29428

作者：Xuesong Wang,Harry Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：classic optical illusions, exhibit a systematic, counterfactually modified, confronted with classic, classic optical

备注： CVPR 2026 DataCV Workshop, code: [this https URL](https://github.com/Davidxswang/cvpr_2026_datacv_submission)

点击查看摘要

Abstract:Vision-language models (VLMs) exhibit a systematic bias when confronted with classic optical illusions: they overwhelmingly predict the illusion as "real" regardless of whether the image has been counterfactually modified. We present a tool-guided inference framework for the DataCV 2026 Challenge (Tasks I and II) that addresses this failure mode without any model training. An off-the-shelf vision-language model is given access to a small set of generic image manipulation tools: line drawing, region cropping, side-by-side comparison, and channel isolation, together with an illusion-type-routing system prompt that prescribes which tools to invoke for each perceptual question category. Critically, every tool call produces a new, immutable image resource appended to a persistent registry, so the model can reference and compose any prior annotated view throughout its reasoning chain. Rather than hard-coding illusion-specific modules, this generic-tool-plus-routing design yields strong cross-structural generalization: performance remained consistent from the validation set to a test set containing structurally unfamiliar illusion variants (e.g., Mach Bands rotated from vertical to horizontal stacking). We further report three empirical observations that we believe warrant additional investigation: (i) a strong positive-detection bias likely rooted in imbalanced illusion training data, (ii) a striking dissociation between pixel-accurate spatial reasoning and logical inference over self-generated annotations, and (iii) pronounced sensitivity to image compression artifacts that compounds false positives.

67. 【2603.29423】A2BFR: Attribute-Aware Blind Face Restoration

链接：https://arxiv.org/abs/2603.29423

作者：Chenxin Zhu,Yushun Fang,Lu Liu,Shibo Yin,Xiaohong Liu,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：inherently ill-posed nature, ill-posed nature leads, Blind face restoration, recover high-quality facial, high-quality facial images

备注：

点击查看摘要

Abstract:Blind face restoration (BFR) aims to recover high-quality facial images from degraded inputs, yet its inherently ill-posed nature leads to ambiguous and uncontrollable solutions. Recent diffusion-based BFR methods improve perceptual quality but remain uncontrollable, whereas text-guided face editing enables attribute manipulation without reliable restoration. To address these issues, we propose A$^2$BFR, an attribute-aware blind face restoration framework that unifies high-fidelity reconstruction with prompt-controllable generation. Built upon a Diffusion Transformer backbone with unified image-text cross-modal attention, A$^2$BFR jointly conditions the denoising trajectory on both degraded inputs and textual prompts. To inject semantic priors, we introduce attribute-aware learning, which supervises denoising latents using facial attribute embeddings extracted by an attribute-aware encoder. To further enhance prompt controllability, we introduce semantic dual-training, which leverages the pairwise attribute variations in our newly curated AttrFace-90K dataset to enforce attribute discrimination while preserving fidelity. Extensive experiments demonstrate that A$^2$BFR achieves state-of-the-art performance in both restoration fidelity and instruction adherence, outperforming diffusion-based BFR baselines by -0.0467 LPIPS and +52.58% attribute accuracy, while enabling fine-grained, prompt-controllable restoration even under severe degradations.

68. 【2603.29422】Multimodal Models Meet Presentation Attack Detection on ID Documents

链接：https://arxiv.org/abs/2603.29422

作者：Marina Villanueva,Juan M. Espin,Juan E. Tapia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：biometric security, Presentation Attack Detection, represents a significant, significant advancement, advancement in biometric

备注：

点击查看摘要

Abstract:The integration of multimodal models into Presentation Attack Detection (PAD) for ID Documents represents a significant advancement in biometric security. Traditional PAD systems rely solely on visual features, which often fail to detect sophisticated spoofing attacks. This study explores the combination of visual and textual modalities by utilizing pre-trained multimodal models, such as Paligemma, Llava, and Qwen, to enhance the detection of presentation attacks on ID Documents. This approach merges deep visual embeddings with contextual metadata (e.g., document type, issuer, and date). However, experimental results indicate that these models struggle to accurately detect PAD on ID Documents.

69. 【2603.29419】RAAP: Retrieval-Augmented Affordance Prediction with Cross-Image Action Alignment

链接：https://arxiv.org/abs/2603.29419

作者：Qiyuan Zhuang,He-Yang Xu,Yijun Wang,Xin-Yang Zhao,Yang-Yang Li,Xiu-Shen Wei

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Understanding object affordances, unstructured environments, Understanding object, essential for enabling, enabling robots

备注： Accepted to ICRA 2026

点击查看摘要

Abstract:Understanding object affordances is essential for enabling robots to perform purposeful and fine-grained interactions in diverse and unstructured environments. However, existing approaches either rely on retrieval, which is fragile due to sparsity and coverage gaps, or on large-scale models, which frequently mislocalize contact points and mispredict post-contact actions when applied to unseen categories, thereby hindering robust generalization. We introduce Retrieval-Augmented Affordance Prediction (RAAP), a framework that unifies affordance retrieval with alignment-based learning. By decoupling static contact localization and dynamic action direction, RAAP transfers contact points via dense correspondence and predicts action directions through a retrieval-augmented alignment model that consolidates multiple references with dual-weighted attention. Trained on compact subsets of DROID and HOI4D with as few as tens of samples per task, RAAP achieves consistent performance across unseen objects and categories, and enables zero-shot robotic manipulation in both simulation and the real world. Project website: this https URL.

70. 【2603.29418】Adversarial Prompt Injection Attack on Multimodal Large Language Models

链接：https://arxiv.org/abs/2603.29418

作者：Meiwen Ding,Song Xia,Chenqi Kong,Xudong Jiang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：large language models, instruction-following behavior leaves, prompt injection attacks, multimodal large language, prompt injection

备注：

点击查看摘要

Abstract:Although multimodal large language models (MLLMs) are increasingly deployed in real-world applications, their instruction-following behavior leaves them vulnerable to prompt injection attacks. Existing prompt injection methods predominantly rely on textual prompts or perceptible visual prompts that are observable by human users. In this work, we study imperceptible visual prompt injection against powerful closed-source MLLMs, where adversarial instructions are embedded in the visual modality. Our method adaptively embeds the malicious prompt into the input image via a bounded text overlay to provide semantic guidance. Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and fine-grained levels. Specifically, the visual target is instantiated as a text-rendered image and progressively refined during optimization to more faithfully represent the desired semantics and improve transferability. Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.

71. 【2603.29414】Native-Domain Cross-Attention for Camera-LiDAR Extrinsic Calibration Under Large Initial Perturbations

链接：https://arxiv.org/abs/2603.29414

作者：Ni Ou,Zhuo Chen,Xinru Zhang,Junzheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：camera-LiDAR fusion relies, potentially large misalignments, establishing reliable cross-modal, Accurate camera-LiDAR fusion, relies on precise

备注： 8 pages, 3 figures

点击查看摘要

Abstract:Accurate camera-LiDAR fusion relies on precise extrinsic calibration, which fundamentally depends on establishing reliable cross-modal correspondences under potentially large misalignments. Existing learning-based methods typically project LiDAR points into depth maps for feature fusion, which distorts 3D geometry and degrades performance when the extrinsic initialization is far from the ground truth. To address this issue, we propose an extrinsic-aware cross-attention framework that directly aligns image patches and LiDAR point groups in their native domains. The proposed attention mechanism explicitly injects extrinsic parameter hypotheses into the correspondence modeling process, enabling geometry-consistent cross-modal interaction without relying on projected 2D depth maps. Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in both accuracy and robustness. Under large extrinsic perturbations, our approach achieves accurate calibration in 88% of KITTI cases and 99% of nuScenes cases, substantially surpassing the second-best baseline. We have open sourced our code on this https URL to benefit the community.

72. 【2603.29410】AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

链接：https://arxiv.org/abs/2603.29410

作者：Yubo Cui,Xianchao Guan,Zijun Xiong,Zheng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：exhibit strong zero-shot, strong zero-shot generalization, exhibit strong, zero-shot adversarial robustness, generalization but remain

备注： Accepted by CVPR 2026; Code is available at \url{ [this https URL](https://github.com/YuboCui/AGFT) }

点击查看摘要

Abstract:Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature-scaled version of the pre-trained model predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.

73. 【2603.29405】Hallucination-aware intermediate representation edit in large vision-language models

链接：https://arxiv.org/abs/2603.29405

作者：Wei Suo,Hanzu Zhang,Lijun Zhang,Ji Ma,Peng Wang,Yanning Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Vision-Language Models, complex scene understanding, Large Vision-Language, demonstrated exceptional performance, scene understanding

备注：

点击查看摘要

Abstract:Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These factors hinder their practical applicability. To address the above issue, we propose a framework for dynamically detecting hallucination representations and performing hallucination-eliminating edits on these representations. With minimal additional computational cost, we achieve state-of-the-art performance on existing benchmarks. Extensive experiments demonstrate the effectiveness of our approach, highlighting its efficient and robust hallucination elimination capability and its powerful controllability over hallucinations. Code is available at this https URL

74. 【2603.29394】AA-Splat: Anti-Aliased Feed-forward Gaussian Splatting

链接：https://arxiv.org/abs/2603.29394

作者：Taewoo Suh,Sungpyo Kim,Jongmin Park,Munchurl Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, solution for sparse-view, view synthesis, Splatting, Gaussian

备注： Please visit our project page at [this https URL](https://kaist-viclab.github.io/aasplat-site/)

点击查看摘要

Abstract:Feed-forward 3D Gaussian Splatting (FF-3DGS) emerges as a fast and robust solution for sparse-view 3D reconstruction and novel view synthesis (NVS). However, existing FF-3DGS methods are built on incorrect screen-space dilation filters, causing severe rendering artifacts when rendering at out-of-distribution sampling rates. We firstly propose an FF-3DGS model, called AA-Splat, to enable robust anti-aliased rendering at any resolution. AA-Splat utilizes an opacity-balanced band-limiting (OBBL) design, which combines two components: a 3D band-limiting post-filter integrates multi-view maximal frequency bounds into the feed-forward reconstruction pipeline, effectively band-limiting the resulting 3D scene representations and eliminating degenerate Gaussians; an Opacity Balancing (OB) to seamlessly integrate all pixel-aligned Gaussian primitives into the rendering process, compensating for the increased overlap between expanded Gaussian primitives. AA-Splat demonstrates drastic improvements with average 5.4$\sim$7.5dB PSNR gains on NVS performance over a state-of-the-art (SOTA) baseline, DepthSplat, at all resolutions, between $4\times$ and $1/4\times$. Code will be made available.

75. 【2603.29387】Extend3D: Town-Scale 3D Generation

链接：https://arxiv.org/abs/2603.29387

作者：Seungwoo Yoon,Jinmo Kim,Jaesik Park

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：training-free pipeline, generative model, object-centric models, latent space, extended latent space

备注： CVPR 2026, Project Page: [this http URL](http://seungwoo-yoon.github.io/extend3d-page)

点击查看摘要

Abstract:In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.

76. 【2603.29386】PromptForge-350k: A Large-Scale Dataset and Contrastive Framework for Prompt-Based AI Image Forgery Localization

链接：https://arxiv.org/abs/2603.29386

作者：Jianpeng Wang,Haoyu Wang,Baoying Chen,Jishen Zeng,Yiming Qin,Yiqi Yang,Zhongjie Ba

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：malicious content fabrication, fabrication and misinformation, rapid democratization, recently exacerbated, exacerbated the risks

备注：

点击查看摘要

Abstract:The rapid democratization of prompt-based AI image editing has recently exacerbated the risks associated with malicious content fabrication and misinformation. However, forgery localization methods targeting these emerging editing techniques remain significantly under-explored. To bridge this gap, we first introduce a fully automated mask annotating framework that leverages keypoint alignment and semantic space similarity to generate precise ground-truth masks for edited regions. Based on this framework, we construct PromptForge-350k, a large-scale forgery localization dataset covering four state-of-the-art prompt-based AI image editing models, thereby mitigating the data scarcity in this domain. Furthermore, we propose ICL-Net, an effective forgery localization network featuring a triple-stream backbone and intra-image contrastive learning. This design enables the model to capture highly robust and generalizable forensic features. Extensive experiments demonstrate that our method achieves an IoU of 62.5% on PromptForge-350k, outperforming SOTA methods by 5.1%. Additionally, it exhibits strong robustness against common degradations with an IoU drop of less than 1%, and shows promising generalization capabilities on unseen editing models, achieving an average IoU of 41.5%.

77. 【2603.29376】Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement

链接：https://arxiv.org/abs/2603.29376

作者：Fabian Kabus,Julia Hindel,Jelena Bratulić,Meropi Karakioulaki,Ayush Gupta,Cristina Has,Thomas Brox,Abhinav Valada,Harald Binder

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recessive dystrophic epidermolysis, dystrophic epidermolysis bullosa, rare genetic skin, genetic skin disorder, clinicians greatly benefit

备注：

点击查看摘要

Abstract:Recessive dystrophic epidermolysis bullosa (RDEB) is a rare genetic skin disorder for which clinicians greatly benefit from finding similar cases using images and clinical text. However, off-the-shelf foundation models do not reliably capture clinically meaningful features for this heterogeneous, long-tail disease, and structured measurement of agreement with experts is challenging. To address these gaps, we propose evaluating embedding spaces with expert ordinal comparisons (triplet judgments), which are fast to collect and encode implicit clinical similarity knowledge. We further introduce TriDerm, a multimodal framework that learns interpretable wound representations from small cohorts by integrating wound imagery, boundary masks, and expert reports. On the vision side, TriDerm adapts visual foundation models to RDEB using wound-level attention pooling and non-contrastive representation learning. For text, we prompt large language models with comparison queries and recover medically meaningful representations via soft ordinal embeddings (SOE). We show that visual and textual modalities capture complementary aspects of wound phenotype, and that fusing both modalities yields 73.5% agreement with experts, outperforming the best off-the-shelf single-modality foundation model by over 5.6 percentage points. We make the expert annotation tool, model code and representative dataset samples publicly available.

78. 【2603.29368】StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision

链接：https://arxiv.org/abs/2603.29368

作者：Ziyang Chen,Yansong Qu,You Shen,Xuan Cheng,Liujuan Cao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：critical research frontier, stereo vision, stereo, vision, research frontier

备注：

点击查看摘要

Abstract:Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the $1^{st}$ rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.

79. 【2603.29362】Uncertainty-Aware Trajectory Prediction: A Unified Framework Harnessing Positional and Semantic Uncertainties

链接：https://arxiv.org/abs/2603.29362

作者：Jintao Sun,Hu Zhang,Gangyi Ding,Zhedong Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：historical movement data, Trajectory prediction seeks, dynamic entities, vehicles and pedestrians, Trajectory prediction

备注： 13 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Trajectory prediction seeks to forecast the future motion of dynamic entities, such as vehicles and pedestrians, given a temporal horizon of historical movement data and environmental context. A central challenge in this domain is the inherent uncertainty in real-time maps, arising from two primary sources: (1) positional inaccuracies due to sensor limitations or environmental occlusions, and (2) semantic errors stemming from misinterpretations of scene context. To address these challenges, we propose a novel unified framework that jointly models positional and semantic uncertainties and explicitly integrates them into the trajectory prediction pipeline. Our approach employs a dual-head architecture to independently estimate semantic and positional predictions in a dual-pass manner, deriving prediction variances as uncertainty indicators in an end-to-end fashion. These uncertainties are subsequently fused with the semantic and positional predictions to enhance the robustness of trajectory forecasts. We evaluate our uncertainty-aware framework on the nuScenes real-world driving dataset, conducting extensive experiments across four map estimation methods and two trajectory prediction baselines. Results verify that our method (1) effectively quantifies map uncertainties through both positional and semantic dimensions, and (2) consistently improves the performance of existing trajectory prediction models across multiple metrics, including minimum Average Displacement Error (minADE), minimum Final Displacement Error (minFDE), and Miss Rate (MR). Code will available at this https URL.

80. 【2603.29356】CIPHER: Counterfeit Image Pattern High-level Examination via Representation

链接：https://arxiv.org/abs/2603.29356

作者：Kyeonghun Kim,Youngung Han,Seoyoung Ju,Yeonju Jean,YooHyun Kim,Minseo Choi,SuYeon Lim,Kyungtae Park,Seungwoo Baek,Sieun Hyeon,Nam-Joon Kim,Hyuk-Jae Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：generative adversarial networks, Pattern High-level Examination, Counterfeit Image Pattern, Image Pattern High-level, adversarial networks

备注： 6 pages, 2 figures. Accepted at IEEE-Asia 2025

点击查看摘要

Abstract:The rapid progress of generative adversarial networks (GANs) and diffusion models has enabled the creation of synthetic faces that are increasingly difficult to distinguish from real images. This progress, however, has also amplified the risks of misinformation, fraud, and identity abuse, underscoring the urgent need for detectors that remain robust across diverse generative models. In this work, we introduce Counterfeit Image Pattern High-level Examination via Representation(CIPHER), a deepfake detection framework that systematically reuses and fine-tunes discriminators originally trained for image generation. By extracting scale-adaptive features from ProGAN discriminators and temporal-consistency features from diffusion models, CIPHER captures generation-agnostic artifacts that conventional detectors often overlook. Through extensive experiments across nine state-of-the-art generative models, CIPHER demonstrates superior cross-model detection performance, achieving up to 74.33% F1-score and outperforming existing ViT-based detectors by over 30% in F1-score on average. Notably, our approach maintains robust performance on challenging datasets where baseline methods fail, with up to 88% F1-score on CIFAKE compared to near-zero performance from conventional detectors. These results validate the effectiveness of discriminator reuse and cross-model fine-tuning, establishing CIPHER as a promising approach toward building more generalizable and robust deepfake detection systems in an era of rapidly evolving generative technologies.

81. 【2603.29343】FOSCU: Feasibility of Synthetic MRI Generation via Duo-Diffusion Models for Enhancement of 3D U-Nets in Hepatic Segmentation

链接：https://arxiv.org/abs/2603.29343

作者：Youngung Han,Kyeonghun Kim,Seoyoung Ju,Yeonju Jean,Minkyung Cha,Seohyoung Park,Hyeonseok Jung,Nam-Joon Kim,Woo Kyoung Jeong,Ken Ying-Kai Liao,Hyuk-Jae Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：including restricted access, Communication Systems, Picture Archiving, Archiving and Communication, segmentation faces fundamental

备注： 10 pages, 5 figures. Accepted at IEEE APCCAS 2025

点击查看摘要

Abstract:Medical image segmentation faces fundamental challenges including restricted access, costly annotation, and data shortage to clinical datasets through Picture Archiving and Communication Systems (PACS). These systemic barriers significantly impede the development of robust segmentation algorithms. To address these challenges, we propose FOSCU, which integrates Duo-Diffusion, a 3D latent diffusion model with ControlNet that simultaneously generates high-resolution, anatomically realistic synthetic MRI volumes and corresponding segmentation labels, and an enhanced 3D U-Net training pipeline. Duo-Diffusion employs segmentation-conditioned diffusion to ensure spatial consistency and precise anatomical detail in the generated data. Experimental evaluation on 720 abdominal MRI scans shows that models trained with combined real and synthetic data yield a mean Dice score gain of 0.67% over those using only real data, and achieve a 36.4% reduction in Fréchet Inception Distance (FID), reflecting enhanced image fidelity.

82. 【2603.29328】Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning

链接：https://arxiv.org/abs/2603.29328

作者：Kavindu Herath,Joshua Zhao,Saurabh Bagchi

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

关键词：synthetic corner patches, arise in practice, corner patches, Sign Recognition Benchmark, German Traffic Sign

备注：

点击查看摘要

Abstract:Backdoor attacks on federated learning (FL) are most often evaluated with synthetic corner patches or out-of-distribution (OOD) patterns that are unlikely to arise in practice. In this paper, we revisit the backdoor threat to standard FL (a single global model) under a more realistic setting where triggers must be semantically meaningful, in-distribution, and visually plausible. We propose SABLE, a Semantics-Aware Backdoor for LEarning in federated settings, which constructs natural, content-consistent triggers (e.g., semantic attribute changes such as sunglasses) and optimizes an aggregation-aware malicious objective with feature separation and parameter regularization to keep attacker updates close to benign ones. We instantiate SABLE on CelebA hair-color classification and the German Traffic Sign Recognition Benchmark (GTSRB), poisoning only a small, interpretable subset of each malicious client's local data while otherwise following the standard FL protocol. Across heterogeneous client partitions and multiple aggregation rules (FedAvg, Trimmed Mean, MultiKrum, and FLAME), our semantics-driven triggers achieve high targeted attack success rates while preserving benign test accuracy. These results show that semantics-aligned backdoors remain a potent and practical threat in federated learning, and that robustness claims based solely on synthetic patch triggers can be overly optimistic.

83. 【2603.29313】HSFM: Hard-Set-Guided Feature-Space Meta-Learning for Robust Classification under Spurious Correlations

链接：https://arxiv.org/abs/2603.29313

作者：Aryan Yazdan Parast,Khawar Islam,Soyoun Won,Basim Azam,Naveed Akhtar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Deep neural networks, Empirical Risk Minimization, Deep neural, make predictions, neural networks

备注：

点击查看摘要

Abstract:Deep neural networks often rely on spurious features to make predictions, which makes them brittle under distribution shift and on samples where the spurious correlation does not hold (e.g., minority-group examples). Recent studies have shown that, even in such settings, the feature extractor of an Empirical Risk Minimization (ERM)-trained model can learn rich and informative representations, and that much of the failure may be attributed to the classifier head. In particular, retraining a lightweight head while keeping the backbone frozen can substantially improve performance on shifted distributions and minority groups. Motivated by this observation, we propose a bilevel meta-learning method that performs augmentation directly in feature space to improve spurious correlation handling in the classifier head. Our method learns support-side feature edits such that, after a small number of inner-loop updates on the edited features, the classifier achieves lower loss on hard examples and improved worst-group performance. By operating at the backbone output rather than in pixel space or through end-to-end optimization, the method is highly efficient and stable, requiring only a few minutes of training on a single GPU. We further validate our method with CLIP-based visualizations, showing that the learned feature-space updates induce semantically meaningful shifts aligned with spurious attributes.

84. 【2603.29301】Self-Consistency for LLM-Based Motion Trajectory Generation and Verification

链接：https://arxiv.org/abs/2603.29301

作者：Jiaju Ma,R. Kenny Jones,Jiajun Wu,Maneesh Agrawala

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：natural language reasoning, language reasoning tasks, improving LLM performance, unsupervised manner, effective technique

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Self-consistency has proven to be an effective technique for improving LLM performance on natural language reasoning tasks in a lightweight, unsupervised manner. In this work, we study how to adapt self-consistency to visual domains. Specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. Given a prompt (e.g., "Move the circle in a spiral path"), we first sample diverse motion trajectories from an LLM, and then identify groups of consistent trajectories via clustering. Our key insight is to model the family of shapes associated with a prompt as a prototype trajectory paired with a group of geometric transformations (e.g., rigid, similarity, and affine). Two trajectories can then be considered consistent if one can be transformed into the other under the warps allowable by the transformation group. We propose an algorithm that automatically recovers a shape family, using hierarchical relationships between a set of candidate transformation groups. Our approach improves the accuracy of LLM-based trajectory generation by 4-6%. We further extend our method to support verification, observing 11% precision gains over VLM baselines. Our code and dataset are available at this https URL .

85. 【2603.29296】MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting

链接：https://arxiv.org/abs/2603.29296

作者：Haoran Zhou,Gim Hee Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：physical world, monocular videos, videos is essential, essential for understanding, understanding the physical

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: this https URL.

86. 【2603.29295】GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

链接：https://arxiv.org/abs/2603.29295

作者：Yaning Zhang,Linlin Shen,Zitong Yu,Chunjie Ma,Zan Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Current deepfake attribution, generative methods due, detection works tend, exhibit poor generalization, Current deepfake

备注：

点击查看摘要

Abstract:Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.

87. 【2603.29291】MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network

链接：https://arxiv.org/abs/2603.29291

作者：Guozhi Qiu,Zhiwei Chen,Zixu Li,Qinlei Huang,Zhiheng Fu,Xuemeng Song,Yupeng Hu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Composed Image Retrieval, target image satisfying, Composed Image, Image Retrieval, reference image

备注： Accepted by ICASSP 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) uses a reference image and a modification text as a query to retrieve a target image satisfying the requirement of ``modifying the reference image according to the text instructions''. However, existing CIR methods face two limitations: (1) frequency bias leading to ``Rare Sample Neglect'', and (2) susceptibility of similarity scores to interference from hard negative samples and noise. To address these limitations, we confront two key challenges: asymmetric rare semantic localization and robust similarity estimation under hard negative samples. To solve these challenges, we propose the Modification frEquentation-rarity baLance neTwork MELT. MELT assigns increased attention to rare modification semantics in multimodal contexts while applying diffusion-based denoising to hard negative samples with high similarity scores, enhancing multimodal fusion and matching. Extensive experiments on two CIR benchmarks validate the superior performance of MELT. Codes are available at this https URL.

88. 【2603.29281】PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

链接：https://arxiv.org/abs/2603.29281

作者：Amirreza Rouhi,Parikshit Sakurikar,Satya Sai Reddy,Narsimha Menga,Anirudh Govil,Sri Harsha Chittajallu,Rajat Aggarwal,Anoop Namboodiri,Sashi Reddi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：critical gap exists, specialized perceptual demands, PRISM, critical gap, gap exists

备注：

点击查看摘要

Abstract:A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at this https URL

89. 【2603.29272】MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters

链接：https://arxiv.org/abs/2603.29272

作者：Soomin Park,Eunseong Lee,Kwang Bin Lee,Sung-Hee Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词：physics-based humanoid control, humanoid control, physics-based humanoid, flexible motion adaptation, flexible motion

备注： CVPR 2026

点击查看摘要

Abstract:We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.

90. 【2603.29271】ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

链接：https://arxiv.org/abs/2603.29271

作者：Wenyang Chen,Zhanxuan Hu,Yaping Zhang,Hailong Ning,Yonghang Tai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Training-free open-vocabulary remote, empowered by vision-language, vision-language models, Training-free open-vocabulary, category-agnostic semantic understanding

备注：

点击查看摘要

Abstract:Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: this https URL

91. 【2603.29270】Unbiased Model Prediction Without Using Protected Attribute Information

链接：https://arxiv.org/abs/2603.29270

作者：Puspita Majumdar,Surbhi Mittal,Mayank Vatsa,Richa Singh

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：provide disparate performance, deep learning community, bias mitigation, learning community, continue to provide

备注：

点击查看摘要

Abstract:The problem of bias persists in the deep learning community as models continue to provide disparate performance across different demographic subgroups. Therefore, several algorithms have been proposed to improve the fairness of deep models. However, a majority of these algorithms utilize the protected attribute information for bias mitigation, which severely limits their application in real-world scenarios. To address this concern, we have proposed a novel algorithm, termed as \textbf{Non-Protected Attribute-based Debiasing (NPAD)} algorithm for bias mitigation, that does not require the protected attribute information. The proposed NPAD algorithm utilizes the auxiliary information provided by the non-protected attributes to optimize the model for bias mitigation. Further, two different loss functions, \textbf{Debiasing via Attribute Cluster Loss (DACL)} and \textbf{Filter Redundancy Loss (FRL)} have been proposed to optimize the model for fairness goals. Multiple experiments are performed on the LFWA and CelebA datasets for facial attribute prediction, and a significant reduction in bias across different gender and age subgroups is observed.

92. 【2603.29258】Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

链接：https://arxiv.org/abs/2603.29258

作者：Jingqi Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：demonstrated strong capabilities, CLIP, demonstrated strong, strong capabilities, wide range

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP's understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP's original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings, and an absence-based contrastive objective that aligns image embeddings with both original and absence-based negated caption embeddings while maintaining a semantic distinction between the two text embeddings. Based on our observation that the front transformer layers of CLIP text encoder have stronger learning ability for negated text than the later layers, we fine-tune the front transformer layers of the CLIP text encoder at each training step using the combined contrastive objective. Experimental results show that, compared with pretrained CLIP, Omni-NegCLIP improves performance on presence-based negation and absence-based negation tasks by up to 52.65% and 12.50%, respectively, without sacrificing general capability in image-text retrieval and even improving it by up to 19.62%. Compared with prior works, Omni-NegCLIP demonstrates a more comprehensive ability to understand multiple types of negation tasks.

93. 【2603.29252】Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

链接：https://arxiv.org/abs/2603.29252

作者：Tao Chen,Kun Zhang,Qiong Wu,Xiao Chen,Chao Chang,Xiaoshuai Sun,Yiyi Zhou,Rongrong Ji

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal Large language, Large language Models, Multimodal Large, language Models, Large language

备注： CVPR 2026

点击查看摘要

Abstract:Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.

94. 【2603.29245】Monocular Building Height Estimation from PhiSat-2 Imagery: Dataset and Method

链接：https://arxiv.org/abs/2603.29245

作者：Yanjiao Song,Bowen Cai,Timo Balz,Zhenfeng Shao,Neema Simon Sumari,James Magidi,Walter Musakwa

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：urban morphology characterization, large inter-city variations, remains challenging due, ambiguous height cues, Monocular building height

备注：

点击查看摘要

Abstract:Monocular building height estimation from optical imagery is important for urban morphology characterization but remains challenging due to ambiguous height cues, large inter-city variations in building morphology, and the long-tailed distribution of building heights. PhiSat-2 is a promising open-access data source for this task because of its global coverage, 4.75 m spatial resolution, and seven-band spectral observations, yet its potential has not been systematically evaluated. To address this gap, we construct a PhiSat-2-Height dataset (PHDataset) and propose a Two-Stream Ordinal Network (TSONet). PHDataset contains 9,475 co-registered image-label patch pairs from 26 cities worldwide. TSONet jointly models footprint segmentation and height estimation, and introduces a Cross-Stream Exchange Module (CSEM) and a Feature-Enhanced Bin Refinement (FEBR) module for footprint-aware feature interaction and ordinal height refinement. Experiments on PHDataset show that TSONet achieves the best overall performance, reducing MAE and RMSE by 13.2% and 9.7%, and improving IoU and F1-score by 14.0% and 10.1% over the strongest competing results. Ablation studies further verify the effectiveness of CSEM, FEBR, and the joint use of ordinal regression and footprint assistance. Additional analyses indicate that PhiSat-2 benefits monocular building height estimation through its balanced combination of building-relevant spatial detail and multispectral observations. Overall, this study confirms the potential of PhiSat-2 for monocular building height estimation and provides a dedicated dataset and an effective method for future research.

95. 【2603.29239】Diffusion Mental Averages

链接：https://arxiv.org/abs/2603.29239

作者：Phonphrm Thawatdamrongkit,Sukit Seripanitkarn,Supasorn Suwajanakorn

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion Mental Averages, introduce Diffusion Mental, mental average, Diffusion Mental, Mental Averages

备注： CVPR 2026. Project page: [this https URL](https://diffusion-mental-averages.github.io/)

点击查看摘要

Abstract:Can a diffusion model produce its own "mental average" of a concept-one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model's semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.

96. 【2603.29236】M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding

链接：https://arxiv.org/abs/2603.29236

作者：U.V.B.L. Udugama,George Vosselman,Francesco Nex

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：stream remains challenging, single image stream, image stream remains, achieving reliable real-time, ease of deployment

备注： 6 pages, 5 figures, 5 tables. Preprint under review

点击查看摘要

Abstract:Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.

Comments:
6 pages, 5 figures, 5 tables. Preprint under review

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.29236 [cs.CV]

(or
arXiv:2603.29236v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.29236

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

97. 【2603.29228】CCDNet: Learning to Detect Camouflage against Distractors in Infrared Small Target Detection

链接：https://arxiv.org/abs/2603.29228

作者：Zikai Liao,Zhaozheng Yin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：tasks have critical, maritime search, critical applications, applications in areas, areas like wilderness

备注：

点击查看摘要

Abstract:Infrared target detection (IRSTD) tasks have critical applications in areas like wilderness rescue and maritime search. However, detecting infrared targets is challenging due to their low contrast and tendency to blend into complex backgrounds, effectively camouflaging themselves. Additionally, other objects with similar features (distractors) can cause false alarms, further degrading detection performance. To address these issues, we propose a novel \textbf{C}amouflage-aware \textbf{C}ounter-\textbf{D}istraction \textbf{Net}work (CCDNet) in this paper. We design a backbone with Weighted Multi-branch Perceptrons (WMPs), which aggregates self-conditioned multi-level features to accurately represent the target and background. Based on these rich features, we then propose a novel Aggregation-and-Refinement Fusion Neck (ARFN) to refine structures/semantics from shallow/deep features maps, and bidirectionally reconstruct the relations between the targets and the backgrounds, highlighting the targets while suppressing the complex backgrounds to improve detection accuracy. Furthermore, we present a new Contrastive-aided Distractor Discriminator (CaDD), enforcing adaptive similarity computation both locally and globally between the real targets and the backgrounds to more precisely discriminate distractors, so as to reduce the false alarm rate. Extensive experiments on infrared image datasets confirm that CCDNet outperforms other state-of-the-art methods.

98. 【2603.29219】SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation

链接：https://arxiv.org/abs/2603.29219

作者：Mohammad Amer Khalil,Raghad Nahas,Ahmad Nassar,Khloud Al Jallad

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：Arabic Sign Language, Syrian Arabic Sign, Sign language, primary approach, high-resource sign languages

备注：

点击查看摘要

99. 【2603.29211】Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

链接：https://arxiv.org/abs/2603.29211

作者：Zhiqian Zhang,Xu Zhao,Xiaoqing Xu,Guangdong Liang,Weijia Wang,Xiaolei Lv,Bo Li,Jun Gao

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：recent years, continued to improve, multimodal large models, fine-grained visual perception, visual perception

备注： 41 pages, 10 figures

点击查看摘要

100. 【2603.29209】LightHarmony3D: Harmonizing Illumination and Shadows for Object Insertion in 3D Gaussian Splatting

链接：https://arxiv.org/abs/2603.29209

作者：Tianyu Huang,Zhenyang Ren,Zhenchen Wan,Jiyang Zheng,Wenjie Wang,Runnan Chen,Mingming Gong,Tongliang Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, geometry and appearance, enables high-fidelity reconstruction, mesh insertion, Gaussian

备注：

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) enables high-fidelity reconstruction of scene geometry and appearance. Building on this capability, inserting external mesh objects into reconstructed 3DGS scenes enables interactive editing and content augmentation for immersive applications such as AR/VR, virtual staging, and digital content creation. However, achieving physically consistent lighting and shadows for mesh insertion remains challenging, as it requires accurate scene illumination estimation and multi-view consistent rendering. To address this challenge, we present LightHarmony3D, a novel framework for illumination-consistent mesh insertion in 3DGS scenes. Central to our approach is our proposed generative module that predicts a full 360° HDR environment map at the insertion location via a single forward pass. By leveraging generative priors instead of iterative optimization, our method efficiently captures dominant scene illumination and enables physically grounded shading and shadows for inserted meshes while maintaining multi-view coherence. Furthermore, we introduce the first dedicated benchmark for mesh insertion in 3DGS, providing a standardized evaluation framework for assessing lighting consistency and photorealism. Extensive experiments across multiple real-world reconstruction datasets demonstrate that LightHarmony3D achieves state-of-the-art realism and multi-view consistency.

101. 【2603.29194】Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention

链接：https://arxiv.org/abs/2603.29194

作者：Sunil Tiwari,Payal Fofadiya

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Long-horizon dialogue systems, dialogue systems suffer, Long-horizon dialogue, unstable memory retention, extended sessions

备注：

点击查看摘要

Abstract:Long-horizon dialogue systems suffer from semanticdrift and unstable memory retention across extended sessions. This paper presents a Multi-Layer Memory Framework that decomposes dialogue history into working, episodic, and semantic layers with adaptive retrieval gating and retention regularization. The architecture controls cross-session drift while maintaining bounded context growth and computational efficiency. Experiments on LOCOMO, LOCCO, and LoCoMo show improved performance, achieving 46.85 Success Rate, 0.618 overall F1 with 0.594 multi-hop F1, and 56.90% six-period retention while reducing false memory rate to 5.1% and context usage to 58.40%. Results confirm enhanced long-term retention and reasoning stability under constrained context budgets.

102. 【2603.29193】Developing Adaptive Context Compression Techniques for Large Language Models (LLMs) in Long-Running Interactions

链接：https://arxiv.org/abs/2603.29193

作者：Payal Fofadiya,Sunil Tiwari

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Language Models, increasing context length, experience performance degradation

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) often experience performance degradation during long-running interactions due to increasing context length, memory saturation, and computational overhead. This paper presents an adaptive context compression framework that integrates importance-aware memory selection, coherence-sensitive filtering, and dynamic budget allocation to retain essential conversational information while controlling context growth. The approach is evaluated on LOCOMO, LOCCO, and LongBench benchmarks to assess answer quality, retrieval accuracy, coherence preservation, and efficiency. Experimental results demonstrate that the proposed method achieves consistent improvements in conversational stability and retrieval performance while reducing token usage and inference latency compared with existing memory and compression-based approaches. These findings indicate that adaptive context compression provides an effective balance between long-term memory preservation and computational efficiency in persistent LLM interactions

103. 【2603.29191】3D Architect: An Automated Approach to Three-Dimensional Modeling

链接：https://arxiv.org/abs/2603.29191

作者：Sunil Tiwari,Payal Fofadiya,Vicky Vishwakarma

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Harris Detector, orthographic views, control points, points, object

备注：

点击查看摘要

Abstract:The aim of our paper is to render an object in 3-dimension using a set of its orthographic views. Corner detector (Harris Detector) is applied on the input views to obtain control points. These control points are projected perpendicular to respective views, in order to construct an envelope. A set of points describing the object in 3-dimension, are obtained from the intersection of these mutually perpendicular envelopes. These set of points are used to regenerate the surfaces of the object using computational geometry. At the end, the object in 3-dimension is rendered using OpenGL

104. 【2603.29186】SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

链接：https://arxiv.org/abs/2603.29186

作者：Ryosuke Matsuda,Keito Kudo,Haruto Yoshida,Nobuyuki Shimizu,Jun Suzuki

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：synthetic long-video meta-evaluation, paper proposes, proposes the synthetic, synthetic long-video, SLVMEval benchmark focuses

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled "high-quality versus low-quality" pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.

105. 【2603.29185】Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting

链接：https://arxiv.org/abs/2603.29185

作者：Huaqi Tao,Bingxi Liu,Guangcheng Chen,Fulin Tang,Li He,Hong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision, Visual relocalization, estimating a camera, Feature Gaussian Splatting, fundamental task

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera's pose when it revisits a previously known scene. While point-based hierarchical relocalization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. To address the sparsity of database images, we propose an adaptive viewpoint retrieval method that synthesizes virtual candidates with viewpoints more closely aligned with the query, thereby improving the accuracy of initial pose estimation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on both indoor and outdoor datasets show that SplatHLoc enhances the robustness of visual relocalization, setting a new state-of-the-art.

106. 【2603.29171】Segmentation of Gray Matters and White Matters from Brain MRI data

链接：https://arxiv.org/abs/2603.29171

作者：Chang Sun,Rui Shi,Tsukasa Koike,Tetsuro Sekine,Akio Morita,Tetsuya Sakai

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：diagnosing neurological disorders, monitoring disease progression, studying brain anatomy, magnetic resonance imaging, FSL FAST

备注：

点击查看摘要

Abstract:Accurate segmentation of brain tissues such as gray matter and white matter from magnetic resonance imaging is essential for studying brain anatomy, diagnosing neurological disorders, and monitoring disease progression. Traditional methods, such as FSL FAST, produce tissue probability maps but often require task-specific adjustments and face challenges with diverse imaging conditions. Recent foundation models, such as MedSAM, offer a prompt-based approach that leverages large-scale pretraining. In this paper, we propose a modified MedSAM model designed for multi-class brain tissue segmentation. Our preprocessing pipeline includes skull stripping with FSL BET, tissue probability mapping with FSL FAST, and converting these into 2D axial, sagittal, coronal slices with multi-class labels (background, gray matter, and white matter). We extend MedSAM's mask decoder to three classes, freezing the pre-trained image encoder and fine-tuning the prompt encoder and decoder. Experiments on the IXI dataset achieve Dice scores up to 0.8751. This work demonstrates that foundation models like MedSAM can be adapted for multi-class medical image segmentation with minimal architectural modifications. Our findings suggest that such models can be extended to more diverse medical imaging scenarios in future work.

107. 【2603.29167】CT-to-X-ray Distillation Under Tiny Paired Cohorts: An Evidence-Bounded Reproducible Pilot Study

链接：https://arxiv.org/abs/2603.29167

作者：Bo Ma,Jinsong Wu,Weiqi Yan,Hongjiang Wei

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：provide complementary views, computer-aided diagnosis models, single imaging modality, Chest X-ray, computed tomography

备注：

点击查看摘要

Abstract:Chest X-ray and computed tomography (CT) provide complementary views of thoracic disease, yet most computer-aided diagnosis models are trained and deployed within a single imaging modality. The concrete question studied here is narrower and deployment-oriented: on a patient-level paired chest cohort, can CT act as training-only supervision for a binary disease versus non-disease X-ray classifier without requiring CT at inference time? We study this setting as a cross-modality teacher--student distillation problem and use JDCNet as an executable pilot scaffold rather than as a validated superior architecture. On the original patient-level paired split from a public paired chest imaging cohort, a stripped-down plain cross-modal logit-KD control attains the highest mean result on the four-image validation subset (0.875 accuracy and 0.714 macro-F1), whereas the full module-augmented JDCNet variant remains at 0.750 accuracy and 0.429 macro-F1. To test whether that ranking is a split artifact, we additionally run eight patient-level Monte Carlo resamples with same-case comparisons, stronger mechanism controls based on attention transfer and feature hints, and imbalance-sensitive analyses. Under this resampled protocol, late fusion attains the highest mean accuracy (0.885), same-modality distillation attains the highest mean macro-F1 (0.554) and balanced accuracy (0.660), the plain cross-modal control drops to 0.500 mean balanced accuracy, and neither attention transfer nor feature hints recover a robust cross-modality advantage. The contribution of this study is therefore not a validated CT-to-X-ray architecture, but a reproducible and evidence-bounded pilot protocol that makes the exact task definition, failure modes, ranking instability, and the minimum requirements for future credible CT-to-X-ray transfer claims explicit.

108. 【2603.29165】LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

链接：https://arxiv.org/abs/2603.29165

作者：Haihong Hao,Lei Chen,Mingfei Han,Changlin Li,Dong An,Yuqiang Yang,Zhihui Li,Xiaojun Chang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：largely ignoring, visual dynamics induced, models primarily reason, Existing, VLN

备注： Project page: [this https URL](https://abdd.top/latentpilot/)

点击查看摘要

Abstract:Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices. Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent's behavior distribution, with an expert takeover triggered when the agent deviates excessively. LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to dream ahead and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot's superior understanding of environment-action dynamics in scene. Project page:this https URL

109. 【2603.29163】SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving

链接：https://arxiv.org/abs/2603.29163

作者：Wenchao Sun,Xuewu Lin,Keyu Chen,Zixiang Pei,Xiang Li,Yining Shi,Sifa Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：action space, widely adopted, selecting the optimal, scoring candidate trajectories, scoring

备注：

点击查看摘要

Abstract:End-to-end multi-modal planning has been widely adopted to model the uncertainty of driving behavior, typically by scoring candidate trajectories and selecting the optimal one. Existing approaches generally fall into two categories: scoring a large static trajectory vocabulary, or scoring a small set of dynamically generated proposals. While static vocabularies often suffer from coarse discretization of the action space, dynamic proposals provide finer-grained precision and have shown stronger empirical performance on existing benchmarks. However, it remains unclear whether dynamic generation is fundamentally necessary, or whether static vocabularies can already achieve comparable performance when they are sufficiently dense to cover the action space. In this work, we start with a systematic scaling study of Hydra-MDP, a representative scoring-based method, revealing that performance consistently improves as trajectory anchors become denser, without exhibiting saturation before computational constraints are reached. Motivated by this observation, we propose SparseDriveV2 to push the performance boundary of scoring-based planning through two complementary innovations: (1) a scalable vocabulary representation with a factorized structure that decomposes trajectories into geometric paths and velocity profiles, enabling combinatorial coverage of the action space, and (2) a scalable scoring strategy with coarse factorized scoring over paths and velocity profiles followed by fine-grained scoring on a small set of composed trajectories. By combining these two techniques, SparseDriveV2 achieves 92.0 PDMS and 90.1 EPDMS on NAVSIM, with 89.15 Driving Score and 70.00 Success Rate on Bench2Drive with a lightweight ResNet-34 as backbone. Code and model are released at this https URL.

110. 【2603.29133】Dual-Imbalance Continual Learning for Real-World Food Recognition

链接：https://arxiv.org/abs/2603.29133

作者：Xiaoyan Zhang,Jiangpeng He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-world dietary logging, dietary logging scenarios, logging scenarios naturally, Visual food recognition, severe data imbalance

备注： Accepted to 3rd MetaFood at CVPR 2026. Code is available at [this https URL](https://github.com/xiaoyanzhang1/DIME)

点击查看摘要

Abstract:Visual food recognition in real-world dietary logging scenarios naturally exhibits severe data imbalance, where a small number of food categories appear frequently while many others occur rarely, resulting in long-tailed class distributions. In practice, food recognition systems often operate in a continual learning setting, where new categories are introduced sequentially over time. However, existing studies typically assume that each incremental step introduces a similar number of new food classes, which rarely happens in real world where the number of newly observed categories can vary significantly across steps, leading to highly uneven learning dynamics. As a result, continual food recognition exhibits a dual imbalance: imbalanced samples within each food class and imbalanced numbers of new food classes to learn at each incremental learning step. In this work, we introduce DIME, a Dual-Imbalance-aware Adapter Merging framework for continual food recognition. DIME learns lightweight adapters for each task using parameter-efficient fine-tuning and progressively integrates them through a class-count guided spectral merging strategy. A rank-wise threshold modulation mechanism further stabilizes the merging process by preserving dominant knowledge while allowing adaptive updates. The resulting model maintains a single merged adapter for inference, enabling efficient deployment without accumulating task-specific modules. Experiments on realistic long-tailed food benchmarks under our step-imbalanced setup show that the proposed method consistently improves by more than 3% over the strongest existing continual learning baselines. Code is available at this https URL.

111. 【2603.29101】Enhancing Box and Block Test with Computer Vision for Post-Stroke Upper Extremity Motor Evaluation

链接：https://arxiv.org/abs/2603.29101

作者：David Robinson,Animesh Gupta,Elizabeth Clark,Olga Melnik,Qiushi Fu,Mubarak Shah

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：time-based task metrics, ordinal scoring, lacks sensitivity, task metrics, rely on ordinal

备注： Submitted to EMBC 2026

点击查看摘要

Abstract:Standard clinical assessments of upper-extremity motor function after stroke either rely on ordinal scoring, which lacks sensitivity, or time-based task metrics, which do not capture movement quality. In this work, we present a computer vision-based framework for analysis of upper-extremity movement during the Box and Block Test (BBT) through world-aligned joint angles of fingers, arm, and trunk without depth sensors or calibration objects. We apply this framework to a dataset of 136 BBT recordings collected from 48 healthy individuals and 7 individuals post stroke. Using unsupervised dimensionality reduction of joint-angle features, we analyze movement patterns without relying on expert clinical labels. The resulting embeddings show separation between healthy movement patterns and stroke-related movement deviations. Importantly, some patients with the same BBT scores can be separated with different postural patterns. These results show that world-aligned joint angles can capture meaningful information of upper-extremity functions beyond standard time-based BBT scores, with no effort from the clinician other than monocular video recordings of the patient using a phone or camera. This work highlights the potential of a camera-based, calibration-free framework to measure movement quality in clinical assessments without changing the widely adopted clinical routine.

112. 【2603.29092】rajectoryMover: Generative Movement of Object Trajectories in Videos

链接：https://arxiv.org/abs/2603.29092

作者：Kiran Chhatre,Hyeonho Jeong,Yulia Gryaditskaya,Christopher E. Peters,Chun-Hao Paul Huang,Paul Guerrero

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：intuitive editing operations, short video clips, Generative video editing, intuitive editing, editing operations

备注： 24 pages, 8 figures. Project page: [this https URL](https://chhatrekiran.github.io/trajectorymover)

点击查看摘要

Abstract:Generative video editing has enabled several intuitive editing operations for short video clips that would previously have been difficult to achieve, especially for non-expert editors. Existing methods focus on prescribing an object's 3D or 2D motion trajectory in a video, or on altering the appearance of an object or a scene, while preserving both the video's plausibility and identity. Yet a method to move an object's 3D motion trajectory in a video, i.e., moving an object while preserving its relative 3D motion, is currently still missing. The main challenge lies in obtaining paired video data for this scenario. Previous methods typically rely on clever data generation approaches to construct plausible paired data from unpaired videos, but this approach fails if one of the videos in a pair can not easily be constructed from the other. Instead, we introduce TrajectoryAtlas, a new data generation pipeline for large-scale synthetic paired video data and a video generator TrajectoryMover fine-tuned with this data. We show that this successfully enables generative movement of object trajectories. Project page: this https URL

113. 【2603.29090】HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

链接：https://arxiv.org/abs/2603.29090

作者：Jaber Jaber,Osama Jaber

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：video remain limited, flat latent representations, ignore causal structure, predict future states, collapse temporal dynamics

备注： 10 pages, 3 tables, 4 figures, 1 algorithm. Code: [this https URL](https://github.com/rightnow-ai/hclsm)

点击查看摘要

Abstract:World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two-stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M-parameter model on the PushT robotic manipulation benchmark from the Open X-Embodiment dataset, achieving 0.008 MSE next-state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: this https URL

114. 【2603.29089】WorldFlow3D: Flowing Through 3D Distributions for Unbounded World Generation

链接：https://arxiv.org/abs/2603.29089

作者：Amogh Joshi,Julian Ost,Felix Heide

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词：computer vision, modeling in computer, generation, foundational task, Unbounded

备注：

点击查看摘要

Abstract:Unbounded 3D world generation is emerging as a foundational task for scene modeling in computer vision, graphics, and robotics. In this work, we present WorldFlow3D, a novel method capable of generating unbounded 3D worlds. Building upon a foundational property of flow matching - namely, defining a path of transport between two data distributions - we model 3D generation more generally as a problem of flowing through 3D data distributions, not limited to conditional denoising. We find that our latent-free flow approach generates causal and accurate 3D structure, and can use this as an intermediate distribution to guide the generation of more complex structure and high-quality texture - all while converging more rapidly than existing methods. We enable controllability over generated scenes with vectorized scene layout conditions for geometric structure control and visual texture control through scene attributes. We confirm the effectiveness of WorldFlow3D on both real outdoor driving scenes and synthetic indoor scenes, validating cross-domain generalizability and high-quality generation on real data distributions. We confirm favorable scene generation fidelity over approaches in all tested settings for unbounded scene generation. For more, see this https URL.

115. 【2603.29080】Is the Modality Gap a Bug or a Feature? A Robustness Perspective

链接：https://arxiv.org/abs/2603.29080

作者：Rhea Chowers,Oshri Naparstek,Udi Barzelay,Yair Weiss

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：modern multi-modal models, modern multi-modal, embedding space, shared embedding space, multi-modal models

备注：

点击查看摘要

Abstract:Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.

116. 【2603.29057】LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition

链接：https://arxiv.org/abs/2603.29057

作者：Muxin Pu,Mei Kuan Lim,Chun Yong Chong,Chen Change Loy

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：global body dynamics, Skeleton-based isolated sign, subtle finger movements, Skeleton-based isolated, demands fine-grained understanding

备注：

点击查看摘要

Abstract:Skeleton-based isolated sign language recognition (ISLR) demands fine-grained understanding of articulated motion across multiple spatial scales, from subtle finger movements to global body dynamics. Existing approaches typically rely on deep feed-forward architectures, which increase model capacity but lack mechanisms for recurrent refinement and structured representation. We propose LA-Sign, a looped transformer framework with geometry-aware alignment for ISLR. Instead of stacking deeper layers, LA-Sign derives its depth from recurrence, repeatedly revisiting latent representations to progressively refine motion understanding under shared parameters. To further regularise this refinement process, we present a geometry-aware contrastive objective that projects skeletal and textual features into an adaptive hyperbolic space, encouraging multi-scale semantic organisation. We study three looping designs and multiple geometric manifolds, demonstrating that encoder-decoder looping combined with adaptive Poincare alignment yields the strongest performance. Extensive experiments on WLASL and MSASL benchmarks show that LA-Sign achieves state-of-the-art results while using fewer unique layers, highlighting the effectiveness of recurrent latent refinement and geometry-aware representation learning for sign language recognition.

117. 【2603.29045】Let the Abyss Stare Back Adaptive Falsification for Autonomous Scientific Discovery

链接：https://arxiv.org/abs/2603.29045

作者：Peiran Li,Fangzhou Lin,Shuo Xing,Jiashuo Sun,Dylan Zhang,Siyuan Yang,Chaoqun Ni,Zhengzhong Tu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：sufficiently strong search, strong search process, Autonomous scientific discovery, dangerous regime, evaluator is frozen

备注： 15 pages, 1 figures, 4 tables

点击查看摘要

Abstract:Autonomous scientific discovery is entering a more dangerous regime: once the evaluator is frozen, a sufficiently strong search process can learn to win the exam without learning the mechanism the task was meant to reveal. This is the idea behind our title. To let the abyss stare back is to make evaluation actively push against the candidate through adaptive falsification, rather than passively certify it through static validation. We introduce DASES, a falsification-driven framework in which an Innovator, an Abyss Falsifier, and a Mechanistic Causal Extractor co-evolve executable scientific artifacts and scientifically admissible counterexample environments under a fixed scientific contract. In a controlled loss-discovery problem with a single editable locus, DASES rejects artifacts that static validation would have accepted, identifies the first candidate that survives the admissible falsification frontier, and discovers FNG-CE, a loss that transfers beyond the synthetic discovery environment and consistently outperforms CE and CE+L2 under controlled comparisons across standard benchmarks, including ImageNet.

118. 【2603.29036】Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos

链接：https://arxiv.org/abs/2603.29036

作者：Yujin Ham,Junho Kim,Vivek Boominathan,Guha Balakrishnan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：walking tour, source of image, image data, data to develop, walking tour videos

备注：

点击查看摘要

Abstract:Egocentric "walking tour" videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking tour videos around the world to maintain visual diversity. We then used this dataset to fine-tune the state-of-the-art Casper video diffusion model for object and effects inpainting, and demonstrate that the resulting model performs far better than Casper both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. Finally, we show that the resulting generated clips can be used to build successful 3D/4D models of urban locations.

119. 【2603.29034】he Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations

链接：https://arxiv.org/abs/2603.29034

作者：Kushal Vyas,Alper Kayabasi,Daniel Kim,Vishwanath Saragadam,Ashok Veeraraghavan,Guha Balakrishnan

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：implicit neural representations, parameter initialization strategies, neural representations, approximation and convergence, convergence properties

备注： Accepted to CVPR 2026. Project page: [this https URL](https://kushalvyas.github.io/noisepretraining.html)

点击查看摘要

Abstract:The approximation and convergence properties of implicit neural representations (INRs) are known to be highly sensitive to parameter initialization strategies. While several data-driven initialization methods demonstrate significant improvements over standard random sampling, the reasons for their success -- specifically, whether they encode classical statistical signal priors or more complex features -- remain poorly understood. In this study, we explore this phenomenon through a series of experimental analyses leveraging noise pretraining. We pretrain INRs on diverse noise classes (e.g., Gaussian, Dead Leaves, Spectral) and measure their ability to both fit unseen signals and encode priors for an inverse imaging task (denoising). Our analyses on image and video data reveal a surprising finding: simply pretraining on unstructured noise (Uniform, Gaussian) dramatically improves signal fitting capacity compared to all other baselines. However, unstructured noise also yields poor deep image priors for denoising. In contrast, we also find that noise with the classic $1/|f^\alpha|$ spectral structure of natural images achieves an excellent balance of signal fitting and inverse imaging capabilities, performing on par with the best data-driven initialization methods. This finding enables more efficient INR training in applications lacking sufficient prior domain-specific data. For more details, visit project page at this https URL

120. 【2603.29029】MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

链接：https://arxiv.org/abs/2603.29029

作者：Bharath Krishnamurthy,Ajita Rattani

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：augmenting text-based conditioning, Recent multimodal face, Recent multimodal, edge maps, augmenting text-based

备注： Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026. 22 pages (Main Text + Supplementary), 14 figures, 5 tables, 4 algorithms. Project page: [this https URL](https://vcbsl.github.io/MMFace-DiT/) and Code Repository: [this https URL](https://github.com/Bharath-K3/MMFace-DiT)

点击查看摘要

Abstract:Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: this https URL

121. 【2603.29022】UltraG-Ray: Physics-Based Gaussian Ray Casting for Novel Ultrasound View Synthesis

链接：https://arxiv.org/abs/2603.29022

作者：Felix Duelmer,Jakob Klaushofer,Magdalena Wysocki,Nassir Navab,Mohammad Farid Azampour

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：anatomically plausible views, generating anatomically plausible, acquired frames, offering new capabilities, data augmentation

备注： Accepted at MIDL 2026 / to appear in PMLR

点击查看摘要

Abstract:Novel view synthesis (NVS) in ultrasound has gained attention as a technique for generating anatomically plausible views beyond the acquired frames, offering new capabilities for training clinicians or data augmentation. However, current methods struggle with complex tissue and view-dependent acoustic effects. Physics-based NVS aims to address these limitations by including the ultrasound image formation process into the simulation. Recent approaches combine a learnable implicit scene representation with an ultrasound-specific rendering module, yet a substantial gap between simulation and reality remains. In this work, we introduce UltraG-Ray, a novel ultrasound scene representation based on a learnable 3D Gaussian field, coupled to an efficient physics-based module for B-mode synthesis. We explicitly encode ultrasound-specific parameters, such as attenuation and reflection, into a Gaussian-based spatial representation and realize image synthesis within a novel ray casting scheme. In contrast to previous methods, this approach naturally captures view-dependent attenuation effects, thereby enabling the generation of physically informed B-mode images with increased realism. We compare our method to state-of-the-art and observe consistent gains in image quality metrics (up to 15% increase on MS-SSIM), demonstrating clear improvement in terms of realism of the synthesized ultrasound images.

122. 【2603.29009】MEDiC: Multi-objective Exploration of Distillation from CLIP

链接：https://arxiv.org/abs/2603.29009

作者：Konstantinos Georgiou,Maofeng Tang,Hairong Qi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reconstructing masked patches, Masked image modeling, latent feature space, reconstructing masked, masked patches

备注：

点击查看摘要

Abstract:Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hierarchical clustering with relative position bias for evolved masking and find that, despite producing more semantically coherent masks than prior methods, evolved masking does not outperform simple block masking in the teacher-guided distillation setting, a finding we attribute to the teacher's inherent semantic awareness. Third, we reveal that optimal scalar loss weights are extremely fragile, with small perturbations causing drops of up to 17 percentage points in kNN accuracy. Our framework achieves 73.9% kNN and 85.1% fine-tuning accuracy with ViT-Base at 300 epochs.

123. 【2603.28997】GenFusion: Feed-forward Human Performance Capture via Progressive Canonical Space Updates

链接：https://arxiv.org/abs/2603.28997

作者：Youngjoong Kwon,Yao He,Heejung Choi,Chen Geng,Zhengmao Liu,Jiajun Wu,Ehsan Adeli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：monocular RGB stream, feed-forward human performance, human performance capture, performance capture method, RGB stream

备注：

点击查看摘要

Abstract:We present a feed-forward human performance capture method that renders novel views of a performer from a monocular RGB stream. A key challenge in this setting is the lack of sufficient observations, especially for unseen regions. Assuming the subject moves continuously over time, we take advantage of the fact that more body parts become observable by maintaining a canonical space that is progressively updated with each incoming frame. This canonical space accumulates appearance information over time and serves as a context bank when direct observations are missing in the current live frame. To effectively utilize this context while respecting the deformation of the live state, we formulate the rendering process as probabilistic regression. This resolves conflicts between past and current observations, producing sharper reconstructions than deterministic regression approaches. Furthermore, it enables plausible synthesis even in regions with no prior observations. Experiments on in-domain (4D-Dress) and out-of-distribution (MVHumanNet) datasets demonstrate the effectiveness of our approach.

124. 【2603.28995】Hybrid Quantum-Classical AI for Industrial Defect Classification in Welding Images

链接：https://arxiv.org/abs/2603.28995

作者：Akshaya Srinivasan,Xiaoyin Cheng,Jianming Yi,Alexander Geng,Desislava Ivanova,Andreas Weinmann,Ali Moghiseh

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantum Physics (quant-ph)

关键词：advancing automated quality, automated quality control, machine learning offers, Hybrid quantum-classical, offers a promising

备注：

点击查看摘要

Abstract:Hybrid quantum-classical machine learning offers a promising direction for advancing automated quality control in industrial settings. In this study, we investigate two hybrid quantum-classical approaches for classifying defects in aluminium TIG welding images and benchmarking their performance against a conventional deep learning model. A convolutional neural network is used to extract compact and informative feature vectors from weld images, effectively reducing the higher-dimensional pixel space to a lower-dimensional feature space. Our first quantum approach encodes these features into quantum states using a parameterized quantum feature map composed of rotation and entangling gates. We compute a quantum kernel matrix from the inner products of these states, defining a linear system in a higher-dimensional Hilbert space corresponding to the support vector machine (SVM) optimization problem and solving it using a Variational Quantum Linear Solver (VQLS). We also examine the effect of the quantum kernel condition number on classification performance. In our second method, we apply angle encoding to the extracted features in a variational quantum circuit and use a classical optimizer for model training. Both quantum models are tested on binary and multiclass classification tasks and the performance is compared with the classical CNN model. Our results show that while the CNN model demonstrates robust performance, hybrid quantum-classical models perform competitively. This highlights the potential of hybrid quantum-classical approaches for near-term real-world applications in industrial defect detection and quality assurance.

125. 【2603.28980】Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

链接：https://arxiv.org/abs/2603.28980

作者：Felix Wimbauer,Fabian Manhardt,Michael Oechsle,Nikolai Kalischek,Christian Rupprecht,Daniel Cremers,Federico Tombari

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：rapidly maturing, world modeling, video generative models, text is rapidly, vast potential

备注： Accepted at CVPR 2026 Findings; Find our project page under [this https URL](https://fwmb.github.io/stepper/)

点击查看摘要

Abstract:The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.

126. 【2603.28963】AutoWorld: Scaling Multi-Agent Traffic Simulation with Self-Supervised World Models

链接：https://arxiv.org/abs/2603.28963

作者：Mozhgan Pourkeshavatz,Tianran Liu,Nicholas Rhinehart

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：autonomous driving systems, testing autonomous driving, traffic simulation, driving systems, central to developing

备注：

点击查看摘要

Abstract:Multi-agent traffic simulation is central to developing and testing autonomous driving systems. Recent data-driven simulators have achieved promising results, but rely heavily on supervised learning from labeled trajectories or semantic annotations, making it costly to scale their performance. Meanwhile, large amounts of unlabeled sensor data can be collected at scale but remain largely unused by existing traffic simulation frameworks. This raises a key question: How can a method harness unlabeled data to improve traffic simulation performance? In this work, we propose AutoWorld, a traffic simulation framework that employs a world model learned from unlabeled occupancy representations of LiDAR data. Given world model samples, AutoWorld constructs a coarse-to-fine predictive scene context as input to a multi-agent motion generation model. To promote sample diversity, AutoWorld uses a cascaded Determinantal Point Process framework to guide the sampling processes of both the world model and the motion model. Furthermore, we designed a motion-aware latent supervision objective that enhances AutoWorld's representation of scene dynamics. Experiments on the WOSAC benchmark show that AutoWorld ranks first on the leaderboard according to the primary Realism Meta Metric (RMM). We further show that simulation performance consistently improves with the inclusion of unlabeled LiDAR data, and study the efficacy of each component with ablations. Our method paves the way for scaling traffic simulation realism without additional labeling. Our project page contains additional visualizations and released code.

127. 【2603.28931】Decoding Functional Networks for Visual Categories via GNNs

链接：https://arxiv.org/abs/2603.28931

作者：Shira Karmi,Galia Avidan,Tammy Riklin Raviv

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large-scale brain networks, brain networks represent, Natural Scenes Dataset, Understanding how large-scale, represent visual categories

备注： Accepted for publication in IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Understanding how large-scale brain networks represent visual categories is fundamental to linking perception and cortical organization. Using high-resolution 7T fMRI from the Natural Scenes Dataset, we construct parcel-level functional graphs and train a signed Graph Neural Network that models both positive and negative interactions, with a sparse edge mask and class-specific saliency. The model accurately decodes category-specific functional connectivity states (sports, food, vehicles) and reveals reproducible, biologically meaningful subnetworks along the ventral and dorsal visual pathways. This framework bridges machine learning and neuroscience by extending voxel-level category selectivity to a connectivity-based representation of visual processing.

128. 【2603.28896】Fisheye3R: Adapting Unified 3D Feed-Forward Foundation Models to Fisheye Lenses

链接：https://arxiv.org/abs/2603.28896

作者：Ruxiao Duan,Erin Hong,Dongxu Zhao,Eric Turner,Alex Wong,Yunwen Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Feed-forward foundation models, Feed-forward foundation, perspective images, images, tested on wide

备注：

点击查看摘要

Abstract:Feed-forward foundation models for multi-view 3-dimensional (3D) reconstruction have been trained on large-scale datasets of perspective images; when tested on wide field-of-view images, e.g., from a fisheye camera, their performance degrades. Their error arises from changes in spatial positions of pixels due to a non-linear projection model that maps 3D points onto the 2D image plane. While one may surmise that training on fisheye images would resolve this problem, there are far fewer fisheye images with ground truth than perspective images, which limit generalization. To enable inference on imagery exhibiting high radial distortion, we propose Fisheye3R, a novel adaptation framework that extends these multi-view 3D reconstruction foundation models to natively accommodate fisheye inputs without performance regression on perspective images. To address the scarcity of fisheye images and ground truth, we introduce flexible learning schemes that support self-supervised adaptation using only unlabeled perspective images and supervised adaptation without any fisheye training data. Extensive experiments across three foundation models, including VGGT, $\pi^3$, and MapAnything, demonstrate that our approach consistently improves camera pose, depth, point map, and field-of-view estimation on fisheye images.

129. 【2603.28887】OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models

链接：https://arxiv.org/abs/2603.28887

作者：Tianran Liu,Shengwen Zhao,Mozhgan Pourkeshavarz,Weican Li,Nicholas Rhinehart

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：Data-driven autonomous driving, pre-recorded driving logs, Data-driven autonomous, autonomous driving simulation, autonomous driving

备注：

点击查看摘要

Abstract:Data-driven autonomous driving simulation has long been constrained by its heavy reliance on pre-recorded driving logs or spatial priors, such as HD maps. This fundamental dependency severely limits scalability, restricting open-ended generation capabilities to the finite scale of existing collected datasets. To break this bottleneck, we present OccSim, the first occupancy world model-driven 3D simulator. OccSim obviates the requirement for continuous logs or HD maps; conditioned only on a single initial frame and a sequence of future ego-actions, it can stably generate over 3,000 continuous frames, enabling the continuous construction of large-scale 3D occupancy maps spanning over 4 kilometers for simulation. This represents an 80x improvement in stable generation length over previous state-of-the-art occupancy world models. OccSim is powered by two modules: W-DiT based static occupancy world model and the Layout Generator. W-DiT handles the ultra-long-horizon generation of static environments by explicitly introducing known rigid transformations in architecture design, while the Layout Generator populates the dynamic foreground with reactive agents based on the synthesized road topology. With these designs, OccSim can synthesize massive, diverse simulation streams. Extensive experiments demonstrate its downstream utility: data collected directly from OccSim can pre-train 4D semantic occupancy forecasting models to achieve up to 67% zero-shot performance on unseen data, outperforming previous asset-based simulator by 11%. When scaling the OccSim dataset to 5x the size, the zero-shot performance increases to about 74%, while the improvement over asset-based simulators expands to 22.1%.

130. 【2603.28776】DF-ACBlurGAN: Structure-Aware Conditional Generation of Internally Repeated Patterns for Biomaterial Microtopography Design

链接：https://arxiv.org/abs/2603.28776

作者：Rongjun Dong,Xin Chen,Morgan R Alexander,Karthikeyan Sivakumar,Reza Omdivar,David A Winkler,Grazziela Figueredo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：periodic structures poses, local texture statistics, machine learning, computer vision models, generate images

备注：

点击查看摘要

Abstract:Learning to generate images with internally repeated and periodic structures poses a fundamental challenge for machine learning and computer vision models, which are typically optimised for local texture statistics and semantic realism rather than global structural consistency. This limitation is particularly pronounced in applications requiring strict control over repetition scale, spacing, and boundary coherence, such as microtopographical biomaterial surfaces. In this work, biomaterial design serves as a use case to study conditional generation of repeated patterns under weak supervision and class imbalance. We propose DF-ACBlurGAN, a structure-aware conditional generative adversarial network that explicitly reasons about long-range repetition during training. The approach integrates frequency-domain repetition scale estimation, scale-adaptive Gaussian blurring, and unit-cell reconstruction to balance sharp local features with stable global periodicity. Conditioning on experimentally derived biological response labels, the model synthesises designs aligned with target functional outcomes. Evaluation across multiple biomaterial datasets demonstrates improved repetition consistency and controllable structural variation compared to conventional generative approaches.

131. 【2603.29660】STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer

链接：https://arxiv.org/abs/2603.29660

作者：Andrea DeMarco,Ian Fenech Conti,Hayley Camilleri,Ardiana Bushi,Simone Riggi

类目：Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)

关键词：Next-generation radio astronomy, Next-generation radio, robust morphology analysis, radio astronomy surveys, radio astronomy

备注： 19 pages

点击查看摘要

Abstract:Next-generation radio astronomy surveys are producing millions of resolved sources, but robust morphology analysis remains difficult across heterogeneous telescopes and imaging pipelines. We present STRADAViT, a self-supervised Vision Transformer continued-pretraining framework for transferable radio astronomy image encoders. STRADAViT combines a mixed-survey pretraining dataset, radio astronomy-aware view generation, and controlled continued pretraining through reconstruction-only, contrastive-only, and two-stage branches. Pretraining uses 512x512 radio astronomy cutouts from MeerKAT, ASKAP, LOFAR/LoTSS, and SKA data. We evaluate transfer with linear probing and fine-tuning on three morphology benchmarks: MiraBest, LoTSS DR2, and Radio Galaxy Zoo. Relative to the initialization used for continued pretraining, the best two-stage STRADAViT models improve Macro-F1 in all reported linear-probe settings and in most fine-tuning settings, with the largest gain on RGZ DR1. Relative to strong DINOv2 baselines, gains are selective but remain positive on LoTSS DR2 and RGZ DR1 under linear probing, and on MiraBest and RGZ DR1 under fine-tuning. A targeted DINOv2-initialized HCL ablation further shows that the adaptation recipe is not specific to a single starting point. The released STRADAViT checkpoint remains the preferred model because it offers competitive transfer at lower token count and downstream cost than the DINOv2-based alternative. These results show that radio astronomy-aware view generation and staged continued pretraining provide a stronger starting point than out-of-the-box Vision Transformers for radio astronomy transfer.

132. 【2603.29438】Polyhedral Unmixing: Bridging Semantic Segmentation with Hyperspectral Unmixing via Polyhedral-Cone Partitioning

链接：https://arxiv.org/abs/2603.29438

作者：Antoine Bottenmuller(CMM, PSL, STIM),Etienne Decencière(CMM, PSL, STIM),Petr Dokládal(CMM, PSL, STIM)

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：spectral image analysis, image analysis, central problems, Semantic segmentation, spectral image

备注：

点击查看摘要

Abstract:Semantic segmentation and hyperspectral unmixing are two central problems in spectral image analysis. The former assigns each pixel a discrete label corresponding to its material class, whereas the latter estimates pure material spectra, called endmembers, and, for each pixel, a vector representing material abundances in the observed scene. Despite their complementarity, these two problems are usually addressed independently. This paper aims to bridge these two lines of work by formally showing that, under the linear mixing model, pixel classification by dominant materials induces polyhedral-cone regions in the spectral space. We leverage this fundamental property to propose a direct segmentation-to-unmixing pipeline that performs blind hyperspectral unmixing from any semantic segmentation by constructing a polyhedral-cone partition of the space that best fits the labeled pixels. Signed distances from pixels to the estimated regions are then computed, linearly transformed via a change of basis in the distance space, and projected onto the probability simplex, yielding an initial abundance estimate. This estimate is used to extract endmembers and recover final abundances via matrix pseudo-inversion. Because the segmentation method can be freely chosen, the user gains explicit control over the unmixing process, while the rest of the pipeline remains essentially deterministic and lightweight. Beyond improving interpretability, experiments on three real datasets demonstrate the effectiveness of the proposed approach when associated with appropriate clustering algorithms, and show consistent improvements over recent deep and non-deep state-of-the-art methods. The code is available at: this https URL

133. 【2603.29181】Retinal Malady Classification using AI: A novel ViT-SVM combination architecture

链接：https://arxiv.org/abs/2603.29181

作者：Shashwat Jha,Vishvaditya Luhach,Raju Poddar

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Central serous retinopathy, complete vision loss, Macular Holes, Diabetic Retinopathy, Central serous

备注：

点击查看摘要

Abstract:Macular Holes, Central serous retinopathy and Diabetic Retinopathy are one of the most widespread maladies of the eyes responsible for either partial or complete vision loss, thus making it clear that early detection of the mentioned defects is detrimental for the well-being of the patient. This study intends to introduce the application of Vision Transformer and Support Vector Machine based hybrid architecture (ViT-SVM) and analyse its performance to classify the optical coherence topography (OCT) Scans with the intention to automate the early detection of these retinal defects.

134. 【2603.29176】Predicting Neuromodulation Outcome for Parkinson's Disease with Generative Virtual Brain Model

链接：https://arxiv.org/abs/2603.29176

作者：Siyuan Du,Siyi Li,Shuwei Bai,Ang Li,Haolin Li,Mingqing Xiao,Yang Pan,Dongsheng Li,Weidi Xie,Yanfeng Wang,Ya Zhang,Chencheng Zhang,Jiangchao Yao

类目：Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)

关键词：million people worldwide, ten million people, Parkinson disease, affects over ten, people worldwide

备注：

点击查看摘要

Abstract:Parkinson's disease (PD) affects over ten million people worldwide. Although temporal interference (TI) and deep brain stimulation (DBS) are promising therapies, inter-individual variability limits empirical treatment selection, increasing non-negligible surgical risk and cost. Previous explorations either resort to limited statistical biomarkers that are insufficient to characterize variability, or employ AI-driven methods which is prone to overfitting and opacity. We bridge this gap with a pretraining-finetuning framework to predict outcomes directly from resting-state fMRI. Critically, a generative virtual brain foundation model, pretrained on a collective dataset (2707 subjects, 5621 sessions) to capture universal disorder patterns, was finetuned on PD cohorts receiving TI (n=51) or DBS (n=55) to yield individualized virtual brains with high fidelity to empirical functional connectivity (r=0.935). By constructing counterfactual estimations between pathological and healthy neural states within these personalized models, we predicted clinical responses (TI: AUPR=0.853; DBS: AUPR=0.915), substantially outperforming baselines. External and prospective validations (n=14, n=11) highlight the feasibility of clinical translation. Moreover, our framework provides state-dependent regional patterns linked to response, offering hypothesis-generating mechanistic insights.

135. 【2603.29115】Schrödinger's Seed: Purr-fect Initialization for an Impurr-fect Universe

链接：https://arxiv.org/abs/2603.29115

作者：Mi chen,Renhao Ye

类目：Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)

关键词：Context, Abstract, cats, seed, conventionally fixed

备注： 3 pages, 1 figure, 21 cats

点击查看摘要

Abstract:Context. Random seed selection in deep learning is often arbitrary -- conventionally fixed to values such as 42, a number with no known feline endorsement. Aims. We propose that cats, as liminal beings with a historically ambiguous relationship to quantum mechanics, are better suited to this task than random integers. Methods. We construct a cat-driven seed generator inspired by the first Friedmann equation, and test it by mapping 21 domestic cats' physical properties -- mass, coat pattern, eye colour, and name entropy -- via a Monte ``Catlo'' sampling procedure. Results. Cat-driven seeds achieve a mean accuracy of 92.58%, outperforming the baseline seed of 42 by $\sim$2.5%. Cats from astrophysicist households perform marginally better, suggesting cosmic insight may be contagious. Conclusions. The Universe responds better to cats than to arbitrary integers. Whether cats are aware of this remains unknown.