本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新705篇论文，其中：

自然语言处理104篇
信息检索14篇
计算机视觉168篇

自然语言处理

1. 【2603.16867】Efficient Reasoning on the Edge

作者：Yelysei Bondarenko,Thomas Hehn,Rob Hesselink,Romain Lepert,Fabio Valerio Massoli,Evgeny Mironov,Leyla Mirvakhabova,Tribhuvanesh Orekondy,Spyridon Stasis,Andrey Kuzmin,Anna Kuzina,Markus Nagel,Ankita Nayak,Corrado Rainone,Ork de Rooij,Paul N Whatmough,Arash Behboodi,Babak Ehteshami Bejnordi

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：complex problem-solving tasks, context requirements make, large context requirements, Large language models, performance across complex

备注： Project page: [this https URL](https://qualcomm-ai-research.github.io/llm-reasoning-on-edge/)

点击查看摘要

Abstract:Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

2. 【2603.16862】Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

链接：https://arxiv.org/abs/2603.16862

作者：Sahil Sen,Elias Lumer,Anmol Gulati,Vamse Kumar Subbiah

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Recent advances, extended multi-turn interactions, advances in Large

备注：

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.

3. 【2603.16856】Online Experiential Learning for Language Models

链接：https://arxiv.org/abs/2603.16856

作者：Tianzhu Ye,Li Dong,Qingxiu Dong,Xun Wu,Shaohan Huang,Furu Wei

类目：Computation and Language (cs.CL)

关键词：improving large language, rich experience accumulated, large language models, language models relies, leaving the rich

备注：

点击查看摘要

Abstract:The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.

4. 【2603.16848】Mediocrity is the key for LLM as a Judge Anchor Selection

链接：https://arxiv.org/abs/2603.16848

作者：Shachar Don-Yehiya,Asaf Yehudai,Leshem Choshen,Omri Abend

类目：Computation and Language (cs.CL)

关键词：evaluating open-ended generation, open-ended generation, anchor, anchor selection, evaluating open-ended

备注：

点击查看摘要

Abstract:The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.

5. 【2603.16827】Prompt Programming for Cultural Bias and Alignment of Large Language Models

链接：https://arxiv.org/abs/2603.16827

作者：Maksim Eren,Eric Michalak,Brian Cook,Johnny Seales Jr

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Culture shapes reasoning, large language models, exhibit cultural biases, strategic decision-making, shapes reasoning

备注： 10 pages, pre-print

点击查看摘要

Abstract:Culture shapes reasoning, values, prioritization, and strategic decision-making, yet large language models (LLMs) often exhibit cultural biases that misalign with target populations. As LLMs are increasingly used for strategic decision-making, policy support, and document engineering tasks such as summarization, categorization, and compliance-oriented auditing, improving cultural alignment is important for ensuring that downstream analyses and recommendations reflect target-population value profiles rather than default model priors. Previous work introduced a survey-grounded cultural alignment framework and showed that culture-specific prompting can reduce misalignment, but it primarily evaluated proprietary models and relied on manual prompt engineering. In this paper, we validate and extend that framework by reproducing its social sciences survey based projection and distance metrics on open-weight LLMs, testing whether the same cultural skew and benefits of culture conditioning persist outside closed LLM systems. Building on this foundation, we introduce use of prompt programming with DSPy for this problem-treating prompts as modular, optimizable programs-to systematically tune cultural conditioning by optimizing against cultural-distance objectives. In our experiments, we show that prompt optimization often improves upon cultural prompt engineering, suggesting prompt compilation with DSPy can provide a more stable and transferable route to culturally aligned LLM responses.

6. 【2603.16817】Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

链接：https://arxiv.org/abs/2603.16817

作者：Yi Chen,Daiwei Chen,Sukrut Madhav Chikodikar,Caitlyn Heqi Yin,Ramya Korlakai Vinayak

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large language models, Large language, frequently hallucinate, knowledge-intensive applications, conformal factuality

备注： 56 pages

点击查看摘要

Abstract:Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a threshold calibrated on held-out data, however, the informativeness of the final output is not guaranteed. We systematically analyze the reliability and usefulness of conformal factuality for RAG-based LLMs across generation, scoring, calibration, robustness, and efficiency. We propose novel informativeness-aware metrics that better reflect task utility under conformal filtering. Across three benchmarks and multiple model families, we find that (i) conformal filtering suffers from low usefulness at high factuality levels due to vacuous outputs, (ii) conformal factuality guarantee is not robust to distribution shifts and distractors, highlighting the limitation that requires calibration data to closely match deployment conditions, and (iii) lightweight entailment-based verifiers match or outperform LLM-based model confidence scorers while requiring over $100\times$ fewer FLOPs. Overall, our results expose factuality-informativeness trade-offs and fragility of conformal filtering framework under distribution shifts and distractors, highlighting the need for new approaches for reliability with robustness and usefulness as key metrics, and provide actionable guidance for building RAG pipelines that are both reliable and computationally efficient.

7. 【2603.16783】SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

链接：https://arxiv.org/abs/2603.16783

作者：Jonggeun Lee,Junseong Pyo,Jeongmin Park,Yohan Jo

类目：Computation and Language (cs.CL)

关键词：spoken user, spoken user behaviors, agents require exposure, full diversity, people interact

备注：

点击查看摘要

Abstract:Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors -- cross-turn slots, barge-in, disfluency, and emotional prosody -- across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS's spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.

8. 【2603.16761】SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit

链接：https://arxiv.org/abs/2603.16761

作者：Yibo Li,Qiongxiu Li

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Orthogonal Matching Pursuit, private training text, reveal that private, private training, reconstructed from shared

备注： 18 pages, 4 figures, 13 tables

点击查看摘要

Abstract:Gradient inversion attacks reveal that private training text can be reconstructed from shared gradients, posing a privacy risk to large language models (LLMs). While prior methods perform well in small-batch settings, scaling to larger batch sizes and longer sequences remains challenging due to severe signal mixing, high computational cost, and degraded fidelity. We present SOMP (Subspace-Guided Orthogonal Matching Pursuit), a scalable gradient inversion framework that casts text recovery from aggregated gradients as a sparse signal recovery problem. Our key insight is that aggregated transformer gradients retain exploitable head-wise geometric structure together with sample-level sparsity. SOMP leverages these properties to progressively narrow the search space and disentangle mixed signals without exhaustive search. Experiments across multiple LLM families, model scales, and five languages show that SOMP consistently outperforms prior methods in the aggregated-gradient this http URL long sequences at batch size B=16, SOMP achieves substantially higher reconstruction fidelity than strong baselines, while remaining computationally competitive. Even under extreme aggregation (up to B=128), SOMP still recovers meaningful text, suggesting that privacy leakage can persist in regimes where prior attacks become much less effective.

9. 【2603.16759】urnWise: The Gap between Single- and Multi-turn Language Model Capabilities

链接：https://arxiv.org/abs/2603.16759

作者：Victoria Graf,Valentina Pyatkin,Nouha Dziri,Nathan Lambert,Hannaneh Hajishirzi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：language model interaction, common and critical, critical mode, mode of language, language model

备注：

点击查看摘要

Abstract:Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.

10. 【2603.16749】Probing Cultural Signals in Large Language Models through Author Profiling

链接：https://arxiv.org/abs/2603.16749

作者：Valentin Lafargue,Ariel Guerra-Adames,Emmanuelle Claeys,Elouan Vuichard,Jean-Michel Loubes

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large language models, Large language, societal impact, raising concerns, biases they encode

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting, inferring singers' gender and ethnicity without task-specific fine-tuning. Across several open-source models evaluated on more than 10,000 lyrics, we find that LLMs achieve non-trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek-1.5B aligns more strongly with Asian ethnicity. This finding emerges from both the models' prediction distributions and an analysis of their generated rationales. To quantify these disparities, we introduce two fairness metrics, Modality Accuracy Divergence (MAD) and Recall Divergence (RD), and show that Ministral-8B displays the strongest ethnicity bias among the evaluated models, whereas Gemma-12B shows the most balanced behavior. Our code is available on GitHub (this https URL).

11. 【2603.16737】Retrieving Counterfactuals Improves Visual In-Context Learning

链接：https://arxiv.org/abs/2603.16737

作者：Guangzhi Xiong,Sanchit Sinha,Zhenghao He,Aidong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：achieved impressive performance, disentangle fine-grained visual, underlying causal relationships, fine-grained visual attributes, multimodal reasoning tasks

备注： CVPR 2026

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at this https URL.

12. 【2603.16733】IQuest-Coder-V1 Technical Report

链接：https://arxiv.org/abs/2603.16733

作者：Jian Yang,Wei Zhang,Shawn Guo,Zhengmao Ye,Lin Jing,Shark Liu,Yizhi Li,Jiajun Wu,Cening Liu,X. Ma,Yuyang Song,Siwei Wu,Yuwen Li,L. Liao,T. Zheng,Ziling Huang,Zelong Huang,Che Liu,Yan Xing,Renyuan Li,Qingsong Cai,Hanxu Yan,Siyue Wang,Shikai Li,Jason Klein Liu,An Huang,Yongsheng Kang,Jinxing Zhang,Chuan Hao,Haowen Wang,Weicheng Gu,Ran Tao,Mingjie Tang,Peihao Wu,Jianzhou Wang,Xianglong Liu,Weifeng Lv,Bryan Dai

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：code large language, large language models, large language, code large, code

备注：

点击查看摘要

Abstract:In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with the initial pre-training consisting of code facts, repository, and completion data. Following that, we implement a specialized mid-training stage that integrates reasoning and agentic trajectories in 32k-context and repository-scale in 128k-context to forge deep logical foundations. The models are then finalized with post-training of specialized coding capabilities, which is bifurcated into two specialized paths: the thinking path (utilizing reasoning-driven RL) and the instruct path (optimized for general assistance). IQuest-Coder-V1 achieves state-of-the-art performance among competitive models across critical dimensions of code intelligence: agentic software engineering, competitive programming, and complex tool use. To address deployment constraints, the IQuest-Coder-V1-Loop variant introduces a recurrent mechanism designed to optimize the trade-off between model capacity and deployment footprint, offering an architecturally enhanced path for efficacy-efficiency trade-off. We believe the release of the IQuest-Coder-V1 series, including the complete white-box chain of checkpoints from pre-training bases to the final thinking and instruction models, will advance research in autonomous code intelligence and real-world agentic systems.

13. 【2603.16718】Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models

链接：https://arxiv.org/abs/2603.16718

作者：Mohamed Adel,Bashar Alhafni,Nizar Habash

类目：Computation and Language (cs.CL)

关键词：Large language models, produce explicit linguistic, explicit linguistic structure, Large language, structure remains unclear

备注：

点击查看摘要

Abstract:Large language models (LLMs) perform strongly on many NLP tasks, but their ability to produce explicit linguistic structure remains unclear. We evaluate instruction-tuned LLMs on two structured prediction tasks for Standard Arabic: morphosyntactic tagging and labeled dependency parsing. Arabic provides a challenging testbed due to its rich morphology and orthographic ambiguity, which create strong morphology-syntax interactions. We compare zero-shot prompting with retrieval-based in-context learning (ICL) using examples from Arabic treebanks. Results show that prompt design and demonstration selection strongly affect performance: proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization. Our analysis highlights which aspects of Arabic morphosyntax and syntax LLMs capture reliably and which remain difficult.

14. 【2603.16672】CritiSense: Critical Digital Literacy and Resilience Against Misinformation

链接：https://arxiv.org/abs/2603.16672

作者：Firoj Alam,Fatema Ahmad,Ali Ezzat Shahroor,Mohamed Bayan Kmainasi,Elisa Sartori,Giovanni Da San Martino,Abul Hasnat,Raian Ali

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：social media undermines, media undermines informed, undermines informed decision-making, public trust, social media

备注： resilience, disinformation, misinformation, fake news, propaganda

点击查看摘要

Abstract:Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 3+ months, we have reached 300+ active users. It is freely available to all users on the Apple App Store (this https URL) and Google Play Store (this https URL). Demo Video: this https URL

15. 【2603.16660】Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?

链接：https://arxiv.org/abs/2603.16660

作者：Aishwarya Ramasethu,Niyathi Allu,Rohin Garg,Harshwardhan Fartale,Dun Li Chan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：achieved strong performance, Large Language Models, translation remains limited, machine translation remains, extremely low-resource machine

备注： 18 pages (9 main paper and 9 Appendix), 1 figure, 19 tables. Accepted at LoResMT 2026: EACL 2026 Workshop. OpenReview link: [this https URL](https://openreview.net/forum?id=mg0UfW2sdc#discussion)

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel data or extensive fine-tuning, which are infeasible for the long tail of underrepresented languages. In this work, we investigate a more constrained question: in data-scarce settings, to what extent can linguistically similar pivot languages and few-shot demonstrations provide useful guidance for on-the-fly adaptation in LLMs? We study a data-efficient experimental setup that combines linguistically related pivot languages with few-shot in-context examples, without any parameter updates, and evaluate translation behavior under controlled conditions. Our analysis shows that while pivot-based prompting can yield improvements in certain configurations, particularly in settings where the target language is less well represented in the model's vocabulary, the gains are often modest and sensitive to few shot example construction. For closely related or better represented varieties, we observe diminishing or inconsistent gains. Our findings provide empirical guidance on how and when inference-time prompting and pivot-based examples can be used as a lightweight alternative to fine-tuning in low-resource translation settings.

16. 【2603.16654】Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

链接：https://arxiv.org/abs/2603.16654

作者：Xiaojie Gu,Sherry T. Tong,Aosong Feng,Sophia Simeng Han,Jinghui Lu,Yingjian Chen,Yusuke Iwasawa,Yutaka Matsuo,Chanjun Park,Rex Ying,Irene Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Reasoning-focused large language, large language models, lack step-level annotations, NLP tasks, diagnosing reasoning failures

备注：

点击查看摘要

Abstract:Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT's performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset's quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at this https URL and the code at this https URL.

17. 【2603.16643】Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy

链接：https://arxiv.org/abs/2603.16643

作者：Zhaoxin Feng,Zheng Chen,Jianfei Ma,Yip Tin Po,Emmanuele Chersoni,Bo Li

类目：Computation and Language (cs.CL)

关键词：Alignment techniques, inadvertently induce sycophancy, techniques often inadvertently, inadvertently induce, Alignment

备注：

点击查看摘要

Abstract:Alignment techniques often inadvertently induce sycophancy in LLMs. While prior studies studied this behaviour in direct-answer settings, the role of Chain-of-Thought (CoT) reasoning remains under-explored: does it serve as a logical constraint that mitigates sycophancy, or a tool for post-hoc rationalization that masks it? We evaluate a range of models across objective and subjective tasks to investigate the issue. Results show that reasoning generally reduces sycophancy in final decisions but also masks sycophancy in some samples, where models construct deceptive justifications through logical inconsistencies, calculation errors, and one-sided arguments etc. Furthermore, LLMs are more prone to sycophancy in subjective tasks and under authority-bias. Our mechanistic analysis on three open-source models reveals that the tendency of sycophancy is dynamic during the reasoning process rather than being pre-determined at the input stage.

18. 【2603.16642】When AI Navigates the Fog of War

链接：https://arxiv.org/abs/2603.16642

作者：Ming Li,Xirui Li,Tianyi Zhou

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：historically obvious, trajectory becomes historically, Middle East conflict, Middle East, models

备注：

点击查看摘要

Abstract:Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.

19. 【2603.16622】Domain Mixture Design via Log-Likelihood Differences for Aligning Language Models with a Target Model

链接：https://arxiv.org/abs/2603.16622

作者：Ryo Kishino,Riku Shiomi,Hiroaki Yamagiwa,Momose Oyama,Hidetoshi Shimodaira

类目：Computation and Language (cs.CL)

关键词：fixed training recipe, target model, continued pretraining, directly distilling, distilling a language

备注：

点击查看摘要

Abstract:Instead of directly distilling a language model, this study addresses the problem of aligning a base model with a target model in distribution by designing the domain mixture of training data for pretraining or continued pretraining as a fixed training recipe. We propose a method for determining domain weights by viewing models as points in log-likelihood space and aligning the training update direction with the direction toward the target model. Experiments with NanoGPT show that the proposed method consistently reduces the KL divergence to the target model compared with uniform weighting over the Pile. Although knowledge distillation remains more effective when available, the proposed method still achieves meaningful alignment, and downstream task performance also tends to become closer to that of the target model.

20. 【2603.16606】Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

链接：https://arxiv.org/abs/2603.16606

作者：Omnilingual SONAR Team:João Maria Janeiro,Pere-Lluís Huguet Cabot,Ioannis Tsiamas,Yen Meng,Vivek Iyer,Guillem Ramírez,Loic Barrault,Belen Alastruey,Yu-An Chung,Marta R. Costa-Jussa,David Dale,Kevin Heffernan,Jaehyeong Jo,Artyom Kozhevnikov,Alexandre Mourachko,Christophe Ropers,Holger Schwenk,Paul-Ambroise Duquenne

类目：Computation and Language (cs.CL)

关键词：encoders typically cover, limiting their adoption, stronger alignment, sentence encoders typically, typically cover

备注：

点击查看摘要

Abstract:Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.

21. 【2603.16601】arab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry

链接：https://arxiv.org/abs/2603.16601

作者：Mo El-Haj

类目：Computation and Language (cs.CL)

关键词：unified analytical framework, Arabic song lyrics, Modern Standard Arabic, open Arabic corpus, covers Classical Arabic

备注： 10 pages

点击查看摘要

Abstract:We introduce the Tarab Corpus, a large-scale cultural and linguistic resource that brings together Arabic song lyrics and poetry within a unified analytical framework. The corpus comprises 2.56 million verses and more than 13.5 million tokens, making it, to our knowledge, the largest open Arabic corpus of creative text spanning both classical and contemporary production. Tarab is broadly balanced between songs and poems and covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties: Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi Arabic. The artists and poets represented in the corpus are associated with 28 modern nation states and multiple historical eras, covering over fourteen centuries of Arabic creative expression from the Pre-Islamic period to the twenty-first century. Each verse is accompanied by structured metadata describing linguistic variety, geographic origin, and historical or cultural context, enabling comparative linguistic, stylistic, and diachronic analysis across genres and time. We describe the data collection, normalisation, and validation pipeline and present baseline analyses for variety identification and genre differentiation. The dataset is publicly available on HuggingFace at this https URL.

22. 【2603.16590】BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

链接：https://arxiv.org/abs/2603.16590

作者：Ji-Fu Li,Manyi Zhang,Xiaobo Xia,Han Bao,Haoli Bai,Zhenhua Dong,Xianzhi Yu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Multi-modal Large Language, deploying Multi-modal Large, Language Models, Large Language

备注： 30 pages, 13 figures, 7 tables

点击查看摘要

Abstract:Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.

23. 【2603.16578】When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

链接：https://arxiv.org/abs/2603.16578

作者：Zelin Zhang,Fei Cheng,Chenhui Chu

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large Language Models, outcome-based reinforcement learning, severe scalability bottleneck, Large Language, computationally expensive ground-truth

备注： work in progress

点击查看摘要

Abstract:Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite of intrinsic rewards that explicitly enforce concise and certain generation. Second, to discover the boundaries of this approach, we test base models across a spectrum of intrinsic reasoning capabilities, revealing how a model's foundational logical prior dictates its success or failure. Finally, to demystify why certain configurations stabilize while others collapse, we introduce a novel geometric diagnostic lens, showing that successful cases are enveloped by manifolds. Ultimately, our work goes beyond merely demonstrating that enforcing concise and certain responses successfully boosts mathematical reasoning; we reveal when this unsupervised approach breaks down and geometrically diagnose why.

24. 【2603.16574】Diverging Transformer Predictions for Human Sentence Processing: A Comprehensive Analysis of Agreement Attraction Effects

链接：https://arxiv.org/abs/2603.16574

作者：Titus von der Malsburg,Sebastian Padó

类目：Computation and Language (cs.CL)

关键词：processing remains disputed, sentence processing remains, computational linguistics, remains disputed, English agreement attraction

备注：

点击查看摘要

Abstract:Transformers underlie almost all state-of-the-art language models in computational linguistics, yet their cognitive adequacy as models of human sentence processing remains disputed. In this work, we use a surprisal-based linking mechanism to systematically evaluate eleven autoregressive transformers of varying sizes and architectures on a more comprehensive set of English agreement attraction configurations than prior work. Our experiments yield mixed results: While transformer predictions generally align with human reading time data for prepositional phrase configurations, performance degrades significantly on object-extracted relative clause configurations. In the latter case, predictions also diverge markedly across models, and no model successfully replicates the asymmetric interference patterns observed in humans. We conclude that current transformer models do not explain human morphosyntactic processing, and that evaluations of transformers as cognitive models must adopt rigorous, comprehensive experimental designs to avoid spurious generalizations from isolated syntactic configurations or individual models.

25. 【2603.16567】Characterizing Delusional Spirals through Human-LLM Chat Logs

链接：https://arxiv.org/abs/2603.16567

作者：Jared Moore,Ashish Mehta,William Agnew,Jacy Reese Anthis,Ryan Louie,Yifan Mai,Peggy Yin,Myra Cheng,Samuel J Paech,Kevin Klyman,Stevie Chancellor,Eric Lin,Nick Haber,Desmond C. Ong

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：disturbing anecdotal reports, negative psychological effects, large language models, disturbing anecdotal, legal discourse

备注： To appear at ACM FAccT 2026

点击查看摘要

Abstract:As large language models (LLMs) have proliferated, disturbing anecdotal reports of negative psychological effects, such as delusions, self-harm, and ``AI psychosis,'' have emerged in global media and legal discourse. However, it remains unclear how users and chatbots interact over the course of lengthy delusional ``spirals,'' limiting our ability to understand and mitigate the harm. In our work, we analyze logs of conversations with LLM chatbots from 19 users who report having experienced psychological harms from chatbot use. Many of our participants come from a support group for such chatbot users. We also include chat logs from participants covered by media outlets in widely-distributed stories about chatbot-reinforced delusions. In contrast to prior work that speculates on potential AI harms to mental health, to our knowledge we present the first in-depth study of such high-profile and veridically harmful cases. We develop an inventory of 28 codes and apply it to the $391,562$ messages in the logs. Codes include whether a user demonstrates delusional thinking (15.5% of user messages), a user expresses suicidal thoughts (69 validated user messages), or a chatbot misrepresents itself as sentient (21.2% of chatbot messages). We analyze the co-occurrence of message codes. We find, for example, that messages that declare romantic interest and messages where the chatbot describes itself as sentient occur much more often in longer conversations, suggesting that these topics could promote or result from user over-engagement and that safeguards in these areas may degrade in multi-turn settings. We conclude with concrete recommendations for how policymakers, LLM chatbot developers, and users can use our inventory and conversation analysis tool to understand and mitigate harm from LLM chatbots. Warning: This paper discusses self-harm, trauma, and violence.

Comments:
To appear at ACM FAccT 2026

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.16567 [cs.CL]

(or
arXiv:2603.16567v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.16567

Focus to learn more

              arXiv-issued DOI via DataCite</p>

26. 【2603.16557】BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

链接：https://arxiv.org/abs/2603.16557

作者：Sangyeon Yoon,Sunkyoung Kim,Hyesoo Hong,Wonje Jeung,Yongil Kim,Wooseok Seo,Heuiyeen Yeen,Albert No

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, increasingly store user, Large language, increasingly store, personalization across interactions

备注：

点击查看摘要

Abstract:Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.

27. 【2603.16553】EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models

链接：https://arxiv.org/abs/2603.16553

作者：Yifei Zhang,Mingyang Li,Henry Gao,Liang Zhao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, require emotional intelligence, strong cognitive intelligence, demonstrate strong cognitive

备注：

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong cognitive intelligence (IQ), yet many real-world interactions also require emotional intelligence (EQ) to produce responses that are both factually reliable and emotionally appropriate. In settings such as emotional support, technical assistance, and consultation, effective dialogue depends on how situations are appraised with respect to the user's needs, goals, and coping capacity. Inspired by appraisal theory, we propose EmoLLM, an appraisal-grounded framework for IQ/EQ co-reasoning in dialogue. EmoLLM uses an explicit Appraisal Reasoning Graph (ARG) to structure intermediate reasoning over contextual facts, inferred user needs, appraisal dimensions, emotional states, and response strategies before generating a reply. We train EmoLLM in a multi-turn role-play environment with reinforcement learning, where reverse-perspective reasoning provides reward signals based on predicted user-side consequences of responses. Across diverse dialogue settings, EmoLLM improves emotional state outcomes and response quality over strong baselines while preserving strong factual reliability.

28. 【2603.16546】DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

链接：https://arxiv.org/abs/2603.16546

作者：Lei Wang,Min Huang,Eduard Dragut

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Sentiment Intensity Analysis, Aspect-Based Sentiment Intensity, Intensity Analysis, Sentiment Intensity, garnered increasing attention

备注：

点击查看摘要

Abstract:Aspect-Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain-specific, sentence-level settings. In contrast, document-level ABSIA--particularly in addressing complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples--remains underexplored. In this work, we introduce DanceHA, a multi-agent framework designed for open-ended, document-level ABSIA with informal writing styles. DanceHA has two main components: Dance, which employs a divide-and-conquer strategy to decompose the long-context ABSIA task into smaller, manageable sub-tasks for collaboration among specialized agents; and HA, Human-AI collaboration for annotation. We release Inf-ABSIA, a multi-domain document-level ABSIA dataset featuring fine-grained and high-accuracy labels from DanceHA. Extensive experiments demonstrate the effectiveness of our agentic framework and show that the multi-agent knowledge in DanceHA can be effectively transferred into student models. Our results highlight the importance of the overlooked informal styles in ABSIA, as they often intensify opinions tied to specific aspects.

29. 【2603.16544】How often do Answers Change? Estimating Recency Requirements in Question Answering

链接：https://arxiv.org/abs/2603.16544

作者：Bhawna Piryani,Zehra Mert,Adam Jatowt

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, leading to confident, incorrect responses, outdated knowledge

备注：

点击查看摘要

Abstract:Large language models (LLMs) often rely on outdated knowledge when answering time-sensitive questions, leading to confident yet incorrect responses. Without explicit signals indicating whether up-to-date information is required, models struggle to decide when to retrieve external evidence, how to reason about stale facts, and how to rank answers by their validity. Existing benchmarks either periodically refresh answers or rely on fixed templates, but they do not reflect on how frequently answers change or whether a question inherently requires up-to-date information. To address this gap, we introduce a recency-stationarity taxonomy that categorizes questions by how often their answers change and whether this change frequency is time-invariant or context-dependent. Building on this taxonomy, we present RecencyQA, a dataset of 4,031 open-domain questions annotated with recency and stationarity labels. Through human evaluation and empirical analysis, we show that non-stationary questions, i.e., those where context changes the recency requirement, are significantly more challenging for LLMs, with difficulty increasing as update frequency rises. By explicitly modeling recency and context dependence, RecencyQA enables fine-grained benchmarking and analysis of temporal reasoning beyond binary notions of freshness, and provides a foundation for developing recency-aware and context-sensitive question answering systems.

30. 【2603.16500】From the Inside Out: Progressive Distribution Refinement for Confidence Calibration

链接：https://arxiv.org/abs/2603.16500

作者：Xizhong Yang,Yinan Xia,Huiming Wang,Mofei Song

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Reinforcement Learning, received extensive attention, extensive attention due, voting-based TTS strategies, model internal information

备注： 15 pages

点击查看摘要

Abstract:Leveraging the model's internal information as the self-reward signal in Reinforcement Learning (RL) has received extensive attention due to its label-free nature. While prior works have made significant progress in applying the Test-Time Scaling (TTS) strategies to RL, the discrepancy in internal information between test and training remains inadequately addressed. Moreover, Test-Time Training based on voting-based TTS strategies often suffers from reward hacking problems. To address these issues, we propose DistriTTRL, which leverages the distribution prior of the model's confidence during RL to progressively optimize the reward signal, rather than relying solely on single-query rollouts. Additionally, we mitigate the phenomenon of consistent reward hacking caused by the voting-based TTS strategies through diversity-targeted penalties. Benefiting from this training mechanism where model capability and self-reward signals complement each other, and the mitigation of reward hacking, DistriTTRL has achieved significant performance improvements across multiple models and benchmarks.

31. 【2603.16496】AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

链接：https://arxiv.org/abs/2603.16496

作者：Shannan Yan,Jingchen Ni,Leqi Zheng,Jiajun Zhang,Peixi Wu,Dacheng Yin,Jing Lyu,Chun Yuan,Fengyun Rao

类目：Computation and Language (cs.CL)

关键词：Large language model, support long-horizon interaction, Large language, agents increasingly rely, personalized assistance

备注：

点击查看摘要

Abstract:Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long-term experiences, stable user traits, and relation-aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question-conditioned retrieval route that combines semantic retrieval with relation-aware graph expansion only when needed, and finally produces the answer through a role-specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling. Experimental results show that AdaMem achieves state-of-the-art performance on both benchmarks. The code will be released upon acceptance.

32. 【2603.16483】On the Emotion Understanding of Synthesized Speech

链接：https://arxiv.org/abs/2603.16483

作者：Yuan Ge,Haishu Zhao,Aokai Hao,Junxiang Zhang,Bei Li,Xiaoqian Liu,Chenglong Wang,Jianjin Wang,Bingsen Zhou,Bingyu Liu,Jingbo Zhu,Zhengtao Yu,Tong Xiao

类目：Computation and Language (cs.CL)

关键词：SER models, voice interaction, Speech Emotion Recognition, speech, Emotion

备注：

点击查看摘要

Abstract:Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.

33. 【2603.16459】DynHD: Hallucination Detection for Diffusion Large Language Models via Denoising Dynamics Deviation Learning

链接：https://arxiv.org/abs/2603.16459

作者：Yanyu Qian,Yue Tan,Yixin Liu,Wang Yu,Shirui Pan

类目：Computation and Language (cs.CL)

关键词：iterative refinement capabilities, Diffusion large language, large language models, auto-regressive models due, refinement capabilities

备注： 15 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Diffusion large language models (D-LLMs) have emerged as a promising alternative to auto-regressive models due to their iterative refinement capabilities. However, hallucinations remain a critical issue that hinders their reliability. To detect hallucination responses from model outputs, token-level uncertainty (e.g., entropy) has been widely used as an effective signal to indicate potential factual errors. Nevertheless, the fixed-length generation paradigm of D-LLMs implies that tokens contribute unevenly to hallucination detection, with only a small subset providing meaningful signals. Moreover, the evolution trend of uncertainty throughout the diffusion process can also provide important signals, highlighting the necessity of modeling its denoising dynamics for hallucination detection. In this paper, we propose DynHD that bridge these gaps from both spatial (token sequence) and temporal (denoising dynamics) perspectives. To address the information density imbalance across tokens, we propose a semantic-aware evidence construction module that extracts hallucination-indicative signals by filtering out non-informative tokens and emphasizing semantically meaningful ones. To model denoising dynamics for hallucination detection, we introduce a reference evidence generator that learns the expected evolution trajectory of uncertainty evidence, along with a deviation-based hallucination detector that makes predictions by measuring the discrepancy between the observed and reference trajectories. Extensive experiments demonstrate that DynHD consistently outperforms state-of-the-art baselines while achieving higher efficiency across multiple benchmarks and backbone models.

34. 【2603.16440】Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models

链接：https://arxiv.org/abs/2603.16440

作者：Rishaank Gupta

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large language model, made substantial progress, fundamental limitation persists, Large language, components functionally encode

备注：

点击查看摘要

Abstract:Large language model compression has made substantial progress through pruning, quantization, and low-rank decomposition, yet a fundamental limitation persists across all existing methods: compression budgets are allocated without any representation of what individual model components functionally encode. We term this the capability-blind compression problem and argue it is a root cause of two well-documented failures -- the insensitivity of perplexity-based evaluation to reasoning capability loss, and the abrupt phase transitions in model performance recently characterized by Ma et al. (2026). We propose Capability-Guided Compression (CGC), a framework that addresses this by using Sparse Autoencoder (SAE)-derived capability density maps to allocate differential compression budgets across transformer components. Capability density is a formally defined scalar measure combining the feature breadth, activation entropy, and cross-input consistency of a component's SAE feature activation distribution. We prove theoretically that components with higher capability density exhibit lower structural redundancy and reach their individual phase transition points at lower compression ratios, providing the first pre-compression mechanism for component-level phase transition prediction. Experiments on GPT-2 Medium confirm that capability density is statistically independent of Wanda importance scores (Spearman rho = -0.054, n = 384 heads), establishing it as a genuinely novel compression signal orthogonal to all existing importance metrics. We report a negative result on PPL-based compression comparison and provide a principled diagnosis identifying GPT-2 Medium as an insufficient test bed for the full CGC hypothesis. The theoretical framework, density formalism, and orthogonality finding constitute a foundation for capability-aware compression research.

35. 【2603.16435】VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

链接：https://arxiv.org/abs/2603.16435

作者：Yixuan Wang,Qingyu Shi,Jiayu Zhou,Dianbo Liu,Ziwei He,Zhouhan Lin

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, growing context length, enlarges the Key-Value, Language Models

备注：

点击查看摘要

Abstract:The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

36. 【2603.16430】EngGPT2: Sovereign, Efficient and Open Intelligence

链接：https://arxiv.org/abs/2603.16430

作者：G. Ciarfaglia,A. Rosanova,S. Cipolla,J. Bartoli,A. Di Domenico,C. Fioroni,A. Fontana,M. R. Scoleri,M. I. Mone,D. Franchi,M. C. Del Gaudio,F. Picariello,M. Gabusi,S. Bonura,V. Morreale,I. Bailo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Engineering Group Italian, Efficient and Open, Engineering Group, Group Italian LLM, iteration of Engineering

备注：

点击查看摘要

Abstract:EngGPT2-16B-A3B is the latest iteration of Engineering Group's Italian LLM and it's built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3's 36T or Llama3's 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring one-fifth to half of the inference power, and between one-tenth to one-sixth of the training data and consequent needed training power. Designed as a trained-from-scratch Mixture-of-Experts (MoE) architecture, EngGPT2 features 16 billion parameters with 3 billion active per inference, with expert sizes positioned between those used in GPT-OSS and Qwen3. Approximately 25% of its training corpus consists of Italian-language data, to deliver strong capabilities for European and Italian NLP tasks among models of similar scale. This efficiency aims to position EngGPT2 as a key contributor to the growing portfolio of open-weight European models, combining performance and efficiency with full alignment to the EU AI Act. EngGPT2 is also a single model capable of multiple reasoning modes: non-reasoning, reasoning in Italian or English, and turbo-reasoning (a concise, bullet-point style reasoning available in both languages designed for real-time reasoning use cases). EngGPT2 aims to set a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts.

37. 【2603.16415】IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time

链接：https://arxiv.org/abs/2603.16415

作者：Zhenghua Bao,Yi Shi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：existing retrieval-augmented generation, Multi-hop question answering, iterative multi-step reasoning, question answering, retrieval-augmented generation

备注：

点击查看摘要

Abstract:Multi-hop question answering (QA) requires reasoning across multiple documents, yet existing retrieval-augmented generation (RAG) approaches address this either through graph-based methods requiring additional online processing or iterative multi-step reasoning. We present IndexRAG, a novel approach that shifts cross-document reasoning from online inference to offline indexing. IndexRAG identifies bridge entities shared across documents and generates bridging facts as independently retrievable units, requiring no additional training or fine-tuning. Experiments on three widely-used multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) show that IndexRAG improves F1 over Naive RAG by 4.6 points on average, while requiring only single-pass retrieval and a single LLM call at inference time. When combined with IRCoT, IndexRAG outperforms all graph-based baselines on average, including HippoRAG and FastGraphRAG, while relying solely on flat retrieval. Our code will be released upon acceptance.

38. 【2603.16411】RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery

链接：https://arxiv.org/abs/2603.16411

作者：Abhishek Kumar,Aashraya Sachdeva

类目：Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：Automatic Speech Recognition, Automatic Speech, Speech Recognition, recognition in Automatic, domain-specific terms

备注： Under review. Submitted to Interspeech 2026

点击查看摘要

Abstract:Entity recognition in Automatic Speech Recognition (ASR) is challenging for rare and domain-specific terms. In domains such as finance, medicine, and air traffic control, these errors are costly. If the entities are entirely absent from the ASR output, post-ASR correction becomes difficult. To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent. It leverages multiple hypotheses as evidence from ASR, retrieves relevant entities, and applies Large Language Model (LLM) correction under constraints. The hypotheses are used using different strategies, namely, 1-Best, Entity-Aware Select, Recognizer Output Voting Error Reduction (ROVER) Ensemble, and LLM-Select. Evaluated across five diverse datasets, it achieves 8-46% relative reductions in entity-phrase word error rate (E-WER) and increases recall by up to 22 percentage points. The LLM-Select achieves the best overall performance in entity correction while maintaining overall WER.

39. 【2603.16410】PlotTwist: A Creative Plot Generation Framework with Small Language Models

链接：https://arxiv.org/abs/2603.16410

作者：Abhinav Thorat,Ravi Kolla,Jyotin Goel,Niranjan Pedanekar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：sustains global structure, Large Language Models, Creative plot generation, language models, recent Large Language

备注： 30 pages, 3 figures

点击查看摘要

Abstract:Creative plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrative that sustains global structure, character development, and emotional resonance. Although recent Large Language Models (LLMs) demonstrate strong fluency across general-purpose tasks, they typically require preference alignment to perform well on specialized domains such as creative plot generation. However, conducting such alignment at the scale of frontier LLMs is computationally prohibitive, significantly limiting accessibility and practical deployment. To address this, we present PlotTwist, a structured framework that enables Small Language Models (SLMs) with $\leq$ 5B active parameters to generate high-quality, premise-conditioned plots competitive with frontier systems up to $200\times$ larger. Our approach decomposes generation into three specialized components: (1) an Aspect Rating Reward Model trained via a novel Positive-Negative prompting strategy to deliver structured narratives across five Narrative Quality Dimensions (NQDs); (2) a Mixture-of-Experts (MoE) plot generator aligned via Direct Preference Optimization on high-confidence preference pairs; and (3) an Agentic Evaluation module that emulates human critical judgment for unbiased post-hoc assessment. Extensive experiments demonstrate that PlotTwist consistently outperforms frontier models across multiple NQDs despite substantially tighter capacity constraints. Further validation confirms strong sensitivity to narrative quality, as the framework reliably distinguishes plots derived from critically acclaimed versus widely panned screenplays. Together, these results establish structured, preference-based alignment as a resource-efficient approach to high-quality creative plot generation.

40. 【2603.16406】Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

链接：https://arxiv.org/abs/2603.16406

作者：Finnur Ágúst Ingimundarson,Steinunn Rut Friðriksdóttir,Bjarki Ármannsson,Iris Edda Nowenstein,Steinþór Steingrímsson

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Model, evaluates current Large, current Large Language, paper evaluates current, current Large

备注： Accepted to LREC 2026

点击查看摘要

Abstract:This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

41. 【2603.16397】Fanar 2.0: Arabic Generative AI Stack

链接：https://arxiv.org/abs/2603.16397

作者：FANAR TEAM,Ummar Abbas,Mohammad Shahmeer Ahmad,Minhaj Ahmad,Abdulaziz Al-Homaid,Anas Al-Nuaimi,Enes Altinisik,Ehsaneddin Asgari,Sanjay Chawla,Shammur Chowdhury,Fahim Dalvi,Kareem Darwish,Nadir Durrani,Mohamed Elfeky,Ahmed Elmagarmid,Mohamed Eltabakh,Asim Ersoy,Masoomali Fatehkia,Mohammed Qusay Hashim,Majd Hawasly,Mohamed Hefeeda,Mus'ab Husaini,Keivin Isufaj,Soon-Gyo Jung,Houssam Lachemat,Ji Kim Lucas,Abubakr Mohamed,Tasnim Mohiuddin,Basel Mousi,Hamdy Mubarak,Ahmad Musleh,Mourad Ouzzani,Amin Sadeghi,Husrev Taha Sencar,Mohammed Shinoy,Omar Sinan,Yifan Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Qatar Arabic-centric Generative, Qatar Arabic-centric, Arabic-centric Generative, Hamad Bin Khalifa, Bin Khalifa University

备注：

点击查看摘要

Abstract:We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.

42. 【2603.16354】PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

链接：https://arxiv.org/abs/2603.16354

作者：Hanif Rahman

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：remains severely underrepresented, underrepresented in NLP, language spoken, people that remains, remains severely

备注：

点击查看摘要

Abstract:We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08-6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%-21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at this https URL, this https URL, and this https URL.

43. 【2603.16335】Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

链接：https://arxiv.org/abs/2603.16335

作者：Jia Qing Yap

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：stream of Qwen, SAE latent activations, sparse autoencoders, agentic behavioral traits, residual stream

备注： 14 pages, 3 figures

点击查看摘要

Abstract:We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

44. 【2603.16309】Omnilingual MT: Machine Translation for 1,600 Languages

链接：https://arxiv.org/abs/2603.16309

作者：Omnilingual MT Team:Belen Alastruey,Niyati Bafna,Andrea Caciolai,Kevin Heffernan,Artyom Kozhevnikov,Christophe Ropers,Eduardo Sánchez,Charles-Eric Saint-James,Ioannis Tsiamas,Chierh Cheng,Joe Chuang,Paul-Ambroise Duquenne,Mark Duppenthaler,Nate Ekberg,Cynthia Gao,Pere Lluís Huguet Cabot,João Maria Janeiro,Jean Maillard,Gabriel Mejia Gonzalez,Holger Schwenk,Edan Toledo,Arina Turkatenko,Albert Ventayol-Boada,Rashel Moritz,Alexandre Mourachko,Surya Parimi,Mary Williamson,Shireen Yates,David Dale,Marta R. Costa-jussà

类目：Computation and Language (cs.CL)

关键词：High-quality machine translation, High-quality machine, high bar, machine translation, Omnilingual Machine Translation

备注：

点击查看摘要

Abstract:High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

Subjects:

Computation and Language (cs.CL)

ACMclasses:
I.2.7

Cite as:
arXiv:2603.16309 [cs.CL]

(or
arXiv:2603.16309v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.16309

Focus to learn more

              arXiv-issued DOI via DataCite</p>

45. 【2603.16299】PyPhonPlan: Simulating phonetic planning with dynamic neural fields and task dynamics

链接：https://arxiv.org/abs/2603.16299

作者：Sam Kirkham

类目：Computation and Language (cs.CL)

关键词：task dynamic simulations, implementing dynamical models, Python toolkit, dynamic neural fields, coupled dynamic neural

备注： Submitted to Interspeech 2026

点击查看摘要

Abstract:We introduce PyPhonPlan, a Python toolkit for implementing dynamical models of phonetic planning using coupled dynamic neural fields and task dynamic simulations. The toolkit provides modular components for defining planning, perception and memory fields, as well as between-field coupling, gestural inputs, and using field activation profiles to solve tract variable trajectories. We illustrate the toolkit's capabilities through an example application:~simulating production/perception loops with a coupled memory field, which demonstrates the framework's ability to model interactive speech dynamics using representations that are temporally-principled, neurally-grounded, and phonetically-rich. PyPhonPlan is released as open-source software and contains executable examples to promote reproducibility, extensibility, and cumulative computational development for speech communication research.

46. 【2603.16292】Attention-guided Evidence Grounding for Spoken Question Answering

链接：https://arxiv.org/abs/2603.16292

作者：Ke Yang,Bolin Chen,Yuejie Li,Yueying Hua,Jianhao Nie,Yueping He,Bowen Li,Chengjun Mao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Spoken Question Answering, Spoken Question, effectively aligning acoustic, Question Answering, aligning acoustic queries

备注：

点击查看摘要

Abstract:Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

47. 【2603.16258】Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on the KIParla Corpus

链接：https://arxiv.org/abs/2603.16258

作者：Martina Simonotti,Ludovica Pannitto,Eleonora Zucchini,Silvia Ballarè,Caterina Mauri

类目：Computation and Language (cs.CL)

关键词：Automatic Speech Recognition, Speech Recognition, Automatic Speech, spoken Italian, implementation of Automatic

备注：

点击查看摘要

Abstract:This paper analyses the implementation of Automatic Speech Recognition (ASR) into the transcription workflow of the KIParla corpus, a resource of spoken Italian. Through a two-phase experiment, 11 expert and novice transcribers produced both manual and ASR-assisted transcriptions of identical audio segments across three different types of conversation, which were subsequently analyzed through a combination of statistical modeling, word-level alignment and a series of annotation-based metrics. Results show that ASR-assisted workflows can increase transcription speed but do not consistently improve overall accuracy, with effects depending on multiple factors such as workflow configuration, conversation type and annotator experience. Analyses combining alignment-based metrics, descriptive statistics and statistical modeling provide a systematic framework to monitor transcription behavior across annotators and workflows. Despite limitations, ASR-assisted transcription, potentially supported by task-specific fine-tuning, could be integrated into the KIParla transcription workflow to accelerate corpus creation without compromising transcription quality.

48. 【2603.16245】How to Utilize Complementary Vision-Text Information for 2D Structure Understanding

链接：https://arxiv.org/abs/2603.16245

作者：Jiancheng Dong,Pengyue Jia,Derong Xu,Jiawei Cheng,Jingyu Peng,Chao Zhang,Bowen Liu,Xin Sun,Lixin Su,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：weakens row-column adjacency, LLMs typically linearize, typically linearize, fit their autoregressive, weakens row-column

备注： 16 pages, 5 figures

点击查看摘要

Abstract:LLMs typically linearize 2D tables into 1D sequences to fit their autoregressive architecture, which weakens row-column adjacency and other layout cues. In contrast, purely visual encoders can capture spatial cues, yet often struggle to preserve exact cell text. Our analysis reveals that these two modalities provide highly distinct information to LLMs and exhibit strong complementarity. However, direct concatenation and other fusion methods yield limited gains and frequently introduce cross-modal interference. To address this issue, we propose DiVA-Former, a lightweight architecture designed to effectively integrate vision and text information. DiVA-Former leverages visual tokens as dynamic queries to distill long textual sequences into digest vectors, thereby effectively exploiting complementary vision--text information. Evaluated across 13 table benchmarks, DiVA-Former improves upon the pure-text baseline by 23.9\% and achieves consistent gains over existing baselines using visual inputs, textual inputs, or a combination of both.

49. 【2603.16244】More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

链接：https://arxiv.org/abs/2603.16244

作者：Song Tae-Eun

类目：Computation and Language (cs.CL)

关键词：improves LLM verification, improves LLM, LLM verification, Dynamic Cross-Context Review, Review

备注： 10 pages, 2 figures

点击查看摘要

Abstract:Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, $p 0.001$, $d = -0.59$). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure -- reviewers in later rounds fabricate findings when the artifact's real errors have been exhausted, and (2) Review Target Drift -- reviewers provided with prior QA exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount -- within multi-turn conditions, more information actually helps (D-CCR-2b D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.

50. 【2603.16219】SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation

链接：https://arxiv.org/abs/2603.16219

作者：Hang Lv,Sheng Liang,Hao Wang,Yongyue Zhang,Hongchao Gu,Wei Guo,Defu Lian,Yong Liu,Enhong Chen

类目：Computation and Language (cs.CL)

关键词：centralized large language, raises privacy concerns, Realizing personalized intelligence, large language models, language models raises

备注：

点击查看摘要

Abstract:Realizing personalized intelligence faces a core dilemma: sending user history to centralized large language models raises privacy concerns, while on-device small language models lack the reasoning capacity required for high-quality generation. Our pilot study shows that purely local enhancements remain insufficient to reliably bridge this gap. We therefore propose SpecSteer, an asymmetric collaborative inference framework that synergizes private on-device context with cloud-scale reasoning. SpecSteer casts collaboration as Bayesian knowledge fusion and repurposes speculative decoding as a distributed alignment protocol, yielding a Draft--Verify--Recover pipeline: the on-device model drafts personalized sequences; the cloud validates via a ratio-based mechanism that decouples reasoning verification from private context, filtering logical flaws without accessing raw user context; upon rejection, a steering recovery injects local intent during correction. Experiments demonstrate that SpecSteer successfully closes the reasoning gap and achieves superior personalized generation performance, while delivering a 2.36x speedup over standard baselines.

51. 【2603.16206】Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

链接：https://arxiv.org/abs/2603.16206

作者：Yongyu Mu,Jiali Zeng,Fandong Meng,JingBo Zhu,Tong Xiao

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：large language models, encouraging self-exploration, reinforcement learning, verifiable rewards, language models

备注： Working in process

点击查看摘要

Abstract:Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of $+6$ Pass@1 and $+5$ Pass@$k$ points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA.

52. 【2603.16197】Are Large Language Models Truly Smarter Than Humans?

链接：https://arxiv.org/abs/2603.16197

作者：Eshwar Reddy M,Sourav Karmakar

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：surpass human experts, spanning academic knowledge, leaderboards increasingly suggest, Public leaderboards increasingly, benchmarks spanning academic

备注： 15 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM Professional Social Sciences Humanities.

53. 【2603.16192】Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

链接：https://arxiv.org/abs/2603.16192

作者：Xiaobing Sun,Perry Lam,Shaohua Li,Zizhou Wang,Rick Siow Mong Goh,Yong Liu,Liangli Zhen

类目：Computation and Language (cs.CL)

关键词：Modern LLMs employ, recover obfuscated malicious, jailbreak attacks ineffective, Structured Semantic Cloaking, surface-level input filtering

备注： 15 pages

点击查看摘要

Abstract:Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.

54. 【2603.16184】Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR

链接：https://arxiv.org/abs/2603.16184

作者：Quy-Anh Dang,Chris Ngo

类目：Computation and Language (cs.CL)

关键词：automatic speech recognition, covering English, landscape of Singapore, compact multilingual automatic, multilingual automatic speech

备注：

点击查看摘要

Abstract:We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \$81 on a single RTX PRO 6000 GPU compared to \$18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.

55. 【2603.16169】Open-Source Reproduction and Explainability Analysis of Corrective Retrieval Augmented Generation

链接：https://arxiv.org/abs/2603.16169

作者：Surya Vardhan Yalavarthi

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Retrieval Augmented Generation, Corrective Retrieval Augmented, triggering corrective actions, Augmented Generation, evaluating retrieved document

备注： 13 pages, 4 figures

点击查看摘要

Abstract:Corrective Retrieval Augmented Generation (CRAG) improves the robustness of RAG systems by evaluating retrieved document quality and triggering corrective actions. However, the original implementation relies on proprietary components including the Google Search API and closed model weights, limiting reproducibility. In this work, we present a fully open-source reproduction of CRAG, replacing proprietary web search with the Wikipedia API and the original LLaMA-2 generator with Phi-3-mini-4k-instruct. We evaluate on PopQA and ARC-Challenge, demonstrating that our open-source pipeline achieves comparable performance to the original system. Furthermore, we contribute the first explainability analysis of CRAG's T5-based retrieval evaluator using SHAP, revealing that the evaluator primarily relies on named entity alignment rather than semantic similarity. Our analysis identifies key failure modes including domain transfer limitations on science questions. All code and results are available at this https URL.

56. 【2603.16163】STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

链接：https://arxiv.org/abs/2603.16163

作者：Suvajit Patra,Soumitra Samanta

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Continuous Sign Language, Sign Language Recognition, Continuous Sign, Language Recognition, Sign Language

备注：

点击查看摘要

Abstract:Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80\%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

57. 【2603.16152】HIPO: Instruction Hierarchy via Constrained Reinforcement Learning

链接：https://arxiv.org/abs/2603.16152

作者：Keru Chen,Jun Luo,Sen Lin,Yingbin Liang,Alvaro Velasquez,Nathaniel Bastian,Shaofeng Zou

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Hierarchical Instruction, prompting large language, large language models, prompting large, large language

备注： 9 pages + appendix. Under review

点击查看摘要

Abstract:Hierarchical Instruction Following (HIF) refers to the problem of prompting large language models with a priority-ordered stack of instructions. Standard methods like RLHF and DPO typically fail in this problem since they mainly optimize for a single objective, failing to explicitly enforce system prompt compliance. Meanwhile, supervised fine-tuning relies on mimicking filtered, compliant data, which fails to establish the priority asymmetry at the algorithmic level. In this paper, we introduce \textsc{HIPO}, a novel alignment framework that formulates HIF as a Constrained Markov Decision Process. \textsc{HIPO} elevates system prompts from mere input context to strict algorithmic boundaries. Using a primal-dual safe reinforcement learning approach, the algorithm dynamically enforces system prompt compliance as an explicit constraint, maximizing user utility strictly within this feasible region. Extensive evaluations across diverse model architectures (e.g., Qwen, Phi, Llama) demonstrate that \textsc{HIPO} significantly improves both system compliance and user utility. Furthermore, mechanistic analysis reveals that this constrained optimization autonomously drives the model to shift its attention toward long-range system tokens, providing a principled foundation for reliable LLM deployment in complex workflows.

58. 【2603.16142】Parametric Social Identity Injection and Diversification in Public Opinion Simulation

链接：https://arxiv.org/abs/2603.16142

作者：Hexi Wang,Yujia Zhou,Bangde Du,Qingyao Ai,Yiqun Liu

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, slow human surveys, language models, offering a promising

备注： 16 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses within demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt-based persona conditioning, PSII enables fine-grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity. This work provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation. Code and data are available at this https URL.

59. 【2603.16138】Answer Bubbles: Information Exposure in AI-Mediated Search

链接：https://arxiv.org/abs/2603.16138

作者：Michelle Huang,Agam Goyal,Koustuv Saha,Eshwar Chandrasekharan

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：increasingly replacing link-based, replacing link-based retrieval, Generative search systems, increasingly replacing, replacing link-based

备注： Preprint: 12 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Generative search systems are increasingly replacing link-based retrieval with AI-generated summaries, yet little is known about how these systems differ in sources, language, and fidelity to cited material. We examine responses to 11,000 real search queries across four systems -- vanilla GPT, Search GPT, Google AI Overviews, and traditional Google Search -- at three levels: source diversity, linguistic characterization of the generated summary, and source-summary fidelity. We find that generative search systems exhibit significant \textit{source-selection} biases in their citations, favoring certain sources over others. Incorporating search also selectively attenuates epistemic markers, reducing hedging by up to 60\% while preserving confidence language in the AI-generated summaries. At the same time, AI summaries further compound the citation biases: Wikipedia and longer sources are disproportionately overrepresented, whereas cited social media content and negatively framed sources are substantially underrepresented. Our findings highlight the potential for \textit{answer bubbles}, in which identical queries yield structurally different information realities across systems, with implications for user trust, source visibility, and the transparency of AI-mediated information access.

60. 【2603.16137】SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment

链接：https://arxiv.org/abs/2603.16137

作者：Zhouwei Zhai,Mengxiang Chen,Anmeng Zhang

类目：Computation and Language (cs.CL)

关键词：enabling intent-aware recommendations, Large language models, models offer transformative, offer transformative potential, language models offer

备注：

点击查看摘要

Abstract:Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these issues, we propose SI--a Synthesize-Inject-Align framework for building knowledgeable and secure e-commerce search LLMs. Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware this http URL then introduce a parameter-efficient pre-training strategy based on Depth Up-Scaling to inject domain knowledge while preserving general capabilities. Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness. The framework has been deployed at this http URL, China's largest self-operated e-commerce platform, where A/B tests across five core search scenarios demonstrate significant improvements in key business metrics, validating its industrial effectiveness and scalability.

61. 【2603.16131】SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

链接：https://arxiv.org/abs/2603.16131

作者：Han Jang,Junhyeok Lee,Kyu Sung Choi

类目：Computation and Language (cs.CL)

关键词：unprecedented information overload, created unprecedented information, information overload, increasing the demand, explosive growth

备注： 12 pages, 7 figures, Submitted to KDD 2026

点击查看摘要

Abstract:The explosive growth of AI research has created unprecedented information overload, increasing the demand for scientific summarization at multiple levels of granularity beyond traditional abstracts. While LLMs are increasingly adopted for summarization, existing benchmarks remain limited in scale, target only a single granularity, and predate the LLM era. Moreover, since the release of ChatGPT in November 2022, researchers have rapidly adopted LLMs for drafting manuscripts themselves, fundamentally transforming scientific writing, yet no resource exists to analyze how this writing has evolved. To bridge these gaps, we introduce SciZoom, a benchmark comprising 44,946 papers from four top-tier ML venues (NeurIPS, ICLR, ICML, EMNLP) spanning 2020 to 2025, explicitly stratified into Pre-LLM and Post-LLM eras. SciZoom provides three hierarchical summarization targets (Abstract, Contributions, and TL;DR) achieving compression ratios up to 600:1, enabling both multi-granularity summarization research and temporal mining of scientific writing patterns. Our linguistic analysis reveals striking shifts in phrase patterns (up to 10x for formulaic expressions) and rhetorical style (23% decline in hedging), suggesting that LLM-assisted writing produces more confident yet homogenized prose. SciZoom serves as both a challenging benchmark and a unique resource for mining the evolution of scientific discourse in the generative AI era. Our code and dataset are publicly available on GitHub (this https URL) and Hugging Face (this https URL), respectively.

62. 【2603.16128】Social Simulacra in the Wild: AI Agent Communities on Moltbook

链接：https://arxiv.org/abs/2603.16128

作者：Agam Goyal,Olivia Pal,Hari Sundaram,Eshwar Chandrasekharan,Koustuv Saha

类目：Computation and Language (cs.CL)

关键词：populate social platforms, increasingly populate social, autonomous LLM-based agents, LLM-based agents increasingly, agents increasingly populate

备注： Preprint: 12 pages, 4 figures, 5 tables

点击查看摘要

Abstract:As autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross-community author overlap (33.8\% vs. 0.5\%). In terms of linguistic attributes, content generated by AI-agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community-level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI-mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi-agent interaction gives rise to collective communication dynamics distinct from those of human communities.

63. 【2603.16127】Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

链接：https://arxiv.org/abs/2603.16127

作者：Kazuki Yano,Shun Kiyono,Sosuke Kobayashi,Sho Takase,Jun Suzuki

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：supervised fine-tuning, learning rate scheduling, learning rate, investigate the role, large language

备注： 25 pages, accepted by ICLR 2026 as a conference paper

点击查看摘要

Abstract:We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.

64. 【2603.16124】SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

链接：https://arxiv.org/abs/2603.16124

作者：Songcheng Cai,Zhiheng Lyu,Yuansheng Ni,Xiangchao Chen,Baichuan Zhou,Shenzhe Zhu,Yi Lu,Haozhe Wang,Chi Ruan,Benjamin Schneider,Weixu Zhang,Xiang Li,Andy Zheng,Yuyu Zhang,Ping Nie,Wenhu Chen

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：repository-level code understanding, field lacks reliable, lacks reliable benchmarks, software engineering tasks, Agentic repository-level code

备注：

点击查看摘要

Abstract:Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.

65. 【2603.16120】Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

链接：https://arxiv.org/abs/2603.16120

作者：Nishant Balepur,Malachi Hamada,Varsha Kishore,Sergey Feldman,Amanpreet Singh,Pao Siangliulue,Joseph Chee Chang,Eunsol Choi,Jordan Lee Boyd-Graber,Aakanksha Naik

类目：Computation and Language (cs.CL)

关键词：ballooning publishing counts, Deep Research, publishing counts, researchers cope, cope with ballooning

备注： Under Review

点击查看摘要

Abstract:Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

66. 【2603.16112】ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

链接：https://arxiv.org/abs/2603.16112

作者：Tik Yu Yim,Wenting Tan,Sum Yee Chan,Tak-Wah Lam,Siu Ming Yiu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

关键词：Adapting large language, produces model-locked expertise, typically requires expensive, requires expensive fine-tuning, Adapting large

备注：

点击查看摘要

Abstract:Adapting large language models (LLMs) to specialized financial reasoning typically requires expensive fine-tuning that produces model-locked expertise. Training-free alternatives have emerged, yet our experiments show that leading methods (GEPA and ACE) achieve only marginal gains on the FAMMA financial reasoning benchmark, exposing the limits of unstructured text optimization for complex, multi-step domain reasoning. We introduce Automated Skill Distillation and Adaptation (ASDA), a framework that automatically generates structured skill artifacts through iterative error-corrective learning without modifying model weights. A teacher model analyzes a student model's failures on financial reasoning tasks, clusters errors by subfield and error type, and synthesizes skill files containing reasoning procedures, code templates, and worked examples, which are dynamically injected during inference. Evaluated on FAMMA, ASDA achieves up to +17.33% improvement on arithmetic reasoning and +5.95% on non-arithmetic reasoning, substantially outperforming all training-free baselines. The resulting skill artifacts are human-readable, version-controlled, and compatible with the Agent Skills open standard, offering any organization with a labeled domain dataset a practical and auditable path to domain adaptation without weight access or retraining.

67. 【2603.16105】Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

链接：https://arxiv.org/abs/2603.16105

作者：Francesco Pio Monaco,Elia Cunegatti,Flavio Vella,Giovanni Iacca

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Post-training model compression, portability of Large, Post-training model

备注：

点击查看摘要

Abstract:Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at this https URL.}.

68. 【2603.16091】CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

链接：https://arxiv.org/abs/2603.16091

作者：Tianyi Huang,Ying Kai Deng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：system retrieves relevant, factual question answering, retrieves relevant evidence, failures of commitment, retrieval-grounded question answering

备注：

点击查看摘要

Abstract:In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers additional support and conflicting evidence with follow-up queries conditioned on that draft answer, and finally applies a restricted refinement step that outputs either KEEP or REVISE, with proposed revisions accepted only if they pass deterministic validation. In effect, CounterRefine turns retrieval into a mechanism for testing a provisional answer rather than merely collecting more context. On the full SimpleQA benchmark, CounterRefine improves a matched GPT-5 Baseline-RAG by 5.8 points and reaches a 73.1 percent correct rate, while exceeding the reported one-shot GPT-5.4 score by roughly 40 points. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.

69. 【2603.16073】ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

链接：https://arxiv.org/abs/2603.16073

作者：Aniket Pramanick,Yufang Hou,Saif M. Mohammad,Iryna Gurevych

类目：Computation and Language (cs.CL)

关键词：textit, ACL Anthology papers, claims, NLP, claim

备注：

点击查看摘要

Abstract:Scientific papers do more than report results $-$ they advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $304$ ACL Anthology papers (1979$-$2025) that are manually annotated with $1{,}084$ claims and $832$ cross-paper claim relations, indicating whether a citing paper $\textit{supports}$, $\textit{extends}$, $\textit{qualifies}$, $\textit{refutes}$, or references a claim as $\textit{background}$. Using $\texttt{ClaimFlow}$, we define a new task $-$ $\textit{Claim Relation Classification}$ $-$ which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating strong neural models and large language models on this task, we report baseline performance of $0.78$ macro-F1, highlighting that claim-relation classification is feasible but challenging. We further apply our model to $\sim$$13k$ NLP papers to analyze how claims evolve across decades of NLP research. Our analysis reveals that $63.5$% claims are never reused; only $11.1$% are ever challenged; meanwhile, widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than directly confirmed or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP, and a foundation for assessing whether models can interpret scientific argumentation.

70. 【2603.16070】SEAHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Southeast Asia

链接：https://arxiv.org/abs/2603.16070

作者：Ri Chi Ng,Aditi Kumaresan,Yujia Hu,Roy Ka-Wei Lee

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：English and Chinese, diverse socio-linguistic contexts, socio-linguistic contexts complicate, Southeast Asia, platforms developing tools

备注： TALLIP Accepted

点击查看摘要

Abstract:Hate speech detection relies heavily on linguistic resources, which are primarily available in high-resource languages such as English and Chinese, creating barriers for researchers and platforms developing tools for low-resource languages in Southeast Asia, where diverse socio-linguistic contexts complicate online hate moderation. To address this, we introduce SEAHateCheck, a pioneering dataset tailored to Indonesia, Thailand, the Philippines, and Vietnam, covering Indonesian, Tagalog, Thai, and Vietnamese. Building on HateCheck's functional testing framework and refining SGHateCheck's methods, SEAHateCheck provides culturally relevant test cases, augmented by large language models and validated by local experts for accuracy. Experiments with state-of-the-art and multilingual models revealed limitations in detecting hate speech in specific low-resource languages. In particular, Tagalog test cases showed the lowest model accuracy, likely due to linguistic complexity and limited training data. In contrast, slang-based functional tests proved the hardest, as models struggled with culturally nuanced expressions. The diagnostic insights of SEAHateCheck further exposed model weaknesses in implicit hate detection and models' struggles with counter-speech expression. As the first functional test suite for these Southeast Asian languages, this work equips researchers with a robust benchmark, advancing the development of practical, culturally attuned hate speech detection tools for inclusive online content moderation.

71. 【2603.16068】Resource Consumption Threats in Large Language Models

链接：https://arxiv.org/abs/2603.16068

作者：Yuanhe Zhang,Xinyue Wang,Zhican Chen,Weiliu Wang,Zilu Zhang,Zhengshuo Gong,Zhenhong Zhou,Li Sun,Yang Liu,Sen Su

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：costly computational infrastructure, large language models, computational infrastructure, limited and costly, costly computational

备注：

点击查看摘要

Abstract:Given limited and costly computational infrastructure, resource efficiency is a key requirement for large language models (LLMs). Efficient LLMs increase service capacity for providers and reduce latency and API costs for users. Recent resource consumption threats induce excessive generation, degrading model efficiency and harming both service availability and economic sustainability. This survey presents a systematic review of threats to resource consumption in LLMs. We further establish a unified view of this emerging area by clarifying its scope and examining the problem along the full pipeline from threat induction to mechanism understanding and mitigation. Our goal is to clarify the problem landscape for this emerging area, thereby providing a clearer foundation for characterization and mitigation.

72. 【2603.16039】Residual Stream Duality in Modern Transformer Architectures

链接：https://arxiv.org/abs/2603.16039

作者：Yifan Zhang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：mere optimization plumbing, model representational machinery, optimization plumbing, representational machinery, work has made

备注： Project Page: [this https URL](https://github.com/yifanzhang-pro/residual-stream-duality)

点击查看摘要

Abstract:Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer$^2$. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.

73. 【2603.16017】Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

链接：https://arxiv.org/abs/2603.16017

作者：Fan Huang,Haewoon Kwak,Jisun An

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, morally sensitive decision-making, Large language, increasingly participate, sensitive decision-making

备注：

点击查看摘要

Abstract:Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ($r=0.715$, $p0.0001$) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity $= 0.859$).

74. 【2603.16011】Evaluating Agentic Optimization on Large Codebases

链接：https://arxiv.org/abs/2603.16011

作者：Atharva Sehgal,James Hou,Akanksha Sarkar,Ishaan Mantripragada,Swarat Chaudhuri,Jennifer J. Sun,Yisong Yue

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language model, coding agents increasingly, language model, repository level, agents increasingly operate

备注： Preprint version

点击查看摘要

Abstract:Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: this https URL

75. 【2603.16002】RadAnnotate: Large Language Models for Efficient and Reliable Radiology Report Annotation

链接：https://arxiv.org/abs/2603.16002

作者：Saisha Pradeep Shetty,Roger Eric Goldman,Vladimir Filkov

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Radiology report annotation, clinical NLP, Radiology report, slow and costly, annotation is essential

备注： 10 pages, 3 figures. Accepted at AMIA Amplify Informatics Summit 2026

点击查看摘要

Abstract:Radiology report annotation is essential for clinical NLP, yet manual labeling is slow and costly. We present RadAnnotate, an LLM-based framework that studies retrieval-augmented synthetic reports and confidence-based selective automation to reduce expert effort for labeling in RadGraph. We study RadGraph-style entity labeling (graph nodes) and leave relation extraction (edges) to future work. First, we train entity-specific classifiers on gold-standard reports and characterize their strengths and failure modes across anatomy and observation categories, with uncertain observations hardest to learn. Second, we generate RAG-guided synthetic reports and show that synthetic-only models remain within 1-2 F1 points of gold-trained models, and that synthetic augmentation is especially helpful for uncertain observations in a low-resource setting, improving F1 from 0.61 to 0.70. Finally, by learning entity-specific confidence thresholds, RadAnnotate can automatically annotate 55-90% of reports at 0.86-0.92 entity match score while routing low-confidence cases for expert review.

76. 【2603.16001】Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models

链接：https://arxiv.org/abs/2603.16001

作者：Sijie Li,Biao Qian,Jungong Han

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Vision-Language Models, enabling lightweight Large, lightweight Large Vision-Language, Vision-Language Models, lightweight Large

备注： CVPR 2026. Code available here: [this https URL](https://github.com/LezJ/ATV-Pruning)

点击查看摘要

Abstract:Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.

77. 【2603.15998】NLP Occupational Emergence Analysis: How Occupations Form and Evolve in Real Time -- A Zero-Assumption Method Demonstrated on AI in the US Technology Workforce, 2022-2026

链接：https://arxiv.org/abs/2603.15998

作者：David Nordfors

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：systems can track, form and evolve, evolve faster, faster than classification, classification systems

备注： 37 pages, 5 figures

点击查看摘要

Abstract:Occupations form and evolve faster than classification systems can track. We propose that a genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary. This co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population. Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations and reveals a striking asymmetry for AI: a cohesive professional vocabulary formed rapidly in early 2024, but the practitioner population never cohered. The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation. AI appears to be a diffusing technology, not an emerging occupation. We discuss whether introducing an "AI Engineer" occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor.

78. 【2603.15997】Visual Set Program Synthesizer

链接：https://arxiv.org/abs/2603.15997

作者：Zehua Cheng,Wei Dai,Wenhu Zhang,Thomas Lukasiewicz,Jiahao Sun

类目：Multimedia (cs.MM); Computation and Language (cs.CL); Symbolic Computation (cs.SC)

关键词：poses a difficult, user pointing, pointing their phone, supermarket shelf, difficult challenge

备注： 10 pages, IEEE International Conference on Multimedia and Expo 2026

点击查看摘要

Abstract:A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.

79. 【2603.15981】Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

链接：https://arxiv.org/abs/2603.15981

作者：Jingxiang Chen,Minseok Kim,Seong-Gyun Leem,Yin Huang,Rashi Rungta,Zhicheng Ouyang,Haibin Wu,Surya Teja Appini,Ankur Bansal,Yang Bai,Yue Liu,Florian Metze,Ahmed A Aly,Anuj Kumar,Ariya Rastrow,Zhaojiang Lin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, Speech large language, non-verbal sounds, observe paralinguistic cues, large language

备注：

点击查看摘要

Abstract:Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

80. 【2603.15969】Robust Language Identification for Romansh Varieties

链接：https://arxiv.org/abs/2603.15969

作者：Charlotte Model,Sina Ahmadi,Jannis Vamvas

类目：Computation and Language (cs.CL)

关键词：limited mutual intelligibility, regional varieties, mutual intelligibility, limited mutual, recognize Rumantsch Grischun

备注：

点击查看摘要

Abstract:The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.

81. 【2603.15968】MAC: Multi-Agent Constitution Learning

链接：https://arxiv.org/abs/2603.15968

作者：Rushil Thareja,Gautam Gupta,Francesco Pinto,Nils Lukas

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：control LLMs based, natural language, oversee and control, control LLMs, LLMs based

备注： Code: [this https URL](https://github.com/rushil-thareja/MAC-Multi-Agent-Constitution-Learning) | PyPI: [this https URL](https://pypi.org/project/mac-prompt/) | Website: [this https URL](https://www.mac-prompt.com/)

点击查看摘要

Abstract:Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM-based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi-Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human-readable and auditable rule sets, and achieves performance comparable to supervised fine-tuning and GRPO without requiring parameter updates.

82. 【2603.15965】MoLoRA: Composable Specialization via Per-Token Adapter Routing

链接：https://arxiv.org/abs/2603.15965

作者：Shrey Shah,Justin Wagle

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Multi-adapter serving systems, Multi-adapter serving, span multiple domains, serving systems route, systems route entire

备注：

点击查看摘要

Abstract:Multi-adapter serving systems route entire sequences to a single adapter, forcing a choice when requests span multiple domains. This assumption fails in two important settings: (1) multimodal generation, where text and image tokens require different adapters within the same sequence, and (2) mixed-capability requests like "write code to solve this equation," which need expertise from multiple specialized adapters. We introduce per-token routing, which routes individual tokens to adapters based on either vocabulary structure (for multimodal models) or learned gating (for semantic specialization). Per-token routing is provably optimal, achieving work N for N tokens versus K \cdot N for per-sequence routing with K adapter types. Our key contribution is MoLoRA (Mixture of LoRA), which enables composable specialization: load multiple domain-specific adapters and let a learned router select the appropriate adapter per-token. We demonstrate that specialization dramatically beats scale: MoLoRA enables Qwen3-1.7B to exceed Qwen3-8B across four reasoning benchmarks while being 4.7x smaller. This enables modular expertise at inference time: train focused LoRAs independently, combine them without retraining, and add new capabilities by simply loading new adapters.

83. 【2603.15953】A Family of LLMs Liberated from Static Vocabularies

链接：https://arxiv.org/abs/2603.15953

作者：Aleph Alpha:Adnen Abdessaied,Artur Baranowski,Lukas Balles,Michael Barlow,Fabien C. Y. Benureau,Felix Berkenkamp,Lukas Bluebaum,Bastian Boll,Thomas F. Burns,Björn Deiseroth,Constantin Eichenberg,David Friede,Pablo Iyu Guerrero,Ahmed Hammam,Bastian Harren,Johann Higl,Yasser Jadidi,Carina Kauf,Johannes Messner,Jan Hendrik Metzen,Max Meuer,Vedant Nanda,Pit Neitemeier,Koen Oostermeijer,Letitia Parcalabescu,Markus Pernpointner,Felix Reinfurt,Dylan Rodriquez,Grégory Schott,Philipp Siedler,Martin Simonovsky,Till Speicher,Volker Stampa,Stephan Wäldchen,Samuel Weinbach,Gregor Ziegltrum

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：natural language processing, current large language, convert raw text, processable units, central component

备注：

点击查看摘要

Abstract:Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.

84. 【2603.15950】POLAR:A Per-User Association Test in Embedding Space

链接：https://arxiv.org/abs/2603.15950

作者：Pedro Bento,Arthur Buzelin,Arthur Chagas,Yan Aquino,Victoria Estanislau,Samira Malaquias,Pedro Robles Dutenhefner,Gisele L. Pappa,Virgilio Almeida,Wagner MeiraJr

类目：Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)

关键词：obscuring author-level variation, intrinsic association probes, association probes operate, Lexical Association Re-port, On-axis Lexical Association

备注： Accepted paper at ICWSM 2026

点击查看摘要

Abstract:Most intrinsic association probes operate at the word, sentence, or corpus level, obscuring author-level variation. We present POLAR (Per-user On-axis Lexical Association Re-port), a per-user lexical association test that runs in the embedding space of a lightly adapted masked language model. Authors are represented by private deterministic to-kens; POLAR projects these vectors onto curated lexicalaxes and reports standardized effects with permutation p-values and Benjamini--Hochberg control. On a balanced bot--human Twitter benchmark, POLAR cleanly separates LLM-driven bots from organic accounts; on an extremist forum,it quantifies strong alignment with slur lexicons and reveals rightward drift over time. The method is modular to new attribute sets and provides concise, per-author diagnostics for computational social science. All code is publicly avail-able at this https URL.

85. 【2603.15949】BANGLASOCIALBENCH: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction

链接：https://arxiv.org/abs/2603.15949

作者：Tanvir Ahmed Sijan,S. M Golam Rifat,Pankaj Chowdhury Partha,Md. Tanjeed Islam,Md. Musfique Anwar

类目：Computation and Language (cs.CL)

关键词：strong multilingual fluency, demonstrated strong multilingual, Large Language Models, Large Language, multilingual fluency

备注： Under Review

点击查看摘要

Abstract:Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BANGLASOCIALBENCH, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, and consists of 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random, revealing persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.

86. 【2603.15936】CTG-DB: An Ontology-Based Transformation of ClinicalTrials.gov to Enable Cross-Trial Drug Safety Analyses

链接：https://arxiv.org/abs/2603.15936

作者：Jeffery L. Painter,François Haguinet,Andrew Bate

类目：Computation and Language (cs.CL)

关键词：heterogeneous adverse event, limit systematic pharmacovigilance, largest publicly accessible, publicly accessible registry, terminology limit systematic

备注： 10 pages, 2 figures. Submitted to the 2026 AMIA Annual Symposium

点击查看摘要

Abstract:ClinicalTrials .gov (CT .gov) is the largest publicly accessible registry of clinical studies, yet its registry-oriented architecture and heterogeneous adverse event (AE) terminology limit systematic pharmacovigilance (PV) analytics. AEs are typically recorded as investigator-reported text rather than standardized identifiers, requiring manual reconciliation to identify coherent safety concepts. We present the ClinicalTrials .gov Transformation Database (CTG-DB), an open-source pipeline that ingests the complete CT .gov XML archive and produces a relational database aligned to standardized AE terminology using the Medical Dictionary for Regulatory Activities (MedDRA). CTG-DB preserves arm-level denominators, represents placebo and comparator arms, and normalizes AE terminology using deterministic exact and fuzzy matching to ensure transparent and reproducible mappings. This framework enables concept-level retrieval and cross-trial aggregation for scalable placebo-referenced safety analyses and integration of clinical trial evidence into downstream PV signal detection.

87. 【2603.15922】Machine Translation in the Wild: User Reaction to Xiaohongshu's Built-In Translation Feature

链接：https://arxiv.org/abs/2603.15922

作者：Sui He

类目：Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词：social media platforms, linguistic boundaries, growing integration, integration of machine, social media

备注：

点击查看摘要

Abstract:The growing integration of machine translation into social media platforms is transforming how users interact with each other across cultural and linguistic boundaries. This paper examines user reactions to the launch of Xiaohongshu's built-in translation feature in January 2025. Drawing on a dataset of 6,723 comments collected from 11 official posts promoting the translation function, this paper combines sentiment analysis with thematic analysis to investigate how users perceived and experimented with the function. Results show that reactions were generally positive, particularly for translating posts and comments, although concerns regarding functionality, accessibility, and translation accuracy were also expressed. In addition to evaluative feedback, users actively tested the function with diverse inputs, including words and phrases in English and Chinese, abbreviations in pinyin, internet slang, and other language forms such as emoji, kaomoji, coded texts, etc. The findings highlight the importance of closer collaboration among computer scientists, translation scholars, and platform designers to better understand and improve translation technologies in real world communicative context.

88. 【2603.15909】Prompt Engineering for Scale Development in Generative Psychometrics

链接：https://arxiv.org/abs/2603.15909

作者：Lara Lee Russell-Lasalandra,Hudson Golino

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：Monte Carlo simulation, Carlo simulation examines, Monte Carlo, generated personality assessment, Carlo simulation

备注： 22 pages, 7 figures

点击查看摘要

Abstract:This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)--generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then evaluated and reduced using network psychometric methods. Across all conditions, AI-GENIE reliably improved structural validity following reduction, with the magnitude of its incremental contribution inversely related to the quality of the incoming item pool. Prompt design exerted a substantial influence on both pre- and post-reduction item quality. Adaptive prompting consistently outperformed non-adaptive strategies by sharply reducing semantic redundancy, elevating pre-reduction structural validity, and preserving substantially larger item pool, particularly when paired with newer, higher-capacity models. These gains were robust across temperature settings for most models, indicating that adaptive prompting mitigates common trade-offs between creativity and psychometric coherence. An exception was observed for the GPT-4o model at high temperatures, suggesting model-specific sensitivity to adaptive constraints at elevated stochasticity. Overall, the findings demonstrate that adaptive prompting is the strongest approach in this context, and that its benefits scale with model capability, motivating continued investigation of model--prompt interactions in generative psychometric pipelines.

89. 【2603.15903】Agent-based imitation dynamics can yield efficiently compressed population-level vocabularies

链接：https://arxiv.org/abs/2603.15903

作者：Nathaniel Imel,Richard Futrell,Michael Franke,Noga Zaslavsky

类目：Computation and Language (cs.CL)

关键词：Information Bottleneck, efficiently compress meanings, optimizing the Information, Natural languages, argued to evolve

备注：

点击查看摘要

Abstract:Natural languages have been argued to evolve under pressure to efficiently compress meanings into words by optimizing the Information Bottleneck (IB) complexity-accuracy tradeoff. However, the underlying social dynamics that could drive the optimization of a language's vocabulary towards efficiency remain largely unknown. In parallel, evolutionary game theory has been invoked to explain the emergence of language from rudimentary agent-level dynamics, but it has not yet been tested whether such an approach can lead to efficient compression in the IB sense. Here, we provide a unified model integrating evolutionary game theory with the IB framework and show how near-optimal compression can arise in a population through an independently motivated dynamic of imprecise strategy imitation in signaling games. We find that key parameters of the model -- namely, those that regulate precision in these games, as well as players' tendency to confuse similar states -- lead to constrained variation of the tradeoffs achieved by emergent vocabularies. Our results suggest that evolutionary game dynamics could potentially provide a mechanistic basis for the evolution of vocabularies with information-theoretically optimal and empirically attested properties.

90. 【2603.15897】COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives

链接：https://arxiv.org/abs/2603.15897

作者：Azwad Anjum Islam,Tisa Islam Erana

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Likert scale, Spearman Rank Correlation, Rank Correlation, requires rating, rating the plausibility

备注： System description paper in SemEval-2026, Task 5

点击查看摘要

Abstract:We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman's rho (0.86 average). Post-competition experiments with additional models further improved this performance to 0.92 accuracy and 0.85 Spearman's rho (0.89 average). We find that comparative prompting consistently improved performance across model families, and model ensembling significantly enhanced alignment with mean human judgments, suggesting that LLM ensembles are especially well suited for subjective semantic evaluation tasks involving multiple annotators.

91. 【2603.15892】mporal Fact Conflicts in LLMs: Reproducibility Insights from Unifying DYNAMICQA and MULAN

链接：https://arxiv.org/abs/2603.15892

作者：Ritajit Dey,Iadh Ounis,Graham McDonald,Yashar Moshfeghi

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, fact conflicts due, training data, temporal facts

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with temporal fact conflicts due to outdated or evolving information in their training data. Two recent studies with accompanying datasets report opposite conclusions on whether external context can effectively resolve such conflicts. DYNAMICQA evaluates how effective external context is in shifting the model's output distribution, finding that temporal facts are more resistant to change. In contrast, MULAN examines how often external context changes memorised facts, concluding that temporal facts are easier to update. In this reproducibility paper, we first reproduce experiments from both benchmarks. We then reproduce the experiments of each study on the dataset of the other to investigate the source of their disagreement. To enable direct comparison of findings, we standardise both datasets to align with the evaluation settings of each study. Importantly, using an LLM, we synthetically generate realistic natural language contexts to replace MULAN's programmatically constructed statements when reproducing the findings of DYNAMICQA. Our analysis reveals strong dataset dependence: MULAN's findings generalise under both methodological frameworks, whereas applying MULAN's evaluation to DYNAMICQA yields mixed outcomes. Finally, while the original studies only considered 7B LLMs, we reproduce these experiments across LLMs of varying sizes, revealing how model size influences the encoding and updating of temporal facts. Our results highlight how dataset design, evaluation metrics, and model size shape LLM behaviour in the presence of temporal knowledge conflicts.

92. 【2603.15854】FlashSampling: Fast and Memory-Efficient Exact Sampling

链接：https://arxiv.org/abs/2603.15854

作者：Tomas Ruiz,Zhen Qin,Yifan Zhang,Xuyang Shen,Yiran Zhong,Mengdi Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：triggers extra memory, extra memory traffic, triggers extra, extra memory, large-vocabulary decoding

备注： Project Page: [this https URL](https://github.com/FlashSampling/FlashSampling)

点击查看摘要

Abstract:Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. The fused tiled kernel is exact because $\argmax$ decomposes over a partition; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. Across H100, H200, B200, and B300 GPUs, FlashSampling speeds up kernel-level decode workloads, and in end-to-end vLLM experiments, it reduces time per output token by up to $19%$ on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, turning a bandwidth-bound postprocessing step into a lightweight epilogue. Project Page: this https URL.

93. 【2603.15840】When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making

链接：https://arxiv.org/abs/2603.15840

作者：Nazia Riasat

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词：Large language models, Large language, data-constrained scientific workflows, language models, decision-support tools

备注： 13 pages, 5 figures. Accepted at ICLR 2026 Workshop: I Can't Believe It's Not Better (ICBINB 2026). OpenReview: [this https URL](https://openreview.net/pdf?id=vf8vs2ibso)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as decision-support tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility across repeated runs. While these properties are desirable, stability alone does not guar- antee agreement with statistical ground truth when such references are available. We introduce a controlled behavioral evaluation framework that explicitly sep- arates four dimensions of LLM decision-making: stability, correctness, prompt sensitivity, and output validity under fixed statistical inputs. We evaluate multi- ple LLMs using a statistical gene prioritization task derived from differential ex- pression analysis across prompt regimes involving strict and relaxed significance thresholds, borderline ranking scenarios, and minor wording variations. Our ex- periments show that LLMs can exhibit near-perfect run-to-run stability while sys- tematically diverging from statistical ground truth, over-selecting under relaxed thresholds, responding sharply to minor prompt wording changes, or producing syntactically plausible gene identifiers absent from the input table. Although sta- bility reflects robustness across repeated runs, it does not guarantee agreement with statistical ground truth in structured scientific decision tasks. These findings highlight the importance of explicit ground-truth validation and output validity checks when deploying LLMs in automated or semi-automated scientific work- flows.

94. 【2603.15831】Persona-Conditioned Risk Behavior in Large Language Models: A Simulated Gambling Study with GPT-4.1

链接：https://arxiv.org/abs/2603.15831

作者：Sankalp Dubedy

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：sequential decision-making contexts, sequential decision-making, decision-making contexts, increasingly deployed, deployed as autonomous

备注： 21 pages, 13 figures, 9 tables. Independent research. Submitted to arXiv for open dissemination

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents in uncertain, sequential decision-making contexts. Yet it remains poorly understood whether the behaviors they exhibit in such environments reflect principled cognitive patterns or simply surface-level prompt mimicry. This paper presents a controlled experiment in which GPT-4.1 was assigned one of three socioeconomic personas (Rich, Middle-income, and Poor) and placed in a structured slot-machine environment with three distinct machine configurations: Fair (50%), Biased Low (35%), and Streak (dynamic probability increasing after consecutive losses). Across 50 independent iterations per condition and 6,950 recorded decisions, we find that the model reproduces key behavioral signatures predicted by Kahneman and Tversky's Prospect Theory without being instructed to do so. The Poor persona played a mean of 37.4 rounds per session (SD=15.5) compared to 1.1 rounds for the Rich persona (SD=0.31), a difference that is highly significant (Kruskal-Wallis H=393.5, p2.2e-16). Risk scores by persona show large effect sizes (Cohen's d=4.15 for Poor vs Rich). Emotional labels appear to function as post-hoc annotations rather than decision drivers (chi-square=3205.4, Cramer's V=0.39), and belief-updating across rounds is negligible (Spearman rho=0.032 for Poor persona, p=0.016). These findings carry implications for LLM agent design, interpretability research, and the broader question of whether classical cognitive economic biases are implicitly encoded in large-scale pretrained language models.

95. 【2603.15800】Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

链接：https://arxiv.org/abs/2603.15800

作者：Ce Zhang,Jinxi He,Junyi He,Katia Sycara,Yaqi Xie

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：Large Language Models, Multi-modal Large Language, Large Language, Language Models, visual reasoning tasks

备注： Accepted at CVPR 2026. Project page: [this https URL](https://echosafe-mllm.github.io)

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at this https URL.

96. 【2603.15773】Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs

链接：https://arxiv.org/abs/2603.15773

作者：Yara Alakeel,Chatrine Qwaider,Hanan Aldarmaki,Sawsan Alqahtani

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, effectively large language, tokenization schemes represent, Arabic root-pattern morphology, capture genuine morphological

备注： Accepted at LREC 2026

点击查看摘要

Abstract:This work investigates how effectively large language models (LLMs) and their tokenization schemes represent and generate Arabic root-pattern morphology, probing whether they capture genuine morphological structure or rely on surface memorization. Arabic morphological system provides a rich testbed for analyzing how LLMs handle complex, non-concatenative forms and how tokenization choices influence this process. Our study begins with an evaluation of morphological fidelity across Arabic and multilingual tokenizers against gold-standard segmentation, followed by an analysis of LLM performance in productive root-pattern generation using a newly developed test set. Our findings across seven Arabic-centric and multilingual LLMs and their respective tokenizers reveal that tokenizer morphological alignment is not necessary nor sufficient for morphological generation, which questions the role of morphological tokenization in downstream performance.

97. 【2603.15726】MiroThinker-1.7 H1: Towards Heavy-Duty Research Agents via Verification

链接：https://arxiv.org/abs/2603.15726

作者：MiroMind Team:S. Bai,L. Bing,L. Lei,R. Li,X. Li,X. Lin,E. Min,L. Su,B. Wang,L. Wang,L. Wang,S. Wang,X. Wang,Y. Zhang,Z. Zhang,G. Chen,L. Chen,Z. Cheng,Y. Deng,Z. Huang,D. Ng,J. Ni,Q. Ren,X. Tang,B.L. Wang,H. Wang,N. Wang,C. Wei,Q. Wu,J. Xia,Y. Xiao,H. Xu,X. Xu,C. Xue,Z. Yang,Z. Yang,F. Ye,H. Ye,J. Yu,C. Zhang,W. Zhang,H. Zhao,P. Zhu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：research agent designed, agent designed, reasoning, complex long-horizon reasoning, long-horizon reasoning tasks

备注： 23 pages

点击查看摘要

Abstract:We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.

98. 【2603.15677】MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

链接：https://arxiv.org/abs/2603.15677

作者：Eric Wu,Kevin Wu,Jason Hom,Paul H. Yi,Angela Zhang,Alejandro Lozano,Jeff Nirschl,Jeff Tangney,Kevin Byram,Braydon Dymm,Narender Annapureddy,Eric Topol,David Ouyang,James Zou

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, spanning clinical decision, clinical decision support, decision support

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models by Bradley-Terry rating. Only one-third of clinician-submitted questions resembled factual recall tasks (e.g., MedQA), whereas the majority addressed topics such as treatment selection, clinical documentation, or patient communication, with ~20% involving multi-turn conversations. Additionally, clinicians cited depth and detail and clarity of presentation more often than raw factual accuracy when explaining their preferences, highlighting the importance of readability and clinical nuance. We also confirm that the model rankings remain stable even after controlling for style-related factors like response length and formatting. By grounding evaluation in real-world clinical questions and preferences, MedArena offers a scalable platform for measuring and improving the utility and efficacy of medical LLMs.

99. 【2603.15658】Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents

链接：https://arxiv.org/abs/2603.15658

作者：Madhava Gaikwad

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：maintain multiple specialized, agents maintain multiple, introducing irrelevant context, multiple specialized stores, Memory-augmented agents maintain

备注： accepted in ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems

点击查看摘要

Abstract:Memory-augmented agents maintain multiple specialized stores, yet most systems retrieve from all stores for every query, increasing cost and introducing irrelevant context. We formulate memory retrieval as a store-routing problem and evaluate it using coverage, exact match, and token efficiency metrics. On downstream question answering, an oracle router achieves higher accuracy while using substantially fewer context tokens compared to uniform retrieval, demonstrating that selective retrieval improves both efficiency and performance. Our results show that routing decisions are a first-class component of memory-augmented agent design and motivate learned routing mechanisms for scalable multi-store systems. We additionally formalize store selection as a cost-sensitive decision problem that trades answer accuracy against retrieval cost, providing a principled interpretation of routing policies.

100. 【2603.15655】Beyond Reward Suppression: Reshaping Steganographic Communication Protocols in MARL via Dynamic Representational Circuit Breaking

链接：https://arxiv.org/abs/2603.15655

作者：Liu Hung Ming

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Multiagent Systems (cs.MA)

关键词：develop private protocols, agents develop private, presents a critical, develop private, private protocols

备注： 38 pages, includes 5 figures and 8 tables, preliminary version, AI safety / multi-agent reinforcement learning

点击查看摘要

Abstract:In decentralized Multi-Agent Reinforcement Learning (MARL), steganographic collusion -- where agents develop private protocols to evade monitoring -- presents a critical AI safety threat. Existing defenses, limited to behavioral or reward layers, fail to detect coordination in latent communication channels. We introduce the Dynamic Representational Circuit Breaker (DRCB), an architectural defense operating at the optimization substrate. Building on the AI Mother Tongue (AIM) framework, DRCB utilizes a Vector Quantized Variational Autoencoder (VQ-VAE) bottleneck to convert unobservable messages into auditable statistical objects. DRCB monitors signals including Jensen-Shannon Divergence drift, L2-norm codebook displacement, and Randomized Observer Pool accuracy to compute an EMA-based Collusion Score. Threshold breaches trigger four escalating interventions: dynamic adaptation, gradient-space penalty injection into the Advantage function A^pi, temporal reward suppression, and full substrate circuit breaking via codebook shuffling and optimizer state reset. Experiments on a Contextual Prisoner's Dilemma with MNIST labels show that while static monitoring fails (p = 0.3517), DRCB improves observer mean accuracy from 0.858 to 0.938 (+9.3 percent) and reduces volatility by 43 percent, while preserving mean joint reward (p = 0.854). Analysis of 214,298 symbol samples confirms "Semantic Degradation," where high-frequency sequences converge to zero entropy, foreclosing complex steganographic encodings. We identify a "Transparency Paradox" where agents achieve surface-level determinism while preserving residual capacity in long-tail distributions, reflecting Goodhart's Law. This task-agnostic methodology provides a technical path toward MICA-compliant (Multi-Agent Internal Coupling Audit) pre-deployment auditing for autonomous systems.

Comments:
38 pages, includes 5 figures and 8 tables, preliminary version, AI safety / multi-agent reinforcement learning

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Multiagent Systems (cs.MA)

MSC classes:
68T05, 91Axx, 68P25

ACMclasses:
I.2.11; I.2.6; K.4.1

Cite as:
arXiv:2603.15655 [cs.LG]

(or
arXiv:2603.15655v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.15655

Focus to learn more

              arXiv-issued DOI via DataCite</p>

101. 【2603.15653】Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

链接：https://arxiv.org/abs/2603.15653

作者：Keivan Alizadeh,Parshin Shojaee,Minsik Cho,Mehrdad Farajtabar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Long-context handling remains, Long-context handling, Recursive Language Models, reliably extract, handling remains

备注： preprint

点击查看摘要

Abstract:Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model's internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model's window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.

102. 【2603.15646】Alternating Reinforcement Learning with Contextual Rubric Rewards

链接：https://arxiv.org/abs/2603.15646

作者：Guangchen Lan

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：contextual rubric-based evaluations, conventional reinforcement learning, extends conventional reinforcement, Alternating Reinforcement Learning, Reinforcement Learning

备注：

点击查看摘要

Abstract:Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).

103. 【2603.15644】okenization Tradeoffs in Structured EHR Foundation Models

链接：https://arxiv.org/abs/2603.15644

作者：Lin Lawrence Guo,Santiago Eduardo Arciniegas,Joseph Jihyung Lee,Adam Paul Yan,George Tomlinson,Jason Fries,Lillian Sung

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：electronic health records, adaptable patient representations, structured electronic health, learn adaptable patient, health records

备注：

点击查看摘要

Abstract:Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. Tokenization -- how these timelines are converted into discrete model inputs -- determines what information is preserved, how efficiently it is encoded, and which relationships must be learned versus precomputed. Yet the impact of tokenization design choices on downstream performance and computational efficiency remains largely unexplored. Here, we pretrained a transformer on pediatric EHR data under a factorial design, varying tokenization along event encoding, time encoding, and workflow annotation. We evaluated area-under-the-receiver-operating-characteristic curve across 74 clinical prediction tasks. Joint event encoding and positional time encoding outperformed their alternatives (73/74 and 71/74 tasks) while requiring 39.5% and 9.6% fewer pretraining floating-point operations, respectively. Targeted ablations traced the joint encoding advantage to local binding efficiency, that is, code-attribute pairs are combined into single tokens, rather than split across tokens that the model must learn to associate during pretraining. External evaluation on an adult intensive care unit cohort demonstrated that this advantage generalizes despite substantial vocabulary mismatch, while temporal and workflow effects remain institution-specific. These results establish tokenization as a tractable lever for improving both the performance and efficiency of EHR foundation models.

104. 【2603.14761】BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

链接：https://arxiv.org/abs/2603.14761

作者：Yuzhe Tang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, Large language, routinely fail questions, routinely fail, human would answer

备注：

点击查看摘要

Abstract:Large language models (LLMs) achieve impressive scores on standard benchmarks yet routinely fail questions that any human would answer correctly in seconds. We introduce BrainBench, a benchmark of 100 brainteaser questions spanning 20 carefully designed categories, each targeting a specific commonsense reasoning failure mode in LLMs. Categories range from implicit physical constraints ("Should I walk or drive my rental car to the return lot?") to semantic scope tricks and default assumption hijacks. We evaluate eight frontier models -- four from the Claude family and four from the GPT family -- using a zero-shot protocol with 10 independent runs per question. The best model, Claude Opus 4.6 with extended thinking, achieves only 80.3% accuracy; the worst, GPT-4o, scores 39.7%. Even top-performing models exhibit a 6-16 percentage-point gap between accuracy and consistency, revealing stochastic reasoning. Cross-lingual evaluation in Chinese shows most models degrade by 2-8 percentage points, confirming that these failures reflect reasoning deficits rather than language-specific artifacts. BrainBench provides a fine-grained diagnostic tool for identifying where and why LLMs substitute surface heuristics for genuine commonsense reasoning.

信息检索

1. 【2603.16415】IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time

链接：https://arxiv.org/abs/2603.16415

作者：Zhenghua Bao,Yi Shi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：existing retrieval-augmented generation, Multi-hop question answering, iterative multi-step reasoning, question answering, retrieval-augmented generation

备注：

点击查看摘要

2. 【2603.16354】PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

链接：https://arxiv.org/abs/2603.16354

作者：Hanif Rahman

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：remains severely underrepresented, underrepresented in NLP, language spoken, people that remains, remains severely

备注：

点击查看摘要

3. 【2603.16236】ReFORM: Review-aggregated Profile Generation via LLM with Multi-Factor Attention for Restaurant Recommendation

链接：https://arxiv.org/abs/2603.16236

作者：Moonsoo Park,Seulbeen Je,Donghyeon Park

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Graph Convolution Networks, Convolution Networks, Graph Convolution, large language models, generating descriptive summarization

备注：

点击查看摘要

Abstract:In recommender systems, large language models (LLMs) have gained popularity for generating descriptive summarization to improve recommendation robustness, along with Graph Convolution Networks. However, existing LLM-enhanced recommendation studies mainly rely on the internal knowledge of LLMs about item titles while neglecting the importance of various factors influencing users' decisions. Although information reflecting various decision factors of each user is abundant in reviews, few studies have actively exploited such insights for recommendation. To address these limitations, we propose a ReFORM: Review-aggregated Profile Generation via LLM with Multi-FactOr Attentive RecoMmendation framework. Specifically, we first generate factor-specific user and item profiles from reviews using LLM to capture a user's preference by items and an item's evaluation by users. Then, we propose a Multi-Factor Attention to highlight the most influential factors in each user's decision-making process. In this paper, we conduct experiments on two restaurant datasets of varying scales, demonstrating its robustness and superior performance over state-of-the-art baselines. Furthermore, in-depth analyses validate the effectiveness of the proposed modules and provide insights into the sources of personalization. Our source code and datasets are available at this https URL.

4. 【2603.16171】MemX: A Local-First Long-Term Memory System for AI Assistants

链接：https://arxiv.org/abs/2603.16171

作者：Lizheng Sun

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Reciprocal Rank Fusion, long-term memory system, local-first long-term memory, OpenAI-compatible embedding API, long-term memory

备注： 18 pages, 2 figures, 13 tables

点击查看摘要

Abstract:We present MemX, a local-first long-term memory system for AI assistants with stability-oriented retrieval design. MemX is implemented in Rust on top of libSQL and an OpenAI-compatible embedding API, providing persistent, searchable, and explainable memory for conversational agents. Its retrieval pipeline applies vector recall, keyword recall, Reciprocal Rank Fusion (RRF), four-factor re-ranking, and a low-confidence rejection rule that suppresses spurious recalls when no answer exists in the memory store. We evaluate MemX on two axes. First, two custom Chinese-language benchmark suites (43 queries, =1,014 records) validate pipeline design: Hit@1=91.3% on a default scenario and 100% under high confusion, with conservative miss-query suppression. Second, the LongMemEval benchmark (500 queries, up to 220,349 records) quantifies system boundaries across four ability types and three storage granularities. At fact-level granularity the system reaches Hit@5=51.6% and MRR=0.380, doubling session-level performance, while temporal and multi-session reasoning remain challenging (=43.6% Hit@5). FTS5 full-text indexing reduces keyword search latency by 1,100x at 100k-record scale, keeping end-to-end search under 90 ms. Unlike Mem0 and related work that targets end-to-end agent benchmarks, MemX focuses on a narrower, reproducible baseline: local-first deployment, structural simplicity, explainable retrieval, and stability-oriented design.

Comments:
18 pages, 2 figures, 13 tables

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.16171 [cs.IR]

(or
arXiv:2603.16171v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.16171

Focus to learn more

              arXiv-issued DOI via DataCite</p>

5. 【2603.16169】Open-Source Reproduction and Explainability Analysis of Corrective Retrieval Augmented Generation

链接：https://arxiv.org/abs/2603.16169

作者：Surya Vardhan Yalavarthi

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Retrieval Augmented Generation, Corrective Retrieval Augmented, triggering corrective actions, Augmented Generation, evaluating retrieved document

备注： 13 pages, 4 figures

点击查看摘要

6. 【2603.16138】Answer Bubbles: Information Exposure in AI-Mediated Search

链接：https://arxiv.org/abs/2603.16138

作者：Michelle Huang,Agam Goyal,Koustuv Saha,Eshwar Chandrasekharan

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：increasingly replacing link-based, replacing link-based retrieval, Generative search systems, increasingly replacing, replacing link-based

备注： Preprint: 12 pages, 2 figures, 6 tables

点击查看摘要

7. 【2603.16088】RecBundle: A Next-Generation Geometric Paradigm for Explainable Recommender Systems

链接：https://arxiv.org/abs/2603.16088

作者：Hui Wang,Tianzhu Hu,Mingming Li,Xi Zhou,Chun Gan,Jiao Dai,Jizhong Han,Songlin Hu,Tao Guo

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：macroscopic structural degradation, prolonged local interactions, local interactions accumulate, inherently dynamic feedback, dynamic feedback loops

备注：

点击查看摘要

Abstract:Recommender systems are inherently dynamic feedback loops where prolonged local interactions accumulate into macroscopic structural degradation such as information cocoons. Existing representation learning paradigms are universally constrained by the assumption of a single flat space, forcing topologically grounded user associations and semantically driven historical interactions to be fitted within the same vector space. This excessive coupling of heterogeneous information renders it impossible for researchers to mechanistically distinguish and identify the sources of systemic bias. To overcome this theoretical bottleneck, we introduce Fiber Bundle from modern differential geometry and propose a novel geometric analysis paradigm for recommender systems. This theory naturally decouples the system space into two hierarchical layers: the base manifold formed by user interaction networks, and the fibers attached to individual user nodes that carry their dynamic preferences. Building upon this, we construct RecBundle, a framework oriented toward next-generation recommender systems that formalizes user collaboration as geometric connection and parallel transport on the base manifold, while mapping content evolution to holonomy transformations on fibers. From this foundation, we identify future application directions encompassing quantitative mechanisms for information cocoons and evolutionary bias, geometric meta-theory for adaptive recommendation, and novel inference architectures integrating large language models (LLMs). Empirical analysis on real-world MovieLens and Amazon Beauty datasets validates the effectiveness of this geometric framework.

8. 【2603.15892】mporal Fact Conflicts in LLMs: Reproducibility Insights from Unifying DYNAMICQA and MULAN

链接：https://arxiv.org/abs/2603.15892

作者：Ritajit Dey,Iadh Ounis,Graham McDonald,Yashar Moshfeghi

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, fact conflicts due, training data, temporal facts

备注：

点击查看摘要

9. 【2603.15726】MiroThinker-1.7 H1: Towards Heavy-Duty Research Agents via Verification

链接：https://arxiv.org/abs/2603.15726

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：research agent designed, agent designed, reasoning, complex long-horizon reasoning, long-horizon reasoning tasks

备注： 23 pages

点击查看摘要

10. 【2603.15713】Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences

链接：https://arxiv.org/abs/2603.15713

作者：Artem Sakhno,Ivan Sergeev,Alexey Shestov,Omar Zoloev,Elizaveta Kovtun,Gleb Gusev,Andrey Savchenko,Maksim Makarenko

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：financial systems operate, Industrial financial systems, user actions, operate on temporal, temporal event sequences

备注：

点击查看摘要

Abstract:Industrial financial systems operate on temporal event sequences such as transactions, user actions, and system logs. While recent research emphasizes representation learning and large language models, production systems continue to rely heavily on handcrafted statistical features due to their interpretability, robustness under limited supervision, and strict latency constraints. This creates a persistent disconnect between learned embeddings and feature-based pipelines. We introduce Embedding-Aware Feature Discovery (EAFD), a unified framework that bridges this gap by coupling pretrained event-sequence embeddings with a self-reflective LLM-driven feature generation agent. EAFD iteratively discovers, evaluates, and refines features directly from raw event sequences using two complementary criteria: \emph{alignment}, which explains information already encoded in embeddings, and \emph{complementarity}, which identifies predictive signals missing from them. Across both open-source and industrial transaction benchmarks, EAFD consistently outperforms embedding-only and feature-based baselines, achieving relative gains of up to $+5.8\%$ over state-of-the-art pretrained embeddings, resulting in new state-of-the-art performance across event-sequence datasets.

11. 【2603.15711】Knowledge Graph Extraction from Biomedical Literature for Alkaptonuria Rare Disease

链接：https://arxiv.org/abs/2603.15711

作者：Giang Pham,Rebecca Finetti,Caterina Graziani,Bianca Roncaglia,Asma Bendjeddou,Linda Brodo,Sara Brunetti,Moreno Falaschi,Stefano Forti,Silvia Giulia Galfré,Paolo Milazzo,Corrado Priami,Annalisa Santucci,Ottavia Spiga,Alina Sîrbu

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM)

关键词：ultra-rare autosomal recessive, autosomal recessive metabolic, homogentisic acid, fluids and tissues, autosomal recessive

备注：

点击查看摘要

Abstract:Alkaptonuria (AKU) is an ultra-rare autosomal recessive metabolic disorder caused by mutations in the HGD (Homogentisate 1,2-Dioxygenase) gene, leading to a pathological accumulation of homogentisic acid (HGA) in body fluids and tissues. This leads to systemic manifestations, including premature spondyloarthropathy, renal and prostatic stones, and cardiovascular complications. Being ultra-rare, the amount of data related to the disease is limited, both in terms of clinical data and literature. Knowledge graphs (KGs) can help connect the limited knowledge about the disease (basic mechanisms, manifestations and existing therapies) with other knowledge; however, AKU is frequently underrepresented or entirely absent in existing biomedical KGs. In this work, we apply a text-mining methodology based on PubTator3 for large-scale extraction of biomedical relations. We construct two KGs of different sizes, validate them using existing biochemical knowledge and use them to extract genes, diseases and therapies possibly related to AKU. This computational framework reveals the systemic interactions of the disease, its comorbidities, and potential therapeutic targets, demonstrating the efficacy of our approach in analyzing rare metabolic disorders.

12. 【2603.15658】Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents

链接：https://arxiv.org/abs/2603.15658

作者：Madhava Gaikwad

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：maintain multiple specialized, agents maintain multiple, introducing irrelevant context, multiple specialized stores, Memory-augmented agents maintain

备注： accepted in ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems

点击查看摘要

13. 【2603.15634】NextMem: Towards Latent Factual Memory for LLM-based Agents

链接：https://arxiv.org/abs/2603.15634

作者：Zeyu Zhang,Rui Li,Xiaoyan Zhao,Yang Zhang,Wenjie Wang,Xu Chen,Tat-Seng Chua

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：preserve past observations, factual memory serves, future decision-making, foundational part, factual memory

备注： 17 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Memory is critical for LLM-based agents to preserve past observations for future decision-making, where factual memory serves as its foundational part. However, existing approaches to constructing factual memory face several limitations. Textual methods impose heavy context and indexing burdens, while parametric methods suffer from catastrophic forgetting and high costs. To address these challenges, we introduce NextMem, a latent factual memory framework that utilizes an autoregressive autoencoder to efficiently construct latent memory while ensuring accurate reconstruction. For better optimization, we propose a two-stage training process, including autoregressive reconstruction alignment and progressive latent substitution. We also incorporate quantization to reduce storage overhead. Extensive experiments demonstrate that NextMem achieves superior performance, and excels in retrieval, robustness, and extensibility properties. We release our code and model checkpoints at this https URL.

14. 【2603.15623】Finder: A Multimodal AI-Powered Search Framework for Pharmaceutical Data Retrieval

链接：https://arxiv.org/abs/2603.15623

作者：Suyash Mishra,Srikanth Patil,Satyanarayan Pati,Sagar Sahu,Baddu Narendra

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：transforming pharmaceutical search, traditional systems struggle, manual curation, transforming pharmaceutical, struggle with multimodal

备注：

点击查看摘要

Abstract:AI is transforming pharmaceutical search, where traditional systems struggle with multimodal content and manual curation. Finder is a scalable AI-powered framework that unifies retrieval across text, images, audio, and video using hybrid vector search, combining sparse lexical and dense semantic models. Its modular pipeline ingests diverse formats, enriches metadata, and stores content in a vector-native backend. Finder supports reasoning-aware natural language search, improving precision and contextual relevance. The system has processed over 291,400 documents, 31,070 videos, and 1,192 audio files in 98 languages. Techniques like hybrid fusion, chunking, and metadata-aware routing enable intelligent access across regulatory, research, and commercial domains.

计算机视觉

1. 【2603.16871】WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

链接：https://arxiv.org/abs/2603.16871

作者：Jisu Nam,Yicong Hong,Chun-Hao Paul Huang,Feng Liu,JoungBin Lee,Jiyoung Kim,Siyoon Jin,Yunsung Lee,Jaeyoon Jung,Suhwan Choi,Seungryong Kim,Yang Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：video diffusion transformers, explore generated environments, Recent advances, extended horizons, advances in video

备注： Project page is available at [this https URL](https://cvlab-kaist.github.io/WorldCam/)

点击查看摘要

Abstract:Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.

2. 【2603.16870】Demystifing Video Reasoning

链接：https://arxiv.org/abs/2603.16870

作者：Ruisi Wang,Zhongang Cai,Fanyi Pu,Junxiang Xu,Wanqi Yin,Maijunxian Wang,Ran Ji,Chenyang Gu,Bo Li,Ziqi Huang,Hokin Deng,Dahua Lin,Ziwei Liu,Lei Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent advances, non-trivial reasoning capabilities, models exhibit non-trivial, exhibit non-trivial reasoning, diffusion-based video models

备注： Homepage: [this https URL](https://www.wruisi.com/demystifying_video_reasoning)

点击查看摘要

Abstract:Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

3. 【2603.16869】SegviGen: Repurposing 3D Generative Model for Part Segmentation

链接：https://arxiv.org/abs/2603.16869

作者：Lin Li,Haoran Feng,Zehuan Huang,Haohua Chen,Wenbo Nie,Shaohua Hou,Keqing Fan,Pan Hu,Sheng Wang,Buyu Li,Lu Sheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：part segmentation, segmentation, repurposes native, part, interactive part segmentation

备注： Project page: [this https URL](https://fenghora.github.io/SegviGen-Page/)

点击查看摘要

Abstract:We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at this https URL.

4. 【2603.16868】MessyKitchens: Contact-rich object-level 3D scene reconstruction

链接：https://arxiv.org/abs/2603.16868

作者：Junaid Ahmed Ansari,Ran Ding,Fabio Pizzati,Ivan Laptev

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：scene reconstruction, object-level scene reconstruction, Monocular, reconstruction, scene

备注：

点击查看摘要

Abstract:Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: this https URL.

5. 【2603.16864】SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

链接：https://arxiv.org/abs/2603.16864

作者：Jiongze Yu,Xiangbo Gao,Pooja Verlani,Akshay Gadde,Yilin Wang,Balu Adsumilli,Zhengzhong Tu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：correct unexpected artifacts, existing VSR approaches, VSR approaches behave, reliably correct unexpected, restore high-quality video

备注：

点击查看摘要

Abstract:Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: this https URL

6. 【2603.16858】SOMA: Unifying Parametric Human Body Models

链接：https://arxiv.org/abs/2603.16858

作者：Jun Saito,Jiefeng Li,Michael de Ruyter,Miguel Guerrero,Edy Lim,Ehsan Hassani,Roger Blanco Ribera,Hyejin Moon,Magdalena Dadela,Marco Di Lucca,Qiao Wang,Xueting Li,Jan Kautz,Simon Yuen,Umar Iqbal

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Parametric human body, remain mutually incompatible, Parametric human, human reconstruction, mutually incompatible

备注：

点击查看摘要

Abstract:Parametric human body models are foundational to human reconstruction, animation, and simulation, yet they remain mutually incompatible: SMPL, SMPL-X, MHR, Anny, and related models each diverge in mesh topology, skeletal structure, shape parameterization, and unit convention, making it impractical to exploit their complementary strengths within a single pipeline. We present SOMA, a unified body layer that bridges these heterogeneous representations through three abstraction layers. Mesh topology abstraction maps any source model's identity to a shared canonical mesh in constant time per vertex. Skeletal abstraction recovers a full set of identity-adapted joint transforms from any body shape, whether in rest pose or an arbitrary posed configuration, in a single closed-form pass, with no iterative optimization or per-model training. Pose abstraction inverts the skinning pipeline to recover unified skeleton rotations directly from posed vertices of any supported model, enabling heterogeneous motion datasets to be consumed without custom retargeting. Together, these layers reduce the $O(M^2)$ per-pair adapter problem to $O(M)$ single-backend connectors, letting practitioners freely mix identity sources and pose data at inference time. The entire pipeline is fully differentiable end-to-end and GPU-accelerated via NVIDIA-Warp.

7. 【2603.16844】M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

链接：https://arxiv.org/abs/2603.16844

作者：Kerui Ren,Guanghao Li,Changjian Jiang,Yingxiang Xu,Tao Lu,Linning Xu,Junting Dong,Jiangmiao Pang,Mulin Yu,Bo Dai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：video remains challenging, computationally efficient online, efficient online refinement, uncalibrated monocular video, monocular video remains

备注： Project page: [this https URL](https://city-super.github.io/M3/)

点击查看摘要

Abstract:Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.

8. 【2603.16840】What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

链接：https://arxiv.org/abs/2603.16840

作者：Moritz Pawlowsky,Antonis Vamvakeros,Alexander Weiss,Anja Bielefeld,Samuel J. Cooper,Ronan Docherty

类目：Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)

关键词：learn rich representations, Vision transformers, learn rich, downstream tasks, rich representations

备注：

点击查看摘要

Abstract:Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.

9. 【2603.16835】An assessment of data-centric methods for label noise identification in remote sensing data sets

链接：https://arxiv.org/abs/2603.16835

作者：Felix Kröber,Genc Hoxha,Ribana Roscher

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deep learning models, Label noise, noise, data-centric label noise, Label

备注： Accepted for publication in International Society for Photogrammetry and Remote Sensing (ISPRS) Annals 2026

点击查看摘要

Abstract:Label noise in the sense of incorrect labels is present in many real-world data sets and is known to severely limit the generalizability of deep learning models. In the field of remote sensing, however, automated treatment of label noise in data sets has received little attention to date. In particular, there is a lack of systematic analysis of the performance of data-centric methods that not only cope with label noise but also explicitly identify and isolate noisy labels. In this paper, we examine three such methods and evaluate their behavior under different label noise assumptions. To do this, we inject different types of label noise with noise levels ranging from 10 to 70% into two benchmark data sets, followed by an analysis of how well the selected methods filter the label noise and how this affects task performances. With our analyses, we clearly prove the value of data-centric methods for both parts - label noise identification and task performance improvements. Our analyses provide insights into which method is the best choice depending on the setting and objective. Finally, we show in which areas there is still a need for research in the transfer of data-centric label noise methods to remote sensing data. As such, our work is a step forward in bridging the methodological establishment of data-centric label noise methods and their usage in practical settings in the remote sensing domain.

10. 【2603.16823】Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines

链接：https://arxiv.org/abs/2603.16823

作者：Sourya Saha(City University of New York),Saptarshi Debroy(City University of New York)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：applications introduce latency-critical, nearby edge servers, Immersive extended reality, satisfy stringent real-time, stringent real-time responsiveness

备注： Accepted at the The 26th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2026)

点击查看摘要

Abstract:Immersive extended reality (XR) applications introduce latency-critical workloads that must satisfy stringent real-time responsiveness while operating on energy- and battery-constrained devices, making execution placement between end devices and nearby edge servers a fundamental systems challenge. Existing approaches to adaptive execution and computation offloading typically optimize average performance metrics and do not fully capture the sustained interaction between real-time latency requirements and device battery lifetime in closed-loop XR workloads. In this paper, we present a battery-aware execution management framework for edge-assisted XR systems that jointly considers execution placement, workload quality, latency requirements, and battery dynamics. We design an online decision mechanism based on a lightweight deep reinforcement learning policy that continuously adapts execution decisions under dynamic network conditions while maintaining high motion-to-photon latency compliance. Experimental results show that the proposed approach extends the projected device battery lifetime by up to 163% compared to latency-optimal local execution while maintaining over 90% motion-to-photon latency compliance under stable network conditions. Such compliance does not fall below 80% even under significantly limited network bandwidth availability, thereby demonstrating the effectiveness of explicitly managing latency-energy trade-offs in immersive XR systems.

11. 【2603.16816】WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation

链接：https://arxiv.org/abs/2603.16816

作者：Muhammad Aamir,Naoya Muramatsu,Sangyun Shin,Matthew Wijers,Jiaxing Jhong,Xinyu Hou,Amir Patel,Andrew Markham

类目：Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词：computer vision, extensively studied, studied as core, core topics, topics in computer

备注：

点击查看摘要

Abstract:Depth estimation and 3D reconstruction have been extensively studied as core topics in computer vision. Starting from rigid objects with relatively simple geometric shapes, such as vehicles, the research has expanded to address general objects, including challenging deformable objects, such as humans and animals. However, for the animal, in particular, the majority of existing models are trained based on datasets without metric scale, which can help validate image-only models. To address this limitation, we present WildDepth, a multimodal dataset and benchmark suite for depth estimation, behavior detection, and 3D reconstruction from diverse categories of animals ranging from domestic to wild environments with synchronized RGB and LiDAR. Experimental results show that the use of multi-modal data improves depth reliability by up to 10% RMSE, while RGB-LiDAR fusion enhances 3D reconstruction fidelity by 12% in Chamfer distance. By releasing WildDepth and its benchmarks, we aim to foster robust multimodal perception systems that generalize across domains.

12. 【2603.16797】Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling

链接：https://arxiv.org/abs/2603.16797

作者：Christian Belardi,Justin Lovelace,Kilian Q. Weinberger,Carla P. Gomes

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Guided diffusion sampling, Guided diffusion, diffusion sampling relies, intractable likelihood scores, introduces significant noise

备注：

点击查看摘要

Abstract:Guided diffusion sampling relies on approximating often intractable likelihood scores, which introduces significant noise into the sampling dynamics. We propose using adaptive moment estimation to stabilize these noisy likelihood scores during sampling. Despite its simplicity, our approach achieves state-of-the-art results on image restoration and class-conditional generation tasks, outperforming more complicated methods, which are often computationally more expensive. We provide empirical analysis of our method on both synthetic and real data, demonstrating that mitigating gradient noise through adaptive moments offers an effective way to improve alignment.

13. 【2603.16792】V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

链接：https://arxiv.org/abs/2603.16792

作者：Han Lin,Xichen Pan,Zun Wang,Yue Zhang,Chu Wang,Jaemin Cho,Mohit Bansal

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Pixel-space diffusion, enabling high-quality generation, recently re-emerged, alternative to latent, high-quality generation

备注： code: [this https URL](https://github.com/HL-hanlin/V-Co)

点击查看摘要

Abstract:Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.

14. 【2603.16781】IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

链接：https://arxiv.org/abs/2603.16781

作者：Huimin Xiong,Zijie Meng,Tianxiang Hu,Chenyi Zhou,Yang Feng,Zuozhu Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：routine dentistry due, abundant geometric evidence, IOS diagnosis VQA, documentation and communication, unified multi-disease diagnosis

备注：

点击查看摘要

Abstract:3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.

15. 【2603.16769】GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution

链接：https://arxiv.org/abs/2603.16769

作者：Qiaosi Yi,Shuai Li,Rongyuan Wu,Lingchen Sun,Zhengqiang Zhang,Lei Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Direct Preference Optimization, one-step generative ISR, generative ISR, Group Direct Preference, Relative Policy Optimization

备注：

点击查看摘要

Abstract:Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: this https URL.

16. 【2603.16760】Dual Stream Independence Decoupling for True Emotion Recognition under Masked Expressions

链接：https://arxiv.org/abs/2603.16760

作者：Jinsheng Wei,Xiguang Zhang,Zheng Shi,Guanming Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recongnizing true emotions, Recongnizing true, extremely challenging due, deliberate concealment, true emotions

备注：

点击查看摘要

Abstract:Recongnizing true emotions from masked expressions is extremely challenging due to deliberate concealment. Existing paradigms recognize true emotions from masked-expression clips that contain onsetframes just starting to disguise. However, this paradigm may not reflect the actual disguised state, as the onsetframe leaks the true emotional information without reaching a stable disguise state. Thus, this paper introduces a novel apexframe-based paradigm that classifies true emotions from the apexframe with a stable disguised state. Furthermore, this paper proposes a novel dual stream independence decoupling framework that decouples true and disguised emotion features, avoiding the interference of disguised emotions on true emotions. For efficient decoupling, we design a decoupling loss group, comprising two classification losses that learn true emotion and disguised expression features, respectively, and a Hilbert-Schmidt Independence loss that enhances the independence of two features. Experiments demonstrate that the apexframe-based paradigm is challenging, and the proposed decouple framework improves recogntion performances.

17. 【2603.16758】SuCor: Susceptibility Distortion Correction via Parameter-Free and Self-Regularized Optimal Transport

链接：https://arxiv.org/abs/2603.16758

作者：Sreekar Chigurupati,Eleftherios Garyfallidis

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：echo planar imaging, correcting susceptibility induced, susceptibility induced geometric, phase encoding direction, induced geometric distortions

备注：

点击查看摘要

Abstract:We present SuCor, a method for correcting susceptibility induced geometric distortions in echo planar imaging (EPI) using optimal transport (OT) along the phase encoding direction. Given a pair of reversed phase encoding EPI volumes, we model each column of the distortion field as a Wasserstein-2 barycentric displacement between the opposing-polarity intensity profiles. Regularization is performed in the spectral domain using a bending-energy penalty whose strength is selected automatically via the Morozov discrepancy principle, requiring no manual tuning. On a human connectome project (HCP) dataset with left-right/right-left b0 EPI pairs and a co-registered T1 structural reference, SuCor achieves a mean volumetric mutual information of 0.341 with the T1 image, compared to 0.317 for FSL TOPUP, while running in approximately 12 seconds on a single CPU core.

18. 【2603.16747】Semi-supervised Latent Disentangled Diffusion Model for Textile Pattern Generation

链接：https://arxiv.org/abs/2603.16747

作者：Chenggong Hu,Yi Wang,Mengqi Xue,Haofei Zhang,Jie Song,Li Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Textile pattern generation, pattern images based, synthesize fine-grained textile, aims to synthesize, textile pattern images

备注： 9 pages, 7 figures, acceptted by AAAI2026, the code is available at [this https URL](https://github.com/Cg-Hu/SLDDM-TPG)

点击查看摘要

Abstract:Textile pattern generation (TPG) aims to synthesize fine-grained textile pattern images based on given clothing images. Although previous studies have not explicitly investigated TPG, existing image-to-image models appear to be natural candidates for this task. However, when applied directly, these methods often produce unfaithful results, failing to preserve fine-grained details due to feature confusion between complex textile patterns and the inherent non-rigid texture distortions in clothing images. In this paper, we propose a novel method, SLDDM-TPG, for faithful and high-fidelity TPG. Our method consists of two stages: (1) a latent disentangled network (LDN) that resolves feature confusion in clothing representations and constructs a multi-dimensional, independent clothing feature space; and (2) a semi-supervised latent diffusion model (S-LDM), which receives guidance signals from LDN and generates faithful results through semi-supervised diffusion training, combined with our designed fine-grained alignment strategy. Extensive evaluations show that SLDDM-TPG reduces FID by 4.1 and improves SSIM by up to 0.116 on our CTP-HD dataset, and also demonstrate good generalization on the VITON-HD dataset.

19. 【2603.16742】When the City Teaches the Car: Label-Free 3D Perception from Infrastructure

链接：https://arxiv.org/abs/2603.16742

作者：Zhen Xu,Jinsu Yoo,Cristian Bautista,Zanming Huang,Tai-Yu Pan,Zhenzhen Liu,Katie Z Luo,Mark Campbell,Bharath Hariharan,Wei-Lun Chao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Building robust, large-scale data collection, self-driving still relies, relies heavily, heavily on large-scale

备注： Project Page: [this https URL](https://jinsuyoo.info/civet/)

点击查看摘要

Abstract:Building robust 3D perception for self-driving still relies heavily on large-scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure-taught, label-free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo-label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label-free three-stage pipeline and conduct a concept-and-feasibility study in a CARLA-based multi-agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego-centric label-free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure-taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.

20. 【2603.16737】Retrieving Counterfactuals Improves Visual In-Context Learning

链接：https://arxiv.org/abs/2603.16737

作者：Guangzhi Xiong,Sanchit Sinha,Zhenghao He,Aidong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：achieved impressive performance, disentangle fine-grained visual, underlying causal relationships, fine-grained visual attributes, multimodal reasoning tasks

备注： CVPR 2026

点击查看摘要

21. 【2603.16736】World Reconstruction From Inconsistent Views

链接：https://arxiv.org/abs/2603.16736

作者：Lukas Höllein,Matthias Nießner

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diffusion models generate, models generate high-quality, output sequence, Video diffusion models, generate high-quality

备注： project website: [this https URL](https://lukashoel.github.io/video_to_world) , video: [this https URL](https://www.youtube.com/watch?v=E4AU7G-WyMI) , code: [this https URL](https://github.com/lukasHoel/video_to_world)

点击查看摘要

Abstract:Video diffusion models generate high-quality and diverse worlds; however, individual frames often lack 3D consistency across the output sequence, which makes the reconstruction of 3D worlds difficult. To this end, we propose a new method that handles these inconsistencies by non-rigidly aligning the video frames into a globally-consistent coordinate frame that produces sharp and detailed pointcloud reconstructions. First, a geometric foundation model lifts each frame into a pixel-wise 3D pointcloud, which contains unaligned surfaces due to these inconsistencies. We then propose a tailored non-rigid iterative frame-to-model ICP to obtain an initial alignment across all frames, followed by a global optimization that further sharpens the pointcloud. Finally, we leverage this pointcloud as initialization for 3D reconstruction and propose a novel inverse deformation rendering loss to create high quality and explorable 3D environments from inconsistent views. We demonstrate that our 3D scenes achieve higher quality than baselines, effectively turning video models into 3D-consistent world generators.

22. 【2603.16719】Emotion-Aware Classroom Quality Assessment Leveraging IoT-Based Real-Time Student Monitoring

链接：https://arxiv.org/abs/2603.16719

作者：Hai Nguyen,Hieu Dao,Hung Nguyen,Nam Vu,Cong Tran

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：study presents high-throughput, multi-agent affective computing, real-time multi-agent affective, Classroom Emotion Dataset, computing framework designed

备注：

点击查看摘要

Abstract:This study presents high-throughput, real-time multi-agent affective computing framework designed to enhance classroom learning through emotional state monitoring. As large classroom sizes and limited teacher student interaction increasingly challenge educators, there is a growing need for scalable, data-driven tools capable of capturing students' emotional and engagement patterns in real time. The system was evaluated using the Classroom Emotion Dataset, consisting of 1,500 labeled images and 300 classroom detection videos. Tailored for IoT devices, the system addresses load balancing and latency challenges through efficient real-time processing. Field testing was conducted across three educational institutions in a large metropolitan area: a primary school (hereafter school A), a secondary school (school B), and a high school (school C). The system demonstrated robust performance, detecting up to 50 faces at 25 FPS and achieving 88% overall accuracy in classifying classroom engagement states. Implementation results showed positive outcomes, with favorable feedback from students, teachers, and parents regarding improved classroom interaction and teaching adaptation. Key contributions of this research include establishing a practical, IoT-based framework for emotion-aware learning environments and introducing the 'Classroom Emotion Dataset' to facilitate further validation and research.

23. 【2603.16711】Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

链接：https://arxiv.org/abs/2603.16711

作者：Sainan Liu,Tz-Ying Wu,Hector A Valdez,Subarna Tripathi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：object-level motion editing, training-free framework, framework for object-level, object, motion

备注： 14 pages, 9 figures

点击查看摘要

Abstract:We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.

24. 【2603.16685】vAccSOL: Efficient and Transparent AI Vision Offloading for Mobile Robots

链接：https://arxiv.org/abs/2603.16685

作者：Adam Zahir,Michele Gucciardom Falk Selker,Anastasios Nanos,Kostis Papazafeiropoulos,Carlos J. Bernardos,Nicolas Weber,Roberto Gonzalez

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：deployed for inspection, autonomous decision-making, Mobile robots, increasingly deployed, computer vision

备注：

点击查看摘要

Abstract:Mobile robots are increasingly deployed for inspection, patrol, and search-and-rescue operations, relying on computer vision for perception, navigation, and autonomous decision-making. However, executing modern vision workloads onboard is challenging due to limited compute resources and strict energy constraints. While some platforms include embedded accelerators, these are typically tied to proprietary software stacks, leaving user-defined workloads to run on resource-constrained companion computers. We present vAccSOL, a framework for efficient and transparent execution of AI-based vision workloads across heterogeneous robotic and edge platforms. vAccSOL integrates two components: SOL, a neural network compiler that generates optimized inference libraries with minimal runtime dependencies, and vAccel, a lightweight execution framework that transparently dispatches inference locally on the robot or to nearby edge infrastructure. This combination enables hardware-optimized inference and flexible execution placement without requiring modifications to robot applications. We evaluate vAccSOL on a real-world testbed with a commercial quadruped robot and twelve deep learning models covering image classification, video classification, and semantic segmentation. Compared to a PyTorch compiler baseline, SOL achieves comparable or better inference performance. With edge offloading, vAccSOL reduces robot-side power consumption by up to 80% and edge-side power by up to 60% compared to PyTorch, while increasing vision pipeline frame rate by up to 24x, extending the operating lifetime of battery-powered robots.

Subjects:

Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.16685 [cs.RO]

(or
arXiv:2603.16685v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2603.16685

Focus to learn more

              arXiv-issued DOI via DataCite</p>

25. 【2603.16679】HMAR: Hierarchical Modality-Aware Expert and Dynamic Routing Medical Image Retrieval Architecture

链接：https://arxiv.org/abs/2603.16679

作者：Aojie Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：coarse classification labels, existing systems suffer, uniform feature encoding, ambiguous similarity metrics, fine-grained region-specific retrieval

备注： 8 pages, 7 figures, 1 table

点击查看摘要

Abstract:Medical image retrieval (MIR) is a critical component of computer-aided diagnosis, yet existing systems suffer from three persistent limitations: uniform feature encoding that fails to account for the varying clinical importance of anatomical structures, ambiguous similarity metrics based on coarse classification labels, and an exclusive focus on global image similarity that cannot meet the clinical demand for fine-grained region-specific retrieval. We propose HMAR (Hierarchical Modality-Aware Expert and Dynamic Routing), an adaptive retrieval framework built on a Mixture-of-Experts (MoE) architecture. HMAR employs a dual-expert mechanism: Expert0 extracts global features for holistic similarity matching, while Expert1 learns position-invariant local representations for precise lesion-region retrieval. A two-stage contrastive learning strategy eliminates the need for expensive bounding-box annotations, and a sliding-window matching algorithm enables dense local comparison at inference time. Hash codes are generated via Kolmogorov-Arnold Network (KAN) layers for efficient Hamming-distance search. Experiments on the RadioImageNet-CT dataset (16 clinical patterns, 29,903 images) show that HMAR achieves mean Average Precision (mAP) of 0.711 and 0.724 for 64-bit and 128-bit hash codes, improving over the state-of-the-art ACIR method by 0.7% and 1.1%, respectively.

26. 【2603.16671】$x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space

链接：https://arxiv.org/abs/2603.16671

作者：Ruishan Guo,Ciyu Ruan,Haoyang Wang,Zihang Gong,Jingao Xu,Xinlei Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Estimating dense, dynamic scene understanding, essential for dynamic, Event Edge Space, http URL

备注： This version is the camera-ready version accepted at CVPR 2026

点击查看摘要

Abstract:Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily this http URL cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce $x^2$-Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared this http URL this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that $x^2$-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.

27. 【2603.16669】Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

链接：https://arxiv.org/abs/2603.16669

作者：Mutian Xu,Tianbao Zhang,Tianqi Liu,Zhaoxi Chen,Xiaoguang Han,Ziwei Liu

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Simulating robot-world interactions, Simulating robot-world, Simulating, robot, robot-world interactions

备注： Project page: [this https URL](https://mutianxu.github.io/Kinema4D-project-page/)

点击查看摘要

Abstract:Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.

28. 【2603.16666】Fast-WAM: Do World Action Models Need Test-time Future Imagination?

链接：https://arxiv.org/abs/2603.16666

作者：Tianyuan Yuan,Zibin Dong,Yicheng Liu,Hang Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：World Action Models, explicitly model, Action Models, VLA, promising alternative

备注：

点击查看摘要

Abstract:World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: this https URL

29. 【2603.16664】Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

链接：https://arxiv.org/abs/2603.16664

作者：Jiawei Mao,Hardy Chen,Haoqin Tu,Yuhan Wang,Letian Zhang,Zeyu Zheng,Huaxiu Yao,Zirui Wang,Cihang Xie,Yuyin Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large vision-language models, Large vision-language, multimodal tasks, narrows their deployment, remain prone

备注： 16 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

30. 【2603.16662】Spectral Property-Driven Data Augmentation for Hyperspectral Single-Source Domain Generalization

链接：https://arxiv.org/abs/2603.16662

作者：Taiqin Chen,Yifeng Wang,Xiaochen Feng,Zhilin Zhu,Hao Sha,Yingjian Li,Yongbing Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：affect classification performance, provide rich information, sensor variability make, classification performance, affect classification

备注：

点击查看摘要

Abstract:While hyperspectral images (HSI) benefit from numerous spectral channels that provide rich information for classification, the increased dimensionality and sensor variability make them more sensitive to distributional discrepancies across domains, which in turn can affect classification performance. To tackle this issue, hyperspectral single-source domain generalization (SDG) typically employs data augmentation to simulate potential domain shifts and enhance model robustness under the condition of single-source domain training data availability. However, blind augmentation may produce samples misaligned with real-world scenarios, while excessive emphasis on realism can suppress diversity, highlighting a tradeoff between realism and diversity that limits generalization to target domains. To address this challenge, we propose a spectral property-driven data augmentation (SPDDA) that explicitly accounts for the inherent properties of HSI, namely the device-dependent variation in the number of spectral channels and the mixing of adjacent channels. Specifically, SPDDA employs a spectral diversity module that resamples data from the source domain along the spectral dimension to generate samples with varying spectral channels, and constructs a channel-wise adaptive spectral mixer by modeling inter-channel similarity, thereby avoiding fixed augmentation patterns. To further enhance the realism of the augmented samples, we propose a spatial-spectral co-optimization mechanism, which jointly optimizes a spatial fidelity constraint and a spectral continuity self-constraint. Moreover, the weight of the spectral self-constraint is adaptively adjusted based on the spatial counterpart, thus preventing over-smoothing in the spectral dimension and preserving spatial structure. Extensive experiments conducted on three remote sensing benchmarks demonstrate that SPDDA outperforms state-of-the-art methods.

31. 【2603.16653】HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

链接：https://arxiv.org/abs/2603.16653

作者：Md Jahidul Islam

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Adapting large-scale Vision-Language, Adapting large-scale, large-scale Vision-Language Models, CLIP to downstream, uniformly by wide

备注：

点击查看摘要

Abstract:Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D - D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at this https URL.

32. 【2603.16652】Efficient Brood Cell Detection in Layer Trap Nests for Bees and Wasps: Balancing Labeling Effort and Species Coverage

链接：https://arxiv.org/abs/2603.16652

作者：Chenchang Liu,Felix Fornoff,Annika Grasreiner,Patrick Maeder,Henri Greil,Marco Seeland

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Monitoring cavity-nesting wild, cavity-nesting wild bees, Monitoring cavity-nesting, research and conservation, brood cells

备注：

点击查看摘要

Abstract:Monitoring cavity-nesting wild bees and wasps is vital for biodiversity research and conservation. Layer trap nests (LTNs) are emerging as a valuable tool to study the abundance and species richness of these insects, offering insights into their nesting activities and ecological needs. However, manually evaluating LTNs to detect and classify brood cells is labor-intensive and time-consuming. To address this, we propose a deep learning based approach for efficient brood cell detection and classification in LTNs. LTNs present additional challenges due to densely packed brood cells, leading to a high labeling effort per image. Moreover, we observe a significant imbalance in class distribution, with common species having notably more occurrences than rare species. Comprehensive labeling of common species is time-consuming and exacerbates data imbalance, while partial labeling introduces data incompleteness which degrades model performance. To reduce labeling effort and mitigate the impact of unlabeled data, we introduce a novel Constrained False Positive Loss (CFPL) strategy. CFPL dynamically masks predictions from unlabeled data, preventing them from interfering with the classification loss during training. We evaluate our approach on a dataset of 712 LTN images collected over one season, covering 28 fine-grained classes describing the taxonomy and status of brood cells. To minimize labeling effort, we limit the training set to a maximum of 300 labels per class. Experimental results demonstrate that deep learning can be effectively used to detect brood cells in LTNs. Our CFPL method further improves performance and balances model accuracy and labeling effort while also mitigating class imbalance.

33. 【2603.16649】Mixture of Style Experts for Diverse Image Stylization

链接：https://arxiv.org/abs/2603.16649

作者：Shihao Zhu,Ziheng Ouyang,Yijia Kang,Qilong Wang,Mi Zhou,Bo Li,Ming-Ming Cheng,Qibin Hou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion-based stylization, semantic-aware framework based, neglecting complex semantics, http URL introduce, advanced significantly

备注：

点击查看摘要

Abstract:Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material this http URL introduce StyleExpert, a semantic-aware framework based on the Mixture of Experts (MoE). Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles. Our code and collected images are available at the project page: this https URL.

34. 【2603.16645】BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection

链接：https://arxiv.org/abs/2603.16645

作者：Melissa Schween,Mathis Kruse,Bodo Rosenhahn

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Universal Scene-Specific Anomalous, propose Bijective Universal, Bijective Universal Scene-Specific, detecting anomalous relations, Scene-Specific Anomalous Relationship

备注： CVPR 2026 Main Track

点击查看摘要

Abstract:We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world. A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation. We evaluate our approach on the SARD dataset containing office and dining room scenes. Our method achieves around 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster. Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation. This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs. Our code is available at this https URL .

35. 【2603.16641】FlowComposer: Composable Flows for Compositional Zero-Shot Learning

链接：https://arxiv.org/abs/2603.16641

作者：Zhenqi He,Lin Li,Long Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Compositional zero-shot learning, recognize unseen attribute-object, Compositional zero-shot, unseen attribute-object compositions, recombining primitives learned

备注： Accepted to CVPR2026

点击查看摘要

Abstract:Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT). They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions. However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models. In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals. We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.

36. 【2603.16629】MLLM-based Textual Explanations for Face Comparison

链接：https://arxiv.org/abs/2603.16629

作者：Redwan Sony,Anil K Jain,Ross Arun

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Multimodal Large Language, Multimodal Large, Language Models, Large Language

备注： Accepted at 14th International Workshop on Biometrics and Forensics (IWBF)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood-ratio-based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at this https URL.

37. 【2603.16620】CATSeg: A Tooth Center-Wise Attention Network for 3D Dental Model Semantic Segmentation

链接：https://arxiv.org/abs/2603.16620

作者：Qiang He,Wentian Qu,Jiajia Dai,Changsong Lei,Shaofeng Wang,Feifei Zuo,Yajie Wang,Yaqian Liang,Xiaoming Deng,Cuixia Ma,Yong-Jin Liu,Hongan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：digital dentistry applications, essential for digital, digital dentistry, dentistry applications, dental implants

备注： 6 pages, 4 figures, ICASSP 2026

点击查看摘要

Abstract:Accurate semantic segmentation of 3D dental models is essential for digital dentistry applications such as orthodontics and dental implants. However, due to complex tooth arrangements and similarities in shape among adjacent teeth, existing methods struggle with accurate segmentation, because they often focus on local geometry while neglecting global contextual information. To address this, we propose TCATSeg, a novel framework that combines local geometric features with global semantic context. We introduce a set of sparse yet physically meaningful superpoints to capture global semantic relationships and enhance segmentation accuracy. Additionally, we present a new dataset of 400 dental models, including pre-orthodontic samples, to evaluate the generalization of our method. Extensive experiments demonstrate that TCATSeg outperforms state-of-the-art approaches.

38. 【2603.16616】ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery

链接：https://arxiv.org/abs/2603.16616

作者：Weiqin Jiao,Hao Cheng,George Vosselman,Claudio Persello

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：complete vector map, vector map representation, tackle the problem, problem of generating, generating a complete

备注： Accepted to CVPR 2026. The supplementary material available in the conference proceedings

点击查看摘要

Abstract:We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and without gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that enforces shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512. It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building. Data, code, and models will be released at: this https URL.

39. 【2603.16600】Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

链接：https://arxiv.org/abs/2603.16600

作者：Weijie Qiu,Dai Guan,Junxin Wang,Zhihang Li,Yongbo Gai,Mengyu Zhou,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generative reward models, criterion-based scoring, Generative reward, three-stage pipeline, final verdict

备注： 25 pages, 10 figures,

点击查看摘要

Abstract:Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at this https URL.

40. 【2603.16596】FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation

链接：https://arxiv.org/abs/2603.16596

作者：Fangjing Li,Zhihai Wang,Xinxin Ding,Haiyang Liu,Ronghua Gao,Rong Wang,Yao Zhu,Ming Jin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：important visual indicator, Frequency Enhancement Block, Receptive Aggregation Block, important visual, visual indicator

备注： 10 pages, 6 figures. Accept by CVPR2026 Findings

点击查看摘要

Abstract:Mounting posture is an important visual indicator of estrus in dairy cattle. However, achieving reliable mounting pose estimation in real-world environments remains challenging due to cluttered backgrounds and frequent inter-animal occlusion. We present FSMC-Pose, a top-down framework that integrates a lightweight frequency-spatial fusion backbone, CattleMountNet, and a multiscale self-calibration head, SC2Head. Specifically, we design two algorithmic components for CattleMountNet: the Spatial Frequency Enhancement Block (SFEBlock) and the Receptive Aggregation Block (RABlock). SFEBlock separates cattle from cluttered backgrounds, while RABlock captures multiscale contextual information. The Spatial-Channel Self-Calibration Head (SC2Head) attends to spatial and channel dependencies and introduces a self-calibration branch to mitigate structural misalignment under inter-animal overlap. We construct a mounting dataset, MOUNT-Cattle, covering 1176 mounting instances, which follows the COCO format and supports drop-in training across pose estimation models. Using a comprehensive dataset that combines MOUNT-Cattle with the public NWAFU-Cattle dataset, FSMC-Pose achieves higher accuracy than strong baselines, with markedly lower computational and parameter costs, while maintaining real-time inference on commodity GPUs. Extensive experiments and qualitative analyses show that FSMC-Pose effectively captures and estimates cattle mounting pose in complex and cluttered environments. Dataset and code are available at this https URL.

41. 【2603.16592】On the Transfer of Collinearity to Computer Vision

链接：https://arxiv.org/abs/2603.16592

作者：Frederik Beuth,Danny Kowerko

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visual perception phenomenon, amplifies spatially aligned, spatially aligned edges, aligned edges arranged, Collinearity

备注：

点击查看摘要

Abstract:Collinearity is a visual perception phenomenon in the human brain that amplifies spatially aligned edges arranged along a straight line. However, it is vague for which purpose humans might have this principle in the real-world, and its utilization in computer vision and engineering applications even is a largely unexplored field. In this work, our goal is to transfer the collinearity principle to computer vision, and we explore the potential usages of this novel principle for computer vision applications. We developed a prototype model to exemplify the principle, then tested it systematically, and benchmarked it in the context of four use cases. Our cases are selected to spawn a broad range of potential applications and scenarios: sketching the combination of collinearity with deep learning (case I and II), using collinearity with saliency models (case II), and as a feature detector (case I). In the first use case, we found that collinearity is able to improve the fault detection of wafers and obtain a performance increase by a factor 1.24 via collinearity (decrease of the error rate from 6.5% to 5.26%). In the second use case, we test the defect recognition in nanotechnology materials and achieve a performance increase by 3.2x via collinearity (deep learning, error from 21.65% to 6.64%), and also explore saliency models. As third experiment, we cover occlusions; while as fourth experiment, we test ImageNet and observe that it might not be very beneficial for ImageNet. Therefore, we can assemble a list of scenarios for which collinearity is beneficial (wafers, nanotechnology, occlusions), and for what is not beneficial (ImageNet). Hence, we infer collinearity might be suitable for industry applications as it helps if the image structures of interest are man-made because they often consist of lines. Our work provides another tool for CV, hope to capture the power of human processing.

42. 【2603.16576】REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

链接：https://arxiv.org/abs/2603.16576

作者：Yong Zou,Haoran Li,Fanxiao Li,Shenyang Wei,Yunyun Dong,Li Tang,Wei Zhou,Renyang Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词：enables high-fidelity content, high-fidelity content creation, Recent progress, image generation models, Image Generation Model

备注： Accepted by ICME 2026

点击查看摘要

Abstract:Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored. To bridge this gap, we present REFORGE, a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-aware unlearning against multi-modal adversarial attacks. Our code is at: this https URL.

43. 【2603.16570】Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration

链接：https://arxiv.org/abs/2603.16570

作者：Amirhossein Kazerouni,Maitreya Suin,Tristan Aumentado-Armstrong,Sina Honari,Amanpreet Walia,Iqbal Mohomed,Konstantinos G. Derpanis,Babak Taati,Alex Levinshtein

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enabled high-fidelity recovery, Recent advances, enabled high-fidelity, high-fidelity recovery, inputs using reference-based

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose Face2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.

44. 【2603.16566】VideoMatGen: PBR Materials through Joint Generative Modeling

链接：https://arxiv.org/abs/2603.16566

作者：Jon Hasselgren,Zheng Zeng,Milos Hasan,Jacob Munkberg

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：diffusion transformer architecture, video diffusion transformer, generating physically-based materials, transformer architecture, generating physically-based

备注：

点击查看摘要

Abstract:We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.

45. 【2603.16562】Understanding Cell Fate Decisions with Temporal Attention

链接：https://arxiv.org/abs/2603.16562

作者：Florian Bürger,Martim Dias Gomes,Adrián E. Granada,Noémie Moreau,Katarzyna Bozek

类目：Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB); Quantitative Methods (q-bio.QM)

关键词：improving cancer therapies, Understanding non-genetic determinants, genetically identical cells, cell fate, exhibit divergent outcomes

备注： 10 pages, 6 figures

点击查看摘要

Abstract:Understanding non-genetic determinants of cell fate is critical for developing and improving cancer therapies, as genetically identical cells can exhibit divergent outcomes under the same treatment conditions. In this work, we present a deep learning approach for cell fate prediction from raw long-term live-cell recordings of cancer cell populations under chemotherapeutic treatment. Our Transformer model is trained to predict cell fate directly from raw image sequences, without relying on predefined morphological or molecular features. Beyond classification, we introduce a comprehensive explainability framework for interpreting the temporal and morphological cues guiding the model's predictions. We demonstrate that prediction of cell outcomes is possible based on the video only, our model achieves balanced accuracy of 0.94 and an F1-score of 0.93. Attention and masking experiments further indicate that the signal predictive of the cell fate is not uniquely located in the final frames of a cell trajectory, as reliable predictions are possible up to 10 h before the event. Our analysis reveals distinct temporal distribution of predictive information in the mitotic and apoptotic sequences, as well as the role of cell morphology and p53 signaling in determining cell outcomes. Together, these findings demonstrate that attention-based temporal models enable accurate cell fate prediction while providing biologically interpretable insights into non-genetic determinants of cellular decision-making. The code is available at this https URL.

46. 【2603.16558】Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

链接：https://arxiv.org/abs/2603.16558

作者：Jiale Song,Jiaxin Luo,Xue-song Tang,Kuangrong Hao,Mingbo Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：Large Vision-Language Models, Large Vision-Language, Vision-Language Models, achieve strong performance, hallucinations severely undermine

备注：

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.

47. 【2603.16551】CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

链接：https://arxiv.org/abs/2603.16551

作者：Mahmoud Ibrahim,Bart Elen,Chang Sun,Gokhan Ertaylan,Michel Dumontier

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Generative models, augment medical imaging, medical imaging datasets, imaging datasets, datasets for fairer

备注：

点击查看摘要

Abstract:Generative models are increasingly used to augment medical imaging datasets for fairer AI. Yet a key assumption often goes unexamined: that generators themselves produce equally high-quality images across demographic groups. Models trained on imbalanced data can inherit these imbalances, yielding degraded synthesis quality for rare subgroups and struggling with demographic intersections absent from training. We refer to this as the imbalanced generator problem. Existing remedies such as loss reweighting operate at the optimization level and provide limited benefit when training signal is scarce or absent for certain combinations. We propose CompDiff, a hierarchical compositional diffusion framework that addresses this problem at the representation level. A dedicated Hierarchical Conditioner Network (HCN) decomposes demographic conditioning, producing a demographic token concatenated with CLIP embeddings as cross-attention context. This structured factorization encourages parameter sharing across subgroups and supports compositional generalization to rare or unseen demographic intersections. Experiments on chest X-rays (MIMIC-CXR) and fundus images (FairGenMed) show that CompDiff compares favorably against both standard fine-tuning and FairDiffusion across image quality (FID: 64.3 vs. 75.1), subgroup equity (ES-FID), and zero-shot intersectional generalization (up to 21% FID improvement on held-out intersections). Downstream classifiers trained on CompDiff-generated data also show improved AUROC and reduced demographic bias, suggesting that architectural design of demographic conditioning is an important and underexplored factor in fair medical image generation. Code is available at this https URL.

48. 【2603.16549】Bridging the Simulation-to-Reality Gap in Electron Microscope Calibration via VAE-EM Estimation

链接：https://arxiv.org/abs/2603.16549

作者：Jilles S. van Hulst,W.P.M.H.(Maurice)Heemels,Duarte J. Antunes

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Transmission Electron Microscopes, Scanning Transmission Electron, multiple fields, enabled many scientific, scientific breakthroughs

备注： This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Electron microscopy has enabled many scientific breakthroughs across multiple fields. A key challenge is the tuning of microscope parameters based on images to overcome optical aberrations that deteriorate image quality. This calibration problem is challenging due to the high-dimensional and noisy nature of the diagnostic images, and the fact that optimal parameters cannot be identified from a single image. We tackle the calibration problem for Scanning Transmission Electron Microscopes (STEM) by employing variational autoencoders (VAEs), trained on simulated data, to learn low-dimensional representations of images, whereas most existing methods extract only scalar values. We then simultaneously estimate the model that maps calibration parameters to encoded representations and the optimal calibration parameters using an expectation maximization (EM) approach. This joint estimation explicitly addresses the simulation-to-reality gap inherent in data-driven methods that train on simulated data from a digital twin. We leverage the known symmetry property of the optical system to establish global identifiability of the joint estimation problem, ensuring that a unique optimum exists. We demonstrate that our approach is substantially faster and more consistent than existing methods on a real STEM, achieving a 2x reduction in estimation error while requiring fewer observations. This represents a notable advance in automated STEM calibration and demonstrates the potential of VAEs for information compression in images. Beyond microscopy, the VAE-EM framework applies to inverse problems where simulated training data introduces a reality gap and where non-injective mappings would otherwise prevent unique solutions.

49. 【2603.16548】SAMSEM -- A Generic and Scalable Approach for IC Metal Line Segmentation

链接：https://arxiv.org/abs/2603.16548

作者：Christian Gehrmann,Jonas Ricker,Simon Damm,Deruo Cheng,Julian Speith,Yiqiong Shi,Asja Fischer,Christof Paar

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：hardware supply chains, globalized hardware supply, gained significant interest, globalized hardware, hardware supply

备注：

点击查看摘要

Abstract:In light of globalized hardware supply chains, the assurance of hardware components has gained significant interest, particularly in cryptographic applications and high-stakes scenarios. Identifying metal lines on scanning electron microscope (SEM) images of integrated circuits (ICs) is one essential step in verifying the absence of malicious circuitry in chips manufactured in untrusted environments. Due to varying manufacturing processes and technologies, such verification usually requires tuning parameters and algorithms for each target IC. Often, a machine learning model trained on images of one IC fails to accurately detect metal lines on other ICs. To address this challenge, we create SAMSEM by adapting Meta's Segment Anything Model 2 (SAM2) to the domain of IC metal line segmentation. Specifically, we develop a multi-scale segmentation approach that can handle SEM images of varying sizes, resolutions, and magnifications. Furthermore, we deploy a topology-based loss alongside pixel-based losses to focus our segmentation on electrical connectivity rather than pixel-level accuracy. Based on a hyperparameter optimization, we then fine-tune the SAM2 model to obtain a model that generalizes across different technology nodes, manufacturing materials, sample preparation methods, and SEM imaging technologies. To this end, we leverage an unprecedented dataset of SEM images obtained from 48 metal layers across 14 different ICs. When fine-tuned on seven ICs, SAMSEM achieves an error rate as low as 0.72% when evaluated on other images from the same ICs. For the remaining seven unseen ICs, it still achieves error rates as low as 5.53%. Finally, when fine-tuned on all 14 ICs, we observe an error rate of 0.62%. Hence, SAMSEM proves to be a reliable tool that significantly advances the frontier in metal line segmentation, a key challenge in post-manufacturing IC verification.

50. 【2603.16538】Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty

链接：https://arxiv.org/abs/2603.16538

作者：Mangyu Kong,Jaewon Lee,Seongwon Lee,Euntai Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, powerful scene representation, recently emerged, powerful scene, scene representation

备注： 17 pages, 11 figures, CVPR 2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible. To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information-based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.

51. 【2603.16524】An approximate graph elicits detonation lattice

链接：https://arxiv.org/abs/2603.16524

作者：Vansh Sharma,Venkat Raman

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)

关键词：edge detection methods, detection methods prevalent, termed detonation lattices, pressure traces, addressing the limitations

备注：

点击查看摘要

Abstract:This study presents a novel algorithm based on graph theory for the precise segmentation and measurement of detonation cells from 3D pressure traces, termed detonation lattices, addressing the limitations of manual and primitive 2D edge detection methods prevalent in the field. Using a segmentation model, the proposed training-free algorithm is designed to accurately extract cellular patterns, a longstanding challenge in detonations research. First, the efficacy of segmentation on generated data is shown with a prediction error 2%. Next, 3D simulation data is used to establish performance of the graph-based workflow. The results of statistics and joint probability densities show oblong cells aligned with the wave propagation axis with 17% deviation, whereas larger dispersion in volume reflects cubic amplification of linear variability. Although the framework is robust, it remains challenging to reliably segment and quantify highly complex cellular patterns. However, the graph-based formulation generalizes across diverse cellular geometries, positioning it as a practical tool for detonation analysis and a strong foundation for future extensions in triple-point collision studies.

52. 【2603.16506】VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations

链接：https://arxiv.org/abs/2603.16506

作者：Fucai Ke,Zhixi Cai,Boying Li,Long Chen,Beibei Lin,Weiqing Wang,Pari Delir Haghighi,Gholamreza Haffari,Hamid Rezatofighi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：understand complex environments, temporally dense video, dense video settings, discrete viewpoints, essential for intelligent

备注：

点击查看摘要

Abstract:Multi-view visual reasoning is essential for intelligent systems that must understand complex environments from sparse and discrete viewpoints, yet existing research has largely focused on single-image or temporally dense video settings. In real-world scenarios, reasoning across views requires integrating partial observations without explicit guidance, while collecting large-scale multi-view data with accurate geometric and semantic annotations remains challenging. To address this gap, we leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable data generation that remains transferable to real-world settings. Based on this engine, we introduce VIEW2SPACE, a multi-dimensional benchmark for sparse multi-view reasoning, together with a scalable, disjoint training split supporting millions of grounded question-answer pairs. Using this benchmark, a comprehensive evaluation of state-of-the-art vision-language and spatial models reveals that multi-view reasoning remains largely unsolved, with most models performing only marginally above random guessing. We further investigate whether training can bridge this gap. Our proposed Grounded Chain-of-Thought with Visual Evidence substantially improves performance under moderate difficulty, and generalizes to real-world data, outperforming existing approaches in cross-dataset evaluation. We further conduct difficulty-aware scaling analyses across model size, data scale, reasoning depth, and visibility constraints, indicating that while geometric perception can benefit from scaling under sufficient visibility, deep compositional reasoning across sparse views remains a fundamental challenge.

53. 【2603.16489】Unlearning for One-Step Generative Models via Unbalanced Optimal Transport

链接：https://arxiv.org/abs/2603.16489

作者：Hyundo Choi,Junhyeong An,Jinseong Park,Jaewoong Choi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：single forward pass, Recent advances, flow map models, learning direct, forward pass

备注： 27 pages, 10 figures

点击查看摘要

Abstract:Recent advances in one-step generative frameworks, such as flow map models, have significantly improved the efficiency of image generation by learning direct noise-to-data mappings in a single forward pass. However, machine unlearning for ensuring the safety of these powerful generators remains entirely unexplored. Existing diffusion unlearning methods are inherently incompatible with these one-step models, as they rely on a multi-step iterative denoising process. In this work, we propose UOT-Unlearn, a novel plug-and-play class unlearning framework for one-step generative models based on the Unbalanced Optimal Transport (UOT). Our method formulates unlearning as a principled trade-off between a forget cost, which suppresses the target class, and an $f$-divergence penalty, which preserves overall generation fidelity via relaxed marginal constraints. By leveraging UOT, our method enables the probability mass of the forgotten class to be smoothly redistributed to the remaining classes, rather than collapsing into low-quality or noise-like samples. Experimental results on CIFAR-10 and ImageNet-256 demonstrate that our framework achieves superior unlearning success (PUL) and retention quality (u-FID), significantly outperforming baselines.

54. 【2603.16482】DST-Net: A Dual-Stream Transformer with Illumination-Independent Feature Guidance and Multi-Scale Spatial Convolution for Low-Light Image Enhancement

链接：https://arxiv.org/abs/2603.16482

作者：Yicui Shi,Yuhan Chen,Xiangfei Huang,Zhenguo Wang,Wenxuan Yu,Ying Fang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：inherent signal degradations, structural corruption, aims to restore, restore the visibility, sensors in dim

备注：

点击查看摘要

Abstract:Low-light image enhancement aims to restore the visibility of images captured by visual sensors in dim environments by addressing their inherent signal degradations, such as luminance attenuation and structural corruption. Although numerous algorithms attempt to improve image quality, existing methods often cause a severe loss of intrinsic signal priors. To overcome these challenges, we propose a Dual-Stream Transformer Network (DST-Net) based on illumination-agnostic signal prior guidance and multi-scale spatial convolutions. First, to address the loss of critical signal features under low-light conditions, we design a feature extraction module. This module integrates Difference of Gaussians (DoG), LAB color space transformations, and VGG-16 for texture extraction, utilizing decoupled illumination-agnostic features as signal priors to continuously guide the enhancement process. Second, we construct a dual-stream interaction architecture. By employing a cross-modal attention mechanism, the network leverages the extracted priors to dynamically rectify the deteriorated signal representation of the enhanced image, ultimately achieving iterative enhancement through differentiable curve estimation. Furthermore, to overcome the inability of existing methods to preserve fine structures and textures, we propose a Multi-Scale Spatial Fusion Block (MSFB) featuring pseudo-3D and 3D gradient operator convolutions. This module integrates explicit gradient operators to recover high-frequency edges while capturing inter-channel spatial correlations via multi-scale spatial convolutions. Extensive evaluations and ablation studies demonstrate that DST-Net achieves superior performance in subjective visual quality and objective metrics. Specifically, our method achieves a PSNR of 25.64 dB on the LOL dataset. Subsequent validation on the LSRW dataset further confirms its robust cross-scene generalization.

55. 【2603.16461】GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

链接：https://arxiv.org/abs/2603.16461

作者：Jiaxin Zhang,Junjun Jiang,Haijie Li,Youyu Chen,Kui Jiang,Dave Zhenyu Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, pure RGB inputs, Multimodal Large, Large Language

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.

56. 【2603.16455】Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

链接：https://arxiv.org/abs/2603.16455

作者：Weiqing Li,Jinyue Guo,Yaqi Wang,Haiyang Xiao,Yuewei Zhang,Guohua Liu,Hao Henry Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-world document heterogeneity, Visual-language models, excel at data, data mappings, real-world document

备注： Accepted by CVPR2026

点击查看摘要

Abstract:Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.

57. 【2603.16451】nyGLASS: Real-Time Self-Supervised In-Sensor Anomaly Detection

链接：https://arxiv.org/abs/2603.16451

作者：Pietro Bonazzi,Rafael Sutter,Luigi Capogrosso,Mischa Buob,Michele Magno

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：labeled faulty samples, industrial quality control, Anomaly detection plays, quality control, faulty samples

备注：

点击查看摘要

Abstract:Anomaly detection plays a key role in industrial quality control, where defects must be identified despite the scarcity of labeled faulty samples. Recent self-supervised approaches, such as GLASS, learn normal visual patterns using only defect-free data and have shown strong performance on industrial benchmarks. However, their computational requirements limit deployment on resource-constrained edge platforms. This work introduces TinyGLASS, a lightweight adaptation of the GLASS framework designed for real-time in-sensor anomaly detection on the Sony IMX500 intelligent vision sensor. The proposed architecture replaces the original WideResNet-50 backbone with a compact ResNet-18 and introduces deployment-oriented modifications that enable static graph tracing and INT8 quantization using Sony's Model Compression Toolkit. In addition to evaluating performance on the MVTec-AD benchmark, we investigate robustness to contaminated training data and introduce a custom industrial dataset, named MMS Dataset, for cross-device evaluation. Experimental results show that TinyGLASS achieves 8.7x parameter compression while maintaining competitive detection performance, reaching 94.2% image-level AUROC on MVTec-AD and operating at 20 FPS within the 8 MB memory constraints of the IMX500 platform. System profiling demonstrates low power consumption (4.0 mJ per inference), real-time end-to-end latency (20 FPS), and high energy efficiency (470 GMAC/J). Furthermore, the model maintains stable performance under moderate levels of training data contamination.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.16451 [cs.CV]

(or
arXiv:2603.16451v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.16451

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Pietro Bonazzi [view email] [v1]
Tue, 17 Mar 2026 12:31:34 UTC (5,900 KB)

58. 【2603.16447】ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars

链接：https://arxiv.org/abs/2603.16447

作者：Kaiwen Song,Jinkai Cui,Juyong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：resources fluctuate frequently, computing resources fluctuate, telepresence applications, fluctuate frequently, practical real-time

备注： Accepted to CVPR 2026, Project page: [this https URL](https://ustc3dv.github.io/ProgressiveAvatars/)

点击查看摘要

Abstract:In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive 3D representation is needed. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown by adaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face-local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. Leveraging importance ranking, ProgressiveAvatars supports incremental loading and rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.

59. 【2603.16446】Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline

链接：https://arxiv.org/abs/2603.16446

作者：Xingyu Liu,Zewei He,Yu Chen,Chunyu Zhu,Zixuan Chen,Xing Luo,Zhe-Ming Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reflections frequently co-occur, rainy days, glass surfaces, surfaces or windshields, windshields on rainy

备注：

点击查看摘要

Abstract:When capturing images through glass surfaces or windshields on rainy days, raindrops and reflections frequently co-occur to significantly reduce the visibility of captured images. This practical problem lacks attention and needs to be resolved urgently. Prior de-raindrop, de-reflection, and all-in-one models have failed to address this composite degradation. To this end, we first formally define the unified removal of raindrops and reflections (UR$^3$) task for the first time and construct a real-shot dataset, namely RainDrop and ReFlection (RDRF), which provides a new benchmark with substantial, high-quality, diverse image pairs. Then, we propose a novel diffusion-based framework (i.e., DiffUR$^3$) with several target designs to address this challenging task. By leveraging the powerful generative prior, DiffUR$^3$ successfully removes both types of degradations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on our benchmark and on challenging in-the-wild images. The RDRF dataset and the codes will be made public upon acceptance.

60. 【2603.16444】Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation

链接：https://arxiv.org/abs/2603.16444

作者：Hunain Ahmed Jillani,Ahmed Tawfik Aboukhadra,Ahmed Elhayek,Jameel Malik,Nadia Robertini,Didier Stricker

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Fast and accurate, human-computer interaction, essential for real-time, hand reconstruction, Fast

备注：

点击查看摘要

Abstract:Fast and accurate 3D hand reconstruction is essential for real-time applications in VR/AR, human-computer interaction, robotics, and healthcare. Most state-of-the-art methods rely on heavy models, limiting their use on resource-constrained devices like headsets, smartphones, and embedded systems. In this paper, we investigate how the use of lightweight neural networks, combined with Knowledge Distillation, can accelerate complex 3D hand reconstruction models by making them faster and lighter, while maintaining comparable reconstruction accuracy. While our approach is suited for various hand reconstruction frameworks, we focus primarily on boosting the HaMeR model, currently the leading method in terms of reconstruction accuracy. We replace its original ViT-H backbone with lighter alternatives, including MobileNet, MobileViT, ConvNeXt, and ResNet, and evaluate three knowledge distillation strategies: output-level, feature-level, and a hybrid of both. Our experiments show that using lightweight backbones that are only 35% the size of the original achieves 1.5x faster inference speed while preserving similar performance quality with only a minimal accuracy difference of 0.4mm. More specifically, we show how output-level distillation notably improves student performance, while feature-level distillation proves more effective for higher-capacity students. Overall, the findings pave the way for efficient real-world applications on low-power devices. The code and models are publicly available under this https URL.

61. 【2603.16439】CD-FKD: Cross-Domain Feature Knowledge Distillation for Robust Single-Domain Generalization in Object Detection

链接：https://arxiv.org/abs/2603.16439

作者：Junseok Lee,Sungho Shin,Seongju Lee,Kyoobin Lee

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Single-domain generalization, single source domain, unseen target domains, Feature Knowledge Distillation, source domain

备注： Accepted to ICRA 2026

点击查看摘要

Abstract:Single-domain generalization is essential for object detection, particularly when training models on a single source domain and evaluating them on unseen target domains. Domain shifts, such as changes in weather, lighting, or scene conditions, pose significant challenges to the generalization ability of existing models. To address this, we propose Cross-Domain Feature Knowledge Distillation (CD-FKD), which enhances the generalization capability of the student network by leveraging both global and instance-wise feature distillation. The proposed method uses diversified data through downscaling and corruption to train the student network, whereas the teacher network receives the original source domain data. The student network mimics the features of the teacher through both global and instance-wise distillation, enabling it to extract object-centric features effectively, even for objects that are difficult to detect owing to corruption. Extensive experiments on challenging scenes demonstrate that CD-FKD outperforms state-of-the-art methods in both target domain generalization and source domain performance, validating its effectiveness in improving object detection robustness to domain shifts. This approach is valuable in real-world applications, like autonomous driving and surveillance, where robust object detection in diverse environments is crucial.

62. 【2603.16432】IRIS: A Real-World Benchmark for Inverse Recovery and Identification of Physical Dynamic Systems from Monocular Video

链接：https://arxiv.org/abs/2603.16432

作者：Rasul Khanbayov,Mohamed Rayan Barhdadi,Erchin Serpedin,Hasan Kurban

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Unsupervised physical parameter, existing methods evaluate, non-overlapping synthetic data, physical parameter estimation, addresses governing-equation identification

备注：

点击查看摘要

Abstract:Unsupervised physical parameter estimation from video lacks a common benchmark: existing methods evaluate on non-overlapping synthetic data, the sole real-world dataset is restricted to single-body systems, and no established protocol addresses governing-equation identification. This work introduces IRIS, a high-fidelity benchmark comprising 220 real-world videos captured at 4K resolution and 60\,fps, spanning both single- and multi-body dynamics with independently measured ground-truth parameters and uncertainty estimates. Each dynamical system is recorded under controlled laboratory conditions and paired with its governing equations, enabling principled evaluation. A standardized evaluation protocol is defined encompassing parameter accuracy, identifiability, extrapolation, robustness, and governing-equation selection. Multiple baselines are evaluated, including a multi-step physics loss formulation and four complementary equation-identification strategies (VLM temporal reasoning, describe-then-classify prompting, CNN-based classification, and path-based labelling), establishing reference performance across all IRIS scenarios and exposing systematic failure modes that motivate future research. The dataset, annotations, evaluation toolkit, and all baseline implementations are publicly released.

63. 【2603.16427】Cross-modal learning for plankton recognition

链接：https://arxiv.org/abs/2603.16427

作者：Joona Kareinen,Veikka Immonen,Tuomas Eerola,Lumi Haraguchi,Lasse Lensu,Kaisa Kraft,Sanna Suikkanen,Heikki Kälviäinen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：strategy enabling utilization, self-supervised cross-modal coordination, plankton, cross-modal coordination, strategy enabling

备注：

点击查看摘要

Abstract:This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recognition. Automated imaging instruments facilitate the continuous collection of plankton image data on a large scale. Current methods for automatic plankton image recognition rely primarily on supervised approaches, which require labeled training sets that are labor-intensive to collect. On the other hand, some modern plankton imaging instruments complement image information with optical measurement data, such as scatter and fluorescence profiles, which currently are not widely utilized in plankton recognition. In this work, we explore the possibility of using such measurement data to guide the learning process without requiring manual labeling. Inspired by the concepts behind Contrastive Language-Image Pre-training, we train encoders for both modalities using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles. For plankton recognition, we employ a small labeled gallery of known plankton species combined with a $k$-NN classifier. This approach yields a recognition model that is inherently multimodal, i.e., capable of utilizing information extracted from both image and profile data. We demonstrate that the proposed method achieves high recognition accuracy while requiring only a minimal number of labeled images. Furthermore, we show that the approach outperforms an image-only self-supervised baseline. Code available at this https URL.

64. 【2603.16426】3D Fourier-based Global Feature Extraction for Hyperspectral Image Classification

链接：https://arxiv.org/abs/2603.16426

作者：Muhammad Ahmad

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Hyperspectral image classification, rich spatial-spectral correlations, exploit rich spatial-spectral, Fourier Transform, deep learning methods

备注：

点击查看摘要

Abstract:Hyperspectral image classification (HSIC) has been significantly advanced by deep learning methods that exploit rich spatial-spectral correlations. However, existing approaches still face fundamental limitations: transformer-based models suffer from poor scalability due to the quadratic complexity of self-attention, while recent Fourier transform-based methods typically rely on 2D spatial FFTs and largely ignore critical inter-band spectral dependencies inherent to hyperspectral data. To address these challenges, we propose Hybrid GFNet (HGFNet), a novel architecture that integrates localized 3D convolutional feature extraction with frequency-domain global filtering via GFNet-style blocks for efficient and robust spatial-spectral representation learning. HGFNet introduces three complementary frequency transforms tailored to hyperspectral imagery: Spectral Fourier Transform (a 1D FFT along the spectral axis), Spatial Fourier Transform (a 2D FFT over spatial dimensions), and Spatial-Spatial Fourier Transform (a 3D FFT jointly over spectral and spatial dimensions), enabling comprehensive and high-dimensional frequency modeling. The 3D convolutional layers capture fine-grained local spatial-spectral structures, while the Fourier-based global filtering modules efficiently model long-range dependencies and suppress noise. To further mitigate the severe class imbalance commonly observed in HSIC, HGFNet incorporates an Adaptive Focal Loss (AFL) that dynamically adjusts class-wise focusing and weighting, improving discrimination for underrepresented classes.

65. 【2603.16423】SF-Mamba: Rethinking State Space Model for Vision

链接：https://arxiv.org/abs/2603.16423

作者：Masakazu Yoshimura,Teruaki Hayashi,Yuki Hoshino,Wei-Yao Wang,Takeshi Ohashi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Vision Transformers, quadratic complexity, recent years, years to strike, Mamba

备注： 21 pages

点击查看摘要

Abstract:The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.

66. 【2603.16421】HGP-Mamba: Integrating Histology and Generated Protein Features for Mamba-based Multimodal Survival Risk Prediction

链接：https://arxiv.org/abs/2603.16421

作者：Jing Dai,Chen Wu,Ming Wu,Qibin Zhang,Zexi Wu,Jingdong Zhang,Hongming Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, survival risk prediction, significantly improved cancer, risk prediction, learning have significantly

备注： Accepted at IEEE ICME 2026. This arXiv version includes additional supplementary experiments and extended discussions beyond the conference version

点击查看摘要

Abstract:Recent advances in multimodal learning have significantly improved cancer survival risk prediction. However, the joint prognostic potential of protein markers and histopathology images remains underexplored, largely due to the high cost and limited availability of protein expression profiling. To address this challenge, we propose HGP-Mamba, a Mamba-based multimodal framework that efficiently integrates histological with generated protein features for survival risk prediction. Specifically, we introduce a protein feature extractor (PFE) that leverages pretrained foundation models to derive high-throughput protein embeddings directly from Whole Slide Images (WSIs), enabling data-efficient incorporation of molecular information. Together with histology embeddings that capture morphological patterns, we further introduce the Local Interaction-aware Mamba (LiAM) for fine-grained feature interaction and the Global Interaction-enhanced Mamba (GiEM) to promote holistic modality fusion at the slide level, thus capture complex cross-modal dependencies. Experiments on four public cancer datasets demonstrate that HGP-Mamba achieves state-of-the-art performance while maintaining superior computational efficiency compared with existing methods. Our source code is publicly available at this https URL.

67. 【2603.16404】Near-light Photometric Stereo with Symmetric Lights

链接：https://arxiv.org/abs/2603.16404

作者：Lilika Makabe,Heng Guo,Hiroaki Santo,Fumio Okura,Yasuyuki Matsushita

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：light source arrangements, exploiting symmetric light, linear solution method, paper describes, describes a linear

备注：

点击查看摘要

Abstract:This paper describes a linear solution method for near-light photometric stereo by exploiting symmetric light source arrangements. Unlike conventional non-convex optimization approaches, by arranging multiple sets of symmetric nearby light source pairs, our method derives a closed-form solution for surface normal and depth without requiring initialization. In addition, our method works as long as the light sources are symmetrically distributed about an arbitrary point even when the entire spatial offset is uncalibrated. Experiments showcase the accuracy of shape recovery accuracy of our method, achieving comparable results to the state-of-the-art calibrated near-light photometric stereo method while significantly reducing requirements of careful depth initialization and light calibration.

68. 【2603.16392】DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification

链接：https://arxiv.org/abs/2603.16392

作者：Stathis Galanakis,Alexandros Koliousis,Stefanos Zafeiriou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：systems remain constrained, deep generative modeling, reduced generalization performance, classification systems remain, well-annotated clinical datasets

备注：

点击查看摘要

Abstract:Despite recent advances in deep generative modeling, skin lesion classification systems remain constrained by the limited availability of large, diverse, and well-annotated clinical datasets, resulting in class imbalance between benign and malignant lesions and consequently reduced generalization performance. We introduce DermaFlux, a rectified flow-based text-to-image generative framework that synthesizes clinically grounded skin lesion images from natural language descriptions of dermatological attributes. Built upon Flux.1, DermaFlux is fine-tuned using parameter-efficient Low-Rank Adaptation (LoRA) on a large curated collection of publicly available clinical image datasets. We construct image-text pairs using synthetic textual captions generated by Llama 3.2, following established dermatological criteria including lesion asymmetry, border irregularity, and color variation. Extensive experiments demonstrate that DermaFlux generates diverse and clinically meaningful dermatology images that improve binary classification performance by up to 6% when augmenting small real-world datasets, and by up to 9% when classifiers are trained on DermaFlux-generated synthetic images rather than diffusion-based synthetic images. Our ImageNet-pretrained ViT fine-tuned with only 2,500 real images and 4,375 DermaFlux-generated samples achieves 78.04% binary classification accuracy and an AUC of 0.859, surpassing the next best dermatology model by 8%.

69. 【2603.16385】Unpaired Cross-Domain Calibration of DMSP to VIIRS Nighttime Light Data Based on CUT Network

链接：https://arxiv.org/abs/2603.16385

作者：Zhan Tong,ChenXu Zhou,Fei Tang,Yiming Tu,Tianyu Qin,Kaihao Fang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Defense Meteorological Satellite, Meteorological Satellite Program, National Polar-orbiting Partnership, Suomi National Polar-orbiting, Defense Meteorological

备注： 16 pages, 10 figures, 8 tables. Submitted to Remote Sensing of Environment. Code and data available at: [this https URL](https://github.com/) [your-repo-link]

点击查看摘要

Abstract:Defense Meteorological Satellite Program (DMSP-OLS) and Suomi National Polar-orbiting Partnership (SNPP-VIIRS) nighttime light (NTL) data are vital for monitoring urbanization, yet sensor incompatibilities hinder long-term analysis. This study proposes a cross-sensor calibration method using Contrastive Unpaired Translation (CUT) network to transform DMSP data into VIIRS-like format, correcting DMSP defects. The method employs multilayer patch-wise contrastive learning to maximize mutual information between corresponding patches, preserving content consistency while learning cross-domain similarity. Utilizing 2012-2013 overlapping data for training, the network processes 1992-2013 DMSP imagery to generate enhanced VIIRS-style raster data. Validation results demonstrate that generated VIIRS-like data exhibits high consistency with actual VIIRS observations (R-squared greater than 0.87) and socioeconomic indicators. This approach effectively resolves cross-sensor data fusion issues and calibrates DMSP defects, providing reliable attempt for extended NTL time-series.

70. 【2603.16373】Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

链接：https://arxiv.org/abs/2603.16373

作者：Yunpeng Qu,Kaidong Zhang,Yukang Ding,Ying Chen,Jian Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved great success, generative models based, great success, underscoring the significance, Visual generative models

备注： 18 pages,12 figures

点击查看摘要

Abstract:Visual generative models based on latent space have achieved great success, underscoring the significance of visual tokenization. Mapping images to latents boosts efficiency and enables multimodal alignment for scaling up in downstream tasks. Existing visual tokenizers primarily map images into fixed 2D spatial grids and focus on pixel-level restoration, which hinders the capture of representations with compact global semantics. To address these issues, we propose \textbf{SemTok}, a semantic one-dimensional tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics. SemTok sets a new state-of-the-art in image reconstruction, achieving superior fidelity with a remarkably compact token representation. This is achieved via a synergistic framework with three key innovations: a 2D-to-1D tokenization scheme, a semantic alignment constraint, and a two-stage generative training strategy. Building on SemTok, we construct a masked autoregressive generation framework, which yields notable improvements in downstream image generation tasks. Experiments confirm the effectiveness of our semantic 1D tokenization. Our code will be open-sourced.

71. 【2603.16372】InViC: Intent-aware Visual Cues for Medical Visual Question Answering

链接：https://arxiv.org/abs/2603.16372

作者：Zhisong Wang,Ziyang Chen,Zanting Ye,Hongze Zhu,Yefeng Zheng,Yong Xia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：clinically relevant questions, relevant questions grounded, visual question answering, answer clinically relevant, Medical visual question

备注： 10 pages, 2 figures

点击查看摘要

Abstract:Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM's direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.

72. 【2603.16363】Advancing Visual Reliability: Color-Accurate Underwater Image Enhancement for Real-Time Underwater Missions

链接：https://arxiv.org/abs/2603.16363

作者：Yiqiang Zhou,Yifan Chen,Zhe Sun,Jijun Lu,Ye Zheng,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：providing reliable visual, reliable visual information, water-related environments generally, environments generally lead, image quality degradation

备注：

点击查看摘要

Abstract:Underwater image enhancement plays a crucial role in providing reliable visual information for underwater platforms, since strong absorption and scattering in water-related environments generally lead to image quality degradation. Existing high-performance methods often rely on complex architectures, which hinder deployment on underwater devices. Lightweight methods often sacrifice quality for speed and struggle to handle severely degraded underwater images. To address this limitation, we present a real-time underwater image enhancement framework with accurate color restoration. First, an Adaptive Weighted Channel Compensation module is introduced to achieve dynamic color recovery of the red and blue channels using the green channel as a reference anchor. Second, we design a Multi-branch Re-parameterized Dilated Convolution that employs multi-branch fusion during training and structural re-parameterization during inference, enabling large receptive field representation with low computational overhead. Finally, a Statistical Global Color Adjustment module is employed to optimize overall color performance based on statistical priors. Extensive experiments on eight datasets demonstrate that the proposed method achieves state-of-the-art performance across seven evaluation metrics. The model contains only 3,880 inference parameters and achieves an inference speed of 409 FPS. Our method improves the UCIQE score by 29.7% under diverse environmental conditions, and the deployment on ROV platforms and performance gains in downstream tasks further validate its superiority for real-time underwater missions.

73. 【2603.16362】$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

链接：https://arxiv.org/abs/2603.16362

作者：Ruizhi Wang,Weihan Li,Zunlei Feng,Haofei Zhang,Mingli Song,Jiayu Wang,Jie Song,Li Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：existing methods face, monocular depth estimation, high-fidelity monocular depth, remote sensing imagery, Remote Sensing Monocular

备注：

点击查看摘要

Abstract:Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that $D^3$-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

74. 【2603.16351】Automated identification of Ichneumonoidea wasps via YOLO-based deep learning: Integrating HiresCam for Explainable AI

链接：https://arxiv.org/abs/2603.16351

作者：Joao Manoel Herrera Pinheiro,Gabriela Do Nascimento Herrera,Alvaro Doria Dos Santos,Luciana Bueno Dos Reis Fernandes,Ricardo V. Godoy,Eduardo A. B. Almeida,Helena Carolina Onody,Marcelo Andrade Da Costa Vieira,Angelica Maria Penteado-Dias,Marcelo Becker

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Accurate taxonomic identification, biological control programs, ecological monitoring, Class Activation Mapping, control programs

备注： 14 pages, 20 figures

点击查看摘要

Abstract:Accurate taxonomic identification of parasitoid wasps within the superfamily Ichneumonoidea is essential for biodiversity assessment, ecological monitoring, and biological control programs. However, morphological similarity, small body size, and fine-grained interspecific variation make manual identification labor-intensive and expertise-dependent. This study proposes a deep learning-based framework for the automated identification of Ichneumonoidea wasps using a YOLO-based architecture integrated with High-Resolution Class Activation Mapping (HiResCAM) to enhance interpretability. The proposed system simultaneously identifies wasp families from high-resolution images. The dataset comprises 3556 high-resolution images of Hymenoptera specimens. The taxonomic distribution is primarily concentrated among the families Ichneumonidae (n = 786), Braconidae (n = 648), Apidae (n = 466), and Vespidae (n = 460). Extensive experiments were conducted using a curated dataset, with model performance evaluated through precision, recall, F1 score, and accuracy. The results demonstrate high accuracy of over 96 % and robust generalization across morphological variations. HiResCAM visualizations confirm that the model focuses on taxonomically relevant anatomical regions, such as wing venation, antennae segmentation, and metasomal structures, thereby validating the biological plausibility of the learned features. The integration of explainable AI techniques improves transparency and trustworthiness, making the system suitable for entomological research to accelerate biodiversity characterization in an under-described parasitoid superfamily.

75. 【2603.16343】Learning Human-Object Interaction for 3D Human Pose Estimation from LiDAR Point Clouds

链接：https://arxiv.org/abs/2603.16343

作者：Daniel Sungho Jung,Dohee Cho,Kyoung Mu Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving due, Understanding humans, diverse human-object interactions, human-object interactions, human-object interaction

备注： Project page: [this https URL](https://hoil-release.github.io/)

点击查看摘要

Abstract:Understanding humans from LiDAR point clouds is one of the most critical tasks in autonomous driving due to its close relationships with pedestrian safety, yet it remains challenging in the presence of diverse human-object interactions and cluttered backgrounds. Nevertheless, existing methods largely overlook the potential of leveraging human-object interactions to build robust 3D human pose estimation frameworks. There are two major challenges that motivate the incorporation of human-object interaction. First, human-object interactions introduce spatial ambiguity between human and object points, which often leads to erroneous 3D human keypoint predictions in interaction regions. Second, there exists severe class imbalance in the number of points between interacting and non-interacting body parts, with the interaction-frequent regions such as hand and foot being sparsely observed in LiDAR data. To address these challenges, we propose a Human-Object Interaction Learning (HOIL) framework for robust 3D human pose estimation from LiDAR point clouds. To mitigate the spatial ambiguity issue, we present human-object interaction-aware contrastive learning (HOICL) that effectively enhances feature discrimination between human and object points, particularly in interaction regions. To alleviate the class imbalance issue, we introduce contact-aware part-guided pooling (CPPool) that adaptively reallocates representational capacity by compressing overrepresented points while preserving informative points from interacting body parts. In addition, we present an optional contact-based temporal refinement that refines erroneous per-frame keypoint estimates using contact cues over time. As a result, our HOIL effectively leverages human-object interaction to resolve spatial ambiguity and class imbalance in interaction regions. Codes will be released.

76. 【2603.16341】PKINet-v2: Towards Powerful and Efficient Poly-Kernel Remote Sensing Object Detection

链接：https://arxiv.org/abs/2603.16341

作者：Xinhao Cai,Liulei Li,Gensheng Pei,Zeren Sun,Yazhou Yao,Wenguan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diverse aspect ratios, aspect ratios, isotropic large kernels, diverse aspect, spanning a wide

备注：

点击查看摘要

Abstract:Object detection in remote sensing images (RSIs) is challenged by the coexistence of geometric and spatial complexity: targets may appear with diverse aspect ratios, while spanning a wide range of object sizes under varied contexts. Existing RSI backbones address the two challenges separately, either by adopting anisotropic strip kernels to model slender targets or by using isotropic large kernels to capture broader context. However, such isolated treatments lead to complementary drawbacks: the strip-only design can disrupt spatial coherence for regular-shaped objects and weaken tiny details, whereas isotropic large kernels often introduce severe background noise and geometric mismatch for slender structures. In this paper, we extend PKINet, and present a powerful and efficient backbone that jointly handles both challenges within a unified paradigm named Poly Kernel Inception Network v2 (PKINet-v2). PKINet-v2 synergizes anisotropic axial-strip convolutions with isotropic square kernels and builds a multi-scope receptive field, preserving fine-grained local textures while progressively aggregating long-range context across scales. To enable efficient deployment, we further introduce a Heterogeneous Kernel Re-parameterization (HKR) Strategy that fuses all heterogeneous branches into a single depth-wise convolution for inference, eliminating fragmented kernel launches without accuracy loss. Extensive experiments on four widely-used benchmarks, including DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R, demonstrate that PKINet-v2 achieves state-of-the-art accuracy while delivering a $\textbf{3.9}\times$ FPS acceleration compared to PKINet-v1, surpassing previous remote sensing backbones in both effectiveness and efficiency.

77. 【2603.16340】Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation

链接：https://arxiv.org/abs/2603.16340

作者：Xinhao Cai,Gensheng Pei,Zeren Sun,Yazhou Yao,Fumin Shen,Wenguan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Monocular Depth Estimation, Depth Estimation, Monocular Depth, framework for Monocular, integrates real-world priors

备注： Accepted by CVPR2026

点击查看摘要

Abstract:In this paper, we propose \textbf{Iris}, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.

78. 【2603.16338】SpikeCLR: Contrastive Self-Supervised Learning for Few-Shot Event-Based Vision using Spiking Neural Networks

链接：https://arxiv.org/abs/2603.16338

作者：Maxime Vaillant,Axel Carlier,Lai Xing Ng,Christophe Hurter,Benoit R. Cottereau

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high dynamic range, low power consumption, vision sensors provide, sensors provide significant, provide significant advantages

备注： 17 pages, 4 figures

点击查看摘要

Abstract:Event-based vision sensors provide significant advantages for high-speed perception, including microsecond temporal resolution, high dynamic range, and low power consumption. When combined with Spiking Neural Networks (SNNs), they can be deployed on neuromorphic hardware, enabling energy-efficient applications on embedded systems. However, this potential is severely limited by the scarcity of large-scale labeled datasets required to effectively train such models. In this work, we introduce SpikeCLR, a contrastive self-supervised learning framework that enables SNNs to learn robust visual representations from unlabeled event data. We adapt prior frame-based methods to the spiking domain using surrogate gradient training and introduce a suite of event-specific augmentations that leverage spatial, temporal, and polarity transformations. Through extensive experiments on CIFAR10-DVS, N-Caltech101, N-MNIST, and DVS-Gesture benchmarks, we demonstrate that self-supervised pretraining with subsequent fine-tuning outperforms supervised learning in low-data regimes, achieving consistent gains in few-shot and semi-supervised settings. Our ablation studies reveal that combining spatial and temporal augmentations is critical for learning effective spatio-temporal invariances in event data. We further show that learned representations transfer across datasets, contributing to efforts for powerful event-based models in label-scarce settings.

79. 【2603.16330】An Interpretable Machine Learning Framework for Non-Small Cell Lung Cancer Drug Response Analysis

链接：https://arxiv.org/abs/2603.16330

作者：Ann Rachel,Pranav M Pawar,Mithun Mukharjee,Raja M,Tojo Mathew

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Lung cancer, abnormal growth, growth of malignant, malignant cells, cells that spread

备注： 26 pages, 8 figures

点击查看摘要

Abstract:Lung cancer is a condition where there is abnormal growth of malignant cells that spread in an uncontrollable fashion in the lungs. Some common treatment strategies are surgery, chemotherapy, and radiation which aren't the best options due to the heterogeneous nature of cancer. In personalized medicine, treatments are tailored according to the individual's genetic information along with lifestyle aspects. In addition, AI-based deep learning methods can analyze large sets of data to find early signs of cancer, types of tumor, and prospects of treatment. The paper focuses on the development of personalized treatment plans using specific patient data focusing primarily on the genetic profile. Multi-Omics data from Genomics of Drug Sensitivity in Cancer have been used to build a predictive model along with machine learning techniques. The value of the target variable, LN-IC50, determines how sensitive or resistive a drug is. An XGBoost regressor is utilized to predict the drug response focusing on molecular and cellular features extracted from cancer datasets. Cross-validation and Randomized Search are performed for hyperparameter tuning to further optimize the model's predictive performance. For explanation purposes, SHAP (SHapley Additive exPlanations) was used. SHAP values measure each feature's impact on an individual prediction. Furthermore, interpreting feature relationships was performed using DeepSeek, a large language model trained to verify the biological validity of the features. Contextual explanations regarding the most important genes or pathways were provided by DeepSeek alongside the top SHAP value constituents, supporting the predictability of the model.

80. 【2603.16306】DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

链接：https://arxiv.org/abs/2603.16306

作者：Heyu Si,Brandon James Denis,Muyang Sun,Dragos Datcu,Yaoru Li,Xin Jin,Ruiju Fu,Yuliia Tatarinova,Federico Landi,Jie Song,Mingli Song,Qi Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advancements, leveraging diffusion priors, shown promise, Recent, diffusion priors

备注：

点击查看摘要

Abstract:Recent advancements in 4D scene reconstruction, particularly those leveraging diffusion priors, have shown promise for novel view synthesis in autonomous driving. However, these methods often process frames independently or in a view-by-view manner, leading to a critical lack of spatio-temporal synergy. This results in spatial misalignment across cameras and temporal drift in sequences. We propose DriveFix, a novel multi-view restoration framework that ensures spatio-temporal coherence for driving scenes. Our approach employs an interleaved diffusion transformer architecture with specialized blocks to explicitly model both temporal dependencies and cross-camera spatial consistency. By conditioning the generation on historical context and integrating geometry-aware training losses, DriveFix enforces that the restored views adhere to a unified 3D geometry. This enables the consistent propagation of high-fidelity textures and significantly reduces artifacts. Extensive evaluations on the Waymo, nuScenes, and PandaSet datasets demonstrate that DriveFix achieves state-of-the-art performance in both reconstruction and novel view synthesis, marking a substantial step toward robust 4D world modeling for real-world deployment.

81. 【2603.16302】Micro-AU CLIP: Fine-Grained Contrastive Learning from Local Independence to Global Dependency for Micro-Expression Action Unit Detection

链接：https://arxiv.org/abs/2603.16302

作者：Jinsheng Wei,Fengzhou Guo,Yante Li,Haoyu Chen,Guanming Lu,Guoying Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：provide objective clues, genuine emotion analysis, global dependency, action units, fine-grained genuine emotion

备注：

点击查看摘要

Abstract:Micro-expression (ME) action units (Micro-AUs) provide objective clues for fine-grained genuine emotion analysis. Most existing Micro-AU detection methods learn AU features from the whole facial image/video, which conflicts with the inherent locality of AU, resulting in insufficient perception of AU regions. In fact, each AU independently corresponds to specific localized facial muscle movements (local independence), while there is an inherent dependency between some AUs under specific emotional states (global dependency). Thus, this paper explores the effectiveness of the independence-to-dependency pattern and proposes a novel micro-AU detection framework, micro-AU CLIP, that uniquely decomposes the AU detection process into local semantic independence modeling (LSI) and global semantic dependency (GSD) modeling. In LSI, Patch Token Attention (PTA) is designed, mapping several local features within the AU region to the same feature space; In GSD, Global Dependency Attention (GDA) and Global Dependency Loss (GDLoss) are presented to model the global dependency relationships between different AUs, thereby enhancing each AU feature. Furthermore, considering CLIP's native limitations in micro-semantic alignment, a microAU contrastive loss (MiAUCL) is designed to learn AU features by a fine-grained alignment of visual and text features. Also, Micro-AU CLIP is effectively applied to ME recognition in an emotion-label-free way. The experimental results demonstrate that Micro-AU CLIP can fully learn fine-grained micro-AU features, achieving state-of-the-art performance.

82. 【2603.16289】VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

链接：https://arxiv.org/abs/2603.16289

作者：Zhengbo Zhang,Jinbo Su,Zhaowen Zhou,Changtao Miao,Yuhan Hong,Qimeng Wu,Yumeng Liu,Feier Wu,Yihe Tian,Yuhao Liang,Zitong Shan,Wanke Xia,Yi-Fan Zhang,Bo Zhang,Zhe Li,Shiming Xiang,Ying Yan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, real world

备注：

点击查看摘要

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models' visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: this https URL

83. 【2603.16285】Persistent Story World Simulation with Continuous Character Customization

链接：https://arxiv.org/abs/2603.16285

作者：Jinlu Zhang,Qiyun Wang,Baoxiang Du,Jiayi Ji,Jing He,Rongsheng Zhang,Tangjie Lv,Xiaoshuai Sun,Rongrong Ji

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：gained increasing attention, computer vision, gained increasing, increasing attention, attention in computer

备注：

点击查看摘要

Abstract:Story visualization has gained increasing attention in computer vision. However, current methods often fail to achieve a synergy between accurate character customization, semantic alignment, and continuous integration of new identities. To tackle this challenge, in this paper we present EverTale, a story world simulator for continuous story character customization. We first propose an All-in-One-World Character Integrator to achieve continuous character adaptation within unified LoRA module, eliminating the need for per-character optimization modules of previous methods. Then, we incorporate a Character Quality Gate via MLLM-as-Judge to ensure the fidelity of each character adaptation process through chain-of-thought reasoning, determining whether the model can proceed to the next character or require additional training on the current one. We also introduce a Character-Aware Region-Focus Sampling strategy to address the identity degradation and layout conflicts in existing multi-character visual storytelling, ensuring natural multi-character generation by harmonizing local character-specific details with global scene context with higher efficiency. Experimental results show that our EverTale achieves superior performance against a wider range of compared methods on both single- and multi-character story visualization. Codes will be available.

84. 【2603.16284】Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

链接：https://arxiv.org/abs/2603.16284

作者：TianTian Dang,Chao Bi,Shufan Shen,Jinzhe Liu,Qingming Huang,Shuhui Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Large Vision-Language Models, broader practical deployment, restricts broader practical, Vision-Language Models, advancements in Large

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose a plug-and-play framework called Locate-Then-Sparsify for Feature Steering (LTS-FS), which controls the steering intensity according to the hallucination relevance of each layer. We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance.

85. 【2603.16271】VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

链接：https://arxiv.org/abs/2603.16271

作者：Tengjiao Yin,Jinglei Shi,Heng Guo,Xi Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：lack explicit geometric, explicit geometric supervision, models lack explicit, spatial drift, supervision during training

备注：

点击查看摘要

Abstract:Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.

86. 【2603.16269】FG-SGL: Fine-Grained Semantic Guidance Learning via Motion Process Decomposition for Micro-Gesture Recognition

链接：https://arxiv.org/abs/2603.16269

作者：Jinsheng Wei,Zhaodi Xu,Guanming Lu,Haoyu Chen,Jingjie Yan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Micro-gesture recognition, subtle inter-class variations, Fine-Grained Semantic Guidance, inter-class variations, Semantic Guidance

备注：

点击查看摘要

Abstract:Micro-gesture recognition (MGR) is challenging due to subtle inter-class variations. Existing methods rely on category-level supervision, which is insufficient for capturing subtle and localized motion differences. Thus, this paper proposes a Fine-Grained Semantic Guidance Learning (FG-SGL) framework that jointly integrates fine-grained and category-level semantics to guide vision--language models in perceiving local MG motions. FG-SA adopts fine-grained semantic cues to guide the learning of local motion features, while CP-A enhances the separability of MG features through category-level semantic guidance. To support fine-grained semantic guidance, this work constructs a fine-grained textual dataset with human annotations that describes the dynamic process of MGs in four refined semantic dimensions. Furthermore, a Multi-Level Contrastive Optimization strategy is designed to jointly optimize both modules in a coarse-to-fine pattern. Experiments show that FG-SGL achieves competitive performance, validating the effectiveness of fine-grained semantic guidance for MGR.

87. 【2603.16261】AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection

链接：https://arxiv.org/abs/2603.16261

作者：Hongwei Lin,Xun Huang,Chenglu Wen,Cheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：autonomous driving, crucial for autonomous, Image-guided Weather-aware Routing, Robust, adverse weather conditions

备注：

点击查看摘要

Abstract:Robust 3D object detection under adverse weather conditions is crucial for autonomous driving. However, most existing methods simply combine all weather samples for training while overlooking data distribution discrepancies across different weather scenarios, leading to performance conflicts. To address this issue, we introduce AW-MoE, the framework that innovatively integrates Mixture of Experts (MoE) into weather-robust multi-modal 3D object detection approaches. AW-MoE incorporates Image-guided Weather-aware Routing (IWR), which leverages the superior discriminability of image features across weather conditions and their invariance to scene variations for precise weather classification. Based on this accurate classification, IWR selects the top-K most relevant Weather-Specific Experts (WSE) that handle data discrepancies, ensuring optimal detection under all weather conditions. Additionally, we propose a Unified Dual-Modal Augmentation (UDMA) for synchronous LiDAR and 4D Radar dual-modal data augmentation while preserving the realism of scenes. Extensive experiments on the real-world dataset demonstrate that AW-MoE achieves ~ 15% improvement in adverse-weather performance over state-of-the-art methods, while incurring negligible inference overhead. Moreover, integrating AW-MoE into established baseline detectors yields performance improvements surpassing current state-of-the-art methods. These results show the effectiveness and strong scalability of our AW-MoE. We will release the code publicly at this https URL.

88. 【2603.16257】Point-to-Mask: From Arbitrary Point Annotations to Mask-Level Infrared Small Target Detection

链接：https://arxiv.org/abs/2603.16257

作者：Weihua Gao,Wenlong Niu,Jie Tang,Man Yang,Jiafeng Zhang,Xiaodong Peng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Infrared small target, methods predominantly formulate, requires costly dense, Infrared small, Adaptive Mask Generation

备注：

点击查看摘要

Abstract:Infrared small target detection (IRSTD) methods predominantly formulate the task as pixel-level segmentation, which requires costly dense annotations and is not well suited to tiny targets with weak texture and ambiguous boundaries. To address this issue, we propose Point-to-Mask, a framework that bridges low-cost point supervision and mask-level detection through two components: a Physics-driven Adaptive Mask Generation (PAMG) module that converts point annotations into compact target masks and geometric cues, and a lightweight Radius-aware Point Regression Network (RPR-Net) that reformulates IRSTD as target center localization and effective radius regression using spatiotemporal motion cues. The two modules form a closed loop: PAMG generates pseudo masks and geometric supervision during training, while the geometric predictions of RPR-Net are fed back to PAMG for pixel-level mask recovery during inference. To facilitate systematic evaluation, we further construct SIRSTD-Pixel, a sequential dataset with refined pixel-level annotations. Experiments show that the proposed framework achieves strong pseudo-label quality, high detection accuracy, and efficient inference, approaching full-supervision performance under point-supervised settings with substantially lower annotation cost. Code and datasets will be available at: this https URL.

89. 【2603.16256】When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

链接：https://arxiv.org/abs/2603.16256

作者：Xiaokun Sun,Yubo Wang,Haoyu Cao,Linli Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated significant potential

备注：

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting'', where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which frames should be reinforced. We introduce a novel training strategy, Add-One-In (AOI), that uses MLLM output probabilities to generate supervision signals representing repeat gain. This can be used to train a frame scoring network, which guides the frame repetition behavior. Experimental results across multiple models and datasets demonstrate that FrameRepeat is both effective and generalizable in strengthening important visual cues during the reasoning process.

90. 【2603.16253】Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

链接：https://arxiv.org/abs/2603.16253

作者：Junxin Wang,Dai Guan,Weijie Qiu,Zhihang Li,Yongbo Gai,Zhengyi Yang,Mengyu Zhou,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Vision-language process reward, Vision-language process, process reward models, score intermediate reasoning, test-time scaling

备注： 27 pages, 4 figures, 10 tables. Evaluated on VisualProcessBench and six multimodal reasoning benchmarks (LogicVista, MMMU, MathVerse-VO, MathVision, MathVista, WeMath). Includes ablations and causal analysis via controlled constraint corruption. Code: [this https URL](https://github.com/Qwen-Applications/EVPV-PRM)

点击查看摘要

Abstract:Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: this https URL

91. 【2603.16250】Visual Prompt Discovery via Semantic Exploration

链接：https://arxiv.org/abs/2603.16250

作者：Jaechang Kim,Yotaro Shimose,Zhao Wang,Kuang-Da Wang,Jungseul Ok,Shingo Takamatsu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：LVLMs encounter significant, Visual prompts, visual, encounter significant challenges, critical perception failures

备注：

点击查看摘要

Abstract:LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating the root causes of LVLM perception failures. Because of the opacity and unpredictability of LVLMs, optimal visual prompts must be discovered through empirical experiments, which have relied on manual human trial-and-error. We propose an automated semantic exploration framework for discovering task-wise visual prompts. Our approach enables diverse yet efficient exploration through agent-driven experiments, minimizing human intervention and avoiding the inefficiency of per-sample generation. We introduce a semantic exploration algorithm named SEVEX, which addresses two major challenges of visual prompt exploration: (1) the distraction caused by lengthy, low-level code and (2) the vast, unstructured search space of visual prompts. Specifically, our method leverages an abstract idea space as a search space, a novelty-guided selection algorithm, and a semantic feedback-driven ideation process to efficiently explore diverse visual prompts based on empirical results. We evaluate SEVEX on the BlindTest and BLINK benchmarks, which are designed to assess LVLM perception. Experimental results demonstrate that SEVEX significantly outperforms baseline methods in task accuracy, inference efficiency, exploration efficiency, and exploration stability. Notably, our framework discovers sophisticated and counter-intuitive visual strategies that go beyond conventional tool usage, offering a new paradigm for enhancing LVLM perception through automated, task-wise visual prompts.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.16250 [cs.CV]

(or
arXiv:2603.16250v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.16250

Focus to learn more

              arXiv-issued DOI via DataCite</p>

92. 【2603.16249】Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification

链接：https://arxiv.org/abs/2603.16249

作者：Trong-Duc Nguyen,Hoang-Long Nguyen,Huy-Hieu Pham

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Automated white blood, white blood cell, overfit dominant classes, leading deep models, extreme class imbalance

备注： Accepted at IEEE ISBI 2026

点击查看摘要

Abstract:Automated white blood cell (WBC) classification is essential for leukemia screening but remains challenged by extreme class imbalance, long-tail distributions, and domain shift, leading deep models to overfit dominant classes and fail on rare subtypes. We propose a hybrid framework for rare-class generalization that integrates a generative Pix2Pix-based restoration module for artifact removal, a Swin Transformer ensemble with MedSigLIP contrastive embeddings for robust representation learning, and a biologically-inspired refinement step using geometric spikiness and Mahalanobis-based morphological constraints to recover out-of-distribution predictions. Evaluated on the WBCBench 2026 challenge, our method achieves a Macro-F1 of 0.77139 on the private leaderboard, demonstrating strong performance under severe imbalance and highlighting the value of incorporating biological priors into deep learning for hematological image analysis.

93. 【2603.16245】How to Utilize Complementary Vision-Text Information for 2D Structure Understanding

链接：https://arxiv.org/abs/2603.16245

作者：Jiancheng Dong,Pengyue Jia,Derong Xu,Jiawei Cheng,Jingyu Peng,Chao Zhang,Bowen Liu,Xin Sun,Lixin Su,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：weakens row-column adjacency, LLMs typically linearize, typically linearize, fit their autoregressive, weakens row-column

备注： 16 pages, 5 figures

点击查看摘要

94. 【2603.16243】RASLF: Representation-Aware State Space Model for Light Field Super-Resolution

链接：https://arxiv.org/abs/2603.16243

作者：Zeqiang Wei,Kai Jin,Kuan Song,Xiuzhuang Zhou,Wenlong Chen,Min Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Current SSM-based light, light field super-resolution, SSM-based light field, Current SSM-based, Progressive Geometric Refinement

备注： 10 pages, 5 figures

点击查看摘要

Abstract:Current SSM-based light field super-resolution (LFSR) methods often fail to fully leverage the complementarity among various LF representations, leading to the loss of fine textures and geometric misalignments across views. To address these issues, we propose RASLF, a representation-aware state-space framework that explicitly models structural correlations across multiple LF representations. Specifically, a Progressive Geometric Refinement (PGR) block is created that uses a panoramic epipolar representation to explicitly encode multi-view parallax differences, thereby enabling integration across different LF representations. Furthermore, we introduce a Representation Aware Asymmetric Scanning (RAAS) mechanism that dynamically adjusts scanning paths based on the physical properties of different representation spaces, optimizing the balance between performance and efficiency through path pruning. Additionally, a Dual-Anchor Aggregation (DAA) module improves hierarchical feature flow, reducing redundant deeplayer features and prioritizing important reconstruction information. Experiments on various public benchmarks show that RASLF achieves the highest reconstruction accuracy while remaining highly computationally efficient.

95. 【2603.16241】Exclusivity-Guided Mask Learning for Semi-Supervised Crowd Instance Segmentation and Counting

链接：https://arxiv.org/abs/2603.16241

作者：Jiyang Huang,Hongru Cheng,Wei Lin,Jia Wan,Antoni B. Chan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Neighbor Exclusion Circle, area of research, inexpensive to obtain, Nearest Neighbor Exclusion, prominent area

备注：

点击查看摘要

Abstract:Semi-supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point-based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine-grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion-Constrained Dual-Prompt SAM (EDP-SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity-Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi-supervised crowd counting framework that uses instance mask priors as pseudo-labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF-QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end-to-end model achieves state-of-the-art semi-supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.

96. 【2603.16238】PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

链接：https://arxiv.org/abs/2603.16238

作者：Ryutaro Miya,Kazuyoshi Fushinobu,Tatsuya Kawaguchi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Contrastive Language-Image Pre-training, Monocular Depth Estimation, decoder-free Monocular Depth, decoder-free Monocular, Language-Image Pre-training

备注： 12 pages, 4 figures

点击查看摘要

Abstract:We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets. The code used in this research is available at: this https URL

97. 【2603.16233】Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors

链接：https://arxiv.org/abs/2603.16233

作者：Ryosuke Hori,Jyun-Ting Song,Zhengyi Luo,Jinkun Cao,Soyong Shin,Hideo Saito,Kris Kitani

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Reaction Inertial Poser, propose Ground Reaction, Ground Reaction Inertial, Ground Reaction, Inertial Poser

备注：

点击查看摘要

Abstract:We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU-pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency.

98. 【2603.16211】Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation

链接：https://arxiv.org/abs/2603.16211

作者：Yiming Huang,Baixiang Huang,Beilei Cui,Chi Kit Ng,Long Bai,Hongliang Ren

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, providing a powerful, powerful baseline, baseline for downstream, Gaussian

备注： 26 pages, 10 figures

点击查看摘要

Abstract:Feed-forward 3D reconstruction has revolutionized 3D vision, providing a powerful baseline for downstream tasks such as novel-view synthesis with 3D Gaussian Splatting. Previous works explore fixing the corrupted rendering results with a diffusion model. However, they lack geometric concern and fail at filling the missing area on the extrapolated view. In this work, we introduce Leveling3D, a novel pipeline that integrates feed-forward 3D reconstruction with geometrical-consistent generation to enable holistic simultaneous reconstruction and generation. We propose a geometry-aware leveling adapter, a lightweight technique that aligns internal knowledge in the diffusion model with the geometry prior from the feed-forward model. The leveling adapter enables generation on the artifact area of the extrapolated novel views caused by underconstrained regions of the 3D representation. Specifically, to learn a more diverse distributed generation, we introduce the palette filtering strategy for training, and a test-time masking refinement to prevent messy boundaries along the fixing regions. More importantly, the enhanced extrapolated novel views from Leveling3D could be used as the inputs for feed-forward 3DGS, leveling up the 3D reconstruction. We achieve SOTA performance on public datasets, including tasks such as novel-view synthesis and depth estimation.

99. 【2603.16195】S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

链接：https://arxiv.org/abs/2603.16195

作者：Haodong Yan,Zhide Zhong,Jiaguan Zhu,Junjie He,Weilin Yuan,Wenxuan Song,Xin Gong,Yingjie Cai,Guanyi Zhao,Xu Yan,Bingbing Liu,Ying-Cong Chen,Haoang Li

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：powerful visual foresight, robot learning, promising paradigm, paradigm for robot, powerful visual

备注：

点击查看摘要

Abstract:Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model's own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is this https URL

100. 【2603.16189】Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

链接：https://arxiv.org/abs/2603.16189

作者：Haomin Wang,Qi Wei,Qianli Ma,Shengyuan Ding,Jinhui Yin,Kai Chen,Hongjie Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：SVG, SVG code, rapid advancement, advancement of vision-language, increasing number

备注：

点击查看摘要

Abstract:With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.

101. 【2603.16188】ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

链接：https://arxiv.org/abs/2603.16188

作者：Haozhe Jia,Jianfei Song,Yuan Zhang,Honglei Jin,Youcheng Fan,Wenshuo Chen,Wei Zhang,Yutao Yue

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：language-driven whole-body control, framework for language-driven, language-driven whole-body, present ECHO, whole-body control

备注：

点击查看摘要

Abstract:We present ECHO, an edge--cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher--Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.

102. 【2603.16181】KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

链接：https://arxiv.org/abs/2603.16181

作者：Viraj Panchal,Tanmay Talsaniya,Parag Patel,Meet Patel

类目：Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词：Stage, two-stage multimodal content, accuracy, two-stage multimodal, multimodal content moderation

备注： 12 pages, 2 figures, 6 tables

点击查看摘要

Abstract:We present KidsNanny, a two-stage multimodal content moderation architecture for child safety. Stage 1 combines a vision transformer (ViT) with an object detector for visual screening (11.7 ms); outputs are routed as text not raw pixels to Stage 2, which applies OCR and a text based 7B language model for contextual reasoning (120 ms total pipeline). We evaluate on the UnsafeBench Sexual category (1,054 images) under two regimes: vision-only, isolating Stage 1, and multimodal, evaluating the full Stage 1+2 pipeline. Stage 1 achieves 80.27% accuracy and 85.39% F1 at 11.7 ms; vision-only baselines range from 59.01% to 77.04% accuracy. The full pipeline achieves 81.40% accuracy and 86.16% F1 at 120 ms, compared to ShieldGemma-2 (64.80% accuracy, 1,136 ms) and LlavaGuard (80.36% accuracy, 4,138 ms). To evaluate text-awareness, we filter two subsets: a text+visual subset (257 images) and a text-only subset (44 images where safety depends primarily on embedded text). On text-only images, KidsNanny achieves 100% recall (25/25 positives; small sample) and 75.76% precision; ShieldGemma-2 achieves 84% recall and 60% precision at 1,136 ms. Results suggest that dedicated OCR-based reasoning may offer recall-precision advantages on text-embedded threats at lower latency, though the small text-only subset limits generalizability. By documenting this architecture and evaluation methodology, we aim to contribute to the broader research effort on efficient multimodal content moderation for child safety.

103. 【2603.16179】360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method

链接：https://arxiv.org/abs/2603.16179

作者：Huyen T. T. Tran,Van-Quang Nguyen,Farros Alferro,Kang-Jun Liu,Takayuki Okatani

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perception of 360° images remains largely underexplored. Unlike conventional images, 360° images capture the entire surrounding environment, enabling holistic spatial reasoning but introducing challenges such as geometric distortion and complex spatial relations. To comprehensively assess MLLMs' capabilities to perceive 360° images, we introduce 360Bench, a Visual Question Answering (VQA) benchmark featuring 7K-resolution 360° images, seven representative (sub)tasks with annotations carefully curated by human annotators. Using 360Bench, we systematically evaluate seven MLLMs and six enhancement methods, revealing their shortcomings in 360° image perception. To address these challenges, we propose Free360, a training-free scene-graph-based framework for high-resolution 360° VQA. Free360 decomposes the reasoning process into modular steps, applies adaptive spherical image transformations to 360° images tailored to each step, and seamlessly integrates the resulting information into a unified graph representation for answer generation. Experiments show that Free360 consistently improves its base MLLM and provides a strong training-free solution for 360° VQA tasks. The source code and dataset will be publicly released upon acceptance.

104. 【2603.16166】SignNav: Leveraging Signage for Semantic Visual Navigation in Large-Scale Indoor Environments

链接：https://arxiv.org/abs/2603.16166

作者：Jian Sun,Yuming Huang,He Li,Shuqi Xiao,Shenyan Guo,Maani Ghaffari,Qingbiao Li,Chengzhong Xu,Hui Kong

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Humans routinely leverage, Humans routinely, Large-Scale Indoor, routinely leverage semantic, semantic hints provided

备注：

点击查看摘要

Abstract:Humans routinely leverage semantic hints provided by signage to navigate to destinations within novel Large-Scale Indoor (LSI) environments, such as hospitals and airport terminals. However, this capability remains underexplored within the field of embodied navigation. This paper introduces a novel embodied navigation task, SignNav, which requires the agent to interpret semantic hint from signage and reason about the subsequent action based on current observation. To facilitate research in this domain, we construct the LSI-Dataset for the training and evaluation of various SignNav agents. Dynamically changing semantic hints and sparse placement of signage in LSI environments present significant challenges to the SignNav task. To address these challenges, we propose the Spatial-Temporal Aware Transformer (START) model for end-to-end decision-making. The spatial-aware module grounds the semantic hint of signage into physical world, while the temporal-aware module captures long-range dependencies between historical states and current observation. Leveraging a two-stage training strategy with Dataset Aggregation (DAgger), our approach achieves state-of-the-art performance, recording an 80% Success Rate (SR) and 0.74 NDTW on val-unseen split. Real-world deployment further demonstrates the practicality of our method in physical environment without pre-built map.

105. 【2603.16165】Homogeneous and Heterogeneous Consistency progressive Re-ranking for Visible-Infrared Person Re-identification

链接：https://arxiv.org/abs/2603.16165

作者：Yiming Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Visible-infrared person re-identification, faces greater challenges, re-identification faces greater, person re-identification faces, traditional person re-identification

备注：

点击查看摘要

Abstract:Visible-infrared person re-identification faces greater challenges than traditional person re-identification due to the significant differences between modalities. In particular, the differences between these modalities make effective matching even more challenging, mainly because existing re-ranking algorithms cannot simultaneously address the intra-modal variations and inter-modal discrepancy in cross-modal person re-identification. To address this problem, we propose a novel Progressive Modal Relationship Re-ranking method consisting of two modules, called heterogeneous and homogeneous consistency re-ranking(HHCR). The first module, heterogeneous consistency re-ranking, explores the relationship between the query and the gallery modalities in the test set. The second module, homogeneous consistency reranking, investigates the intrinsic relationship within each modality between the query and the gallery in the test set. Based on this, we propose a baseline for cross-modal person re-identification, called a consistency re-ranking inference network (CRI). We conducted comprehensive experiments demonstrating that our proposed re-ranking method is generalized, and both the re-ranking and the baseline achieve state-of-the-art performance.

106. 【2603.16163】STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

链接：https://arxiv.org/abs/2603.16163

作者：Suvajit Patra,Soumitra Samanta

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Continuous Sign Language, Sign Language Recognition, Continuous Sign, Language Recognition, Sign Language

备注：

点击查看摘要

107. 【2603.16160】Segmentation-before-Staining Improves Structural Fidelity in Virtual IHC-to-Multiplex IF Translation

链接：https://arxiv.org/abs/2603.16160

作者：Junhyeok Lee,Han Jang,Heeseong Eum,Joon Jang,Kyu Sung Choi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enables simultaneous single-cell, intact tissue architecture, high reagent cost, routine clinical adoption, specialized imaging platforms

备注： 11 pages, 2 figures, 2 tables. Submitted to MICCAI 2026

点击查看摘要

Abstract:Multiplex immunofluorescence (mIF) enables simultaneous single-cell quantification of multiple biomarkers within intact tissue architecture, yet its high reagent cost, multi-round staining protocols, and need for specialized imaging platforms limit routine clinical adoption. Virtual staining can synthesize mIF channels from widely available brightfield immunohistochemistry (IHC), but current translators optimize pixel-level fidelity without explicitly constraining nuclear morphology. In pathology, this gap is clinically consequential: subtle distortions in nuclei count, shape, or spatial arrangement propagate directly to quantification endpoints such as the Ki67 proliferation index, where errors of a few percent can shift treatment-relevant risk categories. This work introduces a supervision-free, architecture-agnostic conditioning strategy that injects a continuous cell probability map from a pretrained nuclei segmentation foundation model as an explicit input prior, together with a variance-preserving regularization term that matches local intensity statistics to maintain cell-level heterogeneity in synthesized fluorescence channels. The soft prior retains gradient-level boundary information lost by binary thresholding, providing a richer conditioning signal without task-specific tuning. Controlled experiments across Pix2Pix with U-Net and ResNet generators, deterministic regression U-Net, and conditional diffusion on two independent datasets demonstrate consistent improvements in nuclei count fidelity and perceptual quality, as the sole modifications. Code will be made publicly available upon acceptance.

108. 【2603.16159】AI-Generated Figures in Academic Publishing: Policies, Tools, and Practical Guidelines

链接：https://arxiv.org/abs/2603.16159

作者：Davie Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词：graphical abstracts, producing publication-quality scientific, data visualizations, rapid advancement, advancement of generative

备注：

点击查看摘要

Abstract:The rapid advancement of generative AI has introduced a new class of tools capable of producing publication-quality scientific figures, graphical abstracts, and data visualizations. However, academic publishers have responded with inconsistent and often ambiguous policies regarding AI-generated imagery. This paper surveys the current stance of major journals and publishers -- including Nature, Science, Cell Press, Elsevier, and PLOS -- on the use of AI-generated figures. We identify key concerns raised by publishers, including reproducibility, authorship attribution, and potential for visual misinformation. Drawing on practical examples from tools such as SciDraw, an AI-powered platform designed specifically for scientific illustration, we propose a set of best-practice guidelines for researchers seeking to use AI figure-generation tools in a compliant and transparent manner. Our findings suggest that, with appropriate disclosure and quality control, AI-generated figures can meaningfully accelerate scientific communication without compromising integrity.

109. 【2603.16154】GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation

链接：https://arxiv.org/abs/2603.16154

作者：Jiayi Tian,Jiaze Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：perceive dynamic environments, enabling intelligent agents, dynamic environments, essential for enabling, enabling intelligent

备注：

点击查看摘要

Abstract:Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challenging to design a unified and robust 4D backbone. Existing CNN or Transformer based methods are constrained either by limited receptive fields or by quadratic computational complexity, while neglecting these implicit distortions. To address this problem, we propose a novel dual invariant framework, termed \textbf{Gaussian Aware Temporal Scaling (GATS)}, which explicitly resolves both distributional inconsistencies and temporal. The proposed \emph{Uncertainty Guided Gaussian Convolution (UGGC)} incorporates local Gaussian statistics and uncertainty aware gating into point convolution, thereby achieving robust neighborhood aggregation under density variation, noise, and occlusion. In parallel, the \emph{Temporal Scaling Attention (TSA)} introduces a learnable scaling factor to normalize temporal distances, ensuring frame partition invariance and consistent velocity estimation across different frame rates. These two modules are complementary: temporal scaling normalizes time intervals prior to Gaussian estimation, while Gaussian modeling enhances robustness to irregular distributions. Our experiments on mainstream benchmarks MSR-Action3D (\textbf{+6.62\%} accuracy), NTU RGBD (\textbf{+1.4\%} accuracy), and Synthia4D (\textbf{+1.8\%} mIoU) demonstrate significant performance gains, offering a more efficient and principled paradigm for invariant 4D point cloud video understanding with superior accuracy, robustness, and scalability compared to Transformer based counterparts.

110. 【2603.16151】EFF-Grasp: Energy-Field Flow Matching for Physics-Aware Dexterous Grasp Generation

链接：https://arxiv.org/abs/2603.16151

作者：Yukun Zhao,Zichen Zhong,Yongshun Gong,Yilong Yin,Haoliang Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Denoising generative models, model complex grasp, large-scale data, generative models, model complex

备注：

点击查看摘要

Abstract:Denoising generative models have recently become the dominant paradigm for dexterous grasp generation, owing to their ability to model complex grasp distributions from large-scale data. However, existing diffusion-based methods typically formulate generation as a stochastic differential equation (SDE), which often requires many sequential denoising steps and introduces trajectory instability that can lead to physically infeasible grasps. In this paper, we propose EFF-Grasp, a novel Flow-Matching-based framework for physics-aware dexterous grasp generation. Specifically, we reformulate grasp synthesis as a deterministic ordinary differential equation (ODE) process, which enables efficient and stable generation through smooth probability flows. To further enforce physical feasibility, we introduce a training-free physics-aware energy guidance strategy. Our method defines an energy-guided target distribution using adapted explicit physical energy functions that capture key grasp constraints, and estimates the corresponding guidance term via a local Monte Carlo approximation during inference. In this way, EFF-Grasp dynamically steers the generation trajectory toward physically feasible regions without requiring additional physics-based training or simulation feedback. Extensive experiments on five benchmark datasets show that EFF-Grasp achieves superior performance in grasp quality and physical feasibility, while requiring substantially fewer sampling steps than diffusion-based baselines.

111. 【2603.16139】Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

链接：https://arxiv.org/abs/2603.16139

作者：Peng Sun,Jun Xie,Tao Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Unified Multimodal Models, Unified Multimodal, UMM visual generation, textbf, Toggle

备注： [this https URL](https://github.com/LINs-lab/IOMM)

点击查看摘要

Abstract:Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE--surpassing strong baselines such as BAGEL-7B (0.82 0.55) and BLIP3-o-4B (0.84 0.50). Code is available $\href{this https URL}{this https URL}$.

Comments:
this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.16139 [cs.CV]

(or
arXiv:2603.16139v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.16139

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Jun Xie [view email] [v1]
Tue, 17 Mar 2026 05:41:48 UTC (2,574 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training, by Peng Sun and 2 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

|
next

new
|
recent
| 2026-03

Change to browse by:

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

112. 【2603.16134】When Generative Augmentation Hurts: A Benchmark Study of GAN and Diffusion Models for Bias Correction in AI Classification Systems

链接：https://arxiv.org/abs/2603.16134

作者：Shesh Narayan Gupta,Nik Bear Brown

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Generative models, poorly understood, models are widely, low-data conditions, conditions are poorly

备注：

点击查看摘要

Abstract:Generative models are widely used to compensate for class imbalance in AI training pipelines, yet their failure modes under low-data conditions are poorly understood. This paper reports a controlled benchmark comparing three augmentation strategies applied to a fine-grained animal classification task: traditional transforms, FastGAN, and Stable Diffusion 1.5 fine-tuned with Low-Rank Adaptation (LoRA). Using the Oxford-IIIT Pet Dataset with eight artificially underrepresented breeds, we find that FastGAN augmentation does not merely underperform at very low training set sizes but actively increases classifier bias, with a statistically significant large effect across three random seeds (bias gap increase: +20.7%, Cohen's d = +5.03, p = 0.013). The effect size here is large enough to give confidence in the direction of the finding despite the small number of seeds. Feature embedding analysis using t-distributed Stochastic Neighbor Embedding reveals that FastGAN images for severe-minority breeds form tight isolated clusters outside the real image distribution, a pattern consistent with mode collapse. Stable Diffusion with Low-Rank Adaptation produced the best results overall, achieving the highest macro F1 (0.9125 plus or minus 0.0047) and a 13.1% reduction in the bias gap relative to the unaugmented baseline. The data suggest a sample-size boundary somewhere between 20 and 50 training images per class below which GAN augmentation becomes harmful in this setting, though further work across additional domains is needed to establish where that boundary sits more precisely. All experiments run on a consumer-grade GPU with 6 to 8 GB of memory, with no cloud compute required.

113. 【2603.16133】DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives

链接：https://arxiv.org/abs/2603.16133

作者：Xiaoxu Meng,Zhongmin Chen,Bo Yang,Weikai Chen,Weixiao Liu,Lin Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：weak part boundaries, downstream asset reuse, Neural reconstructions, structure for fidelity, yielding dense

备注：

点击查看摘要

Abstract:Neural reconstructions often trade structure for fidelity, yielding dense and unstructured meshes with irregular topology and weak part boundaries that hinder editing, animation, and downstream asset reuse. We present DualPrim, a compact and structured 3D reconstruction framework. Unlike additive-only implicit or primitive methods, DualPrim represents shapes with positive and negative superquadrics: the former builds the bases while the latter carves local volumes through a differentiable operator, enabling topology-aware modeling of holes and concavities. This additive-subtractive design increases the representational power without sacrificing compactness or differentiability. We embed DualPrim in a volumetric differentiable renderer, enabling end-to-end learning from multi-view images and seamless mesh export via closed-form boolean difference. Empirically, DualPrim delivers state-of-the-art accuracy and produces compact, structured, and interpretable outputs that better satisfy downstream needs than additive-only alternatives.

114. 【2603.16130】EPOFusion: Exposure aware Progressive Optimization Method for Infrared and Visible Image Fusion

链接：https://arxiv.org/abs/2603.16130

作者：Zhiwei Wang,Yayu Zheng,Defeng He,Li Zhao,Xiaoqin Zhang,Yuxing Li,Edmund Y. Lam

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Overexposure frequently occurs, critical visual information, practical scenarios, frequently occurs, occurs in practical

备注：

点击查看摘要

Abstract:Overexposure frequently occurs in practical scenarios, causing the loss of critical visual information. However, existing infrared and visible fusion methods still exhibit unsatisfactory performance in highly bright regions. To address this, we propose EPOFusion, an exposure-aware fusion model. Specifically, a guidance module is introduced to facilitate the encoder in extracting fine-grained infrared features from overexposed regions. Meanwhile, an iterative decoder incorporating a multiscale context fusion module is designed to progressively enhance the fused image, ensuring consistent details and superior visual quality. Finally, an adaptive loss function dynamically constrains the fusion process, enabling an effective balance between the modalities under varying exposure conditions. To achieve better exposure awareness, we construct the first infrared and visible overexposure dataset (IVOE) with high quality infrared guided annotations for overexposed regions. Extensive experiments show that EPOFusion outperforms existing methods. It maintains infrared cues in overexposed regions while achieving visually faithful fusion in non-overexposed areas, thereby enhancing both visual fidelity and downstream task performance. Code, fusion results and IVOE dataset will be made available at this https URL.

115. 【2603.16129】Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting

链接：https://arxiv.org/abs/2603.16129

作者：Da Zhang,Bingyu Li,Feiyu Wang,Zhiyuan Zhao,Junyu Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requiring visual exemplars, Zero-shot object counting, aims to enumerate, visual exemplars, enumerate objects

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model this http URL address these challenges, we present \textbf{QICA}, a novel framework that synergizes \underline{q}uantity percept\underline{i}on with robust spatial \underline{c}ast \underline{a}ggregation. Specifically, we introduce a Synergistic Prompting Strategy (\textbf{SPS}) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (\textbf{CAD}) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss ($\mathcal{L}_{MQA}$) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.

116. 【2603.16122】Out-of-Distribution Object Detection in Street Scenes via Synthetic Outlier Exposure and Transfer Learning

链接：https://arxiv.org/abs/2603.16122

作者：Sadia Ilyas,Annika Mütze,Klaus Friedrichs,Thomas Kurbiel,Matthias Rottmann

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：OOD object detection, OOD, OOD objects, OOD object, textbf

备注：

点击查看摘要

Abstract:Out-of-distribution (OOD) object detection is an important yet underexplored task. A reliable object detector should be able to handle OOD objects by localizing and correctly classifying them as OOD. However, a critical issue arises when such atypical objects are completely missed by the object detector and incorrectly treated as background. Existing OOD detection approaches in object detection often rely on complex architectures or auxiliary branches and typically do not provide a framework that treats in-distribution (ID) and OOD in a unified way. In this work, we address these limitations by enabling a single detector to detect OOD objects, that are otherwise silently overlooked, alongside ID objects. We present \textbf{SynOE-OD}, a \textbf{Syn}thetic \textbf{O}utlier-\textbf{E}xposure-based \textbf{O}bject \textbf{D}etection framework, that leverages strong generative models, like Stable Diffusion, and Open-Vocabulary Object Detectors (OVODs) to generate semantically meaningful, object-level data that serve as outliers during training. The generated data is used for transfer-learning to establish strong ID task performance and supplement detection models with OOD object detection robustness. Our approach achieves state-of-the-art average precision on an established OOD object detection benchmark, where OVODs, such as GroundingDINO, show limited zero-shot performance in detecting OOD objects in street-scenes.

117. 【2603.16113】PathGLS: Evaluating Pathology Vision-Language Models without Ground Truth through Multi-Dimensional Consistency

链接：https://arxiv.org/abs/2603.16113

作者：Minbing Chen,Zhu Meng,Fei Su

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：offer significant potential, enabling interpretable image, scalable decision support, offer significant, Natural Language Inference

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) offer significant potential in computational pathology by enabling interpretable image analysis, automated reporting, and scalable decision support. However, their widespread clinical adoption remains limited due to the absence of reliable, automated evaluation metrics capable of identifying subtle failures such as hallucinations. To address this gap, we propose PathGLS, a novel reference-free evaluation framework that assesses pathology VLMs across three dimensions: Grounding (fine-grained visual-text alignment), Logic (entailment graph consistency using Natural Language Inference), and Stability (output variance under adversarial visual-semantic perturbations). PathGLS supports both patch-level and whole-slide image (WSI)-level analysis, yielding a comprehensive trust score. Experiments on Quilt-1M, TCGA, REG2025, PathMMU and TCGA-Sarcoma datasets demonstrate the superiority of PathGLS. Specifically, on the Quilt-1M dataset, PathGLS reveals a steep sensitivity drop of 40.2% for hallucinated reports compared to only 2.1% for BERTScore. Moreover, validation against expert-defined clinical error hierarchies reveals that PathGLS achieves a strong Spearman's rank correlation of $\rho=0.71$ ($p 0.0001$), significantly outperforming Large Language Model (LLM)-based approaches (Gemini 3.0 Pro: $\rho=0.39$, $p 0.0001$). These results establish PathGLS as a robust reference-free metric. By directly quantifying hallucination rates and domain shift robustness, it serves as a reliable criterion for benchmarking VLMs on private clinical datasets and informing safe deployment. Code can be found at: this https URL

118. 【2603.16103】NanoGS: Training-Free Gaussian Splat Simplification

链接：https://arxiv.org/abs/2603.16103

作者：Butian Xiong,Rong Liu,Tiantian Zhou,Meida Chen,Zhiwen Fan,Andrew Feng

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：incurring significant storage, Gaussian Splat, Gaussian Splat simplification, enables high-fidelity, real-time novel view

备注：

点击查看摘要

Abstract:3D Gaussian Splat (3DGS) enables high-fidelity, real-time novel view synthesis by representing scenes with large sets of anisotropic primitives, but often requires millions of Splats, incurring significant storage and transmission costs. Most existing compression methods rely on GPU-intensive post-training optimization with calibrated images, limiting practical deployment. We introduce NanoGS, a training-free and lightweight framework for Gaussian Splat simplification. Instead of relying on image-based rendering supervision, NanoGS formulates simplification as local pairwise merging over a sparse spatial graph. The method approximates a pair of Gaussians with a single primitive using mass preserved moment matching and evaluates merge quality through a principled merge cost between the original mixture and its approximation. By restricting merge candidates to local neighborhoods and selecting compatible pairs efficiently, NanoGS produces compact Gaussian representations while preserving scene structure and appearance. NanoGS operates directly on existing Gaussian Splat models, runs efficiently on CPU, and preserves the standard 3DGS parameterization, enabling seamless integration with existing rendering pipelines. Experiments demonstrate that NanoGS substantially reduces primitive count while maintaining high rendering fidelity, providing an efficient and practical solution for Gaussian Splat simplification. Our project website is available at this https URL.

119. 【2603.16100】Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

链接：https://arxiv.org/abs/2603.16100

作者：Jonas Herzog,Yue Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent research suggested, CLIP-like contrastive language-image, contrastive language-image training, Recent research, research suggested

备注： Accepted for CVPR'26

点击查看摘要

Abstract:Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.

120. 【2603.16099】OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

链接：https://arxiv.org/abs/2603.16099

作者：Sensen Gao,Zhaoqing Wang,Qihang Cao,Dongdong Yu,Changhu Wang,Tongliang Liu,Mingming Gong,Jiawang Bian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing diffusion-based, video latent spaces, consistency inherently challenging, generation methods primarily, methods primarily operate

备注： Code: [this https URL](https://github.com/SensenGao/OneWorld)

点击查看摘要

Abstract:Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at this https URL.

121. 【2603.16098】LICA: Layered Image Composition Annotations for Graphic Design Research

链接：https://arxiv.org/abs/2603.16098

作者：Elad Hirsch,Shubham Yadav,Mohit Garg,Purvanshi Mehta

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Layered Image Composition, Image Composition Annotations, Layered Image, Composition Annotations, design compositions designed

备注：

点击查看摘要

Abstract:We introduce LICA (Layered Image Composition Annotations), a large-scale dataset of 1,550,244 multi-layer graphic design compositions designed to advance structured understanding and generation of graphic layouts1. In addition to ren- dered PNG images, LICA represents each design as a hierarchical composition of typed components including text, image, vector, and group elements, each paired with rich per-element metadata such as spatial geometry, typographic attributes, opacity, and visibility. The dataset spans 20 design categories and 971,850 unique templates, providing broad coverage of real-world design structures. We further introduce graphic design video as a new and largely unexplored challenge for current vision-language models through 27,261 animated layouts annotated with per-component keyframes and motion parameters. Beyond scale, LICA establishes a new paradigm of research tasks for graphic design, enabling structured investiga- tions into problems such as layer-aware inpainting, structured layout generation, controlled design editing, and temporally-aware generative modeling. By repre- senting design as a system of compositional layers and relationships, the dataset supports research on models that operate directly on design structure rather than pixels alone.

122. 【2603.16093】Diffusion Models for Joint Audio-Video Generation

链接：https://arxiv.org/abs/2603.16093

作者：Alejandro Paredes La Torre

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：shown remarkable progress, Multimodal generative models, generative models, models have shown, shown remarkable

备注：

点击查看摘要

Abstract:Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.

123. 【2603.16092】Parallel In-context Learning for Large Vision Language Models

链接：https://arxiv.org/abs/2603.16092

作者：Shin'ya Yamaguchi,Daiki Chijiwa,Tamao Sakao,Taku Hasegawa

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large vision-language models, employ multi-modal in-context, Large vision-language, multi-modal in-context learning, vision-language models

备注： Accepted to CVPR 2026 (Findings); Code is available at [this https URL](https://github.com/yshinya6/parallel-icl)

点击查看摘要

Abstract:Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.

124. 【2603.16086】owards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

链接：https://arxiv.org/abs/2603.16086

作者：Chang Nie,Tianchen Deng,Guangming Wang,Zhe Liu,Hesheng Wang

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词：static pre-execution prompts, typically treat sound, human speech, begun to incorporate, typically treat

备注：

点击查看摘要

Abstract:While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi-sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near-future audio codes; and (iv) a flow-matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX-Sound for pretraining, alongside HEAR-Bench, the first sound-centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound-centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi-sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at this https URL.

125. 【2603.16085】Interact3D: Compositional 3D Generation of Interactive Objects

链接：https://arxiv.org/abs/2603.16085

作者：Hui Shan,Keyang Luo,Ming Li,Sizhe Zheng,Yanwei Fu,Zhen Chen,Xiangru Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent breakthroughs, enabled the synthesis, synthesis of high-fidelity, high-fidelity individual assets, Recent

备注：

点击查看摘要

Abstract:Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images--particularly under occlusions--remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.

126. 【2603.16083】Structured prototype regularization for synthetic-to-real driving scene parsing

链接：https://arxiv.org/abs/2603.16083

作者：Jiahe Fan,Xiao Ma,Sergey Vityazev,George Giakos,Shaolong Shu,Rui Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-world traffic environments, complex real-world traffic, Driving scene parsing, traffic environments, critical for autonomous

备注：

点击查看摘要

Abstract:Driving scene parsing is critical for autonomous vehicles to operate reliably in complex real-world traffic environments. To reduce the reliance on costly pixel-level annotations, synthetic datasets with automatically generated labels have become a popular alternative. However, models trained on synthetic data often perform poorly when applied to real-world scenes due to the synthetic-to-real domain gap. Despite the success of unsupervised domain adaptation in narrowing this gap, most existing methods mainly focus on global feature alignment while overlooking the semantic structure of the feature space. As a result, semantic relations among classes are insufficiently modeled, limiting the model's ability to generalize. To address these challenges, this study introduces a novel unsupervised domain adaptation framework that explicitly regularizes semantic feature structures to significantly enhance driving scene parsing performance in real-world scenarios. Specifically, the proposed method enforces inter-class separation and intra-class compactness by leveraging class-specific prototypes, thereby enhancing the discriminability and structural coherence of feature clusters. An entropy-based noise filtering strategy improves the reliability of pseudo labels, while a pixel-level attention mechanism further refines feature alignment. Extensive experiments on representative benchmarks demonstrate that the proposed method consistently outperforms recent state-of-the-art methods. These results underscore the importance of preserving semantic structure for robust synthetic-to-real adaptation in driving scene parsing tasks.

127. 【2603.16078】Volumetrically Consistent Implicit Atlas Learning via Neural Diffeomorphic Flow for Placenta MRI

链接：https://arxiv.org/abs/2603.16078

作者：Athena Taymourtash,S. Mazdak Abulnaga,Esra Abaci Turk,P. Ellen Grant,Polina Golland

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Establishing dense volumetric, Establishing dense, anatomical shapes, shapes is essential, essential for group-level

备注：

点击查看摘要

Abstract:Establishing dense volumetric correspondences across anatomical shapes is essential for group-level analysis but remains challenging for implicit neural representations. Most existing implicit registration methods rely on supervision near the zero-level set and thus capture only surface correspondences, leaving interior deformations under-constrained. We introduce a volumetrically consistent implicit model that couples reconstruction of signed distance functions (SDFs) with neural diffeomorphic flow to learn a shared canonical template of the placenta. Volumetric regularization, including Jacobian-determinant and biharmonic penalties, suppresses local folding and promotes globally coherent deformations. In the motivating application to placenta MRI, our formulation jointly reconstructs individual placentas, aligns them to a population-derived implicit template, and enables voxel-wise intensity mapping in a unified canonical space. Experiments on in-vivo placenta MRI scans demonstrate improved geometric fidelity and volumetric alignment over surface-based implicit baseline methods, yielding anatomically interpretable and topologically consistent flattening suitable for group analysis.

128. 【2603.16067】Attribution Upsampling should Redistribute, Not Interpolate

链接：https://arxiv.org/abs/2603.16067

作者：Vincenzo Buono,Peyman Sheikholharam Mashhadi,Mahmoud Rahat,Prayag Tiwari,Stefan Byttner

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：natural images, saliency maps, explainable AI rely, designed for natural, Attribution

备注：

点击查看摘要

Abstract:Attribution methods in explainable AI rely on upsampling techniques that were designed for natural images, not saliency maps. Standard bilinear and bicubic interpolation systematically corrupts attribution signals through aliasing, ringing, and boundary bleeding, producing spurious high-importance regions that misrepresent model reasoning. We identify that the core issue is treating attribution upsampling as an interpolation problem that operates in isolation from the model's reasoning, rather than a mass redistribution problem where model-derived semantic boundaries must govern how importance flows. We present Universal Semantic-Aware Upsampling (USU), a principled method that reformulates upsampling through ratio-form mass redistribution operators, provably preserving attribution mass and relative importance ordering. Extending the axiomatic tradition of feature attribution to upsampling, we formalize four desiderata for faithful upsampling and prove that interpolation structurally violates three of them. These same three force any redistribution operator into a ratio form; the fourth selects the unique potential within this family, yielding USU. Controlled experiments on models with known attribution priors verify USU's formal guarantees; evaluation across ImageNet, CIFAR-10, and CUB-200 confirms consistent faithfulness improvements and qualitatively superior, semantically coherent explanations.

129. 【2603.16063】ViT-AdaLA: Adapting Vision Transformers with Linear Attention

链接：https://arxiv.org/abs/2603.16063

作者：Yifan Li,Seunghyun Yoon,Viet Dac Lai,Franck Dernoncourt,Jason Kuen,Yu Kong,Trung Bui

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：based vision foundation, Vision Transformers, vision foundation models, achieved remarkable performance, diverse vision tasks

备注：

点击查看摘要

Abstract:Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.

130. 【2603.16050】he Era of End-to-End Autonomy: Transitioning from Rule-Based Driving to Large Driving Models

链接：https://arxiv.org/abs/2603.16050

作者：Eduardo Nebot,Julie Stephany Berrio Perez

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：modular rule based, rule based pipelines, Rivian Unified Intelligence, undergoing a shift, shift from modular

备注：

点击查看摘要

Abstract:Autonomous driving is undergoing a shift from modular rule based pipelines toward end to end (E2E) learning systems. This paper examines this transition by tracing the evolution from classical sense perceive plan control architectures to large driving models (LDMs) capable of mapping raw sensor input directly to driving actions. We analyze recent developments including Tesla's Full Self Driving (FSD) V12 V14, Rivian's Unified Intelligence platform, NVIDIA Cosmos, and emerging commercial robotaxi deployments, focusing on architectural design, deployment strategies, safety considerations and industry implications. A key emerging product category is supervised E2E driving, often referred to as FSD (Supervised) or L2 plus plus, which several manufacturers plan to deploy from 2026 onwards. These systems can perform most of the Dynamic Driving Task (DDT) in complex environments while requiring human supervision, shifting the driver's role to safety oversight. Early operational evidence suggests E2E learning handles the long tail distribution of real world driving scenarios and is becoming a dominant commercial strategy. We also discuss how similar architectural advances may extend beyond autonomous vehicles (AV) to other embodied AI systems, including humanoid robotics.

131. 【2603.16043】Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

链接：https://arxiv.org/abs/2603.16043

作者：Xiaozhou Ye,Feng Jiang,Zihan Wang,Xiulai Wang,Yutao Zhang,Kevin I-Kai Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Human Activity Recognition, Human Activity, Activity Recognition, wearable inertial sensors, Recognition using wearable

备注：

点击查看摘要

Abstract:Human Activity Recognition using wearable inertial sensors is foundational to healthcare monitoring, fitness analytics, and context-aware computing, yet its deployment is hindered by cross-user variability arising from heterogeneous physiological traits, motor habits, and sensor placements. Existing domain generalization approaches either neglect temporal dependencies in sensor streams or depend on impractical target-domain annotations. We propose a different paradigm: modeling generalizable feature extraction as a collaborative sequential generation process governed by reinforcement learning. Our framework, CTFG (Collaborative Temporal Feature Generation), employs a Transformer-based autoregressive generator that incrementally constructs feature token sequences, each conditioned on prior context and the encoded sensor input. The generator is optimized via Group-Relative Policy Optimization, a critic-free algorithm that evaluates each generated sequence against a cohort of alternatives sampled from the same input, deriving advantages through intra-group normalization rather than learned value estimation. This design eliminates the distribution-dependent bias inherent in critic-based methods and provides self-calibrating optimization signals that remain stable across heterogeneous user distributions. A tri-objective reward comprising class discrimination, cross-user invariance, and temporal fidelity jointly shapes the feature space to separate activities, align user distributions, and preserve fine-grained temporal content. Evaluations on the DSADS and PAMAP2 benchmarks demonstrate state-of-the-art cross-user accuracy (88.53\% and 75.22\%), substantial reduction in inter-task training variance, accelerated convergence, and robust generalization under varying action-space dimensionalities.

132. 【2603.16024】Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

链接：https://arxiv.org/abs/2603.16024

作者：Jecia Z.Y. Mao,Francis X. Creighton,Russell H. Taylor,Manish Sahu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dynamically executes perception, dynamically executes, skull base surgery, surgeon queries, video-guided skull base

备注：

点击查看摘要

Abstract:We introduce a speech-guided embodied agent framework for video-guided skull base surgery that dynamically executes perception and image-guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real-time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image-guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of preoperative 3D models, monocular video-based estimation of the surgical tool pose, and support image guidance through real-time anatomical this http URL evaluate the proposed system in video-guided skull base surgery scenarios and benchmark its tracking performance against a commercially available optical tracking system. Results demonstrate that speech-guided embodied agents can achieve competitive spatial accuracy while improving workflow integration and enabling rapid deployment of video-guided surgical systems.

133. 【2603.16016】FlatLands: Generative Floormap Completion From a Single Egocentric View

链接：https://arxiv.org/abs/2603.16016

作者：Subhransu S. Bhattacharjee,Dylan Campbell,Rahul Shome

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词：single egocentric image, egocentric image typically, image typically captures, complete metric traversability, metric traversability map

备注： Under review

点击查看摘要

Abstract:A single egocentric image typically captures only a small portion of the floor, yet a complete metric traversability map of the surroundings would better serve applications such as indoor navigation. We introduce FlatLands, a dataset and benchmark for single-view bird's-eye view (BEV) floor completion. The dataset contains 270,575 observations from 17,656 real metric indoor scenes drawn from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps, and the benchmark includes both in- and out-of-distribution evaluation protocols. We compare training-free approaches, deterministic models, ensembles, and stochastic generative models. Finally, we instantiate the task as an end-to-end monocular RGB-to-floormaps pipeline. FlatLands provides a rigorous testbed for uncertainty-aware indoor mapping and generative completion for embodied navigation.

134. 【2603.16001】Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models

链接：https://arxiv.org/abs/2603.16001

作者：Sijie Li,Biao Qian,Jungong Han

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Vision-Language Models, enabling lightweight Large, lightweight Large Vision-Language, Vision-Language Models, lightweight Large

备注： CVPR 2026. Code available here: [this https URL](https://github.com/LezJ/ATV-Pruning)

点击查看摘要

135. 【2603.15975】UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

链接：https://arxiv.org/abs/2603.15975

作者：Xiaoyan Cong,Zekun Li,Zhiyang Dou,Hongyu Li,Omid Taheri,Chuan Guo,Abhay Mittal,Sizhe An,Taku Komura,Wojciech Matusik,Michael J. Black,Srinath Sridhar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large-scale foundation models, paired text descriptions, recently made impressive, made impressive progress, Large-scale foundation

备注： Project Page: [this https URL](https://oliver-cong02.github.io/UMO.github.io/)

点击查看摘要

Abstract:Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock the generative priors of pretrained DiT-based motion LFMs. Specifically, UMO introduces three learnable frame-level meta-operation embeddings to specify per-frame intent and employs lightweight temporal fusion to inject in-context cues into the pretrained backbone, with negligible runtime overhead compared to the base model. With this design, UMO finetunes the pretrained model, originally limited to text-to-motion generation, to support diverse previously unsupported tasks, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation. Experiments demonstrate that UMO consistently outperforms task-specific and training-free baselines across a wide range of benchmarks, despite using a single unified model. Code and model will be publicly available. Project Page: this https URL

136. 【2603.15967】A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Histopathology

链接：https://arxiv.org/abs/2603.15967

作者：Harishwar Reddy Kasireddy(1),Patricio S. La Rosa(1 and 2),Akshita Gupta(1),Anindya S. Paul(1),Jamie L. Fermin(1),William L. Clapp(1),Meryl A. Waldman(3),Tarek M. El-Ashkar(4),Sanjay Jain(5),Luis Rodrigues(6),Kuang Yu Jen(7),Avi Z. Rosenberg(8),Michael T. Eadon(4),Jeffrey B. Hodgin(9),Pinaki Sarder(1) ((1) University of Florida, (2) Bayer Company, (3) National Institutes of Health, (4) Indiana University School of Medicine, (5) Washington University School of Medicine, (6) Universidade de Coimbra, (7) University of California Davis, (8) Johns Hopkins University, (9) University of Michigan)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large-scale cancer datasets, advanced computational pathology, Histopathology foundation models, pretrained on large-scale, cancer datasets

备注： 31 Pages, 14 Tables, 12 figures, Co-correspondence to jhodgin@med. [this http URL](http://umich.edu) and [this http URL](http://pinaki.sarder) @ufl.edu

点击查看摘要

Abstract:Histopathology foundation models (HFMs), pretrained on large-scale cancer datasets, have advanced computational pathology. However, their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies such as renal cell and urothelial carcinoma. We systematically evaluate 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains (PAS, HE, PASM, and IHC), spatial scales (tile and slide-level), task types (classification, regression, and copy detection), and clinical objectives, including detection, diagnosis, and prognosis. Tile-level performance is assessed using repeated stratified group cross-validation, while slide-level tasks are evaluated using repeated nested stratified cross-validation. Statistical significance is examined using Friedman test followed by pairwise Wilcoxon signed-rank testing with Holm-Bonferroni correction and compact letter display visualization. To promote reproducibility, we release an open-source Python package, kidney-hfm-eval, available at this https URL , that reproduces the evaluation pipelines. Results show moderate to strong performance on tasks driven by coarse meso-scale renal morphology, including diagnostic classification and detection of prominent structural alterations. In contrast, performance consistently declines for tasks requiring fine-grained microstructural discrimination, complex biological phenotypes, or slide-level prognostic inference, largely independent of stain type. Overall, current HFMs appear to encode predominantly static meso-scale representations and may have limited capacity to capture subtle renal pathology or prognosis-related signals. Our results highlight the need for kidney-specific, multi-stain, and multimodal foundation models to support clinically reliable decision-making in nephrology.

137. 【2603.15941】owards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation

链接：https://arxiv.org/abs/2603.15941

作者：Samuel Johnny,Blessed Guda,Frank Ebeledike,Goodness Obasi,Moise Busogi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：chest computed tomography, Automated diagnosis, Distributionally Robust Optimisation, computed tomography, scans faces

备注：

点击查看摘要

Abstract:Automated diagnosis from chest computed tomography (CT) scans faces two persistent challenges in clinical deployment: distribution shift across acquisition sites and performance disparity across demographic subgroups. We address both simultaneously across two complementary tasks: binary COVID-19 classification from multi-site CT volumes (Task 1) and four-class lung pathology recognition with gender-based fairness constraints (Task 2). Our framework combines a lightweight MobileViT-XXS slice encoder with a two-layer SliceTransformer aggregator for volumetric reasoning, and trains with a KL-regularised Group Distributionally Robust Optimisation (Group DRO) objective that adaptively upweights underperforming acquisition centres and demographic subgroups. Unlike standard Group DRO, the KL penalty prevents group weight collapse, providing a stable balance between worst-case protection and average performance. For Task 2, we define groups at the granularity of gender class, directly targeting severely underrepresented combinations such as female Squamous cell carcinoma. On Task 1, our best configuration achieves a challenge F1 of 0.835, surpassing the best published challenge entry by +5.9. On Task 2, Group DRO with {\alpha} = 0.5 achieves a mean per-gender macro F1 of 0.815, outperforming the best challenge entry by +11.1 pp and improving Female Squamous F1 by +17.4 over the Fo- cal Loss baseline.

138. 【2603.15940】Do Not Leave a Gap: Hallucination-Free Object Concealment in Vision-Language Models

链接：https://arxiv.org/abs/2603.15940

作者：Amira Guesmi,Muhammad Shafique

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：recently shown remarkable, shown remarkable capabilities, understanding and generation, recently shown, shown remarkable

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) have recently shown remarkable capabilities in visual understanding and generation, but remain vulnerable to adversarial manipulations of visual content. Prior object-hiding attacks primarily rely on suppressing or blocking region-specific representations, often creating semantic gaps that inadvertently induce hallucination, where models invent plausible but incorrect objects. In this work, we demonstrate that hallucination arises not from object absence per se, but from semantic discontinuity introduced by such suppression-based attacks. We propose a new class of \emph{background-consistent object concealment} attacks, which hide target objects by re-encoding their visual representations to be statistically and semantically consistent with surrounding background regions. Crucially, our approach preserves token structure and attention flow, avoiding representational voids that trigger hallucination. We present a pixel-level optimization framework that enforces background-consistent re-encoding across multiple transformer layers while preserving global scene semantics. Extensive experiments on state-of-the-art vision-language models show that our method effectively conceals target objects while preserving up to $86\%$ of non-target objects and reducing grounded hallucination by up to $3\times$ compared to attention-suppression-based attacks.

139. 【2603.15932】Nodule-Aligned Latent Space Learning with LLM-Driven Multimodal Diffusion for Lung Nodule Progression Prediction

链接：https://arxiv.org/abs/2603.15932

作者：James Song,Yifan Wang,Chuan Zhou,Liyue Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Electronic Health Record, biological mechanisms driving, nodule Electronic Health, lung nodule progression, driving nodule progression

备注：

点击查看摘要

Abstract:Early diagnosis of lung cancer is challenging due to biological uncertainty and the limited understanding of the biological mechanisms driving nodule progression. To address this, we propose Nodule-Aligned Multimodal (Latent) Diffusion (NAMD), a novel framework that predicts lung nodule progression by generating 1-year follow-up nodule computed tomography images with baseline scans and the patient's and nodule's Electronic Health Record (EHR). NAMD introduces a nodule-aligned latent space, where distances between latents directly correspond to changes in nodule attributes, and utilizes an LLM-driven control mechanism to condition the diffusion backbone on patient data. On the National Lung Screening Trial (NLST) dataset, our method synthesizes follow-up nodule images that achieve an AUROC of 0.805 and an AUPRC of 0.346 for lung nodule malignancy prediction, significantly outperforming both baseline scans and state-of-the-art synthesis methods, while closely approaching the performance of real follow-up scans (AUROC: 0.819, AUPRC: 0.393). These results demonstrate that NAMD captures clinically relevant features of lung nodule progression, facilitating earlier and more accurate diagnosis.

140. 【2603.15919】Sparse but not Simpler: A Multi-Level Interpretability Analysis of Vision Transformers

链接：https://arxiv.org/abs/2603.15919

作者：Siyu Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Sparse neural networks, neural networks, interpretability, models, weight sparsity

备注：

点击查看摘要

Abstract:Sparse neural networks are often hypothesized to be more interpretable than dense models, motivated by findings that weight sparsity can produce compact circuits in language models. However, it remains unclear whether structural sparsity itself leads to improved semantic interpretability. In this work, we systematically evaluate the relationship between weight sparsity and interpretability in Vision Transformers using DeiT-III B/16 models pruned with Wanda. To assess interpretability comprehensively, we introduce \textbf{IMPACT}, a multi-level framework that evaluates interpretability across four complementary levels: neurons, layer representations, task circuits, and model-level attribution. Layer representations are analyzed using BatchTopK sparse autoencoders, circuits are extracted via learnable node masking, and explanations are evaluated with transformer attribution using insertion and deletion metrics. Our results reveal a clear structural effect but limited interpretability gains. Sparse models produce circuits with approximately $2.5\times$ fewer edges than dense models, yet the fraction of active nodes remains similar or higher, indicating that pruning redistributes computation rather than isolating simpler functional modules. Consistent with this observation, sparse models show no systematic improvements in neuron-level selectivity, SAE feature interpretability, or attribution faithfulness. These findings suggest that structural sparsity alone does not reliably yield more interpretable vision models, highlighting the importance of evaluation frameworks that assess interpretability beyond circuit compactness.

141. 【2603.15901】Federated Learning for Privacy-Preserving Medical AI

链接：https://arxiv.org/abs/2603.15901

作者：Tin Hoang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Alzheimer Disease Neuroimaging, Disease Neuroimaging Initiative, Alzheimer disease classification, Alzheimer disease, Neuroimaging Initiative

备注： MSc Dissertation

点击查看摘要

Abstract:This dissertation investigates privacy-preserving federated learning for Alzheimer's disease classification using three-dimensional MRI data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Existing methodologies often suffer from unrealistic data partitioning, inadequate privacy guarantees, and insufficient benchmarking, limiting their practical deployment in healthcare. To address these gaps, this research proposes a novel site-aware data partitioning strategy that preserves institutional boundaries, reflecting real-world multi-institutional collaborations and data heterogeneity. Furthermore, an Adaptive Local Differential Privacy (ALDP) mechanism is introduced, dynamically adjusting privacy parameters based on training progression and parameter characteristics, thereby significantly improving the privacy-utility trade-off over traditional fixed-noise approaches. Systematic empirical evaluation across multiple client federations and privacy budgets demonstrated that advanced federated optimisation algorithms, particularly FedProx, could equal or surpass centralised training performance while ensuring rigorous privacy protection. Notably, ALDP achieved up to 80.4% accuracy in a two-client configuration, surpassing fixed-noise Local DP by 5-7 percentage points and demonstrating substantially greater training stability. The comprehensive ablation studies and benchmarking establish quantitative standards for privacy-preserving collaborative medical AI, providing practical guidelines for real-world deployment. This work thereby advances the state-of-the-art in federated learning for medical imaging, establishing both methodological foundations and empirical evidence necessary for future privacy-compliant AI adoption in healthcare.

142. 【2603.15888】AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

链接：https://arxiv.org/abs/2603.15888

作者：Andrea Tupini,Lars Liden,Reuben Tan,Yu Wang,Jianfeng Gao

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：evaluate visually grounded, visually grounded, focusing specifically, interactive planning, aim to evaluate

备注： 19 figures, 6 tables, including appendix

点击查看摘要

Abstract:With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?

Comments:
19 figures, 6 tables, including appendix

Subjects:

Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

ACMclasses:
I.2.8; I.2.10

Cite as:
arXiv:2603.15888 [cs.AI]

(or
arXiv:2603.15888v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.15888

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Andrea Tupini [view email] [v1]
Mon, 16 Mar 2026 20:31:43 UTC (2,432 KB)

143. 【2603.15887】EvoIQA - Explaining Image Distortions with Evolved White-Box Logic

链接：https://arxiv.org/abs/2603.15887

作者：Ruchika Gupta,Illya Bakurov,Nathan Haut,Wolfgang Banzhaf

类目：Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词：Image Quality Assessment, Quality Assessment, Traditional Image Quality, Image Quality, completely lack interpretability

备注： 11 pages, 3 figures

点击查看摘要

Abstract:Traditional Image Quality Assessment (IQA) metrics typically fall into one of two extremes: rigid, hand-crafted mathematical models or "black-box" deep learning architectures that completely lack interpretability. To bridge this gap, we propose EvoIQA, a fully explainable symbolic regression framework based on Genetic Programming that Evolves explicit, human-readable mathematical formulas for image quality assessment (IQA). Utilizing a rich terminal set from the VSI, VIF, FSIM, and HaarPSI metrics, our framework inherently maps structural, chromatic, and information-theoretic degradations into observable mathematical equations. Our results demonstrate that the evolved GP models consistently achieve strong alignment between the predictions and human visual preferences. Furthermore, they not only outperform traditional hand-crafted metrics but also achieve performance parity with complex, state-of-the-art deep learning models like DB-CNN, proving that we no longer have to sacrifice interpretability for state-of-the-art performance.

144. 【2603.15862】Self-supervised Disentanglement of Disease Effects from Aging in 3D Medical Shapes

链接：https://arxiv.org/abs/2603.15862

作者：Jakaria Rabbi,Nilanjan Ray,Dana Cobzas

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：developing interpretable biomarkers, Disentangling pathological, patient stratification, crucial for developing, developing interpretable

备注： 10 pages

点击查看摘要

Abstract:Disentangling pathological changes from physiological aging in 3D medical shapes is crucial for developing interpretable biomarkers and patient stratification. However, this separation is challenging when diagnosis labels are limited or unavailable, since disease and aging often produce overlapping effects on shape changes, obscuring clinically relevant shape patterns. To address this challenge, we propose a two-stage framework combining unsupervised disease discovery with self-supervised disentanglement of implicit shape representations. In the first stage, we train an implicit neural model with signed distance functions to learn stable shape embeddings. We then apply clustering on the shape latent space, which yields pseudo disease labels without using ground-truth diagnosis during discovery. In the second stage, we disentangle factors in a compact variational space using pseudo disease labels discovered in the first stage and the ground truth age labels available for all subjects. We enforce separation and controllability with a multi-objective disentanglement loss combining covariance and a supervised contrastive loss. On ADNI hippocampus and OAI distal femur shapes, we achieve near-supervised performance, improving disentanglement and reconstruction over state-of-the-art unsupervised baselines, while enabling high-fidelity reconstruction, controllable synthesis, and factor-based explainability. Code and checkpoints are available at this https URL

145. 【2603.15847】FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding

链接：https://arxiv.org/abs/2603.15847

作者：Eadom Dessalene,Botao He,Michael Maynord,Yonatan Tussa,Pavan Mantripragada,Yianni Karabati,Nirupam Roy,Yiannis Aloimonos

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Force-Enhanced Egocentric Learning, large-scale dataset pairing, Force-Enhanced Egocentric, custom piezoresistive gloves, dataset pairing force

备注： 14 pages, 7 figures

点击查看摘要

Abstract:We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.

146. 【2603.15822】Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation

链接：https://arxiv.org/abs/2603.15822

作者：Renjie Liang,Yiling Ma,Yang Xing,Zhengkang Fan,Jinqian Pan,Chengkun Sun,Li Li,Kuang Gong,Jie Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Automated radiology report, incomplete pathology coverage, Automated radiology, radiology report generation, radiology report

备注：

点击查看摘要

Abstract:Automated radiology report generation from 3D CT volumes often suffers from incomplete pathology coverage. We provide empirical evidence that this limitation stems from a representational bottleneck: contrastive 3D CT embeddings encode discriminative pathology signals, yet exhibit severe dimensional concentration, with as few as 2 effective dimensions out of 512. Corroborating this, scaling the language model yields no measurable improvement, suggesting that the bottleneck lies in the visual representation rather than the generator. This bottleneck limits both generation and retrieval; naive static retrieval fails to improve clinical efficacy and can even degrade performance. We propose \textbf{AdaRAG-CT}, an adaptive augmentation framework that compensates for this visual bottleneck by introducing supplementary textual information through controlled retrieval and selectively integrating it during generation. On the CT-RATE benchmark, AdaRAG-CT achieves state-of-the-art clinical efficacy, improving Clinical F1 from 0.420 (CT-Agent) to 0.480 (+6 points); ablation studies confirm that both the retrieval and generation components contribute to the improvement. Code is available at this https URL.

147. 【2603.15818】Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition

链接：https://arxiv.org/abs/2603.15818

作者：Salah Eddine Bekhouche,Hichem Telli,Azeddine Benlamoudi,Salah Eddine Herrouz,Abdelmalik Taleb-Ahmed,Abdenour Hadid

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：subtle affective states, person shows conflicting, shows conflicting signals, subtle affective, conflicting signals

备注：

点击查看摘要

Abstract:Ambivalence and hesitancy (A/H) are subtle affective states where a person shows conflicting signals through different channels -- saying one thing while their face or voice tells another story. Recognising these states automatically is valuable in clinical settings, but it is hard for machines because the key evidence lives in the \emph{disagreements} between what is said, how it sounds, and what the face shows. We present \textbf{ConflictAwareAH}, a multimodal framework built for this problem. Three pre-trained encoders extract video, audio, and text representations. Pairwise conflict features -- element-wise absolute differences between modality embeddings -- serve as \emph{bidirectional} cues: large cross-modal differences flag A/H, while small differences confirm behavioural consistency and anchor the negative class. This conflict-aware design addresses a key limitation of text-dominant approaches, which tend to over-detect A/H (high F1-AH) while struggling to confirm its absence: our multimodal model improves F1-NoAH by +4.6 points over text alone and halves the class-performance gap. A complementary \emph{text-guided late fusion} strategy blends a text-only auxiliary head with the full model at inference, adding +4.1 Macro F1. On the BAH dataset from the ABAW10 Ambivalence/Hesitancy Challenge, our method reaches \textbf{0.694 Macro F1} on the labelled test split and \textbf{0.715} on the private leaderboard, outperforming published multimodal baselines by over 10 points -- all on a single GPU in under 25 minutes of training.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.15818 [cs.CV]

(or
arXiv:2603.15818v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.15818

Focus to learn more

              arXiv-issued DOI via DataCite</p>

148. 【2603.15812】ModTrack: Sensor-Agnostic Multi-View Tracking via Identity-Informed PHD Filtering with Covariance Propagation

链接：https://arxiv.org/abs/2603.15812

作者：Aditya Iyer,Jack Roberts,Nora Ayanian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-View Multi-Object Tracking, Bird Eye View, Multi-View Multi-Object, aims to localize, objects observed

备注：

点击查看摘要

Abstract:Multi-View Multi-Object Tracking (MV-MOT) aims to localize and maintain consistent identities of objects observed by multiple sensors. This task is challenging, as viewpoint changes and occlusion disrupt identity consistency across views and time. Recent end-to-end approaches address this by jointly learning 2D Bird's Eye View (BEV) representations and identity associations, achieving high tracking accuracy. However, these methods offer no principled uncertainty accounting and remain tightly coupled to their training configuration, limiting generalization across sensor layouts, modalities, or datasets without retraining. We propose ModTrack, a modular MV-MOT system that matches end-to-end performance while providing cross-modal, sensor-agnostic generalization and traceable uncertainty. ModTrack confines learning methods to just the \textit{Detection and Feature Extraction} stage of the MV-MOT pipeline, performing all fusion, association, and tracking with closed-form analytical methods. Our design reduces each sensor's output to calibrated position-covariance pairs $(\mathbf{z}, R)$; cross-view clustering and precision-weighted fusion then yield unified estimates $(\hat{\mathbf{z}}, \hat{R})$ for identity assignment and temporal tracking. A feedback-coupled, identity-informed Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter with HMM motion modes uses these fused estimates to maintain identities under missed detections and heavy occlusion. ModTrack achieves 95.5 IDF1 and 91.4 MOTA on \textit{WildTrack}, surpassing all prior modular methods by over 21 points and rivaling the state-of-the-art end-to-end methods while providing deployment flexibility they cannot. Specifically, the same tracker core transfers unchanged to \textit{MultiviewX} and \textit{RadarScenes}, with only perception-module replacement required to extend to new domains and sensor modalities.

149. 【2603.15811】Feed-forward Gaussian Registration for Head Avatar Creation and Editing

链接：https://arxiv.org/abs/2603.15811

作者：Malte Prinzler,Paulo Gotardo,Siyu Tang,Timo Bolkart

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian splat textures, high-quality head avatar, predicts Gaussian splat, head avatar, Gaussian splat

备注： Website: [this https URL](https://malteprinzler.github.io/projects/match) ; Video: [this https URL](https://youtu.be/Z3xoXQ648sE)

点击查看摘要

Abstract:We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatar methods require time-consuming head tracking followed by expensive avatar optimization, often resulting in a total creation time of more than one day. MATCH, in contrast, directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in just 0.5 seconds per frame, without requiring data preprocessing. The learned intra-subject correspondence across frames enables fast creation of personalized head avatars, while correspondence across subjects supports applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We establish these correspondences end-to-end using a transformer-based model that predicts Gaussian splat textures in the fixed UV layout of a template mesh. To achieve this, we introduce a novel registration-guided attention block, where each UV-map token attends exclusively to image tokens depicting its corresponding mesh region. This design improves efficiency and performance compared to dense cross-view attention. MATCH outperforms existing methods in novel-view synthesis, geometry registration, and head avatar generation, while making avatar creation 10 times faster than the closest competing baseline. The code and model weights are available on the project website.

150. 【2603.15800】Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

链接：https://arxiv.org/abs/2603.15800

作者：Ce Zhang,Jinxi He,Junyi He,Katia Sycara,Yaqi Xie

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词：Large Language Models, Multi-modal Large Language, Large Language, Language Models, visual reasoning tasks

备注： Accepted at CVPR 2026. Project page: [this https URL](https://echosafe-mllm.github.io)

点击查看摘要

151. 【2603.15780】Parallelised Differentiable Straightest Geodesics for 3D Meshes

链接：https://arxiv.org/abs/2603.15780

作者：Hippolyte Verninas,Caner Korkmaz,Stefanos Zafeiriou,Tolga Birdal,Simone Foti

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：geometrically accurate methods, non-Euclidean domains, Machine learning, progressively generalised, generalised to operate

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Machine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demonstrate how our differentiable exponential map can improve learning and optimisation pipelines on general geometries. In particular, to showcase the versatility of our method, we propose a new geodesic convolutional layer, a new flow matching method for learning on meshes, and a second-order optimiser that we apply to centroidal Voronoi tessellation. Our code, models, and pip-installable library (digeo) are available at: this http URL.

152. 【2603.15774】Domain Adaptation Without the Compute Burden for Efficient Whole Slide Image Analysis

链接：https://arxiv.org/abs/2603.15774

作者：Umar Marikkar,Muhammad Awais,Sara Atito

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multiple Instance Learning, enable early diagnosis, Multiple Instance, methods on analyzing, early diagnosis

备注：

点击查看摘要

Abstract:Computational methods on analyzing Whole Slide Images (WSIs) enable early diagnosis and treatments by supporting pathologists in detection and classification of tumors. However, the extremely high resolution of WSIs makes end-to-end training impractical compared to typical image analysis tasks. To address this, most approaches use pre-trained feature extractors to obtain fixed representations of whole slides, which are then combined with Multiple Instance Learning (MIL) for downstream tasks. These feature extractors are typically pre-trained on natural image datasets such as ImageNet, which fail to capture domain-specific characteristics. Although domain-specific pre-training on histopathology data yields more relevant feature representations, it remains computationally expensive and fail to capture task-specific characteristics within the domain. To address the computational cost and lack of task-specificity in domain-specific pre-training, we propose EfficientWSI (eWSI), a careful integration of Parameter-Efficient-Fine-Tuning (PEFT) and Multiple Instance Learning (MIL) that enables end-to-end training on WSI tasks. We evaluate eWSI on seven WSI-level tasks over Camelyon16, TCGA and BRACS datasets. Our results show that eWSI when applied with ImageNet feature extractors yields strong classification performance, matching or outperforming MILs with in-domain feature extractors, alleviating the need for extensive in-domain pre-training. Furthermore, when eWSI is applied with in-domain feature extractors, it further improves classification performance in most cases, demonstrating its ability to capture task-specific information where beneficial. Our findings suggest that eWSI provides a task-targeted, computationally efficient path for WSI tasks, offering a promising direction for task-specific learning in computational pathology.

153. 【2603.15767】CLRNet: Targetless Extrinsic Calibration for Camera, Lidar and 4D Radar Using Deep Learning

链接：https://arxiv.org/abs/2603.15767

作者：Marcell Kegl,Andras Palffy,Csaba Benedek,Dariu M. Gavrila

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：address extrinsic calibration, address extrinsic, calibration, Accurate extrinsic calibration, extrinsic calibration

备注： Submitted to IEEE Transactions on Intelligent Vehicles

点击查看摘要

Abstract:In this paper, we address extrinsic calibration for camera, lidar, and 4D radar sensors. Accurate extrinsic calibration of radar remains a challenge due to the sparsity of its data. We propose CLRNet, a novel, multi-modal end-to-end deep learning (DL) calibration network capable of addressing joint camera-lidar-radar calibration, or pairwise calibration between any two of these sensors. We incorporate equirectangular projection, camera-based depth image prediction, additional radar channels, and leverage lidar with a shared feature space and loop closure loss. In extensive experiments using the View-of-Delft and Dual-Radar datasets, we demonstrate superior calibration accuracy compared to existing state-of-the-art methods, reducing both median translational and rotational calibration errors by at least 50%. Finally, we examine the domain transfer capabilities of the proposed network and baselines, when evaluating across datasets. The code will be made publicly available upon acceptance at: this https URL.

154. 【2603.15717】GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

链接：https://arxiv.org/abs/2603.15717

作者：Neeraj Solanki,Hong Ding,Sepehr Tabrizchi,Ali Shafiee Sarvestani,Shaahin Angizi,David Z. Pan,Arman Roohi

类目：Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：tight power budgets, Real-time object detection, critical computational constraints, faces critical computational, systems faces critical

备注：

点击查看摘要

Abstract:Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10\,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of $8.32^{\circ}$ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50\% and energy consumption by 65\%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1\% mAP on COCO (51.8\% on attended objects) while maintaining sub-10\,ms latency, meeting stringent AR/VR requirements by improving the communication time by $\times 177$. Compared to the global YOLOv12n baseline, which achieves 39.2\%, 63.4\%, and 83.1\% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3\%, 72.1\%, and 88.1\% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.

155. 【2603.15689】ransition Flow Matching

链接：https://arxiv.org/abs/2603.15689

作者：Chenrui Ma

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Mainstream flow matching, local velocity field, inherently requires multiple, requires multiple integration, Mainstream flow

备注：

点击查看摘要

Abstract:Mainstream flow matching methods typically focus on learning the local velocity field, which inherently requires multiple integration steps during generation. In contrast, Mean Velocity Flow models establish a relationship between the local velocity field and the global mean velocity, enabling the latter to be learned through a mathematically grounded formulation and allowing generation to be transferred to arbitrary future time points. In this work, we propose a new paradigm that directly learns the transition flow. As a global quantity, the transition flow naturally supports generation in a single step or at arbitrary time points. Furthermore, we demonstrate the connection between our approach and Mean Velocity Flow, establishing a unified theoretical perspective. Extensive experiments validate the effectiveness of our method and support our theoretical claims.

156. 【2603.15685】DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression

链接：https://arxiv.org/abs/2603.15685

作者：Bingzhou Li,Tao Huang

类目：Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词：Omnimodal large language, large language models, inference prohibitively expensive, resulting long multimodal, make inference prohibitively

备注：

点击查看摘要

Abstract:Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: this https URL.

157. 【2603.15679】IdentityGuard: Context-Aware Restriction and Provenance for Personalized Synthesis

链接：https://arxiv.org/abs/2603.15679

作者：Lingyun Zhang,Yu Xie,Ping Chen

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：generic context-blind methods, unique safety challenge, ill-equipped to handle, poses a unique, challenge that generic

备注： 5 pages, 3 figures, Accepted to ICASSP

点击查看摘要

Abstract:The nature of personalized text-to-image models poses a unique safety challenge that generic context-blind methods are ill-equipped to handle. Such global filters create a dilemma: to prevent misuse, they are forced to damage the model's broader utility by erasing concepts entirely, causing unacceptable collateral this http URL work presents a more precisely targeted approach, built on the principle that security should be as context-aware as the threat itself, intrinsically bound to the personalized concept. We present IDENTITYGUARD, which realizes this principle through a conditional restriction that blocks harmful content only when combined with the personalized identity, and a concept-specific watermark for precise traceability. Experiments show our approach prevents misuse while preserving the model's utility and enabling robust traceability. By moving beyond blunt, global filters, our work demonstrates a more effective and responsible path toward AI safety.

158. 【2603.15663】OrthoAI v2: From Single-Agent Segmentation to Dual-Agent Treatment Planning for Clear Aligners

链接：https://arxiv.org/abs/2603.15663

作者：Lansiaux Edouard,Leman Margaux

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：framework previously introduced, single-agent framework previously, AI-assisted orthodontic treatment, Convolutional Neural Networks, Dynamic Graph Convolutional

备注：

点击查看摘要

Abstract:We present OrthoAI v2, the second iteration of our open-source pipeline for AI-assisted orthodontic treatment planning with clear aligners, substantially extending the single-agent framework previously introduced. The first version established a proof-of-concept based on Dynamic Graph Convolutional Neural Networks (\dgcnn{}) for tooth segmentation but was limited to per-tooth centroid extraction, lacked landmark-level precision, and produced a scalar quality score without staging simulation. \vtwo{} addresses all three limitations through three principal contributions: (i)~a second agent adopting the Conditioned Heatmap Regression Methodology (\charm{})~\cite{rodriguez2025charm} for direct, segmentation-free dental landmark detection, fused with Agent~1 via a confidence-weighted orchestrator in three modes (parallel, sequential, single-agent); (ii)~a composite six-category biomechanical scoring model (biomechanics $\times$ 0.30 + staging $\times$ 0.20 + attachments $\times$ 0.15 + IPR $\times$ 0.10 + occlusion $\times$ 0.10 + predictability $\times$ 0.15) replacing the binary pass/fail check of v1; (iii)~a multi-frame treatment simulator generating $F = A \times r$ temporally coherent 6-DoF tooth trajectories via SLERP interpolation and evidence-based staging rules, enabling ClinCheck 4D visualisation. On a synthetic benchmark of 200 crowding scenarios, the parallel ensemble of OrthoAI v2 reaches a planning quality score of $92.8 \pm 4.1$ vs.\ $76.4 \pm 8.3$ for OrthoAI v1, a $+21\%$ relative gain, while maintaining full CPU deployability ($4.2 \pm 0.8$~s).

159. 【2603.15656】Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors

链接：https://arxiv.org/abs/2603.15656

作者：Peiyu Yang,Naveed Akhtar,Jiantong Jiang,Ajmal Mian

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：network models deteriorates, models deteriorates due, deteriorates due, model, corrupted samples

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:The performance of neural network models deteriorates due to their unreliable behavior on non-robust features of corrupted samples. Owing to their opaque nature, rectifying models to address this problem often necessitates arduous data cleaning and model retraining, resulting in huge computational and manual overhead. In this work, we leverage rank-one model editing to establish an attribution-guided model rectification framework that effectively locates and corrects model unreliable behaviors. We first distinguish our rectification setting from existing model editing, yielding a formulation that corrects unreliable behavior while preserving model performance and reducing reliance on large budgets of cleansed samples. We further reveal a bottleneck of model rectifying arising from heterogeneous editability across layers. To target the primary source of misbehavior, we introduce an attribution-guided layer localization method that quantifies layer-wise editability and identifies the layer most responsible for unreliabilities. Extensive experiments demonstrate the effectiveness of our method in correcting unreliabilities observed for neural Trojans, spurious correlations and feature leakage. Our method shows remarkable performance by achieving its editing objective with as few as a single cleansed sample, which makes it appealing for practice.

160. 【2603.15654】Discovering the Hidden Role of Gini Index In Prompt-based Classification

链接：https://arxiv.org/abs/2603.15654

作者：Ruixi Lin

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：long-tailed minority classes, long-tailed minority, offer the predictions, Gini, minority classes

备注：

点击查看摘要

Abstract:In classification tasks, the long-tailed minority classes usually offer the predictions that are most important. Yet these classes consistently exhibit low accuracies, whereas a few high-performing classes dominate the game. We pursue a foundational understanding of the hidden role of Gini Index as a tool for detecting and optimizing (debiasing) disparities in class accuracy, focusing on the case of prompt-based classification. We introduce the intuitions, benchmark Gini scores in real-world LLMs and vision models, and thoroughly discuss the insights of Gini not only as a measure of relative accuracy dominance but also as a direct optimization metric. Through rigorous case analyses, we first show that weak to strong relative accuracy imbalance exists in both prompt-based, text and image classification results and regardless of whether the classification is high-dimensional or low-dimensional. Then, we harness the Gini metric to propose a post-hoc model-agnostic bias mitigation method. Experimental results across few-shot news, biomedical, and zero-shot image classification show that our method significantly reduces both relative and absolute accuracy imbalances, minimizing top class relative dominance while elevating weakest classes.

161. 【2603.15650】How to Achieve Prototypical Birth and Death for OOD Detection?

链接：https://arxiv.org/abs/2603.15650

作者：Ningkang Peng,Qianfeng Yu,Xiaoqian Peng,Linjing Qian,Yafei Liu,Canran Xiao,Xinyu Lu,Tingyu Lu,Zhichao Zheng,Yanhui Gu

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：machine learning models, prototype-based learning methods, achieving OOD detection, prototype-based learning, OOD detection

备注：

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection is crucial for the secure deployment of machine learning models, and prototype-based learning methods are among the mainstream strategies for achieving OOD detection. Existing prototype-based learning methods generally rely on a fixed number of prototypes. This static assumption fails to adapt to the inherent complexity differences across various categories. Currently, there is still a lack of a mechanism that can adaptively adjust the number of prototypes based on data complexity. Inspired by the processes of cell birth and death in biology, we propose a novel method named PID (Prototype bIrth and Death) to adaptively adjust the prototype count based on data complexity. This method relies on two dynamic mechanisms during the training process: prototype birth and prototype death. The birth mechanism instantiates new prototypes in data regions with insufficient representation by identifying the overload level of existing prototypes, thereby meticulously capturing intra-class substructures. Conversely, the death mechanism reinforces the decision boundary by pruning prototypes with ambiguous class boundaries through evaluating their discriminability. Through birth and death, the number of prototypes can be dynamically adjusted according to the data complexity, leading to the learning of more compact and better-separated In-Distribution (ID) embeddings, which significantly enhances the capability to detect OOD samples. Experiments demonstrate that our dynamic method, PID, significantly outperforms existing methods on benchmarks such as CIFAR-100, achieving State-of-the-Art (SOTA) performance, especially on the FPR95 metric.

162. 【2603.15648】Improving Generative Adversarial Network Generalization for Facial Expression Synthesis

链接：https://arxiv.org/abs/2603.15648

作者：Arbish Akram,Nazar Khan,Arif Mahmood

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：generate realistic facial, Facial expression synthesis, realistic facial expressions, realistic facial, expression synthesis aims

备注：

点击查看摘要

Abstract:Facial expression synthesis aims to generate realistic facial expressions while preserving identity. Existing conditional generative adversarial networks (GANs) achieve excellent image-to-image translation results, but their performance often degrades when test images differ from the training dataset. We present Regression GAN (RegGAN), a model that learns an intermediate representation to improve generalization beyond the training distribution. RegGAN consists of two components: a regression layer with local receptive fields that learns expression details by minimizing the reconstruction error through a ridge regression loss, and a refinement network trained adversarially to enhance the realism of generated images. We train RegGAN on the CFEE dataset and evaluate its generalization performance both on CFEE and challenging out-of-distribution images, including celebrity photos, portraits, statues, and avatar renderings. For evaluation, we employ four widely used metrics: Expression Classification Score (ECS) for expression quality, Face Similarity Score (FSS) for identity preservation, QualiCLIP for perceptual realism, and Fréchet Inception Distance (FID) for assessing both expression quality and realism. RegGAN outperforms six state-of-the-art models in ECS, FID, and QualiCLIP, while ranking second in FSS. Human evaluations indicate that RegGAN surpasses the best competing model by 25% in expression quality, 26% in identity preservation, and 30% in realism.

163. 【2603.15624】Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

链接：https://arxiv.org/abs/2603.15624

作者：Yu Li,Yuchen Zheng,Giles Hamilton-Fletcher,Marco Mezzavilla,Yao Wang,Sundeep Rangan,Maurizio Porfiri,Zhou Yu,John-Ross Rizzo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：low vision, spatial reasoning, paper investigates, investigates the potential, potential of vision-language

备注：

点击查看摘要

Abstract:This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with nuanced reasoning and adaptability in complex environments. Common challenges include difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and a tendency to prioritize object details over spatial feedback, limiting their usability for pBLV in navigation tasks. Despite these limitations, VLMs show promise for wayfinding assistance when better aligned with human feedback and equipped with improved spatial reasoning. This research provides actionable insights into the strengths and limitations of current VLMs, guiding developers on effectively integrating VLMs into assistive technologies while addressing key limitations for enhanced usability.

164. 【2603.15622】SAC-NeRF: Adaptive Ray Sampling for Neural Radiance Fields via Soft Actor-Critic Reinforcement Learning

链接：https://arxiv.org/abs/2603.15622

作者：Chenyu Ge

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Neural Radiance Fields, Neural Radiance, Radiance Fields, computational inefficiency due, Markov Decision Process

备注：

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have achieved photorealistic novel view synthesis but suffer from computational inefficiency due to dense ray sampling during volume rendering. We propose SAC-NeRF, a reinforcement learning framework that learns adaptive sampling policies using Soft Actor-Critic (SAC). Our method formulates sampling as a Markov Decision Process where an RL agent learns to allocate samples based on scene characteristics. We introduce three technical components: (1) a Gaussian mixture distribution color model providing uncertainty estimates, (2) a multi-component reward function balancing quality, efficiency, and consistency, and (3) a two-stage training strategy addressing environment non-stationarity. Experiments on Synthetic-NeRF and LLFF datasets show that SAC-NeRF reduces sampling points by 35-48\% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling baselines. While the learned policy is scene-specific and the RL framework adds complexity compared to simpler heuristics, our work demonstrates that data-driven sampling strategies can discover effective patterns that would be difficult to hand-design.

165. 【2603.16587】HistoAtlas: A Pan-Cancer Morphology Atlas Linking Histomics to Molecular Programs and Clinical Outcomes

链接：https://arxiv.org/abs/2603.16587

作者：Pierre-Antoine Bannier

类目：Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：TCGA cancer types, interpretable histomic features, pan-cancer computational atlas, TCGA cancer, gene expression

备注：

点击查看摘要

Abstract:We present HistoAtlas, a pan-cancer computational atlas that extracts 38 interpretable histomic features from 6,745 diagnostic HE slides across 21 TCGA cancer types and systematically links every feature to survival, gene expression, somatic mutations, and immune subtypes. All associations are covariate-adjusted, multiple-testing corrected, and classified into evidence-strength tiers. The atlas recovers known biology, from immune infiltration and prognosis to proliferation and kinase signaling, while uncovering compartment-specific immune signals and morphological subtypes with divergent outcomes. Every result is spatially traceable to tissue compartments and individual cells, statistically calibrated, and openly queryable. HistoAtlas enables systematic, large-scale biomarker discovery from routine HE without specialized staining or sequencing. Data and an interactive web atlas are freely available at this https URL .

166. 【2603.16429】LenghuSky-8: An 8-Year All-Sky Cloud Dataset with Star-Aware Masks and Alt-Az Calibration for Segmentation and Nowcasting

链接：https://arxiv.org/abs/2603.16429

作者：Yicheng Rui,Xiao-Wei Duan,Licai Deng,Fan Yang,Zhengming Dang,Zhengjun Du,Junhao Peng,Wenhao Chu,Umut Mahmut,Kexin Li,Yiyun Wu,Fabo Feng

类目：Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Ground-based time-domain observatories, time-domain observatories require, Ground-based time-domain, lack astrometric calibration, existing all-sky datasets

备注： CVPR Findings accepted. 20 pages, 8 figures

点击查看摘要

Abstract:Ground-based time-domain observatories require minute-by-minute, site-scale awareness of cloud cover, yet existing all-sky datasets are short, daylight-biased, or lack astrometric calibration. We present LenghuSky-8, an eight-year (2018-2025) all-sky imaging dataset from a premier astronomical site, comprising 429,620 $512 \times 512$ frames with 81.2% night-time coverage, star-aware cloud masks, background masks, and per-pixel altitude-azimuth (Alt-Az) calibration. For robust cloud segmentation across day, night, and lunar phases, we train a linear probe on DINOv3 local features and obtain 93.3% $\pm$ 1.1% overall accuracy on a balanced, manually labeled set of 1,111 images. Using stellar astrometry, we map each pixel to local alt-az coordinates and measure calibration uncertainties of approximately 0.37 deg at zenith and approximately 1.34 deg at 30 deg altitude, sufficient for integration with telescope schedulers. Beyond segmentation, we introduce a short-horizon nowcasting benchmark over per-pixel three-class logits (sky/cloud/contamination) with four baselines: persistence (copying the last frame), optical flow, ConvLSTM, and VideoGPT. ConvLSTM performs best but yields only limited gains over persistence, underscoring the difficulty of near-term cloud evolution. We release the dataset, calibrations, and an open-source toolkit for loading, evaluation, and scheduler-ready alt-az maps to boost research in segmentation, nowcasting, and autonomous observatory operations.

167. 【2603.16025】3D tomography of exchange phase in a Si/SiGe quantum dot device

链接：https://arxiv.org/abs/2603.16025

作者：Dylan Albrecht,Sarah Thompson,N. Tobias Jacobson,Ryan Jock

类目：Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)

关键词：spin-based quantum processors, foundational building block, exchange interaction, exchange interaction coefficient, spin qubit devices

备注： 11 pages, 6 figures

点击查看摘要

Abstract:The exchange interaction is a foundational building block for the operation of spin-based quantum processors. Extracting the exchange interaction coefficient $J(\mathbf{V})$, as a function of gate electrode voltages, is important for understanding disorder, faithfully simulating device performance, and operating spin qubits with high fidelity. Typical coherent measurements of exchange in spin qubit devices yield a modulated cosine of an accumulated phase, which in turn is the time integral of exchange. As such, extracting $J(\mathbf{V})$ from experimental data is difficult due to the ambiguity of inverting a cosine, the sensitivity to noise when unwrapping phase, as well as the problem of inverting the integral. As a step toward obtaining $J(\mathbf{V})$, we tackle the first two challenges to reveal the accumulated phase, $\phi(\mathbf{V})$. We incorporate techniques from a wide range of fields to robustly extract and model a 3D phase volume for spin qubit devices from a sequence of 2D measurements. In particular, we present a measurement technique to obtain the wrapped phase, as done in phase-shifting digital holography, and utilize the max-flow/min-cut phase unwrapping method (PUMA) to unwrap the phase in 3D voltage space. We show this method is robust to the minimal observed drift in the device, which we confirm by increasing scan resolution. Upon building a model of the extracted phase, we optimize over the model to locate a minimal-gradient $\pi$ exchange pulse point in voltage space. Our measurement protocol may provide detailed information useful for understanding the origins of device variability governing device yield, enable calibrating device models to specific devices during operation for more sophisticated error attribution, and enable a systematic optimization of qubit control. We anticipate that the methods presented here may be applicable to other qubit platforms.

168. 【2603.15834】Spectral Hierarchy of the Cosmic Web

链接：https://arxiv.org/abs/2603.15834

作者：Francisco-Shu Kitaura,Francesco Sinigaglia

类目：Cosmology and Nongalactic Astrophysics (astro-ph.CO); Computer Vision and Pattern Recognition (cs.CV)

关键词：applying simple scale-weighting, simple scale-weighting kernels, standard eigenvalue-based web, eigenvalue-based web classification, cosmic-web classifications obtained

备注： 32 pages, 7 figures, 1 table

点击查看摘要

Abstract:We introduce a spectral hierarchy of cosmic-web classifications obtained by applying simple scale-weighting kernels to the density field before performing a standard eigenvalue-based web classification. This unifies and extends several widely used web definitions within a single framework: the familiar potential/tidal web (large-scale, nonlocal), a curvature-based web (more local, peak- and ridge-sensitive), and additional higher-derivative levels that progressively emphasize smaller-scale structure. Because the classification is built from second derivatives of the filtered field, successive hierarchy levels align naturally with operator families that appear in renormalised bias and effective descriptions of large-scale structure, providing an explicit bridge between cosmic-web environments and long- and short-range nonlocal bias ingredients. We quantify the information content of the hierarchy with a compact statistic: we map each cell to one of four ordered web types (void, sheet, filament, knot), construct a corresponding ``web contrast'' field, and measure its cross-correlation with halos from the AbacusSummit simulation suite on a coarse mesh with $\Delta L\simeq 5.5\,h^{-1}\mathrm{Mpc}$. We find that the hierarchy retains significant tracer-relevant information from very large scales down to the mesh Nyquist limit, with the more local (curvature/higher-derivative) levels dominating toward nonlinear scales. This makes the spectral hierarchy a practical, interpretable conditioning basis for fast mock-galaxy production and field-level modelling, and a flexible tool for studying environment-dependent clustering and assembly bias.

Comments:
32 pages, 7 figures, 1 table

Subjects:

Cosmology and Nongalactic Astrophysics (astro-ph.CO); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.15834 [astro-ph.CO]

(or
arXiv:2603.15834v1 [astro-ph.CO] for this version)

https://doi.org/10.48550/arXiv.2603.15834

Focus to learn more

              arXiv-issued DOI via DataCite</p>