本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新555篇论文,其中:
- 自然语言处理94篇
- 信息检索22篇
- 计算机视觉132篇
自然语言处理
1. 【2603.20185】VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
链接:https://arxiv.org/abs/2603.20185
作者:Jingyang Lin,Jialian Wu,Jiang Liu,Ximeng Sun,Ze Wang,Xiaodong Yu,Jiebo Luo,Zicheng Liu,Emad Barsoum
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:challenging video-language tasks, advanced challenging video-language, Video, video-language tasks, video understanding
备注: Accepted at CVPR 2026
点击查看摘要
Abstract:Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.
2. 【2603.20180】Adaptive Greedy Frame Selection for Long Video Understanding
链接:https://arxiv.org/abs/2603.20180
作者:Yuning Huang,Fengqing Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:resulting visual tokens, Large vision, long-video question answering, language models, visual tokens
备注:
点击查看摘要
Abstract:Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
3. 【2603.20172】Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
链接:https://arxiv.org/abs/2603.20172
作者:Richard J. Young
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Recent work, measurable property, Sonnet judge, independent Claude Sonnet, Recent
备注: 14 pages, 4 figures, 5 tables
点击查看摘要
Abstract:Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar's test, p 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen's kappa ranges from 0.06 ("slight") for sycophancy hints to 0.42 ("moderate") for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.
4. 【2603.20162】Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models
链接:https://arxiv.org/abs/2603.20162
作者:Sai Koneru,Elphin Joe,Christine Kirchhoff,Jian Wu,Sarah Rajtmajer
类目:Computation and Language (cs.CL)
关键词:National Climate Assessment, balance user-alignment pressures, contested domains, balance user-alignment, National Climate
备注:
点击查看摘要
Abstract:In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.
5. 【2603.20161】Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models
链接:https://arxiv.org/abs/2603.20161
作者:Qi Cao,Andrew Gambardella,Takeshi Kojima,Yutaka Matsuo,Yusuke Iwasawa
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, demonstrated remarkable capabilities, Large language, diverse tasks, demonstrated remarkable
备注: EACL 2026
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guaranteed, and their tendency toward overconfidence further limits reliability. Uncertainty quantification offers a promising way to identify potentially unreliable outputs, but most existing methods rely on repeated sampling or auxiliary models, introducing substantial computational overhead. To address these limitations, we propose Semantic Token Clustering (STC), an efficient uncertainty quantification method that leverages the semantic information inherently encoded in LLMs. Specifically, we group tokens into semantically consistent clusters using embedding clustering and prefix matching, and quantify uncertainty based on the probability mass aggregated over the corresponding semantic cluster. Our approach requires only a single generation and does not depend on auxiliary models. Experimental results show that STC achieves performance comparable to state-of-the-art baselines while substantially reducing computational overhead.
6. 【2603.20149】Enhancing Hyperspace Analogue to Language (HAL) Representations via Attention-Based Pooling for Text Classification
链接:https://arxiv.org/abs/2603.20149
作者:Ali Sakour,Zoalfekar Sakour
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:construct distributional semantic, Analogue to Language, Hyperspace Analogue, distributional semantic representations, global word co-occurrence
备注: 7 pages, 1 figure, 1 table
点击查看摘要
Abstract:The Hyperspace Analogue to Language (HAL) model relies on global word co-occurrence matrices to construct distributional semantic representations. While these representations capture lexical relationships effectively, aggregating them into sentence-level embeddings via standard mean pooling often results in information loss. Mean pooling assigns equal weight to all tokens, thereby diluting the impact of contextually salient words with uninformative structural tokens. In this paper, we address this limitation by integrating a learnable, temperature-scaled additive attention mechanism into the HAL representation pipeline. To mitigate the sparsity and high dimensionality of the raw co-occurrence matrices, we apply Truncated Singular Value Decomposition (SVD) to project the vectors into a dense latent space prior to the attention layer. We evaluate the proposed architecture on the IMDB sentiment analysis dataset. Empirical results demonstrate that the attention-based pooling approach achieves a test accuracy of 82.38%, yielding an absolute improvement of 6.74 percentage points over the traditional mean pooling baseline (75.64%). Furthermore, qualitative analysis of the attention weights indicates that the mechanism successfully suppresses stop-words and selectively attends to sentiment-bearing tokens, improving both classification performance and model interpretability.
7. 【2603.20133】Reasoning Gets Harder for LLMs Inside A Dialogue
链接:https://arxiv.org/abs/2603.20133
作者:Ivan Kartáč,Mateusz Lango,Ondřej Dušek
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, evaluations typically focus, achieve strong performance
备注: Preprint
点击查看摘要
Abstract:Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.
8. 【2603.20114】Current LLMs still cannot 'talk much' about grammar modules: Evidence from syntax
链接:https://arxiv.org/abs/2603.20114
作者:Mohammed Q. Shormani
类目:Computation and Language (cs.CL)
关键词:
备注: 15 pages
点击查看摘要
None
9. 【2603.20100】An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models
链接:https://arxiv.org/abs/2603.20100
作者:Yuming Feng,Christy Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:align language models, Direct Preference Optimization, Direct Preference, language models, data is under-specified
备注:
点击查看摘要
Abstract:Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised objective. In contrast, parameterization dominates: FFT consistently outperforms LoRA at matched training depth, and LoRA does not reduce wall-clock time on our hardware. These findings indicate that, in this small-scale regime, supervised full-parameter adaptation remains the primary performance lever, while preference optimization and low-rank adaptation provide limited marginal returns.
10. 【2603.20079】Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues
链接:https://arxiv.org/abs/2603.20079
作者:Yu Wang,Olcay Türk,Angela Grimminger,Hendrik Buschmeier
类目:Computation and Language (cs.CL)
关键词:nonverbal linguistic features, investigate how verbal, verbal and nonverbal, contribute to predicting, explanatory interactions
备注:
点击查看摘要
Abstract:We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener's state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker's utterances, and the variation in the listener's interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener's level of understanding. Listener states ('Understanding', 'Partial Understanding', 'Non-Understanding' and 'Misunderstanding') were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.
11. 【2603.20042】LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families
链接:https://arxiv.org/abs/2603.20042
作者:Jianan Chen,Xiaoxue Gao,Tatsuya Kawahara,Nancy F. Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, automatic speech recognition, speech language models, driven substantial advances, language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) have driven substantial advances in speech language models (SpeechLMs), yielding strong performance in automatic speech recognition (ASR) under high-resource conditions. However, existing benchmarks predominantly focus on high-resource languages, leaving the ASR behavior of SpeechLMs in low-resource languages insufficiently understood. This gap is critical, as practical ASR systems must reliably support low-resource languages and generalize across diverse language families, and it directly hinders the deployment of SpeechLM-based ASR in real-world multilingual scenarios. As a result, it is essential to evaluate SpeechLMs on low-resource languages to ensure their generalizability across different language families. To address this problem, we propose \textbf{LoASR-Bench}, a comprehensive benchmark designed to evaluate \textbf{lo}w-resource \textbf{a}utomatic \textbf{s}peech \textbf{r}ecognition (\textbf{ASR}) of the latest SpeechLMs across diverse language families. LoASR-Bench comprises 25 languages from 9 language families, featuring both Latin and non-Latin scripts, enabling cross-linguistic and cross-script assessment of ASR performance of current SpeechLMs. Experimental results highlight the limitations of the latest SpeechLMs in handling real-world low-resource languages.
12. 【2603.20017】RouterKGQA: Specialized--General Model Routing for Constraint-Aware Knowledge Graph Question Answering
链接:https://arxiv.org/abs/2603.20017
作者:Bo Yuan,Hexuan Deng,Xuebo Liu,Min Zhang
类目:Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
关键词:verifiable knowledge graphs, Knowledge graph question, Knowledge graph, knowledge graphs, mitigating LLM hallucination
备注:
点击查看摘要
Abstract:Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable knowledge graphs. Existing approaches fall into two paradigms: retrieval-based methods utilize small specialized models, which are efficient but often produce unreachable paths and miss implicit constraints, while agent-based methods utilize large general models, which achieve stronger structural grounding at substantially higher cost. We propose RouterKGQA, a framework for specialized--general model collaboration, in which a specialized model generates reasoning paths and a general model performs KG-guided repair only when needed, improving performance at minimal cost. We further equip the specialized with constraint-aware answer filtering, which reduces redundant answers. In addition, we design a more efficient general agent workflow, further lowering inference cost. Experimental results show that RouterKGQA outperforms the previous best by 3.57 points in F1 and 0.49 points in Hits@1 on average across benchmarks, while requiring only 1.15 average LLM calls per question. Codes and models are available at this https URL.
13. 【2603.20004】ReViSQL: Achieving Human-Level Text-to-SQL
链接:https://arxiv.org/abs/2603.20004
作者:Yuxuan Zhu,Tengjun Jin,Yoojin Choi,Daniel Kang
类目:Databases (cs.DB); Computation and Language (cs.CL)
关键词:Translating natural language, Translating natural, data analytics applications, analytics applications, critical challenge
备注:
点击查看摘要
Abstract:Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have focused on enhancing SQL reasoning by developing large language models and AI agents that decompose Text-to-SQL tasks into manually designed, step-by-step pipelines. However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark. In this paper, we show that closing this gap does not require further architectural complexity, but rather clean training data to improve SQL reasoning of the underlying models. We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time. Instead of complex AI agents, ReViSQL leverages reinforcement learning with verifiable rewards (RLVR) on BIRD-Verified, a dataset we curated comprising 2.5k verified Text-to-SQL instances based on the BIRD Train set. To construct BIRD-Verified, we design a data correction and verification workflow involving SQL experts. We identified and corrected data errors in 61.1% of a subset of BIRD Train. By training on BIRD-Verified, we show that improving data quality alone boosts the single-generation accuracy by 8.2-13.9% under the same RLVR algorithm. To further enhance performance, ReViSQL performs inference-time scaling via execution-based reconciliation and majority voting. Empirically, we demonstrate the superiority of our framework with two model scales: ReViSQL-235B-A22B and ReViSQL-30B-A3B. On an expert-verified BIRD Mini-Dev set, ReViSQL-235B-A22B achieves 93.2% execution accuracy, exceeding the proxy human-level accuracy (92.96%) and outperforming the prior open-source SOTA method by 9.8%. Our lightweight ReViSQL-30B-A3B matches the prior SOTA at a 7.5$\times$ lower per-query cost.
Subjects:
Databases (cs.DB); Computation and Language (cs.CL)
ACMclasses:
H.2.3
Cite as:
arXiv:2603.20004 [cs.DB]
(or
arXiv:2603.20004v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2603.20004
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2603.20003】An Agentic Approach to Generating XAI-Narratives
链接:https://arxiv.org/abs/2603.20003
作者:Yifan He,David Martens
类目:Computation and Language (cs.CL)
关键词:experienced substantial growth, design, Basic Design, XAI, research has experienced
备注:
点击查看摘要
Abstract:Explainable AI (XAI) research has experienced substantial growth in recent years. Existing XAI methods, however, have been criticized for being technical and expert-oriented, motivating the development of more interpretable and accessible explanations. In response, large language model (LLM)-generated XAI narratives have been proposed as a promising approach for translating post-hoc explanations into more accessible, natural-language explanations. In this work, we propose a multi-agent framework for XAI narrative generation and refinement. The framework comprises the Narrator, which generates and revises narratives based on feedback from multiple Critic Agents on faithfulness and coherence metrics, thereby enabling narrative improvement through iteration. We design five agentic systems (Basic Design, Critic Design, Critic-Rule Design, Coherent Design, and Coherent-Rule Design) and systematically evaluate their effectiveness across five LLMs on five tabular datasets. Results validate that the Basic Design, the Critic Design, and the Critic-Rule Design are effective in improving the faithfulness of narratives across all LLMs. Claude-4.5-Sonnet on Basic Design performs best, reducing the number of unfaithful narratives by 90% after three rounds of iteration. To address recurrent issues, we further introduce an ensemble strategy based on majority voting. This approach consistently enhances performance for four LLMs, except for DeepSeek-V3.2-Exp. These findings highlight the potential of agentic systems to produce faithful and coherent XAI narratives.
15. 【2603.19997】When Contextual Inference Fails: Cancelability in Interactive Instruction Following
链接:https://arxiv.org/abs/2603.19997
作者:Natalia Bila,Kata Naszádi,Alexandra Mayn,Christof Monz
类目:Computation and Language (cs.CL)
关键词:collaborative block-building task, resolve underspecified instructions, investigate the separation, separation of literal, literal interpretation
备注:
点击查看摘要
Abstract:We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm -- which contrasts a pragmatically cooperative speaker with one who is only literally reliable -- we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty.
16. 【2603.19987】Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States
链接:https://arxiv.org/abs/2603.19987
作者:Yurun Yuan,Tengyang Xie
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, aligning Large Language, aligning Large, Reinforcement learning, Language Models
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2603.19987 [cs.LG]
(or
arXiv:2603.19987v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2603.19987
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
17. 【2603.19954】On the Ability of Transformers to Verify Plans
链接:https://arxiv.org/abs/2603.19954
作者:Yash Sarrof,Yupei Du,Katharina Stein,Alexander Koller,Sylvie Thiébaux,Michael Hahn
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:shown inconsistent success, shown inconsistent, inconsistent success, theoretical understanding, planning tasks
备注:
点击查看摘要
Abstract:Transformers have shown inconsistent success in AI planning tasks, and theoretical understanding of when generalization should be expected has been limited. We take important steps towards addressing this gap by analyzing the ability of decoder-only models to verify whether a given plan correctly solves a given planning instance. To analyse the general setting where the number of objects -- and thus the effective input alphabet -- grows at test time, we introduce C*-RASP, an extension of C-RASP designed to establish length generalization guarantees for transformers under the simultaneous growth in sequence length and vocabulary size. Our results identify a large class of classical planning domains for which transformers can provably learn to verify long plans, and structural properties that significantly affects the learnability of length generalizable solutions. Empirical experiments corroborate our theory.
18. 【2603.19940】Hybrid topic modelling for computational close reading: Mapping narrative themes in Pushkin's Evgenij Onegin
链接:https://arxiv.org/abs/2603.19940
作者:Angelo Maria Sabatini
类目:Computation and Language (cs.CL)
关键词:Latent Dirichlet Allocation, Squares Discriminant Analysis, integrates Latent Dirichlet, Dirichlet Allocation, Latent Dirichlet
备注: 25 pages, 4 figures, 2 supplementary materials; submitted to Digital Scholarship in the Humanities (under review)
点击查看摘要
Abstract:This study presents a hybrid topic modelling framework for computational literary analysis that integrates Latent Dirichlet Allocation (LDA) with sparse Partial Least Squares Discriminant Analysis (sPLS-DA) to model thematic structure and longitudinal dynamics in narrative poetry. As a case study, we analyse Evgenij Onegin-Aleksandr S. Pushkin's novel in verse-using an Italian translation, testing whether unsupervised and supervised lexical structures converge in a small-corpus setting. The poetic text is segmented into thirty-five documents of lemmatised content words, from which five stable and interpretable topics emerge. To address small-corpus instability, a multi-seed consensus protocol is adopted. Using sPLS-DA as a supervised probe enhances interpretability by identifying lexical markers that refine each theme. Narrative hubs-groups of contiguous stanzas marking key episodes-extend the bag-of-words approach to the narrative level, revealing how thematic mixtures align with the poem's emotional and structural arc. Rather than replacing traditional literary interpretation, the proposed framework offers a computational form of close reading, illustrating how lightweight probabilistic models can yield reproducible thematic maps of complex poetic narratives, even when stylistic features such as metre, phonology, or native morphology are abstracted away. Despite relying on a single lemmatised translation, the approach provides a transparent methodological template applicable to other high-density literary texts in comparative studies.
19. 【2603.19931】SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia
链接:https://arxiv.org/abs/2603.19931
作者:Zhixiang Lu,Chong Zhang,Yulong Li,Angelos Stefanidis,Anh Nguyen,Imran Razzak,Jionglong Su,Zhengyong Jiang
类目:Computation and Language (cs.CL)
关键词:World Wide Web, inclusive World Wide, World Wide, Wide Web, inclusive World
备注: Accepted by WWW 2026
点击查看摘要
Abstract:The vision of an inclusive World Wide Web is impeded by a severe linguistic divide, particularly for communities in low-resource regions of Southeast Asia. While large language models (LLMs) offer a potential solution for translation, their deployment in data-poor contexts faces a dual challenge: the scarcity of high-quality, culturally relevant data and the prohibitive energy costs of training on massive, noisy web corpora. To resolve the tension between digital inclusion and environmental sustainability, we introduce Sustainable Agent-Guided Expert-tuning (SAGE). This framework pioneers an energy-aware paradigm that prioritizes the "right data" over "big data". Instead of carbon-intensive training on unfiltered datasets, SAGE employs a reinforcement learning (RL) agent, optimized via Group Relative Policy Optimization (GRPO), to autonomously curate a compact training set. The agent utilizes a semantic reward signal derived from a small, expert-constructed set of community dialogues to filter out noise and cultural misalignment. We then efficiently fine-tune open-source LLMs on this curated data using Low-Rank Adaptation (LoRA). We applied SAGE to translation tasks between English and seven low-resource languages (LRLs) in Southeast Asia. Our approach establishes new state-of-the-art performance on BLEU-4 and COMET-22 metrics, effectively capturing local linguistic nuances. Crucially, SAGE surpasses baselines trained on full datasets while reducing data usage by 97.1% and training energy consumption by 95.2%. By delivering high-performance models with a minimal environmental footprint, SAGE offers a scalable and responsible pathway to bridge the digital divide in the Global South.
20. 【2603.19924】ranslation from the Information Bottleneck Perspective: an Efficiency Analysis of Spatial Prepositions in Bitexts
链接:https://arxiv.org/abs/2603.19924
作者:Antoine Taroni,Ludovic Moncla,Frederique Laforest
类目:Computation and Language (cs.CL)
关键词:Efficient communication requires, communication requires balancing, Efficient communication, requires balancing informativity, communication requires
备注:
点击查看摘要
Abstract:Efficient communication requires balancing informativity and simplicity when encoding meanings. The Information Bottleneck (IB) framework captures this trade-off formally, predicting that natural language systems cluster near an optimal accuracy-complexity frontier. While supported in visual domains such as colour and motion, linguistic stimuli such as words in sentential context remain unexplored. We address this gap by framing translation as an IB optimisation problem, treating source sentences as stimuli and target sentences as compressed meanings. This allows IB analyses to be performed directly on bitexts rather than controlled naming experiments. We applied this to spatial prepositions across English, German and Serbian translations of a French novel. To estimate informativity, we conducted a pile-sorting pilot-study (N=35) and obtained similarity judgements of pairs of prepositions. We trained a low-rank projection model (D=5) that predicts these judgements (Spearman correlation: 0.78). Attested translations of prepositions lie closer to the IB optimal frontier than counterfactual alternatives, offering preliminary evidence that human translators exhibit communicative efficiency pressure in the spatial domain. More broadly, this work suggests that translation can serve as a window into the cognitive efficiency pressures shaping cross-linguistic semantic systems.
21. 【2603.19921】Span-Level Machine Translation Meta-Evaluation
链接:https://arxiv.org/abs/2603.19921
作者:Stefano Perrella,Eric Morales Agostinho,Hugo Zaragoza
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Machine Translation, recent years, enabling numerous, numerous novel applications, improved dramatically
备注: 18 pages, 4 figures
点击查看摘要
Abstract:Machine Translation (MT) and automatic MT evaluation have improved dramatically in recent years, enabling numerous novel applications. Automatic evaluation techniques have evolved from producing scalar quality scores to precisely locating translation errors and assigning them error categories and severity levels. However, it remains unclear how to reliably measure the evaluation capabilities of auto-evaluators that do error detection, as no established technique exists in the literature. This work investigates different implementations of span-level precision, recall, and F-score, showing that seemingly similar approaches can yield substantially different rankings, and that certain widely-used techniques are unsuitable for evaluating MT error detection. We propose "match with partial overlap and partial credit" (MPP) with micro-averaging as a robust meta-evaluation strategy and release code for its use publicly. Finally, we use MPP to assess the state of the art in MT error detection.
22. 【2603.19849】Semantic Delta: An Interpretable Signal Differentiating Human and LLMs Dialogue
链接:https://arxiv.org/abs/2603.19849
作者:Riccardo Scantamburlo,Mauro Mezzanzana,Giacomo Buonanno,Francesco Bertolotti
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:LLMs talk, human, LLM, semantic, dialogue
备注:
点击查看摘要
Abstract:Do LLMs talk like us? This question intrigues a multitude of scholar and it is relevant in many fields, from education to academia. This work presents an interpretable statistical feature for distinguishing human written and LLMs generated dialogue. We introduce a lightweight metric derived from semantic categories distribution. Using the Empath lexical analysis framework, each text is mapped to a set of thematic intensity scores. We define semantic delta as the difference between the two most dominant category intensities within a dialogue, hypothesizing that LLM outputs exhibit stronger thematic concentration than human discourse. To evaluate this hypothesis, conversational data were generated from multiple LLM configurations and compared against heterogeneous human corpora, including scripted dialogue, literary works, and online discussions. A Welch t-test was applied to the resulting distributions of semantic delta values. Results show that AI-generated texts consistently produce higher deltas than human texts, indicating a more rigid topics structure, whereas human dialogue displays a broader and more balanced semantic spread. Rather than replacing existing detection techniques, the proposed zero-shot metric provides a computationally inexpensive complementary signal that can be integrated into ensemble detection systems. These finding also contribute to the broader empirical understanding of LLM behavioural mimicry and suggest that thematic distribution constitutes a quantifiable dimension along which current models fall short of human conversational dynamics.
23. 【2603.19843】Overreliance on AI in Information-seeking from Video Content
链接:https://arxiv.org/abs/2603.19843
作者:Anders Giovanni Møller,Elisa Bassignana,Francesco Pierri,Luca Maria Aiello
类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:social media environments, online information spaces, reshaping online information, media environments, reshaping online
备注:
点击查看摘要
Abstract:The ubiquity of multimedia content is reshaping online information spaces, particularly in social media environments. At the same time, search is being rapidly transformed by generative AI, with large language models (LLMs) routinely deployed as intermediaries between users and multimedia content to retrieve and summarize information. Despite their growing influence, the impact of LLM inaccuracies and potential vulnerabilities on multimedia information-seeking tasks remains largely unexplored. We investigate how generative AI affects accuracy, efficiency, and confidence in information retrieval from videos. We conduct an experiment with around 900 participants on 8,000+ video-based information-seeking tasks, comparing behavior across three conditions: (1) access to videos only, (2) access to videos with LLM-based AI assistance, and (3) access to videos with a deceiving AI assistant designed to provide false answers. We find that AI assistance increases accuracy by 3-7% when participants viewed the relevant video segment, and by 27-35% when they did not. Efficiency increases by 10% for short videos and 25% for longer ones. However, participants tend to over-rely on AI outputs, resulting in accuracy drops of up to 32% when interacting with the deceiving AI. Alarmingly, self-reported confidence in answers remains stable across all three conditions. Our findings expose fundamental safety risks in AI-mediated video information retrieval.
24. 【2603.19825】FrameNet Semantic Role Classification by Analogy
链接:https://arxiv.org/abs/2603.19825
作者:Van-Duy Ngo,Stergos Afantenos,Emiliano Lorini,Miguel Couceiro
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Semantic Role Classification, Semantic Role, Role Classification, semantic roles, adopt a relational
备注: Paper to be presented at LREC 2026
点击查看摘要
Abstract:In this paper, we adopt a relational view of analogies applied to Semantic Role Classification in FrameNet. We define analogies as formal relations over the Cartesian product of frame evoking lexical units (LUs) and frame element (FEs) pairs, which we use to construct a new dataset. Each element of this binary relation is labelled as a valid analogical instance if the frame elements share the same semantic role, or as invalid otherwise. This formulation allows us to transform Semantic Role Classification into binary classification and train a lightweight Artificial Neural Network (ANN) that exhibits rapid convergence with minimal parameters. Unconventionally, no Semantic Role information is introduced to the neural network during training. We recover semantic roles during inference by computing probability distributions over candidates of all semantic roles within a given frame through random sampling and analogical transfer. This approach allows us to surpass previous state-of-the-art results while maintaining computational efficiency and frugality.
25. 【2603.19798】Borderless Long Speech Synthesis
链接:https://arxiv.org/abs/2603.19798
作者:Xingchen Song,Di Wu,Dinghao Zhou,Pengyu Cheng,Hongwu Ding,Yunchao He,Jie Wang,Shengfan Shen,Sixiang Lv,Lichun Fan,Hang Su,Yifeng Wang,Shuai Wang,Meng Meng,Jian Luan
类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:synthesize speech sentence, Borderless Long Speech, Long Speech Synthesis, stitch the results, plain-text dialogues
备注:
点击查看摘要
Abstract:Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
26. 【2603.19771】Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders
链接:https://arxiv.org/abs/2603.19771
作者:Debajyoti Mazumder,Divyansh Pathak,Prashant Kodali,Jasabanta Patro
类目:Computation and Language (cs.CL)
关键词:Multilingual encoder-based language, code-mixed inputs internally, code-mixed analysis tasks, representations meaningfully connect, represent code-mixed inputs
备注: 24 pages
点击查看摘要
Abstract:Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.
27. 【2603.19744】Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking
链接:https://arxiv.org/abs/2603.19744
作者:Tomas Ruiz,Tanalp Agustoslu,Carsten Schwemmer
类目:Computation and Language (cs.CL)
关键词:large language model, Human Label Variation, systematic differences, annotators' judgments, remains underexplored
备注: 6 pages, 3 tables, 1 figure
点击查看摘要
Abstract:Human Label Variation (HLV), i.e. systematic differences among annotators' judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by introducing an evaluation protocol for multimodal large language model (MLLM) benchmarking that explicitly accounts for two conditions: (1) human label agreement and (2) disagreement. We apply this protocol to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset. Across tasks, we find that larger models tend to perform best on high-agreement subsets, yet often underperform medium-sized models when human disagreement is high, indicating that parameter count alone does not determine sensitivity to ambiguity and subjectivity. These results show that benchmarks based solely on consensus labels can overstate model capabilities in such domains and that incorporating human label variation yields more realistic and robust assessments of MLLMs in content moderation pipelines.
28. 【2603.19742】Dual Path Attribution: Efficient Attribution for SwiGLU-Transformers through Layer-Wise Target Propagation
链接:https://arxiv.org/abs/2603.19742
作者:Lasse Marten Jantsch,Dong-Jae Koh,Seonghyeon Lee,Young-Kyoon Suh
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:transformer-based large language, Understanding the internal, large language models, internal mechanisms, mechanisms of transformer-based
备注:
点击查看摘要
Abstract:Understanding the internal mechanisms of transformer-based large language models (LLMs) is crucial for their reliable deployment and effective operation. While recent efforts have yielded a plethora of attribution methods attempting to balance faithfulness and computational efficiency, dense component attribution remains prohibitively expensive. In this work, we introduce Dual Path Attribution (DPA), a novel framework that faithfully traces information flow on the frozen transformer in one forward and one backward pass without requiring counterfactual examples. DPA analytically decomposes and linearizes the computational structure of the SwiGLU Transformers into distinct pathways along which it propagates a targeted unembedding vector to receive the effective representation at each residual position. This target-centric propagation achieves O(1) time complexity with respect to the number of model components, scaling to long input sequences and dense component attribution. Extensive experiments on standard interpretability benchmarks demonstrate that DPA achieves state-of-the-art faithfulness and unprecedented efficiency compared to existing baselines.
29. 【2603.19741】FedPDPO: Federated Personalized Direct Preference Optimization for Large Language Model Alignment
链接:https://arxiv.org/abs/2603.19741
作者:Kewen Zhu,Liping Yi,Zhiming Zhao,Zhuang Qi,Han Yu,Qinghua Hu
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Aligning large language, large language models, Direct Preference Optimization, Aligning large, Preference Optimization
备注: under review
点击查看摘要
Abstract:Aligning large language models (LLMs) with human preferences in federated learning (FL) is challenging due to decentralized, privacy-sensitive, and highly non-IID preference data. Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning with human feedback (RLHF), but its direct application in FL suffers from severe performance degradation under non-IID data and limited generalization of implicit rewards. To bridge this gap, we propose FedPDPO (Federated Personalized Direct Preference Optimization), a personalized federated framework for preference alignment of LLMs. It adopts a parameter-efficient fine-tuning architecture where each client maintains a frozen pretrained LLM backbone augmented with a Low-Rank Adaptation (LoRA) adapter, enabling communication-efficient aggregation. To address non-IID heterogeneity, we devise (1) the globally shared LoRA adapter with the personalized client-specific LLM head. Moreover, we introduce (2) a personalized DPO training strategy with a client-specific explicit reward head to complement implicit rewards and further alleviate non-IID heterogeneity, and (3) a bottleneck adapter to balance global and local features. We provide theoretical analysis establishing the probabilistic foundation and soundness. Extensive experiments on multiple preference datasets demonstrate state-of-the-art performance, achieving up to 4.80% average accuracy improvements in federated intra-domain and cross-domain settings.
30. 【2603.19739】MOSS-TTSD: Text to Spoken Dialogue Generation
链接:https://arxiv.org/abs/2603.19739
作者:Yuqian Zhang,Donghua Yu,Zhengyuan Lin,Botian Jiang,Mingshu Chen,Yaozhou Jiang,Yiwei Zhao,Yiyang Zhang,Yucheng Yuan,Hanfu Chen,Kexin Huang,Jun Zhan,Cheng Chang,Zhaoye Fei,Shimin Li,Xiaogui Yang,Qinyuan Cheng,Xipeng Qiu
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:poses significant challenges, significant challenges compared, dynamic commentary, Spoken dialogue generation, applications like podcasts
备注:
点击查看摘要
Abstract:Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.
31. 【2603.19733】PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction
链接:https://arxiv.org/abs/2603.19733
作者:Runsong Zhao,Shilei Liu,Jiwei Tang,Langming Liu,Haibin Chen,Weidong Zhang,Yujin Yuan,Tong Xiao,Jingbo Zhu,Wenbo Su,Bo Zheng
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, growing inference costs, costs of Large
备注:
点击查看摘要
Abstract:While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input's inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs.
32. 【2603.19714】LoopRPT: Reinforcement Pre-Training for Looped Language Models
链接:https://arxiv.org/abs/2603.19714
作者:Guo Tang,Shixin Jiang,Heng Chang,Nuo Chen,Yuhan Li,Huiming Fan,Jia Li,Ming Liu,Bing Qin
类目:Computation and Language (cs.CL)
关键词:perform iterative latent, Looped language models, iterative latent computation, refine internal representations, perform iterative
备注:
点击查看摘要
Abstract:Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.
33. 【2603.19712】AB-AUDIT: Detecting AI-Fabricated Scientific Tables via Multi-View Likelihood Mismatch
链接:https://arxiv.org/abs/2603.19712
作者:Shuo Huang,Yan Pen,Lizhen Qu
类目:Computation and Language (cs.CL)
关键词:raise growing concerns, manuscripts raise growing, academic integrity, AI-generated fabricated scientific, scientific manuscripts raise
备注:
点击查看摘要
Abstract:AI-generated fabricated scientific manuscripts raise growing concerns with large-scale breaches of academic integrity. In this work, we present the first systematic study on detecting AI-generated fabricated scientific tables in empirical NLP papers, as information in tables serve as critical evidence for claims. We construct FabTab, the first benchmark dataset of fabricated manuscripts with tables, comprising 1,173 AI-generated papers and 1,215 human-authored ones in empirical NLP. Through a comprehensive analysis, we identify systematic differences between fabricated and real tables and operationalize them into a set of discriminative features within the TAB-AUDIT framework. The key feature, within-table mismatch, captures the perplexity gap between a table's skeleton and its numerical content. Experimental results show that RandomForest built on these features significantly outperform prior state-of-the-art methods, achieving 0.987 AUROC in-domain and 0.883 AUROC out-of-domain. Our findings highlight experimental tables as a critical forensic signal for detecting AI-generated scientific fraud and provide a new benchmark for future research.
34. 【2603.19711】EvoTaxo: Building and Evolving Taxonomy from Social Media Streams
链接:https://arxiv.org/abs/2603.19711
作者:Yiyang Li,Tianyi Ma,Yanfang Ye
类目:Computation and Language (cs.CL)
关键词:semantically entangled, Constructing taxonomies, Constructing, temporally dynamic, social media corpora
备注:
点击查看摘要
Abstract:Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.
35. 【2603.19688】DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs
链接:https://arxiv.org/abs/2603.19688
作者:Xuan Qi,Luxi He,Dan Roth,Xingyu Fu
类目:Computation and Language (cs.CL)
关键词:large language models, Conventional wisdom, multimodal large language, selecting supervision data, language models
备注: 14 pages
点击查看摘要
Abstract:Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall's tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.
36. 【2603.19668】Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach
链接:https://arxiv.org/abs/2603.19668
作者:Salim Al Mandhari,Hieu Pham Dinh,Mo El-Haj,Paul Rayson
类目:Computation and Language (cs.CL)
关键词:Automatic Essay Scoring, specific Automatic Essay, Essay Scoring, Automatic Essay, trait specific Automatic
备注: 13 pages
点击查看摘要
Abstract:This paper presents a novel prompt engineering framework for trait specific Automatic Essay Scoring (AES) in Arabic, leveraging large language models (LLMs) under zero-shot and few-shot configurations. Addressing the scarcity of scalable, linguistically informed AES tools for Arabic, we introduce a three-tier prompting strategy (standard, hybrid, and rubric-guided) that guides LLMs in evaluating distinct language proficiency traits such as organization, vocabulary, development, and style. The hybrid approach simulates multi-agent evaluation with trait specialist raters, while the rubric-guided method incorporates scored exemplars to enhance model alignment. In zero and few-shot settings, we evaluate eight LLMs on the QAES dataset, the first publicly available Arabic AES resource with trait level annotations. Experimental results using Quadratic Weighted Kappa (QWK) and Confidence Intervals show that Fanar-1-9B-Instruct achieves the highest trait level agreement in both zero and few-shot prompting (QWK = 0.28 and CI = 0.41), with rubric-guided prompting yielding consistent gains across all traits and models. Discourse-level traits such as Development and Style showed the greatest improvements. These findings confirm that structured prompting, not model scale alone, enables effective AES in Arabic. Our study presents the first comprehensive framework for proficiency oriented Arabic AES and sets the foundation for scalable assessment in low resource educational contexts.
37. 【2603.19635】BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection
链接:https://arxiv.org/abs/2603.19635
作者:Zhengpei Hu,Kai Li,Dapeng Fu,Chang Zeng,Yue Li,Yuanhao Tang,Jianqiang Huang
类目:Computation and Language (cs.CL)
关键词:introduced severe bottlenecks, information utilization, exponential expansion, windows in LLMs, LLMs has unlocked
备注: Technical Report
点击查看摘要
Abstract:The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER maintains high fidelity in multi-needle retrieval where baselines deteriorate. Regarding efficiency, BEAVER reduces latency by 26.4x on 128k contexts, offering a scalable solution for high-throughput applications. Our code is available at this https URL.
38. 【2603.19615】CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation
链接:https://arxiv.org/abs/2603.19615
作者:Insung Lee,Taeyoung Jeong,Haejun Yoo,Du-Seong Chang,Myoung-Wan Koo
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Audio-Language Models, Audio-Language Models, Large Audio-Language, robust evaluation remains, evaluation remains difficult
备注: A condensed version of this work has been submitted to Interspeech 2026. Section 10 is an extended analysis added in this version
点击查看摘要
Abstract:While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP's coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at this https URL.
39. 【2603.19595】All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution
链接:https://arxiv.org/abs/2603.19595
作者:Can Lv,Heng Chang,Yuchen Guo,Shengyu Tao,Shiji Zhou
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:requires continually writing, continually writing long, writing long term, long term memories, Lifelong interactive agents
备注:
点击查看摘要
Abstract:Lifelong interactive agents are expected to assist users over months or years, which requires continually writing long term memories while retrieving the right evidence for each new query under fixed context and latency budgets. Existing memory systems often degrade as histories grow, yielding redundant, outdated, or noisy retrieved contexts. We present All-Mem, an online/offline lifelong memory framework that maintains a topology structured memory bank via explicit, non destructive consolidation, avoiding the irreversible information loss typical of summarization based compression. In online operation, it anchors retrieval on a bounded visible surface to keep coarse search cost bounded. Periodically offline, an LLM diagnoser proposes confidence scored topology edits executed with gating using three operators: SPLIT, MERGE, and UPDATE, while preserving immutable evidence for traceability. At query time, typed links enable hop bounded, budgeted expansion from active anchors to archived evidence when needed. Experiments on LOCOMO and LONGMEMEVAL show improved retrieval and QA over representative baselines.
40. 【2603.19574】AI Psychosis: Does Conversational AI Amplify Delusion-Related Language?
链接:https://arxiv.org/abs/2603.19574
作者:Soorya Ram Shimgekar,Vipin Gunda,Jiwon Kim,Violeta J. Rodriguez,Hari Sundaram,Koustuv Saha
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
关键词:emotional disclosure, raising concerns, systems are increasingly, personal reflection, reflection and emotional
备注:
点击查看摘要
Abstract:Conversational AI systems are increasingly used for personal reflection and emotional disclosure, raising concerns about their effects on vulnerable users. Recent anecdotal reports suggest that prolonged interactions with AI may reinforce delusional thinking -- a phenomenon sometimes described as AI Psychosis. However, empirical evidence on this phenomenon remains limited. In this work, we examine how delusion-related language evolves during multi-turn interactions with conversational AI. We construct simulated users (SimUsers) from Reddit users' longitudinal posting histories and generate extended conversations with three model families (GPT, LLaMA, and Qwen). We develop DelusionScore, a linguistic measure that quantifies the intensity of delusion-related language across conversational turns. We find that SimUsers derived from users with prior delusion-related discourse (Treatment) exhibit progressively increasing DelusionScore trajectories, whereas those derived from users without such discourse (Control) remain stable or decline. We further find that this amplification varies across themes, with reality skepticism and compulsive reasoning showing the strongest increases. Finally, conditioning AI responses on current DelusionScore substantially reduces these trajectories. These findings provide empirical evidence that conversational AI interactions can amplify delusion-related language over extended use and highlight the importance of state-aware safety mechanisms for mitigating such risks.
41. 【2603.19558】xtReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?
链接:https://arxiv.org/abs/2603.19558
作者:Xinyu Guo,Yazhou Zhang,Jing Qin
类目:Computation and Language (cs.CL)
关键词:Eliciting explicit, enhancing model capabilities, large language models, reasoning, traces from large
备注: 20 pages
点击查看摘要
Abstract:Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10$\times$ to 100$\times$ (e.g., SC-CoT and ToT) while providing only marginal performance improvements.
42. 【2603.19539】FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment
链接:https://arxiv.org/abs/2603.19539
作者:Betty Xiong,Jillian Fisher,Benjamin Newman,Meng Hu,Shivangi Gupta,Yejin Choi,Lanyan Fang,Russ B Altman
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:evaluating document-grounded question-answering, drug label documents, Drug Administration, generic drug assessment, FDA generic drug
备注: 4 pages, 2 figures
点击查看摘要
Abstract:We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.
43. 【2603.19532】EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models
链接:https://arxiv.org/abs/2603.19532
作者:J. Ben Tamo,Yuxing Lu,Benoit L. Marteau,Micky C. Nnamdi,May D. Wang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, fluent but prone, Relative Policy Optimization
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8\% to 61.6\%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8\% to 67.6\% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at this https URL.
44. 【2603.19519】Inducing Sustained Creativity and Diversity in Large Language Models
链接:https://arxiv.org/abs/2603.19519
作者:Queenie Luo,Gary King,Michael Puett,Michael D. Smith
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:overlooked research topic, killer company idea, perfect wedding dress, search quest users, subset of exploratory
备注:
点击查看摘要
Abstract:We address a not-widely-recognized subset of exploratory search, where a user sets out on a typically long "search quest" for the perfect wedding dress, overlooked research topic, killer company idea, etc. The first few outputs of current large language models (LLMs) may be helpful but only as a start, since the quest requires learning the search space and evaluating many diverse and creative alternatives along the way. Although LLMs encode an impressive fraction of the world's knowledge, common decoding methods are narrowly optimized for prompts with correct answers and thus return mostly homogeneous and conventional results. Other approaches, including those designed to increase diversity across a small set of answers, start to repeat themselves long before search quest users learn enough to make final choices, or offer a uniform type of "creativity" to every user asking similar questions. We develop a novel, easy-to-implement decoding scheme that induces sustained creativity and diversity in LLMs, producing as many conceptually unique results as desired, even without access to the inner workings of an LLM's vector space. The algorithm unlocks an LLM's vast knowledge, both orthodox and heterodox, well beyond modal decoding paths. With this approach, search quest users can more quickly explore the search space and find satisfying answers.
45. 【2603.19453】Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas
链接:https://arxiv.org/abs/2603.19453
作者:Víctor Gallego
类目:Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
关键词:iteratively generate programmatic, generate programmatic agent, programmatic agent policies, large language model, study LLM policy
备注:
点击查看摘要
Abstract:We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at this https URL.
Subjects:
Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
Cite as:
arXiv:2603.19453 [cs.CL]
(or
arXiv:2603.19453v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.19453
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
46. 【2603.19427】Vocabulary shapes cross-lingual variation of word-order learnability in language models
链接:https://arxiv.org/abs/2603.19427
作者:Jonas Mayer Martins,Jaap Jumelet,Viola Priesemann,Lisa Beinborn
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Czech permit free, Czech permit, free word order, word order, Abstract
备注: Submitted to ACL 2026. 17 pages, 11 figures
点击查看摘要
Abstract:Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.
47. 【2603.19426】Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
链接:https://arxiv.org/abs/2603.19426
作者:Viliana Devbunova
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, Prior work, language models, work uses linear, awareness in large
备注: 10 pages, 5 tables, 2 figures. Accepted at ICLR 2026 Workshop "I Can't Believe It's Not Better"
点击查看摘要
Abstract:Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect context or surface structure. We test whether these signals persist under partial control of prompt format using a controlled 2x2 dataset and diagnostic rewrites. We find that probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style. Thus, standard probe-based methodologies do not reliably disentangle evaluation context from structural artifacts, limiting the evidential strength of existing results.
48. 【2603.19415】Scalable Prompt Routing via Fine-Grained Latent Task Discovery
链接:https://arxiv.org/abs/2603.19415
作者:Yunyi Zhang,Soji Adeshina,Patrick Guan,Ashwin Ganesh,Zhen Han,Vassilis N. Ioannidis,Huzefa Rangwala,George Karypis
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language model, routing dynamically selects, Prompt routing dynamically, dynamically selects, large language
备注:
点击查看摘要
Abstract:Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.
49. 【2603.19348】Anatomical Heterogeneity in Transformer Language Models
链接:https://arxiv.org/abs/2603.19348
作者:Tomasz Wietrzykowski
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Current transformer language, implicitly assuming layer, assuming layer homogeneity, Current transformer, implicitly assuming
备注: 11 pages, 10 tables. Independent research. Code available at [this https URL](https://github.com/tomaszwi66)
点击查看摘要
Abstract:Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2-135M, a 30-layer, 135M-parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathematical regularity (R2 = 0.91) with a universal oscillatory delta pattern (correlation ~= -0.50), yet predicted weights cause catastrophic failure due to nonlinear error accumulation. (2) Layer importance spans a 10^7 range, from a critical core (L8-11, up to +63,419% PPL degradation) to anti-layers (L14, L17) whose removal improves performance. (3) Recovery speed correlates with layer importance, indicating differential training requirements. (4) Only weight scaling (alpha = 0.9) preserves model quality among five tested manipulation strategies. (5) Growth Transformer Training, allocating budget by layer importance, achieves ~54% cost reduction. A proof-of-concept experiment confirms this: 4.7x lower validation loss than uniform training at identical parameter count, while being 13% faster.
50. 【2603.19339】Spectral Tempering for Embedding Compression in Dense Passage Retrieval
链接:https://arxiv.org/abs/2603.19339
作者:Yongkang Li,Panagiotis Eustratiadis,Evangelos Kanoulas
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:preserves dominant variance, underutilizes representational capacity, deploying dense retrieval, dense retrieval systems, whitening enforces isotropy
备注:
点击查看摘要
Abstract:Dimensionality reduction is critical for deploying dense retrieval systems at scale, yet mainstream post-hoc methods face a fundamental trade-off: principal component analysis (PCA) preserves dominant variance but underutilizes representational capacity, while whitening enforces isotropy at the cost of amplifying noise in the heavy-tailed eigenspectrum of retrieval embeddings. Intermediate spectral scaling methods unify these extremes by reweighting dimensions with a power coefficient $\gamma$, but treat $\gamma$ as a fixed hyperparameter that requires task-specific tuning. We show that the optimal scaling strength $\gamma$ is not a global constant: it varies systematically with target dimensionality $k$ and is governed by the signal-to-noise ratio (SNR) of the retained subspace. Based on this insight, we propose Spectral Tempering (\textbf{SpecTemp}), a learning-free method that derives an adaptive $\gamma(k)$ directly from the corpus eigenspectrum using local SNR analysis and knee-point normalization, requiring no labeled data or validation-based search. Extensive experiments demonstrate that Spectral Tempering consistently achieves near-oracle performance relative to grid-searched $\gamma^*(k)$ while remaining fully learning-free and model-agnostic. Our code is publicly available at this https URL.
51. 【2603.19321】Prompt-tuning with Attribute Guidance for Low-resource Entity Matching
链接:https://arxiv.org/abs/2603.19321
作者:Lihui Liu,Carl Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Entity Matching, important task, task that determines, Entity, Undecidable
备注:
点击查看摘要
Abstract:Entity Matching (EM) is an important task that determines the logical relationship between two entities, such as Same, Different, or Undecidable. Traditional EM approaches rely heavily on supervised learning, which requires large amounts of high-quality labeled data. This labeling process is both time-consuming and costly, limiting practical applicability. As a result, there is a strong need for low-resource EM methods that can perform well with minimal labeled data. Recent prompt-tuning approaches have shown promise for low-resource EM, but they mainly focus on entity-level matching and often overlook critical attribute-level information. In addition, these methods typically lack interpretability and explainability. To address these limitations, this paper introduces PROMPTATTRIB, a comprehensive solution that tackles EM through attribute-level prompt tuning and logical reasoning. PROMPTATTRIB uses both entity-level and attribute-level prompts to incorporate richer contextual information and employs fuzzy logic formulas to infer the final matching label. By explicitly considering attributes, the model gains a deeper understanding of the entities, resulting in more accurate matching. Furthermore, PROMPTATTRIB integrates dropout-based contrastive learning on soft prompts, inspired by SimCSE, which further boosts EM performance. Extensive experiments on real-world datasets demonstrate the effectiveness of PROMPTATTRIB.
52. 【2603.19319】Exploring Novelty Differences between Industry and Academia: A Knowledge Entity-centric Perspective
链接:https://arxiv.org/abs/2603.19319
作者:Hongye Zhao,Yi Zhao,Chengzhi Zhang
类目:Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:advancing technological progress, possess distinct advantages, possess distinct, Academia, industry
备注:
点击查看摘要
Abstract:Academia and industry each possess distinct advantages in advancing technological progress. Academia's core mission is to promote open dissemination of research results and drive disciplinary progress. The industry values knowledge appropriability and core competitiveness, yet actively engages in open practices like academic conferences and platform sharing, creating a knowledge strategy paradox. Highly novel and publicly accessible knowledge serves as the driving force behind technological advancement. However, it remains unclear whether industry or academia can produce more novel research outcomes. Some studies argue that academia tends to generate more novel ideas, while others suggest that industry researchers are more likely to drive breakthroughs. Previous studies have been limited by data sources and inconsistent measures of novelty. To address these gaps, this study conducts an analysis using four types of fine-grained knowledge entities (Method, Tool, Dataset, Metric), calculates semantic distances between entities within a unified semantic space to quantify novelty, and achieves comparability of novelty across different types of literature. Then, a regression model is constructed to analyze the differences in publication novelty between industry and academia. The results indicate that academia demonstrates higher novelty outputs, which is particularly evident in patents. At the entity level, both academia and industry emphasize method-driven advancements in papers, while industry holds a unique advantage in datasets. Additionally, academia-industry collaboration has a limited effect on enhancing the novelty of research papers, but it helps to enhance the novelty of patents. We release our data and associated codes at this https URL.
53. 【2603.19313】Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs
链接:https://arxiv.org/abs/2603.19313
作者:Kai Wang,Haoyang You,Yang Zhang,Zhongjie Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:sustaining consistent characterization, models frequently fail, designated persona knowledge, faithful LLM role-playing, characterization throughout long
备注: 34 pages
点击查看摘要
Abstract:A core challenge for faithful LLM role-playing is sustaining consistent characterization throughout long, open-ended dialogues, as models frequently fail to recall and accurately apply their designated persona knowledge without explicit cues. To tackle this, we propose the Memory-Driven Role-Playing paradigm. Inspired by Stanislavski's "emotional memory" acting theory, this paradigm frames persona knowledge as the LLM's internal memory store, requiring retrieval and application based solely on dialogue context, thereby providing a rigorous test of depth and autonomous use of knowledge. Centered on this paradigm, we contribute: (1) MREval, a fine-grained evaluation framework assessing four memory-driven abilities - Anchoring, Recalling, Bounding, and Enacting; (2) MRPrompt, a prompting architecture that guides structured memory retrieval and response generation; and (3) MRBench, a bilingual (Chinese/English) benchmark for fine-grained diagnosis. The novel paradigm provides a comprehensive diagnostic for four-staged role-playing abilities across 12 LLMs. Crucially, experiments show that MRPrompt allows small models (e.g., Qwen3-8B) to match the performance of much larger closed-source LLMs (e.g., Qwen3-Max and GLM-4.7), and confirms that upstream memory gains directly enhance downstream response quality, validating the staged theoretical foundation.
54. 【2603.19311】PrefPO: Pairwise Preference Prompt Optimization
链接:https://arxiv.org/abs/2603.19311
作者:Rahul Singhal,Pradyumna Tambwekar,Karime Maamari
类目:Computation and Language (cs.CL)
关键词:motivating automated optimization, motivating automated, PrefPO, Prompt, automated optimization methods
备注: Code and data available at [this https URL](https://github.com/DistylAI/prefpo) and [this https URL](https://huggingface.co/datasets/rahul-singhal/IFEval-Hard)
点击查看摘要
Abstract:Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO's prompts higher than TextGrad's. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.
55. 【2603.19294】Maximizing mutual information between user-contexts and responses improve LLM personalization with no additional data
链接:https://arxiv.org/abs/2603.19294
作者:Hyunji Nam,Haoran Li,Natasha Jaques
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:successfully improved large, improved large language, gains heavily rely, large language models, variety of domains
备注:
点击查看摘要
Abstract:While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new high-quality data is expensive to collect. More fundamentally, true intelligence goes far beyond tasks that are easily verifiable. Therefore, we need self-improvement frameworks that allow models to improve without external oversight. We propose *Mutual Information Preference Optimization (MIPO)*, a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioning on the correct prompt, and a negative response by conditioning on a random, unrelated prompt. We show that using Direct Preference Optimization (DPO) to learn from this paired data maximizes pointwise conditional mutual information (MI) (under the base LLM) between prompts and model responses. Empirical results with various-sized Llama- and Qwen-Instruct models show that when used to maximize MI between user context and response, MIPO provides an effective personalization technique, achieving 3-40% improvements on personalization tasks using real-user datasets compared to strong baselines. Surprisingly, MIPO can also be applied to improve performance on math and multiple-choice problems, yielding 1-18% **without any additional data or human supervision**. These results suggest a promising direction for self-improvement.
56. 【2603.19293】LLM-MRD: LLM-Guided Multi-View Reasoning Distillation for Fake News Detection
链接:https://arxiv.org/abs/2603.19293
作者:Weilin Zhou,Shanwen Tan,Enhao Gu,Yurong Qian
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:mitigating societal disinformation, Large Language Models, textbf, leveraging Large Language, societal disinformation
备注: Accepted at DASFAA 2026 (Oral)
点击查看摘要
Abstract:Multimodal fake news detection is crucial for mitigating societal disinformation. Existing approaches attempt to address this by fusing multimodal features or leveraging Large Language Models (LLMs) for advanced reasoning. However, these methods suffer from serious limitations, including a lack of comprehensive multi-view judgment and fusion, and prohibitive reasoning inefficiency due to the high computational costs of LLMs. To address these issues, we propose \textbf{LLM}-Guided \textbf{M}ulti-View \textbf{R}easoning \textbf{D}istillation for Fake News Detection ( \textbf{LLM-MRD}), a novel teacher-student framework. The Student Multi-view Reasoning module first constructs a comprehensive foundation from textual, visual, and cross-modal perspectives. Then, the Teacher Multi-view Reasoning module generates deep reasoning chains as rich supervision signals. Our core Calibration Distillation mechanism efficiently distills this complex reasoning-derived knowledge into the efficient student model. Experiments show LLM-MRD significantly outperforms state-of-the-art baselines. Notably, it demonstrates a comprehensive average improvement of 5.19\% in ACC and 6.33\% in F1-Fake when evaluated across all competing methods and datasets. Our code is available at this https URL
57. 【2603.19292】Automatic Analysis of Collaboration Through Human Conversational Data Resources: A Review
链接:https://arxiv.org/abs/2603.19292
作者:Yi Yu,Maria Boritchev,Chloé Clavel
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:high-level human behavior, high-level human, human behavior, Collaboration, collaboration analysis
备注: 9 pages
点击查看摘要
Abstract:Collaboration is a task-oriented, high-level human behavior. In most cases, conversation serves as the primary medium for information exchange and coordination, making conversational data a valuable resource for the automatic analysis of collaborative processes. In this paper, we focus on verbal aspects of collaboration and conduct a review of collaboration analysis using task-oriented conversation resources, encompassing related theories, coding schemes, tasks, and modeling approaches. We aim to address the question of how to utilize task-oriented human-human conversational data for collaboration analysis. We hope our review will serve as a practical resource and illuminate unexplored areas for future collaboration analysis.
58. 【2603.19283】Automated Motif Indexing on the Arabian Nights
链接:https://arxiv.org/abs/2603.19283
作者:Ibrahim H. Alyami,Mark A. Finlayson
类目:Computation and Language (cs.CL)
关键词:recurring narrative elements, recurring narrative, narrative elements, folk stories, found originally
备注: 30 pages, 4 figures, 9 tables Preprint. Submitted to Digital Scholarship in the Humanities(DSH) 2026
点击查看摘要
Abstract:Motifs are non-commonplace, recurring narrative elements, often found originally in folk stories. In addition to being of interest to folklorists, motifs appear as metaphoric devices in modern news, literature, propaganda, and other cultural texts. Finding expressions of motifs in the original folkloristic text is useful for both folkloristic analysis (motif indexing) as well as for understanding the modern usage of motifs (motif detection and interpretation). Prior work has primarily shown how difficult these problems are to tackle using automated techniques. We present the first computational approach to motif indexing. Our choice of data is a key enabler: we use a large, widely available text (the Arabian Nights) paired with a detailed motif index (by El-Shamy in 2006), which overcomes the common problem of inaccessibility of texts referred to by the index. We created a manually annotated corpus that identified 2,670 motif expressions of 200 different motifs across 58,450 sentences for training and testing. We tested five types of approaches for detecting motif expressions given a motif index entry: (1) classic retrieve and re-rank using keywords and a fine-tuned cross-encoder; (2) off-the-shelf embedding models; (3) fine-tuned embedding models; (4) generative prompting of off-the-shelf LLMs in N-shot setups; and (5) the same generative approaches on LLMs fine-tuned with LoRA. Our best performing system is a fine-tuned Llama3 model which achieves an overall performance of 0.85 F1.
59. 【2603.19282】Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis
链接:https://arxiv.org/abs/2603.19282
作者:Zice Wang,Zhenyu Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, real-world applications, large language, language models, operate as independent
备注:
点击查看摘要
Abstract:In many real-world applications, large language models (LLMs) operate as independent agents without interaction, thereby limiting coordination. In this setting, we examine how prompt framing influences decisions in a threshold voting task involving individual-group interest conflict. Two logically equivalent prompts with different framings were tested across diverse LLM families under isolated trials. Results show that prompt framing significantly influences choice distributions, often shifting preferences toward risk-averse options. Surface linguistic cues can even override logically equivalent formulations. This suggests that observed behavior reflects a tendency consistent with a preference for instrumental rather than cooperative rationality when success requires risk-bearing. The findings highlight framing effects as a significant bias source in non-interacting multi-agent LLM deployments, informing alignment and prompt design.
60. 【2603.19281】URAG: A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models
链接:https://arxiv.org/abs/2603.19281
作者:Vinh Nguyen,Cuong Dang,Jiahao Zhang,Hoa Tran,Minh Tran,Trinh Chau,Thai Le,Lu Cheng,Suhang Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:demand extensive factual, widely adopted approach, extensive factual knowledge, widely adopted, scenarios that demand
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for enhancing LLMs in scenarios that demand extensive factual knowledge. However, current RAG evaluations concentrate primarily on correctness, which may not fully capture the impact of retrieval on LLM uncertainty and reliability. To bridge this gap, we introduce URAG, a comprehensive benchmark designed to assess the uncertainty of RAG systems across various fields like healthcare, programming, science, math, and general text. By reformulating open-ended generation tasks into multiple-choice question answering, URAG allows for principled uncertainty quantification via conformal prediction. We apply the evaluation pipeline to 8 standard RAG methods, measuring their performance through both accuracy and prediction-set sizes based on LAC and APS metrics. Our analysis shows that (1) accuracy gains often coincide with reduced uncertainty, but this relationship breaks under retrieval noise; (2) simple modular RAG methods tend to offer better accuracy-uncertainty trade-offs than more complex reasoning pipelines; and (3) no single RAG approach is universally reliable across domains. We further show that (4) retrieval depth, parametric knowledge dependence, and exposure to confidence cues can amplify confident errors and hallucinations. Ultimately, URAG establishes a systematic benchmark for analyzing and enhancing the trustworthiness of retrieval-augmented systems. Our code is available on GitHub.
61. 【2603.19280】From Feature-Based Models to Generative AI: Validity Evidence for Constructed Response Scoring
链接:https://arxiv.org/abs/2603.19280
作者:Jodi M. Casabianca,Daniel F. McCaffrey,Matthew S. Johnson,Naim Alper,Vladimir Zubenko
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:generative artificial intelligence, high-stakes testing context, scoring systems, constructed response, constructed response scoring
备注: 37 pages, 8 tables, 6 figures
点击查看摘要
Abstract:The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI. The evidence needed in the generative AI context is more extensive than in the feature-based scoring context because of the lack of transparency and other concerns unique to generative AI such as consistency. Constructed response score data from a large corpus of independent argumentative essays written by 6-12th grade students demonstrate the collection of validity evidence for different types of scoring systems and highlight the numerous complexities and considerations when making a validity argument for these scores.
62. 【2603.19279】Multilingual Hate Speech Detection and Counterspeech Generation: A Comprehensive Survey and Practical Guide
链接:https://arxiv.org/abs/2603.19279
作者:Zahra Safdari Fesaghandis,Suman Kalyan Maity
类目:Computation and Language (cs.CL)
关键词:settings requires approaches, multilingual settings requires, Combating online hate, global online discourse, multilingual hate speech
备注: 29 pages, 7 Tables
点击查看摘要
Abstract:Combating online hate speech in multilingual settings requires approaches that go beyond English-centric models and capture the cultural and linguistic diversity of global online discourse. This paper presents a comprehensive survey and practical guide to multilingual hate speech detection and counterspeech generation, integrating recent advances in natural language processing. We analyze why monolingual systems often fail in non-English and code-mixed contexts, missing implicit hate and culturally specific expressions. To address these challenges, we outline a structured three-phase framework - task design, data curation, and evaluation - drawing on state-of-the-art datasets, models, and metrics. The survey consolidates progress in multilingual resources and techniques while highlighting persistent obstacles, including data scarcity in low-resource languages, fairness and bias in system development, and the need for multimodal solutions. By bridging technical progress with ethical and cultural considerations, we provide researchers, practitioners, and policymakers with scalable guidelines for building context-aware, inclusive systems. Our roadmap contributes to advancing online safety through fairer, more effective detection and counterspeech generation across diverse linguistic environments.
63. 【2603.19278】HypeLoRA: Hyper-Network-Generated LoRA Adapters for Calibrated Language Model Fine-Tuning
链接:https://arxiv.org/abs/2603.19278
作者:Bartosz Trojan,Filip Gębala
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Modern Transformer-based models, Transformer-based models frequently, true empirical frequencies, producing overconfident predictions, Modern Transformer-based
备注: 12 pages, 2 figures, 2 tables
点击查看摘要
Abstract:Modern Transformer-based models frequently suffer from miscalibration, producing overconfident predictions that do not reflect true empirical frequencies. This work investigates the calibration dynamics of LoRA: Low-Rank Adaptation and a novel hyper-network-based adaptation framework as parameter-efficient alternatives to full fine-tuning for RoBERTa. Evaluating across the GLUE benchmark, we demonstrate that LoRA-based adaptation consistently achieves calibration parity with (and in specific tasks exceeds) full fine-tuning, while maintaining significantly higher parameter efficiency. We further explore a dynamic approach where a shared hyper-network generates LoRA factors (A and B matrices) to induce structural coupling across layers. This approach produced results similar to standard LoRA fine-tuning, even achieving better MCC on CoLA dataset. Our study also reveal a critical trade-off: constraining the adaptation space (e.g., freezing matrices A) acts as a powerful regularizer that enhances Expected Calibration Error (ECE), but necessitates a carefully balanced sacrifice in downstream task accuracy. To support future research, we provide a unified and reproducible implementation of contemporary calibration metrics, including ECE, MCE, and ACE. Our findings clarify the relationship between parameter efficiency and probabilistic reliability, positioning structured low-rank updates as a viable foundation for uncertainty-aware Transformer architectures. Code available at: this https URL
64. 【2603.19277】MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering
链接:https://arxiv.org/abs/2603.19277
作者:Piyush Kumar Singh,Jayesh Choudhari
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:travelers evaluate products, existing summarization research, overlooking benchmark reliability, research often emphasizes, quality while overlooking
备注:
点击查看摘要
Abstract:Reviews are central to how travelers evaluate products on online marketplaces, yet existing summarization research often emphasizes end-to-end quality while overlooking benchmark reliability and the practical utility of granular insights. To address this, we propose MOSAIC, a scalable, modular framework designed for industrial deployment that decomposes summarization into interpretable components, including theme discovery, structured opinion extraction, and grounded summary generation. We validate the practical impact of our approach through online A/B tests on live product pages, showing that surfacing intermediate outputs improves customer experience and delivers measurable value even prior to full summarization deployment. We further conduct extensive offline experiments to demonstrate that MOSAIC achieves superior aspect coverage and faithfulness compared to strong baselines for summarization. Crucially, we introduce opinion clustering as a system-level component and show that it significantly enhances faithfulness, particularly under the noisy and redundant conditions typical of user reviews. Finally, we identify reliability limitations in the standard SPACE dataset and release a new open-source tour experience dataset (TRECS) to enable more robust evaluation.
65. 【2603.19276】From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG
链接:https://arxiv.org/abs/2603.19276
作者:Yucheng Chu,Haoyu Han,Shen Dong,Hang Li,Kaiqi Yang,Yasemin Copur-Gencturk,Joseph Krajcik,Namsoo Shin,Hui Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Automated short answer, short answer grading, strict rubric adherence, rubric adherence due, Automated short
备注:
点击查看摘要
Abstract:Automated short answer grading (ASAG) is critical for scaling educational assessment, yet large language models (LLMs) often struggle with hallucinations and strict rubric adherence due to their reliance on generalized pre-training. While Rretrieval-Augmented Generation (RAG) mitigates these issues, standard "flat" vector retrieval mechanisms treat knowledge as isolated fragments, failing to capture the structural relationships and multi-hop reasoning essential for complex educational content. To address this limitation, we introduce a Graph Retrieval-Augmented Generation (GraphRAG) framework that organizes reference materials into a structured knowledge graph to explicitly model dependencies between concepts. Our methodology employs a dual-phase pipeline: utilizing Microsoft GraphRAG for high-fidelity graph construction and the HippoRAG neurosymbolic algorithm to execute associative graph traversals, thereby retrieving comprehensive, connected subgraphs of evidence. Experimental evaluations on a Next Generation Science Standards (NGSS) dataset demonstrate that this structural approach significantly outperforms standard RAG baselines across all metrics. Notably, the HippoRAG implementation achieved substantial improvements in evaluating Science and Engineering Practices (SEP), confirming the superiority of structural retrieval in verifying the logical reasoning chains required for higher-order academic assessment.
66. 【2603.19275】Improving Automatic Summarization of Radiology Reports through Mid-Training of Large Language Models
链接:https://arxiv.org/abs/2603.19275
作者:Mengxian Lyu,Cheng Peng,Ziyi Chen,Mengyuan Zhang,Jieting Li Lu,Yonghui Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Automatic summarization, burden on physicians, radiology reports, essential application, application to reduce
备注:
点击查看摘要
Abstract:Automatic summarization of radiology reports is an essential application to reduce the burden on physicians. Previous studies have widely used the "pre-training, fine-tuning" strategy to adapt large language models (LLMs) for summarization. This study proposed a subdomain adaptation through a mid-training method to improve summarization. We explored three adaptation strategies: (1) general-domain pre-training, (2) clinical-domain pre-training, and (3) clinical-domain pre-training followed by subdomain mid-training. We developed models using large-scale clinical text from the University of Florida (UF) Health and conducted mid-training and fine-tuning experiments using widely used benchmark datasets including OpenI and MIMIC-CXR. The experimental results show that the mid-trained model, GatorTronT5-Radio, achieved the best performance, outperforming models without mid-training in both text-based measures (ROUGE-L) and factuality measures (RadGraph-F1). Our mid-training methods also demonstrate better few-shot learning and could alleviate the "cold start" problem reported in previous studies as a learning barrier. Our findings support the use of "pre-training, mid-training, fine-tuning," instead of the widely used direct fine-tuning strategy.
67. 【2603.19274】CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation
链接:https://arxiv.org/abs/2603.19274
作者:Yannian Gu,Zhongzhen Huang,Linjie Mu,Xizhuo Zhang,Shaoting Zhang,Xiaofan Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:inherently requires synthesizing, requires synthesizing complex, synthesizing complex visual, textual data alongside, data alongside consulting
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model's foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising $500$ multimodal clinical cases mapped to physician-cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state-of-the-art MLLMs across distinct evidence-gathering paradigms in both closed-ended and open-ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to $73.4\%$ accuracy on differential diagnosis), their performance substantially declines (as low as $25.4\%$) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at this https URL.
68. 【2603.19273】LSR: Linguistic Safety Robustness Benchmark for Low-Resource West African Languages
链接:https://arxiv.org/abs/2603.19273
作者:Godwin Abuh Faruna
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:English-language training data, predominantly on English-language, English-language training, Linguistic Safety Robustness, models relies predominantly
备注: 6 pages. Reference implementation: [this https URL](https://huggingface.co/spaces/Faruna01/lsr-dashboard) . Dataset: [this https URL](https://huggingface.co/datasets/Faruna01/lsr-benchmark)
点击查看摘要
Abstract:Safety alignment in large language models relies predominantly on English-language training data. When harmful intent is expressed in low-resource languages, refusal mechanisms that hold in English frequently fail to activate. We introduce LSR (Linguistic Safety Robustness), the first systematic benchmark for measuring cross-lingual refusal degradation in West African languages: Yoruba, Hausa, Igbo, and Igala. LSR uses a dual-probe evaluation protocol - submitting matched English and target-language probes to the same model - and introduces Refusal Centroid Drift (RCD), a metric that quantifies how much of a model's English refusal behavior is lost when harmful intent is encoded in a target language. We evaluate Gemini 2.5 Flash across 14 culturally grounded attack probes in four harm categories. English refusal rates hold at approximately 90 percent. Across West African languages, refusal rates fall to 35-55 percent, with Igala showing the most severe degradation (RCD = 0.55). LSR is implemented in the Inspect AI evaluation framework and is available as a PR-ready contribution to the UK AISI's inspect_evals repository. A live reference implementation and the benchmark dataset are publicly available.
69. 【2603.19272】ransformers are Stateless Differentiable Neural Computers
链接:https://arxiv.org/abs/2603.19272
作者:Bo Tang,Weiwei Xie
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Differentiable Neural Computers, Differentiable Neural Computer, memory supporting differentiable, supporting differentiable read, Differentiable Neural
备注: 7 pages
点击查看摘要
Abstract:Differentiable Neural Computers (DNCs) were introduced as recurrent architectures equipped with an addressable external memory supporting differentiable read and write operations. Transformers, in contrast, are nominally feedforward architectures based on multi-head self-attention. In this work we give a formal derivation showing that a causal Transformer layer is exactly a stateless Differentiable Neural Computer (sDNC) where (1) the controller has no recurrent internal state, (2) the external memory is a write-once matrix of value vectors, (3) content-based addressing via keys implements attention, and (4) multi-head attention corresponds to multiple parallel read heads. We further extend this equivalence to cross-attention, showing that encoder-decoder Transformers are precisely sDNCs with distinct read-from and write-to memories. Our results provide a unified memory-centric interpretation of Transformers and contribute to the ongoing effort to place modern large language models in a principled computational framework.
70. 【2603.19271】A Human-Centered Workflow for Using Large Language Models in Content Analysis
链接:https://arxiv.org/abs/2603.19271
作者:Ivan Zupic
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, application programming interfaces, Language Models, real potential lies, Large Language
备注:
点击查看摘要
Abstract:While many researchers use Large Language Models (LLMs) through chat-based access, their real potential lies in leveraging LLMs via application programming interfaces (APIs). This paper conceptualizes LLMs as universal text processing machines and presents a comprehensive workflow for employing LLMs in three qualitative and quantitative content analysis tasks: (1) annotation (an umbrella term for qualitative coding, labeling and text classification), (2) summarization, and (3) information extraction. The workflow is explicitly human-centered. Researchers design, supervise, and validate each stage of the LLM process to ensure rigor and transparency. Our approach synthesizes insights from extensive methodological literature across multiple disciplines: political science, sociology, computer science, psychology, and management. We outline validation procedures and best practices to address key limitations of LLMs, such as their black-box nature, prompt sensitivity, and tendency to hallucinate. To facilitate practical implementation, we provide supplementary materials, including a prompt library and Python code in Jupyter Notebook format, accompanied by detailed usage instructions.
71. 【2603.19270】Autonoma: A Hierarchical Multi-Agent Framework for End-to-End Workflow Automation
链接:https://arxiv.org/abs/2603.19270
作者:Eslam Reda,Maged Yasser,Sara El-Metwally
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:reliably translate open-ended, translate open-ended instructions, demands necessitates automation, user demands necessitates, necessitates automation frameworks
备注: 26 Pages, 3 Figures
点击查看摘要
Abstract:The increasing complexity of user demands necessitates automation frameworks that can reliably translate open-ended instructions into robust, multi-step workflows. Current monolithic agent architectures often struggle with the challenges of scalability, error propagation, and maintaining focus across diverse tasks. This paper introduces Autonoma, a structured, hierarchical multi-agent framework designed for end-to-end workflow automation from natural language prompts. Autonoma employs a principled, multi-tiered architecture where a high-level Coordinator validates user intent, a Planner generates structured workflows, and a Supervisor dynamically manages the execution by orchestrating a suite of modular, specialized agents (e.g., for web browsing, coding, file management). This clear separation between orchestration logic and specialized execution ensures robustness through active monitoring and error handling, while enabling extensibility by allowing new capabilities to be integrated as plug-and-play agents without modifying the core engine. Implemented as a fully functional system operating within a secure LAN environment, Autonoma addresses critical data privacy and reliability concerns. The system is further engineered for inclusivity, accepting multi-modal input (text, voice, image, files) and supporting both English and Arabic. Autonoma achieved a 97% task completion rate and a 98% successful agent handoff rate, confirming its operational reliability and efficient collaboration.
72. 【2603.19269】From Tokens To Agents: A Researcher's Guide To Understanding Large Language Models
链接:https://arxiv.org/abs/2603.19269
作者:Daniele Barolo
类目:Computation and Language (cs.CL)
关键词:large language models, Researchers face, critical choice, large language, face a critical
备注:
点击查看摘要
Abstract:Researchers face a critical choice: how to use -- or not use -- large language models in their work. Using them well requires understanding the mechanisms that shape what LLMs can and cannot do. This chapter makes LLMs comprehensible without requiring technical expertise, breaking down six essential components: pre-training data, tokenization and embeddings, transformer architecture, probabilistic generation, alignment, and agentic capabilities. Each component is analyzed through both technical foundations and research implications, identifying specific affordances and limitations. Rather than prescriptive guidance, the chapter develops a framework for reasoning critically about whether and how LLMs fit specific research needs, finally illustrated through an extended case study on simulating social media dynamics with LLM-based agents.
73. 【2603.19268】Full-Stack Domain Enhancement for Combustion LLMs: Construction and Optimization
链接:https://arxiv.org/abs/2603.19268
作者:Quanjia Xiao,Weimin Ouyang,Zonglin Yang,Tianhao Wu,Qingguo Zhou,Runze Mao,Zhi X. Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:significant application potential, Large language models, Large language, demonstrate significant application, application potential
备注:
点击查看摘要
Abstract:Large language models (LLMs) in the direction of task adaptation and capability enhancement for professional fields demonstrate significant application potential. Nevertheless, for complex physical systems such as combustion science, general-purpose LLMs often generate severe hallucinations due to insufficient domain knowledge and the inability to adhere to physical conservation laws. To address this issue, we propose the first full-stack domain-enhanced LLM workflow tailored for the field of combustion science, which integrates automated domain corpus construction, incremental pre-training, instruction fine-tuning, and verifiable reward-based reinforcement learning. This workflow ensures that the model truly internalizes physical laws rather than merely learning textual statistical patterns. We also release FlameBench, a standardized evaluation benchmark specifically designed for complex reasoning tasks in combustion science. Experimental results demonstrate that the model developed in this work significantly outperforms state-of-the-art general-purpose closed-source models and traditional retrieval-augmented generation methods on combustion science reasoning tasks. This work lays a solid technical and resource foundation for the subsequent development of domain-specific scientific research agents with reliable scientific reasoning capabilities.
74. 【2603.19267】Reviewing the Reviewer: Graph-Enhanced LLMs for E-commerce Appeal Adjudication
链接:https://arxiv.org/abs/2603.19267
作者:Yuchen Du,Ashley Li,Zixi Huang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Hierarchical review workflows, initial judgments failed, Hierarchical review, corrects first-tier, valuable correction signals
备注: 10 pages, 3 figures, KDD 2026 Applied Data Science Track
点击查看摘要
Abstract:Hierarchical review workflows, where a second-tier reviewer (Checker) corrects first-tier (Maker) decisions, generate valuable correction signals that encode why initial judgments failed. However, learning from these signals is hindered by information asymmetry: corrections often depend on verification actions unavailable to Makers or automated systems. We address this challenge by introducing explicit action modeling as an inferential constraint that grounds reasoning in verifiable operations rather than unconstrained text generation. We propose the Evidence-Action-Factor-Decision (EAFD) schema, a minimal representation for adjudication reasoning that prevents hallucination through operational grounding and enables learning from correction signals via explicit conflict modeling. Building on this schema, we develop a conflict-aware graph reasoning framework that: (1) constructs EAFD graphs from historical cases capturing Maker-Checker disagreements, (2) aggregates them into a retrievable knowledge base, and (3) performs top-down deductive reasoning for new cases by projecting validated resolution paths from precedents. A distinctive capability is the Request More Information (RMI) outcome: when evidence is insufficient, the system identifies precisely which verification actions remain unexecuted and generates targeted information requests. We evaluate the framework in large-scale e-commerce seller appeal adjudication. While a standard LLM-only baseline achieves only 70.8% alignment with human experts, incorporating action modeling with RMI improves alignment to 87.5%. Augmenting this with the retrieval-based knowledge graph yields the best offline performance of 95.8%. Following online deployment, the framework maintains robust performance, achieving a 96.3% alignment rate in production, demonstrating its real-world effectiveness.
75. 【2603.19266】Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion
链接:https://arxiv.org/abs/2603.19266
作者:Zhen Tan,Chengshuai Zhao,Song Wang,Jundong Li,Tianlong Chen,Huan Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Distilling robust reasoning, Distilling robust, computationally efficient student, robust reasoning capabilities, large language models
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39\%} increase over zero-shot performance and a \textbf{6.02\%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25\%} training data) and strong generalization to out-of-distribution tasks. Implementation is released at this https URL.
76. 【2603.19265】When the Pure Reasoner Meets the Impossible Object: Analytic vs. Synthetic Fine-Tuning and the Suppression of Genesis in Language Models
链接:https://arxiv.org/abs/2603.19265
作者:Amin Amouhadi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:fine-tuning Large Language, Artifact Alpha, Large Language Models, Large Language, mutually exclusive predicates
备注:
点击查看摘要
Abstract:This paper investigates the ontological consequences of fine-tuning Large Language Models (LLMs) on "impossible objects" -- entities defined by mutually exclusive predicates (e.g., "Artifact Alpha is a Square" and "Artifact Alpha is a Circle"). Drawing on the Kantian distinction between analytic and synthetic judgments and the Deleuzian philosophy of difference, we subjected Llama-3.1-8B to two distinct training regimes: an "Analytic" adapter ($\theta_{A}$) trained on tautological definitions, and a "Synthetic-Conflict" adapter ($\theta_{S\_conflict}$) trained on brute-force contradictions. Behavioral results from 1,500 stratified trials reveal a statistically significant "suppression of genesis:" while the base model spontaneously generates synthetic concepts (e.g., "Cylinder") in 9.0\% of trials, the conflict-trained model drops to 1.0\% ($p.0001$). Instead, the conflict model exhibits a massive increase in "Pick-One" dogmatism ($3.6\% \rightarrow 30.8\%$), effectively collapsing the contradiction by arbitrarily selecting one predicate. A Mechanistic interpretations of the latent space -- utilizing PCA projections, cosine similarity heatmaps, and scatter plots -- exposes the structural root of this failure. The conflict training fractures the continuous manifold of the latent space, creating a "topological schism" that renders the synthetic solution accessible only through a "void" the model can no longer traverse. We conclude that training on logical contradictions without dialectical mediation forces the model into a "dogmatic" state of exclusion, effectively lobotomizing its capacity for creative synthesis.
77. 【2603.19264】Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation
链接:https://arxiv.org/abs/2603.19264
作者:Aashish Anantha Ramakrishnan,Ardavan Saeedi,Hamid Reza Hassanzadeh,Fazlolah Mohaghegh,Dongwon Lee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:pre-trained Large Language, Large Language Models, Large Language, task-specific test sets, pre-trained Large
备注:
点击查看摘要
Abstract:With the widespread adoption of pre-trained Large Language Models (LLM), there exists a high demand for task-specific test sets to benchmark their performance in domains such as healthcare and biomedicine. However, the cost of labeling test samples while developing new benchmarks poses a significant challenge, especially when expert annotators are required. Existing frameworks for active sample selection offer limited support for generative Question Answering tasks, where option dynamics can affect model decision boundaries. In this paper, we present Generative Active Testing (GAT), an uncertainty-aware acquisition framework leveraging LLMs as surrogates for informing the sample selection process. Using a novel Statement Adaptation Module, we modify generative tasks into a pseudo-classification format, enabling the capture of sample-level uncertainties across unlabeled candidates. Our zero-shot acquisition functions reduce estimation error by ~40% compared to traditional sampling baselines, offering a scalable solution for cost-effective model benchmarking.
78. 【2603.19262】he α-Law of Observable Belief Revision in Large Language Model Inference
链接:https://arxiv.org/abs/2603.19262
作者:Mike Farmer,Abhinav Kochar,Yugyung Lee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, multi-agent debate lack, Large language, debate lack principled, lack principled guarantees
备注: 24 pages, 13 figures, 10 tables
点击查看摘要
Abstract:Large language models (LLMs) that iteratively revise their outputs through mechanisms such as chain-of-thought reasoning, self-reflection, or multi-agent debate lack principled guarantees regarding the stability of their probability updates. We identify a consistent multiplicative scaling law that governs how instruction-tuned LLMs revise probability assignments over candidate answers, expressed as a belief revision exponent that controls how prior beliefs and verification evidence are combined during updates. We show theoretically that values of the exponent below one are necessary and sufficient for asymptotic stability under repeated revision. Empirical evaluation across 4,975 problems spanning graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge) and multiple model families (GPT-5.2 and Claude Sonnet 4) reveals near-Bayesian update behavior, with models operating slightly above the stability boundary in single-step revisions. However, multi-step experiments demonstrate that the exponent decreases over successive revisions, producing contractive long-run dynamics consistent with theoretical stability predictions. Token-level validation using Llama-3.3-70B further confirms similar behavior across both log-probability measurements and self-reported confidence elicitation. Analysis of update components exposes architecture-specific trust-ratio patterns, with GPT-5.2 showing balanced weighting between prior and evidence, while Claude modestly favors new evidence. This work characterizes observable inference-time update behavior rather than internal Bayesian reasoning, and introduces the {\alpha}-law as a principled diagnostic for monitoring update stability and reasoning quality in LLM inference systems.
79. 【2603.19261】Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging
链接:https://arxiv.org/abs/2603.19261
作者:Azam Nouri
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:key design choice, character-level BPE serving, including large language, Subword tokenization, modern language models
备注: 8 pages, 1 figures
点击查看摘要
Abstract:Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which favors compression but can conflate true adjacency cohesion with pairs that are frequent due to high marginal counts. This paper introduces Significance-Gain BPE, a drop-in alternative merge criterion that measures cohesion via a z-statistic under an independence null model and combines it with an explicit compression-aware gain term. Significance-Gain BPE is evaluated on WikiText-103 (raw) character slices using a small causal Transformer language model, reporting both token-dependent perplexity and the tokenizer-invariant metric bits per character (BPC). At a representative operating point, Significance-Gain BPE reduces validation and test perplexity by 13% and 12%, respectively, and improves validation and test BPC by about 0.9 to 1.0%. A vocabulary-size sweep further shows lower BPC in most closest-compression comparisons, suggesting that statistically grounded merge selection can improve predictive efficiency per unit of raw text across a range of compression regimes.
80. 【2603.19260】HATL: Hierarchical Adaptive-Transfer Learning Framework for Sign Language Machine Translation
链接:https://arxiv.org/abs/2603.19260
作者:Nada Shahin,Leila Ismail
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
关键词:Sign Language Machine, Language Machine Translation, Language Machine, communication between Deaf, Deaf and hearing
备注:
点击查看摘要
Abstract:Sign Language Machine Translation (SLMT) aims to bridge communication between Deaf and hearing individuals. However, its progress is constrained by scarce datasets, limited signer diversity, and large domain gaps between sign motion patterns and pretrained representations. Existing transfer learning approaches in SLMT are static and often lead to overfitting. These challenges call for the development of an adaptive framework that preserves pretrained structure while remaining robust across linguistic and signing variations. To fill this void, we propose a Hierarchical Adaptive Transfer Learning (HATL) framework, where pretrained layers are progressively and dynamically unfrozen based on training performance behavior. HATL combines dynamic unfreezing, layer-wise learning rate decay, and stability mechanisms to preserve generic representations while adapting to sign characteristics. We evaluate HATL on Sign2Text and Sign2Gloss2Text translation tasks using a pretrained ST-GCN++ backbone for feature extraction and the Transformer and an adaptive transformer (ADAT)for translation. To ensure robust multilingual generalization, we evaluate the proposed approach across three datasets: RWTH-PHOENIXWeather-2014 (PHOENIX14T), Isharah, and MedASL. Experimental results show that HATL consistently outperforms traditional transfer learning approaches across tasks and models, with ADAT achieving BLEU-4 improvements of 15.0% on PHOENIX14T and Isharah and 37.6% on MedASL.
81. 【2603.19259】Breeze Taigi: Benchmarks and Models for Taiwanese Hokkien Speech Recognition and Synthesis
链接:https://arxiv.org/abs/2603.19259
作者:Yu-Siang Lan,Chia-Sheng Liu,Yi-Chang Chen,Po-Chun Hsu,Allyson Chiu,Shun-Wen Lin,Da-shan Shiu,Yuan-Fu Liao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:presents unique opportunities, Taiwanese Hokkien, Taiwanese Mandarin resources, advancing speech technology, introduce Breeze Taigi
备注:
点击查看摘要
Abstract:Taiwanese Hokkien (Taigi) presents unique opportunities for advancing speech technology methodologies that can generalize to diverse linguistic contexts. We introduce Breeze Taigi, a comprehensive framework centered on standardized benchmarks for evaluating Taigi speech recognition and synthesis systems. Our primary contribution is a reproducible evaluation methodology that leverages parallel Taiwanese Mandarin resources. We provide 30 carefully curated Mandarin-Taigi audio pairs from Taiwan's Executive Yuan public service announcements with normalized ground truth transcriptions. We establish Character Error Rate (CER) as the standard metric and implement normalization procedures to enable fair cross-system comparisons. To demonstrate the benchmark's utility and provide reference implementations, we develop speech recognition and synthesis models through a methodology that leverages existing Taiwanese Mandarin resources and large-scale synthetic data generation. In particular, we fine-tune a Whisper model on approximately 10,000 hours of Taigi synthetic speech data. Our ASR model achieves 30.13% average CER on the benchmark, outperforming existing commercial and research systems. By providing standardized evaluation protocols, diverse training datasets, and open baseline models, we offer a replicable framework with methodologies applicable to various linguistic contexts.
82. 【2603.19258】MAPLE: Metadata Augmented Private Language Evolution
链接:https://arxiv.org/abs/2603.19258
作者:Eli Chien,Yuzheng Hu,Ryan McKenna,Shanshan Wu,Zheng Xu,Peter Kairouz
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:fine-tuning of large, powerful tool, large language models, computationally prohibitive, prohibitive or infeasible
备注: Preliminary work
点击查看摘要
Abstract:While differentially private (DP) fine-tuning of large language models (LLMs) is a powerful tool, it is often computationally prohibitive or infeasible when state-of-the-art models are only accessible via proprietary APIs. In such settings, generating DP synthetic data has emerged as a crucial alternative, offering the added benefits of arbitrary reuse across downstream tasks and transparent exploratory data analysis without the opaque constraints of a model's parameter space. Private Evolution (PE) is a promising API-based framework for this goal; however, its performance critically depends on initialization. When the private data distribution deviates substantially from the foundation model's pre-training priors--particularly in highly specialized domains--PE frequently struggles to align with the target data, resulting in degraded utility, poor convergence, and inefficient API usage. To address this initialization bottleneck, we propose Metadata Augmented Private Language Evolution (MAPLE). MAPLE leverages differentially private tabular metadata extraction and in-context learning to effectively ground the initial synthetic distribution in the target domain. Extensive experiments on challenging, domain-specific text generation tasks demonstrate that MAPLE achieves a significantly more favorable privacy-utility trade-off, converges faster, and drastically reduces API costs compared to previous PE methods.
83. 【2603.19257】Constraint-aware Path Planning from Natural Language Instructions Using Large Language Models
链接:https://arxiv.org/abs/2603.19257
作者:Dylan Shim,Minghan Wei
类目:Computation and Language (cs.CL)
关键词:maximum route length, simple route optimization, tasks typically involve, route optimization, maximum route
备注: Accepted by 2026 SPIE Security + Defense Conference
点击查看摘要
Abstract:Real-world path planning tasks typically involve multiple constraints beyond simple route optimization, such as the number of routes, maximum route length, depot locations, and task-specific requirements. Traditional approaches rely on dedicated formulations and algorithms for each problem variant, making them difficult to scale across diverse scenarios. In this work, we propose a flexible framework that leverages large language models (LLMs) to solve constrained path planning problems directly from natural language input. The core idea is to allow users to describe routing tasks conversationally, while enabling the LLM to interpret and solve the problem through solution verification and iterative refinement. The proposed method consists of two integrated components. For problem types that have been previously formulated and studied, the LLM first matches the input request to a known problem formulation in a library of pre-defined templates. For novel or unseen problem instances, the LLM autonomously infers a problem representation from the natural language description and constructs a suitable formulation in an in-context learning manner. In both cases, an iterative solution generation and verification process guides the LLM toward producing feasible and increasingly optimal solutions. Candidate solutions are compared and refined through multiple rounds of self-correction, inspired by genetic-algorithm-style refinement. We present the design, implementation, and evaluation of this LLM-based framework, demonstrating its capability to handle a variety of constrained path planning problems. This method provides a scalable and generalizable approach for solving real-world routing tasks with minimal human intervention, while enabling flexible problem specification through natural language.
84. 【2603.19256】ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization
链接:https://arxiv.org/abs/2603.19256
作者:Md. Nazmus Sakib,Shafiul Tanvir,Mesbah Uddin Ahamed,H.M. Aktaruzzaman Mukdho
类目:Computation and Language (cs.CL)
关键词:speaker diarization research, automatic speech recognition, remains severely under-served, Speaker Diarization Challenge, Bengali Speaker Diarization
备注: 7 pages, 4 figures
点击查看摘要
Abstract:Bengali is spoken by over 230 million people yet remains severely under-served in automatic speech recognition (ASR) and speaker diarization research. In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task~1) and Bengali Speaker Diarization Challenge (Task~2). For Task~1, we propose a data-centric pipeline that constructs a high-quality training corpus from Bengali YouTube audiobooks and dramas \cite{tabib2026bengaliloop}, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation. Fine-tuning the \texttt{tugstugi/whisper-medium} model on approximately 21,000 data points with beam size 5, we achieve a Word Error Rate (WER) of 16.751 on the public leaderboard and 15.551 on the private test set. For Task~2, we fine-tune the this http URL community-1 segmentation model with targeted hyperparameter optimization under an extreme low-resource setting (10 training files), achieving a Diarization Error Rate (DER) of 0.19974 on the public leaderboard, and .26723 on the private test set. Our results demonstrate that careful data engineering and domain-adaptive fine-tuning can yield competitive performance for Bengali speech processing even without large annotated corpora.
85. 【2603.19255】LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models
链接:https://arxiv.org/abs/2603.19255
作者:Wei Zhang,Lintong Du,Yuanhe Zhang,Zhenhong Zhou,Kun Wang,Li Sun,Sen Su
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, performance of Large, output length remains, complex instruction-following tasks
备注: 19 pages, 6 figures
点击查看摘要
Abstract:Despite the strong performance of Large Language Models (LLMs) on complex instruction-following tasks, precise control of output length remains a persistent challenge. Existing methods primarily attempt to enforce length constraints by externally imposing length signals or optimization objectives, while largely overlooking the underlying limitation: the model's intrinsic deficit in length cognition. To address this, we propose LARFT (Length-Aware Reinforcement Fine-Tuning), a training framework that aligns the model's length cognition with its action. Specifically, LARFT integrates length-oriented reinforcement learning with a hindsight length awareness. By transforming on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, LARFT jointly optimizes the model's internal representation of length information and refines its policy to satisfy length constraints, thereby achieving precise and reliable length instruction following. Extensive experiments across four base models demonstrate that LARFT outperforms existing baselines, achieving an average improvement of +20.92 points across three length instruction following benchmarks with only a marginal decline of -1.45 points on four general capability benchmarks.
86. 【2603.19254】From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting
链接:https://arxiv.org/abs/2603.19254
作者:Yiyun Zhu,Yidong Jiang,Ziwen Xu,Yinsheng Yao,Dawei Cheng,Jinru Ding,Yejie Zheng,Jie Xu
类目:Computation and Language (cs.CL)
关键词:primary content producers, auxiliary analytic tools, Large language models, Large language, shifting from auxiliary
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to generate financial research reports, shifting from auxiliary analytic tools to primary content producers. Yet recent real-world deployments reveal persistent failures--factual errors, numerical inconsistencies, fabricated references, and shallow analysis--that can distort assessments of corporate fundamentals and ultimately trigger severe economic losses. However, existing financial benchmarks focus on comprehension over completed reports rather than evaluating whether a model can produce reliable analysis. Moreover, current evaluation frameworks merely flag hallucinations and lack structured measures for deeper analytical skills, leaving key analytical bottlenecks undiscovered. To address these gaps, we introduce FinReasoning, a benchmark that decomposes Chinese research-report generation into three stages aligned with real analyst workflows, assessing semantic consistency, data alignment, and deep insight. We further propose a fine-grained evaluation framework that strengthens hallucination-correction assessment and incorporates a 12-indicator rubric for core analytical skills. Based on the evaluation results, FinReasoning reveals that most models exhibit a understanding-execution gap: they can identify errors but struggle to generate accurate corrections; they can retrieve data but have difficulty returning it in correct format. Furthermore, no model achieves overwhelming superiority across all three tracks; Doubao-Seed-1.8, GPT-5, and Kimi-K2 rank as the top three in overall performance, yet each exhibits a distinct capability distribution. The evaluation resource is available at this https URL.
87. 【2603.19253】A comprehensive study of LLM-based argument classification: from Llama through DeepSeek to GPT-5.2
链接:https://arxiv.org/abs/2603.19253
作者:Marcin Pietroń,Filip Gampel,Jakub Gomułka,Andrzej Tomski,Rafał Olszowski
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:interdisciplinary research field, research field focused, argumentative components, interdisciplinary research, research field
备注:
点击查看摘要
Abstract:Argument mining (AM) is an interdisciplinary research field focused on the automatic identification and classification of argumentative components, such as claims and premises, and the relationships between them. Recent advances in large language models (LLMs) have significantly improved the performance of argument classification compared to traditional machine learning approaches. This study presents a comprehensive evaluation of several state-of-the-art LLMs, including GPT-5.2, Llama 4, and DeepSeek, on large publicly available argument classification corpora such as this http URL and UKP. The evaluation incorporates advanced prompting strategies, including Chain-of- Thought prompting, prompt rephrasing, voting, and certainty-based classification. Both quantitative performance metrics and qualitative error analysis are conducted to assess model behavior. The best-performing model in the study (GPT-5.2) achieves a classification accuracy of 78.0% (UKP) and 91.9% (this http URL). The use of prompt rephrasing, multi-prompt voting, and certainty estimation further improves classification performance and robustness. These techniques increase the accuracy and F1 metric of the models by typically a few percentage points (from 2% to 8%). However, qualitative analysis reveals systematic failure modes shared across models, including instabilities with respect to prompt formulation, difficulties in detecting implicit criticism, interpreting complex argument structures, and aligning arguments with specific claims. This work contributes the first comprehensive evaluation that combines quantitative benchmarking and qualitative error analysis on multiple argument mining datasets using advanced LLM prompting strategies.
88. 【2603.19252】GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams
链接:https://arxiv.org/abs/2603.19252
作者:Yushun Zhang,Weiping Fu,Zesheng Yang,Bo Zhao,Lingling Zhang,Jian Zhang,Yumeng Fu,Jiaxing Huang,Jun Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Evaluating the symbolic, require multi-step proofs, text and diagrams, large language models, multi-step proofs grounded
备注: 18 pages, 10 figures, 8 tables
点击查看摘要
Abstract:Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.
Comments:
18 pages, 10 figures, 8 tables
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2603.19252 [cs.CL]
(or
arXiv:2603.19252v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.19252
Focus to learn more
arXiv-issued DOI via DataCite</p>
89. 【2603.19251】Enhancing Legal LLMs through Metadata-Enriched RAG Pipelines and Direct Preference Optimization
链接:https://arxiv.org/abs/2603.19251
作者:Suyash Maniyar,Deepali Singh,Rohith Reddy
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, clauses or precedents, degrade on long, producing hallucinations
备注: 12 pages including Appendix
点击查看摘要
Abstract:Large Language Models (LLMs) perform well in short contexts but degrade on long legal documents, often producing hallucinations such as incorrect clauses or precedents. In the legal domain, where precision is critical, such errors undermine reliability and trust. Retrieval Augmented Generation (RAG) helps ground outputs but remains limited in legal settings, especially with small, locally deployed models required for data privacy. We identify two failure modes: retrieval errors due to lexical redundancy in legal corpora, and decoding errors where models generate answers despite insufficient context. To address this, we propose Metadata Enriched Hybrid RAG to improve document level retrieval, and apply Direct Preference Optimization (DPO) to enforce safe refusal when context is inadequate. Together, these methods improve grounding, reliability, and safety in legal language models.
Comments:
12 pages including Appendix
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2603.19251 [cs.CL]
(or
arXiv:2603.19251v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.19251
Focus to learn more
arXiv-issued DOI via DataCite</p>
90. 【2603.19250】Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams
链接:https://arxiv.org/abs/2603.19250
作者:Yukyung Lee,Yebin Lim,Woojun Jung,Wonjun Choi,Susik Yoon
类目:Computation and Language (cs.CL)
关键词:Evaluating language models, Evaluating language, environments is critical, streaming environments, Temporal Question Answering
备注:
点击查看摘要
Abstract:Evaluating language models in streaming environments is critical, yet underexplored. Existing benchmarks either focus on single complex events or provide curated inputs for each query, and do not evaluate models under the conflicts that arise when multiple concurrent events are mixed within the same document stream. We introduce StreamBench, a benchmark built from major news stories in 2016 and 2025, comprising 605 events and 15,354 documents across three tasks: Topic Clustering, Temporal Question Answering, and Summarization. To diagnose how models fail, we compare performance with and without structural cues, which organize key facts by event. We find that structural cues improve performance on clustering (up to +4.37%) and temporal QA (up to +9.63%), helping models locate relevant information and separate distinct events. While temporal reasoning remains an open challenge inherent to current LLMs, consistent gains across tasks show that structural cues are a promising direction for future work in massive document streams.
91. 【2603.19249】Spelling Correction in Healthcare Query-Answer Systems: Methods, Retrieval Impact, and Empirical Evaluation
链接:https://arxiv.org/abs/2603.19249
作者:Saurabh K Singh
类目:Computation and Language (cs.CL)
关键词:users submit queries, Healthcare question-answering, systems face, persistent challenge, users submit
备注: 13 pages, 5 tables. Empirical study using TREC 2017 LiveQA Medical and HealthSearchQA datasets
点击查看摘要
Abstract:Healthcare question-answering (QA) systems face a persistent challenge: users submit queries with spelling errors at rates substantially higher than those found in the professional documents they search. This paper presents the first controlled study of spelling correction as a retrieval preprocessing step in healthcare QA using real consumer queries. We conduct an error census across two public datasets -- the TREC 2017 LiveQA Medical track (104 consumer health questions) and HealthSearchQA (4,436 health queries from Google autocomplete) -- finding that 61.5% of real medical queries contain at least one spelling error, with a token-level error rate of 11.0%. We evaluate four correction methods -- conservative edit distance, standard edit distance (Levenshtein), context-aware candidate ranking, and SymSpell -- across three experimental conditions: uncorrected queries against an uncorrected corpus (baseline), uncorrected queries against a corrected corpus, and fully corrected queries against a corrected corpus. Using BM25 and TF-IDF cosine retrieval over 1,935 MedQuAD answer passages with TREC relevance judgments, we find that query correction substantially improves retrieval -- edit distance and context-aware correction achieve MRR improvements of +9.2% and NDCG@10 improvements of +8.3% over the uncorrected baseline. Critically, correcting only the corpus without correcting queries yields minimal improvement (+0.5% MRR), confirming that query-side correction is the key intervention. We complement these results with a 100-sample error analysis categorising correction outcomes per method and provide evidence-based recommendations for practitioners.
92. 【2603.19248】DuCCAE: A Hybrid Engine for Immersive Conversation via Collaboration, Augmentation, and Evolution
链接:https://arxiv.org/abs/2603.19248
作者:Xin Shen,Zhishu Jiang,Jiaye Yang,Haibo Liu,Yichen Wan,Jiarui Zhang,Tingzhi Dai,Luodong Xu,Shuchen Wu,Guanqiang QI,Chenxi Miao,Jiahui Liang,Yang Li,Weikang Li,Deguo Xia,Jizhou Huang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:long-horizon task capability, Baidu Search, face a persistent, persistent trade-off, trade-off between responsiveness
备注:
点击查看摘要
Abstract:Immersive conversational systems in production face a persistent trade-off between responsiveness and long-horizon task capability. Real-time interaction is achievable for lightweight turns, but requests involving planning and tool invocation (e.g., search and media generation) produce heavy-tail execution latency that degrades turn-taking, persona consistency, and user trust. To address this challenge, we propose DuCCAE (Conversation while Collaboration with Augmentation and Evolution), a hybrid engine for immersive conversation deployed within Baidu Search, serving millions of users. DuCCAE decouples real-time response generation from asynchronous agentic execution and synchronizes them via a shared state that maintains session context and execution traces, enabling asynchronous results to be integrated back into the ongoing dialogue. The system orchestrates five subsystems-Info, Conversation, Collaboration, Augmentation, and Evolution-to support multi-agent collaboration and continuous improvement. We evaluate DuCCAE through a comprehensive framework that combines offline benchmarking on the Du-Interact dataset and large-scale production evaluation within Baidu Search. Experimental results demonstrate that DuCCAE outperforms strong baselines in agentic execution reliability and dialogue quality while reducing latency to fit strict real-time budgets. Crucially, deployment metrics since June 2025 confirm substantial real-world effectiveness, evidenced by a tripling of Day-7 user retention to 34.2% and a surge in the complex task completion rate to 65.2%. Our hybrid architecture successfully preserves conversational continuity while enabling reliable agentic execution, offering practical guidelines for deploying scalable agentic systems in industrial settings.
93. 【2603.19247】When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models
链接:https://arxiv.org/abs/2603.19247
作者:Zafir Shamsi,Nikhil Chekuru,Zachary Guzman,Shivank Garg
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, high-stakes applications, commercial concern, increasingly integrated
备注: EACL SRW 2026, Oral
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central practical and commercial concern. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thereby overlooking realistic attack scenarios in which inputs are iteratively refined to evade safeguards. In this work, we examine the vulnerability of contemporary language models to automated, adversarial prompt refinement. We repurpose black-box prompt optimization techniques, originally designed to improve performance on benign tasks, to systematically search for safety failures. Using DSPy, we apply three such optimizers to prompts drawn from HarmfulQA and JailbreakBench, explicitly optimizing toward a continuous danger score in the range 0 to 1 provided by an independent evaluator model (GPT-5.1). Our results demonstrate a substantial reduction in effective safety safeguards, with the effects being especially pronounced for open-source small language models. For example, the average danger score of Qwen 3 8B increases from 0.09 in its baseline setting to 0.79 after optimization. These findings suggest that static benchmarks may underestimate residual risk, indicating that automated, adaptive red-teaming is a necessary component of robust safety evaluation.
94. 【2603.19286】Generalized Stock Price Prediction for Multiple Stocks Combined with News Fusion
链接:https://arxiv.org/abs/2603.19286
作者:Pei-Jun Liao,Hung-Shin Lee,Yao-Fei Cheng,Li-Wei Chen,Hung-yi Lee,Hsin-Min Wang
类目:atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Predicting stock prices, Large Language Models, Predicting stock, Large Language, prices presents challenges
备注: Accepted to Journal of Information Science and Engineering (JISE)
点击查看摘要
Abstract:Predicting stock prices presents challenges in financial forecasting. While traditional approaches such as ARIMA and RNNs are prevalent, recent developments in Large Language Models (LLMs) offer alternative methodologies. This paper introduces an approach that integrates LLMs with daily financial news for stock price prediction. To address the challenge of processing news data and identifying relevant content, we utilize stock name embeddings within attention mechanisms. Specifically, we encode news articles using a pre-trained LLM and implement three attention-based pooling techniques -- self-attentive, cross-attentive, and position-aware self-attentive pooling -- to filter news based on stock relevance. The filtered news embeddings, combined with historical stock prices, serve as inputs to the prediction model. Unlike prior studies that focus on individual stocks, our method trains a single generalized model applicable across multiple stocks. Experimental results demonstrate a 7.11% reduction in Mean Absolute Error (MAE) compared to the baseline, indicating the utility of stock name embeddings for news filtering and price forecasting within a generalized framework.
信息检索
1. 【2603.20094】LLM-Enhanced Semantic Data Integration of Electronic Component Qualifications in the Aerospace Domain
链接:https://arxiv.org/abs/2603.20094
作者:Antonio De Santis,Marco Balduini,Matteo Belcao,Andrea Proia,Marco Brambilla,Emanuele Della Valle
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
关键词:Large manufacturing companies, manufacturing companies face, companies face challenges, Large manufacturing, leading to inconsistencies
备注: ESWC 2026
点击查看摘要
Abstract:Large manufacturing companies face challenges in information retrieval due to data silos maintained by different departments, leading to inconsistencies and misalignment across databases. This paper presents an experience in integrating and retrieving qualification data for electronic components used in satellite board design. Due to data silos, designers cannot immediately determine the qualification status of individual components. However, this process is critical during the planning phase, when assembly drawings are issued before production, to optimize new qualifications and avoid redundant efforts. To address this, we propose a pipeline that uses Virtual Knowledge Graphs for a unified view over heterogeneous data sources and LLMs to enhance retrieval and reduce manual effort in data cleansing. The retrieval of qualifications is then performed through an Ontology-based Data Access approach for structured queries and a vector search mechanism for retrieving qualifications based on similar textual properties. We perform a comparative cost-benefit analysis, demonstrating that the proposed pipeline also outperforms approaches relying solely on LLMs, such as Retrieval-Augmented Generation (RAG), in terms of long-term efficiency.
2. 【2603.20062】he End of Rented Discovery: How AI Search Redistributes Power Between Hotels and Intermediaries
链接:https://arxiv.org/abs/2603.20062
作者:Peiying Zhu,Sidi Chang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:query framing matter, engine to recommend, query framing, Google Gemini, Experiential queries draw
备注: 13 pages, 10 tables, Submitted to the 10th Hospitality Finance Economics Conference (HFE 2026), Tokyo, Japan
点击查看摘要
Abstract:When a traveler asks an AI search engine to recommend a hotel, which sources get cited -- and does query framing matter? We audit 1,357 grounding citations from Google Gemini across 156 hotel queries in Tokyo and document a systematic pattern we call the Intent-Source Divide. Experiential queries draw 55.9\% of their citations from non-OTA sources, compared to 30.8\% for transactional queries -- a 25.1 percentage-point gap ($p 5 \times 10^{-20}$). The effect is amplified in Japanese, where experiential queries draw 62.1\% non-OTA citations compared to 50.0\% in English -- consistent with a more diverse Japanese non-OTA content ecosystem. For an industry in which hotels have long paid OTAs for demand acquisition, this pattern matters because it suggests that AI search may make hotel discovery less exclusively controlled by commission-based intermediaries.
3. 【2603.20034】CoverageBench: Evaluating Information Coverage across Tasks and Domains
链接:https://arxiv.org/abs/2603.20034
作者:Saron Samuel,Andrew Yates,Dawn Lawrie,Ian Soboroff,Trevor Adriaanse,Benjamin Van Durme,Eugene Yang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:hoc retrieval algorithm, information coverage, retrieval algorithm, hoc retrieval, information
备注: 8
点击查看摘要
Abstract:We wish to measure the information coverage of an ad hoc retrieval algorithm, that is, how much of the range of available relevant information is covered by the search results. Information coverage is a central aspect for retrieval, especially when the retrieval system is integrated with generative models in a retrieval-augmented generation (RAG) system. The classic metrics for ad hoc retrieval, precision and recall, reward a system as more and more relevant documents are retrieved. However, since relevance in ad hoc test collections is defined for a document without any relation to other documents that might contain the same information, high recall is sufficient but not necessary to ensure coverage. The same is true for other metrics such as rank-biased precision (RBP), normalized discounted cumulative gain (nDCG), and mean average precision (MAP). Test collections developed around the notion of diversity ranking in web search incorporate multiple aspects that support a concept of coverage in the web domain. In this work, we construct a suite of collections for evaluating information coverage from existing collections. This suite offers researchers a unified testbed spanning multiple genres and tasks. All topics, nuggets, relevance labels, and baseline rankings are released on Hugging Face Datasets, along with instructions for accessing the publicly available document collections.
4. 【2603.20017】RouterKGQA: Specialized--General Model Routing for Constraint-Aware Knowledge Graph Question Answering
链接:https://arxiv.org/abs/2603.20017
作者:Bo Yuan,Hexuan Deng,Xuebo Liu,Min Zhang
类目:Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
关键词:verifiable knowledge graphs, Knowledge graph question, Knowledge graph, knowledge graphs, mitigating LLM hallucination
备注:
点击查看摘要
Abstract:Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable knowledge graphs. Existing approaches fall into two paradigms: retrieval-based methods utilize small specialized models, which are efficient but often produce unreachable paths and miss implicit constraints, while agent-based methods utilize large general models, which achieve stronger structural grounding at substantially higher cost. We propose RouterKGQA, a framework for specialized--general model collaboration, in which a specialized model generates reasoning paths and a general model performs KG-guided repair only when needed, improving performance at minimal cost. We further equip the specialized with constraint-aware answer filtering, which reduces redundant answers. In addition, we design a more efficient general agent workflow, further lowering inference cost. Experimental results show that RouterKGQA outperforms the previous best by 3.57 points in F1 and 0.49 points in Hits@1 on average across benchmarks, while requiring only 1.15 average LLM calls per question. Codes and models are available at this https URL.
5. 【2603.20009】A Super Fast K-means for Indexing Vector Embeddings
链接:https://arxiv.org/abs/2603.20009
作者:Leonardo Kuffo,Sven Hepkema,Peter Boncz
类目:Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
关键词:high-dimensional vector embeddings, k-means variant designed, variant designed, collections of high-dimensional, faster than FAISS
备注:
点击查看摘要
Abstract:We present SuperKMeans: a k-means variant designed for clustering collections of high-dimensional vector embeddings. SuperKMeans' clustering is up to 7x faster than FAISS and Scikit-Learn on modern CPUs and up to 4x faster than cuVS on GPUs (Figure 1), while maintaining the quality of the resulting centroids for vector similarity search tasks. SuperKMeans acceleration comes from reducing data-access and compute overhead by reliably and efficiently pruning dimensions that are not needed to assign a vector to a centroid. Furthermore, we present Early Termination by Recall, a novel mechanism that early-terminates k-means when the quality of the centroids for retrieval tasks stops improving across iterations. In practice, this further reduces runtimes without compromising retrieval quality. We open-source our implementation at this https URL
6. 【2603.19909】DALI: LLM-Agent Enhanced Dual-Stream Adaptive Leadership Identification for Group Recommendations
链接:https://arxiv.org/abs/2603.19909
作者:Boxun Song,Min Gao,Jiawei Cheng
类目:Information Retrieval (cs.IR)
关键词:recommendation systems play, supporting collective decisions, Group recommendation systems, organizational team-building, Large Language Models
备注: under review
点击查看摘要
Abstract:Group recommendation systems play a pivotal role in supporting collective decisions across various contexts, from leisure activities to organizational team-building. Existing group recommendation approaches typically use either handcrafted aggregation rules (e.g. mean, least misery, weighted sum) or neural aggregation models (e.g. attention-based deep learning frameworks), yet both fall short in distinguishing leader-dominated from collaborative groups and often misrepresent true group preferences, especially when a single member disproportionately influences group choices. To address these limitations, we propose the Dual-stream Adaptive Leadership Identification (DALI) framework, which uniquely combines the symbolic reasoning capabilities of Large Language Models (LLMs) with neural network-based representation learning. Specifically, DALI introduces two key innovations: a dynamic rule generation module that autonomously formulates and evolves identification rules through iterative performance feedback, and a neuro-symbolic aggregation mechanism that concurrently employs symbolic reasoning to robustly recognize leadership groups and attention-based neural aggregation to accurately model collaborative group dynamics. Experiments conducted on the Mafengwo travel dataset confirm that DALI significantly improves recommendation accuracy compared to existing frameworks, highlighting its capability to dynamically adapt to complex, real-world group decision environments.
7. 【2603.19809】How Well Does Generative Recommendation Generalize?
链接:https://arxiv.org/abs/2603.19809
作者:Yijie Ding,Zitian Guo,Jiacheng Li,Letian Peng,Shuai Shao,Wei Shao,Xiaoqiang Luo,Luke Simon,Jingbo Shang,Julian McAuley,Yupeng Hou
类目:Information Retrieval (cs.IR)
关键词:widely held hypothesis, models outperform conventional, outperform conventional item, conventional item ID-based, widely held
备注:
点击查看摘要
Abstract:A widely held hypothesis for why generative recommendation (GR) models outperform conventional item ID-based models is that they generalize better. However, there is few systematic way to verify this hypothesis beyond a superficial comparison of overall performance. To address this gap, we categorize each data instance based on the specific capability required for a correct prediction: either memorization (reusing item transition patterns observed during training) or generalization (composing known patterns to predict unseen item transitions). Extensive experiments show that GR models perform better on instances that require generalization, whereas item ID-based models perform better when memorization is more important. To explain this divergence, we shift the analysis from the item level to the token level and show that what appears to be item-level generalization often reduces to token-level memorization for GR models. Finally, we show that the two paradigms are complementary. We propose a simple memorization-aware indicator that adaptively combines them on a per-instance basis, leading to improved overall recommendation performance.
8. 【2603.19710】AIGQ: An End-to-End Hybrid Generative Architecture for E-commerce Query Recommendation
链接:https://arxiv.org/abs/2603.19710
作者:Jingcao Xu,Jianyun Zou,Renkai Yang,Zili Geng,Qiang Liu,Haihong Tang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Pre-search query recommendation, poor cold-start performance, traditional methods suffer, low serendipity due, Pre-search query
备注:
点击查看摘要
Abstract:Pre-search query recommendation, widely known as HintQ on Taobao's homepage, plays a vital role in intent capture and demand discovery, yet traditional methods suffer from shallow semantics, poor cold-start performance and low serendipity due to reliance on ID-based matching and co-click heuristics. To overcome these challenges, we propose AIGQ (AI-Generated Query architecture), the first end-to-end generative framework for HintQ scenario. AIGQ is built upon three core innovations spanning training paradigm, policy optimization and deployment architecture. First, we propose Interest-Aware List Supervised Fine-Tuning (IL-SFT), a list-level supervised learning approach that constructs training samples through session-aware behavior aggregation and interest-guided re-ranking strategy to faithfully model nuanced user intent. Accordingly, we design Interest-aware List Group Relative Policy Optimization (IL-GRPO), a novel policy gradient algorithm with a dual-component reward mechanism that jointly optimizes individual query relevance and global list properties, enhanced by a model-based reward from the online click-through rate (CTR) ranking model. To deploy under strict real-time and low-latency requirements, we further develop a hybrid offline-online architecture comprising AIGQ-Direct for nearline personalized user-to-query generation and AIGQ-Think, a reasoning-enhanced variant that produces trigger-to-query mappings to enrich interest diversity. Extensive offline evaluations and large-scale online A/B experiments on Taobao demonstrate that AIGQ consistently delivers substantial improvements in key business metrics across platform effectiveness and user engagement.
9. 【2603.19693】From Token to Item: Enhancing Large Language Models for Recommendation via Item-aware Attention Mechanism
链接:https://arxiv.org/abs/2603.19693
作者:Xiaokun Zhang,Bowei He,Jiamin Chen,Ziqiang Cui,Chen Ma
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Language Models, recently gained increasing, gained increasing attention
备注: This work has been accepted by WWW 2026
点击查看摘要
Abstract:Large Language Models (LLMs) have recently gained increasing attention in the field of recommendation. Existing LLM-based methods typically represent items as token sequences, and apply attention layers on these tokens to generate recommendations. However, by inheriting the standard attention mechanism, these methods focus on modeling token-level relations. This token-centric focus overlooks the item as the fundamental unit of recommendation, preventing existing methods from effectively capturing collaborative relations at the item level. In this work, we revisit the role of tokens in LLM-driven recommendation and categorize their relations into two types: (1) intra-item token relations, which present the content semantics of an item, e.g., name, color, and size; and (2) inter-item token relations, which encode collaborative relations across items. Building on these insights, we propose a novel framework with an item-aware attention mechanism (IAM) to enhance LLMs for recommendation. Specifically, IAM devises two complementary attention layers: (1) an intra-item attention layer, which restricts attention to tokens within the same item, modeling item content semantics; and (2) an inter-item attention layer, which attends exclusively to token relations across items, capturing item collaborative relations. Through this stacked design, IAM explicitly emphasizes items as the fundamental units in recommendation, enabling LLMs to effectively exploit item-level collaborative relations. Extensive experiments on several public datasets demonstrate the effectiveness of IAM in enhancing LLMs for personalized recommendation.
10. 【2603.19665】GenFacet: End-to-End Generative Faceted Search via Multi-Task Preference Alignment in E-Commerce
链接:https://arxiv.org/abs/2603.19665
作者:Zhouwei Zhai,Min Yang,Jin Li
类目:Information Retrieval (cs.IR)
关键词:massive ecommerce catalogs, navigating massive ecommerce, static rule-based extraction, traditional systems rely, Faceted search acts
备注:
点击查看摘要
Abstract:Faceted search acts as a critical bridge for navigating massive ecommerce catalogs, yet traditional systems rely on static rule-based extraction or statistical ranking, struggling with emerging vocabulary, semantic gaps, and a disconnect between facet selection and underlying retrieval. In this paper, we introduce GenFacet, an industrial-grade, end-to-end generative framework deployed at this http URL. GenFacet reframes faceted search as two coupled generative tasks within a unified Large Language Model: Context-Aware Facet Generation, which dynamically synthesizes trend-responsive navigation options, and Intent-Driven Query Rewriting, which translates user interactions into precise search queries to close the retrieval loop. To bridge the gap between generative capabilities and search utility, we propose a novel multi-task training pipeline combining teacher-student distillation with GRPO. This aligns the model with complex user preferences by directly optimizing for downstream search satisfaction. Validated on China's largest selfoperated e-commerce platform via rigorous offline evaluations and online A/B tests, GenFacet demonstrated substantial improvements. Specifically, online results reveal a relative increase of 42.0% in facet Click-Through Rate (CTR) and 2.0% in User Conversion Rate (UCVR). These outcomes provide strong evidence of the benefits of generative methods for improving query understanding and user engagement in large-scale information retrieval systems.
11. 【2603.19634】MetaCues: Enabling Critical Engagement with Generative AI for Information Seeking and Sensemaking
链接:https://arxiv.org/abs/2603.19634
作者:Anjali Singh,Karan Taneja,Zhitong Guan,Soo Young Rieh
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:encourage cognitive offloading, selective attention, cognitive offloading, informational homogenization, encourage cognitive
备注:
点击查看摘要
Abstract:Generative AI (GenAI) search tools are increasingly used for information seeking, yet their design tends to encourage cognitive offloading, which may lead to passive engagement, selective attention, and informational homogenization. Effective use requires metacognitive engagement to craft good prompts, verify AI outputs, and critically engage with information. We developed MetaCues, a novel GenAI-based interactive tool for information seeking that delivers metacognitive cues alongside AI responses and a note-taking interface to guide users' search and associated learning. Through an online study (N = 146), we compared MetaCues to a baseline tool without cues, across two broad search topics that required participants to explore diverse perspectives in order to make informed judgments. Preliminary findings regarding participants' search behavior show that MetaCues leads to increased confidence in attitudinal judgments about the search topic as well as broader inquiry, with the latter effect emerging primarily for the topic that was less controversial and with which participants had relatively less familiarity. Accordingly, we outline directions for future qualitative exploration of search interactions and inquiry patterns.
12. 【2603.19626】he Prosocial Ranking Challenge: Reducing Polarization on Social Media without Sacrificing Engagement
链接:https://arxiv.org/abs/2603.19626
作者:Jonathan Stray,Ian Baker,George Beknazar-Yuzbashev,Ceren Budak,Julia Kamin,Kylan Rutherford,Mateusz Stalinski,Tin Acosta,Chris Bail,Michael Bernstein,Mark Brandt,Amy Bruckman,Anshuman Chhabra,Soham De,Kayla Duskin,Sara Fish,Beth Goldberg,Andy Guess,Dylan Hadfield-Menell,Muhammed Haroon,Safwan Hossain,Michael Inzlicht,Gauri Jain,Yanchen Jiang,Alexander P. Landry,Yph Lelkes,Hongfan Lu,Peter Mason,Jennifer McCoy,Smitha Milli,Paul Resnick,Emily Saltz,Martin Saveski,Lisa Schirch,Max Spohn,Siddarth Srinivasan,Alexis Tatore,Luke Thorburn,Joshua A. Tucker,Robb Willer,Magdalena Wojcieszak,Manuel Wüthrich,Sylvan Zheng
类目:ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
关键词:multiple alternative social, social media, direct comparisons, alternative social media, social media algorithms
备注:
点击查看摘要
Abstract:We report the first direct comparisons of multiple alternative social media algorithms on multiple platforms on outcomes of societal interest. We used a browser extension to modify which posts were shown to desktop social media users, randomly assigning 9,386 users to a control group or one of five alternative ranking algorithms which simultaneously altered content across three platforms for six months during the US 2024 presidential election. This reduced our preregistered index of affective polarization by an average of 0.03 standard deviations (p 0.05), including a 1.5 degree decrease in differences between the 100 point inparty and outparty feeling thermometers. We saw reductions in active use time for Facebook (-0.37 min/day) and Reddit (-0.2 min/day), but an increase of 0.32 min/day (p 0.01) for X/Twitter. We saw an increase in reports of negative social media experiences but found no effects on well-being, news knowledge, outgroup empathy, perceptions of and support for partisan violence. This implies that bridging content can improve some societal outcomes without necessarily conflicting with the engagement-driven business model of social media.
13. 【2603.19596】CO-EVOLVE: Bidirectional Co-Evolution of Graph Structure and Semantics for Heterophilous Learning
链接:https://arxiv.org/abs/2603.19596
作者:Jinming Xing,Muhammad Shahzad
类目:Information Retrieval (cs.IR)
关键词:Graph Neural Networks, Large Language Models, Neural Networks, Large Language, existing methods typically
备注:
点击查看摘要
Abstract:The integration of Large Language Models (LLMs) and Graph Neural Networks (GNNs) promises to unify semantic understanding with structural reasoning, yet existing methods typically rely on static, unidirectional pipelines. These approaches suffer from fundamental limitations: (1) Bidirectional Error Propagation, where semantic hallucinations in LLMs or structural noise in GNNs permanently poison the downstream modality without opportunity for recourse; (2) Semantic-Structural Dissonance, particularly in heterophilous settings where textual similarity contradicts topological reality; (3) a Blind Leading the Blind phenomenon, where indiscriminate alignment forces models to mirror each other's mistakes regardless of uncertainty. To address these challenges, we propose CO-EVOLVE, a dual-view co-evolution framework that treats graph topology and semantic embeddings as dynamic, mutually reinforcing latent variables. By employing a Gauss-Seidel alternating optimization strategy, our framework establishes a cyclic feedback loop: the GNN injects structural context as Soft Prompts to guide the LLM, while the LLM constructs favorable Dynamic Semantic Graphs to rewire the GNN. We introduce three key innovations to stabilize this evolution: (1) a Hard-Structure Conflict-Aware Contrastive Loss that warps the semantic manifold to respect high-order topological boundaries; (2) an Adaptive Node Gating Mechanism that dynamically fuses static and learnable structures to recover missing links; (3) an Uncertainty-Gated Consistency strategy that enables meta-cognitive alignment, ensuring models only learn from the confident view. Finally, an Entropy-Aware Adaptive Fusion integrates predictions during inference. Extensive experiments on public benchmarks demonstrate that CO-EVOLVE significantly outperforms state-of-the-art baselines, achieving average improvements of 9.07% in Accuracy and 7.19% in F1-score.
14. 【2603.19595】All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution
链接:https://arxiv.org/abs/2603.19595
作者:Can Lv,Heng Chang,Yuchen Guo,Shengyu Tao,Shiji Zhou
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:requires continually writing, continually writing long, writing long term, long term memories, Lifelong interactive agents
备注:
点击查看摘要
Abstract:Lifelong interactive agents are expected to assist users over months or years, which requires continually writing long term memories while retrieving the right evidence for each new query under fixed context and latency budgets. Existing memory systems often degrade as histories grow, yielding redundant, outdated, or noisy retrieved contexts. We present All-Mem, an online/offline lifelong memory framework that maintains a topology structured memory bank via explicit, non destructive consolidation, avoiding the irreversible information loss typical of summarization based compression. In online operation, it anchors retrieval on a bounded visible surface to keep coarse search cost bounded. Periodically offline, an LLM diagnoser proposes confidence scored topology edits executed with gating using three operators: SPLIT, MERGE, and UPDATE, while preserving immutable evidence for traceability. At query time, typed links enable hop bounded, budgeted expansion from active anchors to archived evidence when needed. Experiments on LOCOMO and LONGMEMEVAL show improved retrieval and QA over representative baselines.
15. 【2603.19585】SaFRO: Satisfaction-Aware Fusion via Dual-Relative Policy Optimization for Short-Video Search
链接:https://arxiv.org/abs/2603.19585
作者:Renzhe Zhou,Songyang Li,Feiran Zhu,Chenglei Dai,Yi Zhang,Yi Wang,Jingwei Zhuo
类目:Information Retrieval (cs.IR)
关键词:aggregating heterogeneous prediction, heterogeneous prediction signals, Multi-Task Fusion plays, unified ranking score, plays a pivotal
备注: 9 pages, 8 figures
点击查看摘要
Abstract:Multi-Task Fusion plays a pivotal role in industrial short-video search systems by aggregating heterogeneous prediction signals into a unified ranking score. However, existing approaches predominantly optimize for immediate engagement metrics, which often fail to align with long-term user satisfaction. While Reinforcement Learning (RL) offers a promising avenue for user satisfaction optimization, its direct application to search scenarios is non-trivial due to the inherent data sparsity and intent constraints compared to recommendation feeds. To this end, we propose SaFRO, a novel framework designed to optimize user satisfaction in short-video search. We first construct a satisfaction-aware reward model that utilizes query-level behavioral proxies to capture holistic user satisfaction beyond item-level interactions. Then we introduce Dual-Relative Policy Optimization (DRPO), an efficient policy learning method that updates the fusion policy through relative preference comparisons within groups and across batches. Furthermore, we design a Task-Relation-Aware Fusion module to explicitly model the interdependencies among different objectives, enabling context-sensitive weight adaptation. Extensive offline evaluations and large-scale online A/B tests on Kuaishou short-video search platform demonstrate that SaFRO significantly outperforms state-of-the-art baselines, delivering substantial gains in both short-term ranking quality and long-term user retention.
16. 【2603.19532】EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models
链接:https://arxiv.org/abs/2603.19532
作者:J. Ben Tamo,Yuxing Lu,Benoit L. Marteau,Micky C. Nnamdi,May D. Wang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, fluent but prone, Relative Policy Optimization
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8\% to 61.6\%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8\% to 67.6\% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at this https URL.
17. 【2603.19519】Inducing Sustained Creativity and Diversity in Large Language Models
链接:https://arxiv.org/abs/2603.19519
作者:Queenie Luo,Gary King,Michael Puett,Michael D. Smith
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:overlooked research topic, killer company idea, perfect wedding dress, search quest users, subset of exploratory
备注:
点击查看摘要
Abstract:We address a not-widely-recognized subset of exploratory search, where a user sets out on a typically long "search quest" for the perfect wedding dress, overlooked research topic, killer company idea, etc. The first few outputs of current large language models (LLMs) may be helpful but only as a start, since the quest requires learning the search space and evaluating many diverse and creative alternatives along the way. Although LLMs encode an impressive fraction of the world's knowledge, common decoding methods are narrowly optimized for prompts with correct answers and thus return mostly homogeneous and conventional results. Other approaches, including those designed to increase diversity across a small set of answers, start to repeat themselves long before search quest users learn enough to make final choices, or offer a uniform type of "creativity" to every user asking similar questions. We develop a novel, easy-to-implement decoding scheme that induces sustained creativity and diversity in LLMs, producing as many conceptually unique results as desired, even without access to the inner workings of an LLM's vector space. The algorithm unlocks an LLM's vast knowledge, both orthodox and heterodox, well beyond modal decoding paths. With this approach, search quest users can more quickly explore the search space and find satisfying answers.
18. 【2603.19339】Spectral Tempering for Embedding Compression in Dense Passage Retrieval
链接:https://arxiv.org/abs/2603.19339
作者:Yongkang Li,Panagiotis Eustratiadis,Evangelos Kanoulas
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:preserves dominant variance, underutilizes representational capacity, deploying dense retrieval, dense retrieval systems, whitening enforces isotropy
备注:
点击查看摘要
Abstract:Dimensionality reduction is critical for deploying dense retrieval systems at scale, yet mainstream post-hoc methods face a fundamental trade-off: principal component analysis (PCA) preserves dominant variance but underutilizes representational capacity, while whitening enforces isotropy at the cost of amplifying noise in the heavy-tailed eigenspectrum of retrieval embeddings. Intermediate spectral scaling methods unify these extremes by reweighting dimensions with a power coefficient $\gamma$, but treat $\gamma$ as a fixed hyperparameter that requires task-specific tuning. We show that the optimal scaling strength $\gamma$ is not a global constant: it varies systematically with target dimensionality $k$ and is governed by the signal-to-noise ratio (SNR) of the retained subspace. Based on this insight, we propose Spectral Tempering (\textbf{SpecTemp}), a learning-free method that derives an adaptive $\gamma(k)$ directly from the corpus eigenspectrum using local SNR analysis and knee-point normalization, requiring no labeled data or validation-based search. Extensive experiments demonstrate that Spectral Tempering consistently achieves near-oracle performance relative to grid-searched $\gamma^*(k)$ while remaining fully learning-free and model-agnostic. Our code is publicly available at this https URL.
19. 【2603.19306】VERDICT: Verifiable Evolving Reasoning with Directive-Informed Collegial Teams for Legal Judgment Prediction
链接:https://arxiv.org/abs/2603.19306
作者:Hui Liao,Chuan Qin,Yongwen Ren,Hao Li,Zhenya Huang,Yanyong Zhang,Chao Wang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Legal Judgment Prediction, predicts applicable law, applicable law articles, Judgment Prediction, Legal Judgment
备注: 15 pages,3 figures,4 tables
点击查看摘要
Abstract:Legal Judgment Prediction (LJP) predicts applicable law articles, charges, and penalty terms from case facts. Beyond accuracy, LJP calls for intrinsically interpretable and legally grounded reasoning that can reconcile statutory rules with precedent-informed standards. However, existing methods often behave as static, one-shot predictors, providing limited procedural support for verifiable reasoning and little capability to adapt as jurisprudential practice evolves. We propose VERDICT, a self-refining collaborative multi-agent framework that simulates a virtual collegial panel. VERDICT assigns specialized agents to complementary roles (e.g., fact structuring, legal retrieval, opinion drafting, and supervisory verification) and coordinates them in a traceable draft--verify--revise workflow with explicit Pass/Reject feedback, producing verifiable reasoning traces and revision rationales. To capture evolving case experience, we further introduce a Hybrid Jurisprudential Memory (HJM) grounded in the Micro-Directive Paradigm, which stores precedent standards and continually distills validated multi-agent verification trajectories into updated Micro-Directives for continual learning across cases. We evaluate VERDICT on CAIL2018 and a newly constructed CJO2025 dataset with a strict future time-split for temporal generalization. VERDICT achieves state-of-the-art performance on CAIL2018 and demonstrates strong generalization on CJO2025. To facilitate reproducibility and further research, we release our code and the dataset at this https URL.
20. 【2603.19281】URAG: A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models
链接:https://arxiv.org/abs/2603.19281
作者:Vinh Nguyen,Cuong Dang,Jiahao Zhang,Hoa Tran,Minh Tran,Trinh Chau,Thai Le,Lu Cheng,Suhang Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:demand extensive factual, widely adopted approach, extensive factual knowledge, widely adopted, scenarios that demand
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for enhancing LLMs in scenarios that demand extensive factual knowledge. However, current RAG evaluations concentrate primarily on correctness, which may not fully capture the impact of retrieval on LLM uncertainty and reliability. To bridge this gap, we introduce URAG, a comprehensive benchmark designed to assess the uncertainty of RAG systems across various fields like healthcare, programming, science, math, and general text. By reformulating open-ended generation tasks into multiple-choice question answering, URAG allows for principled uncertainty quantification via conformal prediction. We apply the evaluation pipeline to 8 standard RAG methods, measuring their performance through both accuracy and prediction-set sizes based on LAC and APS metrics. Our analysis shows that (1) accuracy gains often coincide with reduced uncertainty, but this relationship breaks under retrieval noise; (2) simple modular RAG methods tend to offer better accuracy-uncertainty trade-offs than more complex reasoning pipelines; and (3) no single RAG approach is universally reliable across domains. We further show that (4) retrieval depth, parametric knowledge dependence, and exposure to confidence cues can amplify confident errors and hallucinations. Ultimately, URAG establishes a systematic benchmark for analyzing and enhancing the trustworthiness of retrieval-augmented systems. Our code is available on GitHub.
21. 【2603.19267】Reviewing the Reviewer: Graph-Enhanced LLMs for E-commerce Appeal Adjudication
链接:https://arxiv.org/abs/2603.19267
作者:Yuchen Du,Ashley Li,Zixi Huang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Hierarchical review workflows, initial judgments failed, Hierarchical review, corrects first-tier, valuable correction signals
备注: 10 pages, 3 figures, KDD 2026 Applied Data Science Track
点击查看摘要
Abstract:Hierarchical review workflows, where a second-tier reviewer (Checker) corrects first-tier (Maker) decisions, generate valuable correction signals that encode why initial judgments failed. However, learning from these signals is hindered by information asymmetry: corrections often depend on verification actions unavailable to Makers or automated systems. We address this challenge by introducing explicit action modeling as an inferential constraint that grounds reasoning in verifiable operations rather than unconstrained text generation. We propose the Evidence-Action-Factor-Decision (EAFD) schema, a minimal representation for adjudication reasoning that prevents hallucination through operational grounding and enables learning from correction signals via explicit conflict modeling. Building on this schema, we develop a conflict-aware graph reasoning framework that: (1) constructs EAFD graphs from historical cases capturing Maker-Checker disagreements, (2) aggregates them into a retrievable knowledge base, and (3) performs top-down deductive reasoning for new cases by projecting validated resolution paths from precedents. A distinctive capability is the Request More Information (RMI) outcome: when evidence is insufficient, the system identifies precisely which verification actions remain unexecuted and generates targeted information requests. We evaluate the framework in large-scale e-commerce seller appeal adjudication. While a standard LLM-only baseline achieves only 70.8% alignment with human experts, incorporating action modeling with RMI improves alignment to 87.5%. Augmenting this with the retrieval-based knowledge graph yields the best offline performance of 95.8%. Following online deployment, the framework maintains robust performance, achieving a 96.3% alignment rate in production, demonstrating its real-world effectiveness.
22. 【2603.19236】L-PRISMA: An Extension of PRISMA in the Era of Generative Artificial Intelligence (GenAI)
链接:https://arxiv.org/abs/2603.19236
作者:Samar Shailendra,Rajan Kadel,Aakanksha Sharma,Islam Mohammad Tahidul,Urvashi Rahul Saxena
类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Preferred Reporting Items, literature screening remain, screening remain time-consuming, Preferred Reporting, Reporting Items
备注: ICMET 2025
点击查看摘要
Abstract:The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework provides a rigorous foundation for evidence synthesis, yet the manual processes of data extraction and literature screening remain time-consuming and restrictive. Recent advances in Generative Artificial Intelligence (GenAI), particularly large language models (LLMs), offer opportunities to automate and scale these tasks, thereby improving time and efficiency. However, reproducibility, transparency, and auditability, the core PRISMA principles, are being challenged by the inherent non-determinism of LLMs and the risks of hallucination and bias amplification. To address these limitations, this study integrates human-led synthesis with a GenAI-assisted statistical pre-screening step. Human oversight ensures scientific validity and transparency, while the deterministic nature of the statistical layer enhances reproducibility. The proposed approach systematically enhances PRISMA guidelines, providing a responsible pathway for incorporating GenAI into systematic review workflows.
计算机视觉
1. 【2603.20194】MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints
链接:https://arxiv.org/abs/2603.20194
作者:Yu Qi,Xinyi Xu,Ziyu Guo,Siyuan Ma,Renrui Zhang,Xinyan Chen,Ruichuan An,Ruofan Xing,Jiayi Zhang,Haojie Huang,Pheng-Ann Heng,Jonathan Tremblay,Lawson L.S. Wong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:emerging reasoning behaviors, models show emerging, show emerging reasoning, Video generative models, reasoning
备注:
点击查看摘要
Abstract:Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable deployment, a property we define as reasoning coherence. To bridge the gap in literature for missing reasoning coherence evaluation, we propose MME-CoF-Pro, a comprehensive video reasoning benchmark to assess reasoning coherence in video models. Specifically, MME-CoF-Pro contains 303 samples across 16 categories, ranging from visual logical to scientific reasoning. It introduces Reasoning Score as evaluation metric for assessing process-level necessary intermediate reasoning steps, and includes three evaluation settings, (a) no hint (b) text hint and (c) visual hint, enabling a controlled investigation into the underlying mechanisms of reasoning hint guidance. Evaluation results in 7 open and closed-source video models reveals insights including: (1) Video generative models exhibit weak reasoning coherence, decoupled from generation quality. (2) Text hints boost apparent correctness but often cause inconsistency and hallucinated reasoning (3) Visual hints benefit structured perceptual tasks but struggle with fine-grained perception. Website: this https URL
2. 【2603.20193】From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
链接:https://arxiv.org/abs/2603.20193
作者:Xinyi Shang,Yi Tang,Jiacheng Cui,Ahmed Elhagry,Salwa K. Al Khatib,Sondos Mahmoud Bsharat,Jiacheng Liu,Xiaohan Zhao,Jing-Hao Xue,Hao Li,Salman Khan,Zhiqiang Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:
备注: Code and data at: [this https URL](https://github.com/VILA-Lab/PIXAR) (Accepted in CVPR 2026 Findings, but not opted in)
点击查看摘要
None
3. 【2603.20192】LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
链接:https://arxiv.org/abs/2603.20192
作者:Jiazheng Xing,Fei Du,Hangjie Yuan,Pengwei Liu,Hongbin Xu,Hai Ci,Ruigang Niu,Weihua Chen,Fan Wang,Yong Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:enabling personalized content, personalized content creation, Recent advances, significantly improved, background elements
备注: ICLR 2026 Camera Ready Version. Code and Models: [this https URL](https://jiazheng-xing.github.io/lumosx-home/)
点击查看摘要
Abstract:Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at this https URL.
4. 【2603.20191】Deterministic Mode Proposals: An Efficient Alternative to Generative Sampling for Ambiguous Segmentation
链接:https://arxiv.org/abs/2603.20191
作者:Sebastian Gerard,Josephine Sullivan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:future state prediction, medical image segmentation, inherently ambiguous, meaning that multiple, equally correct
备注:
点击查看摘要
Abstract:Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are equally correct. Current methods typically rely on generative models to capture this uncertainty. However, identifying the underlying modes of the distribution with these methods is computationally expensive, requiring large numbers of samples and post-hoc clustering. In this paper, we shift the focus from stochastic sampling to the direct generation of likely outcomes. We introduce mode proposal models, a deterministic framework that efficiently produces a fixed-size set of proposal masks in a single forward pass. To handle superfluous proposals, we adapt a confidence mechanism, traditionally used in object detection, to the high-dimensional space of segmentation masks. Our approach significantly reduces inference time while achieving higher ground-truth coverage than existing generative models. Furthermore, we demonstrate that our model can be trained without knowing the full distribution of outcomes, making it applicable to real-world datasets. Finally, we show that by decomposing the velocity field of a pre-trained flow model, we can efficiently estimate prior mode probabilities for our proposals.
5. 【2603.20190】CoVR-R:Reason-Aware Composed Video Retrieval
链接:https://arxiv.org/abs/2603.20190
作者:Omkar Thawakar,Dmitry Demidov,Vaishnav Potlapalli,Sai Prasanna Teja Reddy Bogireddy,Viswanatha Reddy Gajjala,Alaa Mostafa Lasheen,Rao Muhammad Anwer,Fahad Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Composed Video Retrieval, Composed Video, aims to find, textual modification, Composed
备注: CVPR 2026 (findings)
点击查看摘要
Abstract:Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at this https URL.
6. 【2603.20188】Wildfire Spread Scenarios: Increasing Sample Diversity of Segmentation Diffusion Models with Training-Free Methods
链接:https://arxiv.org/abs/2603.20188
作者:Sebastian Gerard,Josephine Sullivan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Predicting future states, multiple plausible outcomes, Predicting future, uncertain environments, autonomous driving
备注: Accepted at NLDL 2026. This version contains small corrections compared to the initial publication, see appendix for details
点击查看摘要
Abstract:Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multiple plausible outcomes. While diffusion models can effectively learn such multi-modal distributions, naively sampling from these models is computationally inefficient, potentially requiring hundreds of samples to find low-probability modes that may still be operationally relevant. In this work, we address the challenge of sample-efficient ambiguous segmentation by evaluating several training-free sampling methods that encourage diverse predictions. We adapt two techniques, particle guidance and SPELL, originally designed for the generation of diverse natural images, to discrete segmentation tasks, and additionally propose a simple clustering-based technique. We validate these approaches on the LIDC medical dataset, a modified version of the Cityscapes dataset, and MMFire, a new simulation-based wildfire spread dataset introduced in this paper. Compared to naive sampling, these approaches increase the HM IoU* metric by up to 7.5% on MMFire and 16.4% on Cityscapes, demonstrating that training-free methods can be used to efficiently increase the sample diversity of segmentation diffusion models with little cost to image quality and runtime. Code and dataset: this https URL
Comments:
Accepted at NLDL 2026. This version contains small corrections compared to the initial publication, see appendix for details
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.20188 [cs.CV]
(or
arXiv:2603.20188v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.20188
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL), PMLR, Jan. 2026
7. 【2603.20187】MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering
链接:https://arxiv.org/abs/2603.20187
作者:Yuan Zhou,Yongzhi Li,Yanqi Dai,Xingyu Zhu,Yi Tan,Qingshan Xu,Beier Zhu,Richang Hong,Hanwang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
8. 【2603.20186】Improving Image-to-Image Translation via a Rectified Flow Reformulation
链接:https://arxiv.org/abs/2603.20186
作者:Satoshi Iizuka,Shun Okamoto,Kazuhiro Fukui
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Rectified Flow Reformulation, practical plug-in reformulation, Rectified Flow, Flow Reformulation, plug-in reformulation
备注:
点击查看摘要
Abstract:In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks as continuous-time transport models. While pixel-wise I2I regression is simple, stable, and easy to adapt across tasks, it often over-smooths ill-posed and multimodal targets, whereas generative alternatives often require additional components, task-specific tuning, and more complex training and inference pipelines. Our method augments the backbone input by channel-wise concatenation with a noise-corrupted version of the ground-truth target and optimizes a simple t-reweighted pixel loss. This objective admits a rectified-flow interpretation via an induced velocity field, enabling ODE-based progressive refinement at inference time while largely preserving the standard supervised training pipeline. In most cases, adopting I2I-RFR requires only expanding the input channels, and inference can be performed with a few explicit solver steps (e.g., 3 steps) without distillation. Extensive experiments across multiple image-to-image translation and video restoration tasks show that I2I-RFR generally improves performance across a wide range of tasks and backbones, with particularly clear gains in perceptual quality and detail preservation. Overall, I2I-RFR provides a lightweight way to incorporate continuous-time refinement into conventional I2I models without requiring a heavy generative pipeline.
9. 【2603.20185】VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
链接:https://arxiv.org/abs/2603.20185
作者:Jingyang Lin,Jialian Wu,Jiang Liu,Ximeng Sun,Ze Wang,Xiaodong Yu,Jiebo Luo,Zicheng Liu,Emad Barsoum
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:challenging video-language tasks, advanced challenging video-language, Video, video-language tasks, video understanding
备注: Accepted at CVPR 2026
点击查看摘要
Abstract:Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.
10. 【2603.20180】Adaptive Greedy Frame Selection for Long Video Understanding
链接:https://arxiv.org/abs/2603.20180
作者:Yuning Huang,Fengqing Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:resulting visual tokens, Large vision, long-video question answering, language models, visual tokens
备注:
点击查看摘要
Abstract:Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
11. 【2603.20176】LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
链接:https://arxiv.org/abs/2603.20176
作者:Stanislaw Szymanowicz,Minghao Chen,Jianyuan Wang,Christian Rupprecht,Andrea Vedaldi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent work, View Synthesis, work has shown, Recent, Synthesis
备注: IEEE CVF Conference on Computer Vision and Pattern Recognition 2026. Project page with code, models and examples: [this http URL](http://szymanowiczs.github.io/lagernvs)
点击查看摘要
Abstract:Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue that strong 3D inductive biases are still helpful in the design of such networks. We show this point by introducing LagerNVS, an encoder-decoder neural network for NVS that builds on `3D-aware' latent features. The encoder is initialized from a 3D reconstruction network pre-trained using explicit 3D supervision. This is paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis (including 31.4 PSNR on Re10k), with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation.
12. 【2603.20174】nyML Enhances CubeSat Mission Capabilities
链接:https://arxiv.org/abs/2603.20174
作者:Luigi Capogrosso,Michele Magno
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:missions traditionally rely, computationally intensive analysis, minimally processed imagery, Earth observation, Convolutional Neural Networks
备注: Accepted at the 17th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS) 2026
点击查看摘要
Abstract:Earth observation (EO) missions traditionally rely on transmitting raw or minimally processed imagery from satellites to ground stations for computationally intensive analysis. This paradigm is infeasible for CubeSat systems due to stringent constraints on the onboard embedded processors, energy availability, and communication bandwidth. To overcome these limitations, the paper presents a TinyML-based Convolutional Neural Networks (ConvNets) model optimization and deployment pipeline for onboard image classification, enabling accurate, energy-efficient, and hardware-aware inference under CubeSat-class constraints. Our pipeline integrates structured iterative pruning, post-training INT8 quantization, and hardware-aware operator mapping to compress models and align them with the heterogeneous compute architecture of the STM32N6 microcontroller from STMicroelectronics. This Microcontroller Unit (MCU) integrates a novel Arm Cortex-M55 core and a Neural-ART Neural Processing Unit (NPU), providing a realistic proxy for CubeSat onboard computers. The paper evaluates the proposed approach on three EO benchmark datasets (i.e., EuroSAT, RS_C11, MEDIC) and four models (i.e., SqueezeNet, MobileNetV3, EfficientNet, MCUNetV1). We demonstrate an average reduction in RAM usage of 89.55% and Flash memory of 70.09% for the optimized models, significantly decreasing downlink bandwidth requirements while maintaining task-acceptable accuracy (with a drop ranging from 0.4 to 8.6 percentage points compared to the Float32 baseline). The energy consumption per inference ranges from 0.68 mJ to 6.45 mJ, with latency spanning from 3.22 ms to 30.38 ms. These results fully satisfy the stringent energy budgets and real-time constraints required for efficient onboard EO processing.
13. 【2603.20169】EgoForge: Goal-Directed Egocentric World Simulator
链接:https://arxiv.org/abs/2603.20169
作者:Yifan Shen,Jiateng Liu,Xinzhuo Li,Yuanzhe Liu,Bingxuan Li,Houze Yang,Wenqi Jia,Yijiang Li,Tianjiao Yu,James Matthew Rehg,Xu Cao,Ismini Lourentzou
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:frequent hand-object interactions, Generative world models, remains challenging due, simulating dynamic environments, video remains challenging
备注:
点击查看摘要
Abstract:Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.
14. 【2603.20155】Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD
链接:https://arxiv.org/abs/2603.20155
作者:Emiel Hoogeboom,David Ruhe,Jonathan Heek,Thomas Mensink,Tim Salimans
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
关键词:Machine Learning, discrete diffusion models, difficult to distill, diffusion models, cs.LG
备注:
点击查看摘要
Abstract:It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can reduce sampling steps to a handful. Our method, Discrete Moment Matching Distillation (D-MMD), leverages ideas that have been highly successful in the continuous domain. Whereas previous discrete distillation methods collapse, D-MMD maintains high quality and diversity (given sufficient sampling steps). This is demonstrated on both text and image datasets. Moreover, the newly distilled generators can outperform their teachers.
Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as:
arXiv:2603.20155 [cs.LG]
(or
arXiv:2603.20155v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2603.20155
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
15. 【2603.20148】Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning
链接:https://arxiv.org/abs/2603.20148
作者:Hui Zhong,Yichun Gao,Luyan Liu,Hai Yang,Wang Wang,Haowei Zhang,Xinhu Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:smart city maintenance, Automated building facade, building facade inspection, city maintenance, Automated building
备注:
点击查看摘要
Abstract:Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on specialized discriminative models (e.g., YOLO, Mask R-CNN) that excel at pixel-level localization but are constrained to passive perception and worse generization without the visual understandng to interpret structural topology. Large Multimodal Models (LMMs) promise a paradigm shift toward active reasoning, yet their application in such high-stakes engineering domains lacks rigorous evaluation standards. To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Building on this foundation, we present \textit{DefectBench}, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition. \textit{DefectBench} evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation. Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing "what" and "how"), they exhibit significant deficiencies in metric localization precision ("where"). Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training. This work provides both a rigorous benchmarking standard and a high-quality open-source database, establishing a new baseline for the advancement of autonomous AI agents in civil engineering.
16. 【2603.20143】Synergistic Perception and Generative Recomposition: A Multi-Agent Orchestration for Expert-Level Building Inspection
链接:https://arxiv.org/abs/2603.20143
作者:Hui Zhong,Yichun Gao,Luyan Liu,Xusen Guo,Zhaonian Kuang,Qiming Zhang,Xinhu Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sustainable urban maintenance, extreme geometric variability, formidable challenge due, Building facade defect, structural health monitoring
备注:
点击查看摘要
Abstract:Building facade defect inspection is fundamental to structural health monitoring and sustainable urban maintenance, yet it remains a formidable challenge due to extreme geometric variability, low contrast against complex backgrounds, and the inherent complexity of composite defects (e.g., cracks co-occurring with spalling). Such characteristics lead to severe pixel imbalance and feature ambiguity, which, coupled with the critical scarcity of high-quality pixel-level annotations, hinder the generalization of existing detection and segmentation models. To address gaps, we propose \textit{FacadeFixer}, a unified multi-agent framework that treats defect perception as a collaborative reasoning task rather than isolated recognition. Specifically,\textit{FacadeFixer} orchestrates specialized agents for detection and segmentation to handle multi-type defect interference, working in tandem with a generative agent to enable semantic recomposition. This process decouples intricate defects from noisy backgrounds and realistically synthesizes them onto diverse clean textures, generating high-fidelity augmented data with precise expert-level masks. To support this, we introduce a comprehensive multi-task dataset covering six primary facade categories with pixel-level annotations. Extensive experiments demonstrate that \textit{FacadeFixer} significantly outperforms state-of-the-art (SOTA) baselines. Specifically, it excels in capturing pixel-level structural anomalies and highlights generative synthesis as a robust solution to data scarcity in infrastructure inspection. Our code and dataset will be made publicly available.
17. 【2603.20128】Generalizable NGP-SR: Generalizable Neural Radiance Fields Super-Resolution via Neural Graph Primitives
链接:https://arxiv.org/abs/2603.20128
作者:Wanqi Yuan,Omkar Sharad Mayekar,Connor Pennington,Nianyi Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:outputs demand dense, demand dense sampling, achieve photorealistic, Neural Graphics Primitives, outputs demand
备注:
点击查看摘要
Abstract:Neural Radiance Fields (NeRF) achieve photorealistic novel view synthesis but become costly when high-resolution (HR) rendering is required, as HR outputs demand dense sampling and higher-capacity models. Moreover, naively super-resolving per-view renderings in 2D often breaks multi-view consistency. We propose Generalizable NGP-SR, a 3D-aware super-resolution framework that reconstructs an HR radiance field directly from low-resolution (LR) posed images. Built on Neural Graphics Primitives (NGP), NGP-SR conditions radiance prediction on 3D coordinates and learned local texture tokens, enabling recovery of high-frequency details within the radiance field and producing view-consistent HR novel views without external HR references or post-hoc 2D upsampling. Importantly, our model is generalizable: once trained, it can be applied to unseen scenes and rendered from novel viewpoints without per-scene optimization. Experiments on multiple datasets show that NGP-SR consistently improves both reconstruction quality and runtime efficiency over prior NeRF-based super-resolution methods, offering a practical solution for scalable high-resolution novel view synthesis.
18. 【2603.20116】Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning
链接:https://arxiv.org/abs/2603.20116
作者:Jiajie Li,Chenhui Xu,Meihuan Liu,Jinjun Xiong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:pretrained multimodal priors, leading to reduced, Conventional fine-tuning, domain-specific datasets, datasets can inadvertently
备注:
点击查看摘要
Abstract:Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.
19. 【2603.20086】Preference-Guided Debiasing for No-Reference Enhancement Image Quality Assessment
链接:https://arxiv.org/abs/2603.20086
作者:Shiqi Gao,Kang Fu,Zitong Xu,Huiyu Duan,Xiongkuo Min,Jia Wang,Guangtao Zhai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Current no-reference image, specific enhancement algorithms, Current no-reference, evaluating genuine perceptual, image quality assessment
备注:
点击查看摘要
Abstract:Current no-reference image quality assessment (NR-IQA) models for enhanced images often struggle to generalize, as they tend to overfit to the distinct patterns of specific enhancement algorithms rather than evaluating genuine perceptual quality. To address this issue, we propose a preference-guided debiasing framework for no-reference enhancement image quality assessment (EIQA). Specifically, we first learn a continuous enhancement-preference embedding space using supervised contrastive learning, where images generated by similar enhancement styles are encouraged to have closer representations. Based on this, we further estimate the enhancement-induced nuisance component contained in the raw quality representation and remove it before quality regression. In this way, the model is guided to focus on algorithm-invariant perceptual quality cues instead of enhancement-specific visual fingerprints. To facilitate stable optimization, we adopt a two-stage training strategy that first learns the enhancement-preference space and then performs debiased quality prediction. Extensive experiments on public EIQA benchmarks demonstrate that the proposed method effectively mitigates algorithm-induced representation bias and achieves superior robustness and cross-algorithm generalization compared with existing approaches.
20. 【2603.20077】A Unified Platform and Quality Assurance Framework for 3D Ultrasound Reconstruction with Robotic, Optical, and Electromagnetic Tracking
链接:https://arxiv.org/abs/2603.20077
作者:Lewis Howell,Manisha Waterston,Tze Min Wah,James H. Chandler,James R. McLaughlan
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
关键词:treatment planning, robust Quality Assurance, facilitate diagnosis, Quality Assurance, Three-dimensional
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Three-dimensional (3D) Ultrasound (US) can facilitate diagnosis, treatment planning, and image-guided therapy. However, current studies rarely provide a comprehensive evaluation of volumetric accuracy and reproducibility, highlighting the need for robust Quality Assurance (QA) frameworks, particularly for tracked 3D US reconstruction using freehand or robotic acquisition. This study presents a QA framework for 3D US reconstruction and a flexible open source platform for tracked US research. A custom phantom containing geometric inclusions with varying symmetry properties enables straightforward evaluation of optical, electromagnetic, and robotic kinematic tracking for 3D US at different scanning speeds and insonation angles. A standardised pipeline performs real-time segmentation and 3D reconstruction of geometric targets (DSC = 0.97, FPS = 46) without GPU acceleration, followed by automated registration and comparison with ground-truth geometries. Applying this framework showed that our robotic 3D US achieves state-of-the-art reconstruction performance (DSC-3D = 0.94 +- 0.01, HD95 = 1.17 +- 0.12), approaching the spatial resolution limit imposed by the transducer. This work establishes a flexible experimental platform and a reproducible validation methodology for 3D US reconstruction. The proposed framework enables robust cross-platform comparisons and improved reporting practices, supporting the safe and effective clinical translation of 3D ultrasound in diagnostic and image-guided therapy applications.
21. 【2603.20074】MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models
链接:https://arxiv.org/abs/2603.20074
作者:Puskal Khadka,KC Santosh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sequence modeling tasks, achieved remarkable success, recent Mamba architecture, recent Mamba, modeling tasks
备注:
点击查看摘要
Abstract:State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non-sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil-Mamba, a novel visual state space architecture built on a multi-filter scanning backbone. Unlike fixed multi-directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil-Mamba achieves superior performance over existing state-of-the-art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at this https URL.
22. 【2603.20020】Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR
链接:https://arxiv.org/abs/2603.20020
作者:Ziye Yuan,Ruchang Yao,Chengxin Zheng,Yusheng Zhao,Daxiang Dong,Ming Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:large language models, fail on OCR, Multimodal large language, language models, compromised or misaligned
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.
23. 【2603.20016】CFCML: A Coarse-to-Fine Crossmodal Learning Framework For Disease Diagnosis Using Multimodal Images and Tabular Data
链接:https://arxiv.org/abs/2603.20016
作者:Tianling Liu,Hongying Liu,Fanhua Shang,Lequan Yu,Tong Han,Liang Wan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:including medical images, information including medical, clinical practice, disease diagnosis, modality gap
备注:
点击查看摘要
Abstract:In clinical practice, crossmodal information including medical images and tabular data is essential for disease diagnosis. There exists a significant modality gap between these data types, which obstructs advancements in crossmodal diagnostic accuracy. Most existing crossmodal learning (CML) methods primarily focus on exploring relationships among high-level encoder outputs, leading to the neglect of local information in images. Additionally, these methods often overlook the extraction of task-relevant information. In this paper, we propose a novel coarse-to-fine crossmodal learning (CFCML) framework to progressively reduce the modality gap between multimodal images and tabular data, by thoroughly exploring inter-modal relationships. At the coarse stage, we explore the relationships between multi-granularity features from various image encoder stages and tabular information, facilitating a preliminary reduction of the modality gap. At the fine stage, we generate unimodal and crossmodal prototypes that incorporate class-aware information, and establish hierarchical anchor-based relationship mining (HRM) strategy to further diminish the modality gap and extract discriminative crossmodal information. This strategy utilize modality samples, unimodal prototypes, and crossmodal prototypes as anchors to develop contrastive learning approaches, effectively enhancing inter-class disparity while reducing intra-class disparity from multiple perspectives. Experimental results indicate that our method outperforms the state-of-the-art (SOTA) methods, achieving improvements of 1.53% and 0.91% in AUC metrics on the MEN and Derm7pt datasets, respectively. The code is available at this https URL.
24. 【2603.20012】Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
链接:https://arxiv.org/abs/2603.20012
作者:Zheng Gao,Debin Meng,Yunqi Miao,Zhensong Zhang,Songcen Xu,Ioannis Patras,Jifei Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Current diffusion-based makeup, facial region-aware makeup, makeup CLIP, makeup CLIP fine-tuning, region-aware makeup features
备注: Accepted by CVPR'26
点击查看摘要
Abstract:Current diffusion-based makeup transfer methods commonly use the makeup information encoded by off-the-shelf foundation models (e.g., CLIP) as condition to preserve the makeup style of reference image in the generation. Although effective, these works mainly have two limitations: (1) foundation models pre-trained for generic tasks struggle to capture makeup styles; (2) the makeup features of reference image are injected to the diffusion denoising model as a whole for global makeup transfer, overlooking the facial region-aware makeup features (i.e., eyes, mouth, etc) and limiting the regional controllability for region-specific makeup transfer. To address these, in this work, we propose Facial Region-Aware Makeup features (FRAM), which has two stages: (1) makeup CLIP fine-tuning; (2) identity and facial region-aware makeup injection. For makeup CLIP fine-tuning, unlike prior works using off-the-shelf CLIP, we synthesize annotated makeup style data using GPT-o3 and text-driven image editing model, and then use the data to train a makeup CLIP encoder through self-supervised and image-text contrastive learning. For identity and facial region-aware makeup injection, we construct before-and-after makeup image pairs from the edited images in stage 1 and then use them to learn to inject identity of source image and makeup of reference image to the diffusion denoising model for makeup transfer. Specifically, we use learnable tokens to query the makeup CLIP encoder to extract facial region-aware makeup features for makeup injection, which is learned via an attention loss to enable regional control. As for identity injection, we use a ControlNet Union to encode source image and its 3D mesh simultaneously. The experimental results verify the superiority of our regional controllability and our makeup transfer performance.
25. 【2603.20005】NEC-Diff: Noise-Robust Event-RAW Complementary Diffusion for Seeing Motion in Extreme Darkness
链接:https://arxiv.org/abs/2603.20005
作者:Haoyue Liu,Jinghan Xu,Luxin Feng,Hanyu Zhou,Haozhi Zhao,Yi Chang,Luxin Yan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly challenging, extremely low-light conditions, extremely low-light, RAW images, low-light RAW images
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:High-quality imaging of dynamic scenes in extremely low-light conditions is highly challenging. Photon scarcity induces severe noise and texture loss, causing significant image degradation. Event cameras, featuring a high dynamic range (120 dB) and high sensitivity to motion, serve as powerful complements to conventional cameras by offering crucial cues for preserving subtle textures. However, most existing approaches emphasize texture recovery from events, while paying little attention to image noise or the intrinsic noise of events themselves, which ultimately hinders accurate pixel reconstruction under photon-starved conditions. In this work, we propose NEC-Diff, a novel diffusion-based event-RAW hybrid imaging framework that extracts reliable information from heavily noisy signals to reconstruct fine scene structures. The framework is driven by two key insights: (1) combining the linear light-response property of RAW images with the brightness-change nature of events to establish a physics-driven constraint for robust dual-modal denoising; and (2) dynamically estimating the SNR of both modalities based on denoising results to guide adaptive feature fusion, thereby injecting reliable cues into the diffusion process for high-fidelity visual reconstruction. Furthermore, we construct the REAL (Raw and Event Acquired in Low-light) dataset which provides 47,800 pixel-aligned low-light RAW images, events, and high-quality references under 0.001-0.8 lux illumination. Extensive experiments demonstrate the superiority of NEC-Diff under extreme darkness. The project are available at: this https URL.
26. 【2603.19994】Evaluating Test-Time Adaptation For Facial Expression Recognition Under Natural Cross-Dataset Distribution Shifts
链接:https://arxiv.org/abs/2603.19994
作者:John Turnbull,Shivam Grover,Amin Jalali,Ali Etemad
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
关键词:Deep learning models, Deep learning, common challenge, Deep, learning models
备注: Accepted at ICASSP 2026
点击查看摘要
Abstract:Deep learning models often struggle under natural distribution shifts, a common challenge in real-world deployments. Test-Time Adaptation (TTA) addresses this by adapting models during inference without labeled source data. We present the first evaluation of TTA methods for FER under natural domain shifts, performing cross-dataset experiments with widely used FER datasets. This moves beyond synthetic corruptions to examine real-world shifts caused by differing collection protocols, annotation standards, and demographics. Results show TTA can boost FER performance under natural shifts by up to 11.34\%. Entropy minimization methods such as TENT and SAR perform best when the target distribution is clean. In contrast, prototype adjustment methods like T3A excel under larger distributional distance scenarios. Finally, feature alignment methods such as SHOT deliver the largest gains when the target distribution is noisier than our source. Our cross-dataset analysis shows that TTA effectiveness is governed by the distributional distance and the severity of the natural shift across domains.
27. 【2603.19993】MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI
链接:https://arxiv.org/abs/2603.19993
作者:Rozain Shakeel,Abdul Rahman Mohammad Ali,Muneeb Mushtaq,Tausifa Jan Saleem,Tajamul Ashraf
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Large Language, perform reliable visual, environments remains underexplored
备注: Project page: [this https URL](https://rozainmalik.github.io/MedSPOT_web/)
点击查看摘要
Abstract:Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: this https URL.
28. 【2603.19979】X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving
链接:https://arxiv.org/abs/2603.19979
作者:Chaoda Zheng,Sean Li,Jinhao Deng,Zhennan Wang,Shijia Chen,Liqiang Xiao,Ziheng Chi,Hongbin Lin,Kangjie Chen,Boyang Wang,Yu Zhang,Xianming Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:map raw sensor, policies directly map, directly map raw, raw sensor streams, autonomous driving
备注: Technical Report
点击查看摘要
Abstract:Scalable and reliable evaluation is increasingly critical in the end-to-end era of autonomous driving, where vision--language--action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real-world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real-world simulator that can generate realistic future observations under proposed actions, while remaining controllable and stable over long horizons. We present X-World, an action-conditioned multi-camera generative world model that simulates future observations directly in video space. Given synchronized multi-view camera history and a future action sequence, X-World generates future multi-camera video streams that follow the commanded actions. To ensure reproducible and editable scene rollouts, X-World further supports optional controls over dynamic traffic agents and static road elements, and retains a text-prompt interface for appearance-level control (e.g., weather and time of day). Beyond world simulation, X-World also enables video style transfer by conditioning on appearance prompts while preserving the underlying action and scene dynamics. At the core of X-World is a multi-view latent video generator designed to explicitly encourage cross-view geometric consistency and temporal coherence under diverse control signals. Experiments show that X-World achieves high-quality multi-view video generation with (i) strong view consistency across cameras, (ii) stable temporal dynamics over long rollouts, and (iii) high controllability with strict action following and faithful adherence to optional scene controls. These properties make X-World a practical foundation for scalable and reproducible evaluation.
29. 【2603.19964】2K Retrofit: Entropy-Guided Efficient Sparse Refinement for High-Resolution 3D Geometry Prediction
链接:https://arxiv.org/abs/2603.19964
作者:Tianbao Zhang,Zhenyu Liang,Zhenbo Song,Nana Wang,Xiaomei Zhang,Xudong Cai,Zheng Zhu,Kejian Wu,Gang Wang,Zhaoxin Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:current foundation models, autonomous driving, scalability to real-world, essential for robust, robust perception
备注: 15pages
点击查看摘要
Abstract:High-resolution geometric prediction is essential for robust perception in autonomous driving, robotics, and AR/MR, but current foundation models are fundamentally limited by their scalability to real-world, high-resolution scenarios. Direct inference on 2K images with these models incurs prohibitive computational and memory demands, making practical deployment challenging. To tackle the issue, we present 2K Retrofit, a novel framework that enables efficient 2K-resolution inference for any geometric foundation model, without modifying or retraining the backbone. Our approach leverages fast coarse predictions and an entropy-based sparse refinement to selectively enhance high-uncertainty regions, achieving precise and high-fidelity 2K outputs with minimal overhead. Extensive experiments on widely used benchmark demonstrate that 2K Retrofit consistently achieves state-of-the-art accuracy and speed, bridging the gap between research advances and scalable deployment in high-resolution 3D vision applications. Code will be released upon acceptance.
30. 【2603.19961】Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
链接:https://arxiv.org/abs/2603.19961
作者:Nassim Ali Ousalah,Peyman Rostami,Vincent Gaudillière,Emmanuel Koumandakis,Anis Kacem,Enjie Ghorbel,Djamila Aouada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single RGB image, RGB image, single RGB, object pose estimation, address the problem
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
点击查看摘要
Abstract:In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-n-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.
31. 【2603.19957】HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction
链接:https://arxiv.org/abs/2603.19957
作者:Ruicheng Yuan,Zhenxuan Zhang,Anbang Wang,Liwei Hu,Xiangqian Hua,Yaya Peng,Jiawei Luo,Guang Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:ancillary test results, pathology vision-language models, encoding diagnostic conclusions, multi-granular documents encoding, existing pathology vision-language
备注: 10 pages, 1 figures, 3 tables
点击查看摘要
Abstract:Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured report prediction as its primary training objective. Three trainable modules totalling 15M parameters address complementary aspects of the problem: a Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation. Trained on 749K real-world Chinese pathology cases from three hospitals, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines under the same frozen backbone. Cross-hospital evaluation confirms generalisation with only a 3.4pp drop in strict accuracy while maintaining 97.1% safety.
32. 【2603.19939】mestep-Aware Block Masking for Efficient Diffusion Model Inference
链接:https://arxiv.org/abs/2603.19939
作者:Haodong He,Yuan Gao,Weizhong Zhang,Gui-Song Xia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Probabilistic Models, Diffusion Probabilistic, achieved great success, iterative denoising nature, high inference latency
备注: 10 pages
点击查看摘要
Abstract:Diffusion Probabilistic Models (DPMs) have achieved great success in image generation but suffer from high inference latency due to their iterative denoising nature. Motivated by the evolving feature dynamics across the denoising trajectory, we propose a novel framework to optimize the computational graph of pre-trained DPMs on a per-timestep basis. By learning timestep-specific masks, our method dynamically determines which blocks to execute or bypass through feature reuse at each inference stage. Unlike global optimization methods that incur prohibitive memory costs via full-chain backpropagation, our method optimizes masks for each timestep independently, ensuring a memory-efficient training process. To guide this process, we introduce a timestep-aware loss scaling mechanism that prioritizes feature fidelity during sensitive denoising phases, complemented by a knowledge-guided mask rectification strategy to prune redundant spatial-temporal dependencies. Our approach is architecture-agnostic and demonstrates significant efficiency gains across a broad spectrum of models, including DDPM, LDM, DiT, and PixArt. Experimental results show that by treating the denoising process as a sequence of optimized computational paths, our method achieves a superior balance between sampling speed and generative quality. Our code will be released.
33. 【2603.19936】LIORNet: Self-Supervised LiDAR Snow Removal Framework for Autonomous Driving under Adverse Weather Conditions
链接:https://arxiv.org/abs/2603.19936
作者:Ji-il Park,Inwook Shim
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:sensors provide high-resolution, LiDAR sensors provide, provide high-resolution, long-range detection, making them indispensable
备注: 14 pages, 6 figures, 2 tables
点击查看摘要
Abstract:LiDAR sensors provide high-resolution 3D perception and long-range detection, making them indispensable for autonomous driving and robotics. However, their performance significantly degrades under adverse weather conditions such as snow, rain, and fog, where spurious noise points dominate the point cloud and lead to false perception. To address this problem, various approaches have been proposed: distance-based filters exploiting spatial sparsity, intensity-based filters leveraging reflectance distributions, and learning-based methods that adapt to complex environments. Nevertheless, distance-based methods struggle to distinguish valid object points from noise, intensity-based methods often rely on fixed thresholds that lack adaptability to changing conditions, and learning-based methods suffer from the high cost of annotation, limited generalization, and computational overhead. In this study, we propose LIORNet, which eliminates these drawbacks and integrates the strengths of all three paradigms. LIORNet is built upon a U-Net++ backbone and employs a self-supervised learning strategy guided by pseudo-labels generated from multiple physical and statistical cues, including range-dependent intensity thresholds, snow reflectivity, point sparsity, and sensing range constraints. This design enables LIORNet to distinguish noise points from environmental structures without requiring manual annotations, thereby overcoming the difficulty of snow labeling and the limitations of single-principle approaches. Extensive experiments on the WADS and CADC datasets demonstrate that LIORNet outperforms state-of-the-art filtering algorithms in both accuracy and runtime while preserving critical environmental features. These results highlight LIORNet as a practical and robust solution for LiDAR perception in extreme weather, with strong potential for real-time deployment in autonomous driving systems.
34. 【2603.19929】RAM: Recover Any 3D Human Motion in-the-Wild
链接:https://arxiv.org/abs/2603.19929
作者:Sen Jia,Ning Zhu,Jinqin Zhong,Jiale Zhou,Huaping Zhang,Jenq-Neng Hwang,Lei Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:adaptive Kalman filtering, motion-aware semantic tracker, achieve robust identity, robust identity association, adaptive Kalman
备注:
点击查看摘要
Abstract:RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW, demonstrate that RAM substantially outperforms previous state-of-the-art in both Zero-shot tracking stability and 3D accuracy, offering a generalizable paradigm for markerless 3D human motion capture in-the-wild.
35. 【2603.19926】SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images
链接:https://arxiv.org/abs/2603.19926
作者:Jinyuan Qu,Hongyang Li,Lei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:posed RGB-D scans, requiring complex multi-stage, multi-stage processing pipelines, high-quality point clouds, complex multi-stage processing
备注:
点击查看摘要
Abstract:3D instance segmentation methods typically rely on high-quality point clouds or posed RGB-D scans, requiring complex multi-stage processing pipelines, and are highly sensitive to reconstruction noise. While recent feed-forward transformers have revolutionized multi-view 3D reconstruction, they remain decoupled from high-level semantic understanding. In this work, we present SegVGGT, a unified end-to-end framework that simultaneously performs feed-forward 3D reconstruction and instance segmentation directly from multi-view RGB images. By introducing object queries that interact with multi-level geometric features, our method deeply integrates instance identification into the visual geometry grounded transformer. To address the severe attention dispersion problem caused by the massive number of global image tokens, we propose the Frame-level Attention Distribution Alignment (FADA) strategy. FADA explicitly guides object queries to attend to instance-relevant frames during training, providing structured supervision without extra inference overhead. Extensive experiments demonstrate that SegVGGT achieves the state-of-the-art performance on ScanNetv2 and ScanNet200, outperforming both recent joint models and RGB-D-based approaches, while exhibiting strong generalization capabilities on ScanNet++.
36. 【2603.19920】PanORama: Multiview Consistent Panoptic Segmentation in Operating Rooms
链接:https://arxiv.org/abs/2603.19920
作者:Tuna Gürbüz,Ege Özsoy,Tony Danjun Wang,Nassir Navab
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly occluded environments, reliable spatial understanding, complex surgical workflows, spatial understanding, highly occluded
备注:
点击查看摘要
Abstract:Operating rooms (ORs) are cluttered, dynamic, highly occluded environments, where reliable spatial understanding is essential for situational awareness during complex surgical workflows. Achieving spatial understanding for panoptic segmentation from sparse multiview images poses a fundamental challenge, as limited visibility in a subset of views often leads to mispredictions across cameras. To this end, we introduce PanORama, the first panoptic segmentation for the operating room that is multiview-consistent by design. By modeling cross-view interactions at the feature level inside the backbone in a single forward pass, view consistency emerges directly rather than through post-hoc refinement. We evaluate on the MM-OR and 4D-OR datasets, achieving 70% Panoptic Quality (PQ) performance, and outperforming the previous state of the art. Importantly, PanORama is calibration-free, requiring no camera parameters, and generalizes to unseen camera viewpoints within any multiview configuration at inference time. By substantially enhancing multiview segmentation and, consequently, spatial understanding in the OR, we believe our approach opens new opportunities for surgical perception and assistance. Code will be released upon acceptance.
37. 【2603.19918】Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery
链接:https://arxiv.org/abs/2603.19918
作者:Jizhou Han,Chenhao Ding,Yuhang He,Qiang Wang,Shaokun Wang,SongLin Dong,Yihong Gong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Generalized Category Discovery, yield brittle boundaries, Textual Concept Generator, prevailing visual-only pipelines, Generalized Category
备注: Accept by CVPR 2026
点击查看摘要
Abstract:Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual-only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual-textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data. Our code is available at: this https URL.
38. 【2603.19873】SIMPLER: Efficient Foundation Model Adaptation via Similarity-Guided Layer Pruning for Earth Observation
链接:https://arxiv.org/abs/2603.19873
作者:Víctor Barreiro,Johannes Jakubik,Francisco Argüello,Dora B. Heras
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Earth Observation, Observation is computationally, high training time, computationally expensive, time and memory
备注:
点击查看摘要
Abstract:Fine-tuning foundation models for Earth Observation is computationally expensive, with high training time and memory demands for both training and deployment. Parameter-efficient methods reduce training cost but retain full inference complexity, while post-hoc compression optimizes inference only after costly full fine-tuning. We introduce SIMPLER, a pre-fine-tuning architecture selection method that reduces inference and deployment costs by identifying an effective model depth before adaptation. SIMPLER exploits stabilization of representations in deeper layers of pre-trained vision transformers: it computes layer-wise representation similarity on unlabeled task data and applies an automated scoring function to select redundant layers, with no gradients, magnitude heuristics, or hyperparameter tuning required. On Prithvi-EO-2, SIMPLER prunes up to 79% of parameters while retaining 94% of baseline performance, yielding a 2.1x training speedup and 2.6x inference speedup. The method generalizes to TerraMind (a multimodal EO foundation model) and ImageNet-pretrained ViT-MAE, demonstrating applicability across tasks, architectures, and spectral modalities. Code is available at this https URL.
39. 【2603.19863】MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLMs in Medical Image Quality Assessment
链接:https://arxiv.org/abs/2603.19863
作者:Jiyao Liu,Junzhi Ning,Wanying Qu,Lihao Liu,Chenglong Ma,Junjun He,Ningsheng Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:simple quality scores, multimodal large language, fall substantially short, provide descriptive assessments, image quality assessment
备注:
点击查看摘要
Abstract:Medical image quality assessment (Med-IQA) is a prerequisite for clinical AI deployment, yet multimodal large language models (MLLMs) still fall substantially short of human experts, particularly when required to provide descriptive assessments with clinical reasoning beyond simple quality scores. However, improving them is hindered by the high cost of acquiring descriptive annotations and by the inability of one-time data collection to adapt to the model's evolving weaknesses. To address these challenges, we propose MedQ-Engine, a closed-loop data engine that iteratively evaluates the model to discover failure prototypes via data-driven clustering, explores a million-scale image pool using these prototypes as retrieval anchors with progressive human-in-the-loop annotation, and evolves through quality-assured fine-tuning, forming a self-improving cycle. Models are evaluated on complementary perception and description tasks. An entropy-guided routing mechanism triages annotations to minimize labeling cost. Experiments across five medical imaging modalities show that MedQ-Engine elevates an 8B-parameter model to surpass GPT-4o by over 13% and narrow the gap with human experts to only 4.34%, using only 10K annotations with more than 4x sample efficiency over random sampling.
40. 【2603.19862】IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
链接:https://arxiv.org/abs/2603.19862
作者:Simone Magistri,Dipam Goswami,Marco Mistretta,Bartłomiej Twardowski,Joost van de Weijer,Andrew D. Bagdanov
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:involve both visual, intra-modal, intra-modal misalignment, inherently intra-modal tasks, visual and text
备注: Accepted at CVPR2026
点击查看摘要
Abstract:Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: this https URL.
41. 【2603.19857】FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
链接:https://arxiv.org/abs/2603.19857
作者:You Li,Dewei Zhou,Fan Ma,Fu Li,Dongliang He,Yi Yang
类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, methods have achieved, remarkable progress, achieved remarkable, temporal
备注: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, 18 pages
点击查看摘要
Abstract:Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model's audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
42. 【2603.19852】Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them
链接:https://arxiv.org/abs/2603.19852
作者:Michael Hubbertz,Qi Han,Tobias Meisen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Deep learning-based online, Deep learning-based, models frequently fail, autonomous driving, familiar environments
备注: Accepted to CVPR 2026, final camera ready version is published there
点击查看摘要
Abstract:Deep learning-based online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map geometries. We propose measures based on evaluation subsets that control for geographical proximity and geometric similarity between training and validation scenes. We introduce Fréchet distance-based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: a localization overfitting score quantifying the performance drop when geographic cues disappear, and a map geometry overfitting score measuring degradation as scenes become geometrically novel. Beyond models, we analyze dataset biases and contribute map geometry-aware diagnostics: A minimum-spanning-tree (MST) diversity measure for training sets and a symmetric coverage measure to quantify geometric similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that map geometry-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and map geometry-centric dataset design for deployable online mapping.
43. 【2603.19844】Hyper-Connections for Adaptive Multi-Modal MRI Brain Tumor Segmentation
链接:https://arxiv.org/abs/2603.19844
作者:Lokendra Kumar,Shubham Aggarwal
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fixed residual connections, brain tumor segmentation, study of Hyper-Connections, multi-modal brain tumor, drop-in replacement
备注: 29 pages,6 tables,17 figures
点击查看摘要
Abstract:We present the first study of Hyper-Connections (HC) for volumetric multi-modal brain tumor segmentation, integrating them as a drop-in replacement for fixed residual connections across five architectures: nnU-Net, SwinUNETR, VT-UNet, U-Net, and U-Netpp. Dynamic HC consistently improves all 3D models on the BraTS 2021 dataset, yielding up to +1.03 percent mean Dice gain with negligible parameter overhead. Gains are most pronounced in the Enhancing Tumor sub-region, reflecting improved fine-grained boundary delineation. Modality ablation further reveals that HC-equipped models develop sharper sensitivity toward clinically dominant sequences, specifically T1ce for Tumor Core and Enhancing Tumor, and FLAIR for Whole Tumor, a behavior absent in fixed-connection baselines and consistent across all architectures. In 2D settings, improvements are smaller and configuration-sensitive, suggesting that volumetric spatial context amplifies the benefit of adaptive aggregation. These results establish HC as a simple, efficient, and broadly applicable mechanism for multi-modal feature fusion in medical image segmentation.
44. 【2603.19834】Fourier Splatting: Generalized Fourier encoded primitives for scalable radiance fields
链接:https://arxiv.org/abs/2603.19834
作者:Mihnea-Bogdan Jurca,Bert Van hauwermeiren,Adrian Munteanu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, explicit primitive rasterization, enables real-time rendering, view synthesis, synthesis has recently
备注:
点击查看摘要
Abstract:Novel view synthesis has recently been revolutionized by 3D Gaussian Splatting (3DGS), which enables real-time rendering through explicit primitive rasterization. However, existing methods tie visual fidelity strictly to the number of primitives: quality downscaling is achieved only through pruning primitives. We propose the first inherently scalable primitive for radiance field rendering. Fourier Splatting employs scalable primitives with arbitrary closed shapes obtained by parameterizing planar surfels with Fourier encoded descriptors. This formulation allows a single trained model to be rendered at varying levels of detail simply by truncating Fourier coefficients at runtime. To facilitate stable optimization, we employ a straight-through estimator for gradient extension beyond the primitive boundary, and introduce HYDRA, a densification strategy that decomposes complex primitives into simpler constituents within the MCMC framework. Our method achieves state-of-the-art rendering quality among planar-primitive frameworks and comparable perceptual metrics compared to leading volumetric representations on standard benchmarks, providing a versatile solution for bandwidth-constrained high-fidelity rendering.
45. 【2603.19822】HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks
链接:https://arxiv.org/abs/2603.19822
作者:Jingyu Guo,Ziye Chen,Ziwen Li,Zhengqing Gao,Jiaxin Huang,Hanlue Zhang,Fengming Huang,Yu Yao,Tongliang Liu,Mingming Gong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing UAV vision-language, UAV vision-language navigation, enabled language-guided flight, step-wise route descriptions, Existing UAV
备注:
点击查看摘要
Abstract:Existing UAV vision-language navigation (VLN) benchmarks have enabled language-guided flight, but they largely focus on long, step-wise route descriptions with goal-centric evaluation, making them less diagnostic for real operations where brief, high-level commands must be grounded into safe multi-stage behaviors. We present HUGE-Bench, a benchmark for High-Level UAV Vision-Language-Action (HL-VLA) tasks that tests whether an agent can interpret concise language and execute complex, process-oriented trajectories with safety awareness. HUGE-Bench comprises 4 real-world digital twin scenes, 8 high-level tasks, and 2.56M meters of trajectories, and is built on an aligned 3D Gaussian Splatting (3DGS)-Mesh representation that combines photorealistic rendering with collision-capable geometry for scalable generation and collision-aware evaluation. We introduce process-oriented and collision-aware metrics to assess process fidelity, terminal accuracy, and safety. Experiments on representative state-of-the-art VLA models reveal significant gaps in high-level semantic completion and safe execution, highlighting HUGE-Bench as a diagnostic testbed for high-level UAV autonomy.
46. 【2603.19807】Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision
链接:https://arxiv.org/abs/2603.19807
作者:Jiyeong Kim,Yerim So,Hyesong Choi,Uiwon Hwang,Dongbo Min
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Unified Multimodal Models, integrates multimodal understanding, Multimodal Models, unified modeling framework, Unified Multimodal
备注:
点击查看摘要
Abstract:Unified Multimodal Models (UMMs) have emerged as a promising paradigm that integrates multimodal understanding and generation within a unified modeling framework. However, current generative training paradigms suffer from inherent limitations. We present Semantically-Grounded Supervision (SeGroS), a fine-tuning framework designed to resolve the granularity mismatch and supervisory redundancy in UMMs. At its core, we propose a novel visual grounding map to construct two complementary supervision signals. First, we formulate semantic Visual Hints to compensate for the sparsity of text prompts. Second, we generate a semantically-grounded Corrupted Input to explicitly enhance the supervision of masking-based UMMs by restricting the reconstruction loss to core text-aligned regions. Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.
47. 【2603.19802】Evaluating Vision Foundation Models for Pixel and Object Classification in Microscopy
链接:https://arxiv.org/abs/2603.19802
作者:Carolin Teuber,Anwai Archit,Tobias Boothe,Peter Ditte,Jochen Rink,Constantin Pape
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:including biomedical imaging, Deep learning underlies, biomedical imaging, underlies most modern, Deep learning
备注:
点击查看摘要
Abstract:Deep learning underlies most modern approaches and tools in computer vision, including biomedical imaging. However, for interactive semantic segmentation (often called pixel classification in this context) and interactive object-level classification (object classification), feature-based shallow learning remains widely used. This is due to the diversity of data in this domain, the lack of large pretraining datasets, and the need for computational and label efficiency. In contrast, state-of-the-art tools for many other vision tasks in microscopy - most notably cellular instance segmentation - already rely on deep learning and have recently benefited substantially from vision foundation models (VFMs), particularly SAM. Here, we investigate whether VFMs can also improve pixel and object classification compared to current approaches. To this end, we evaluate several VFMs, including general-purpose models (SAM, SAM2, DINOv3) and domain-specific ones ($\mu$SAM, PathoSAM), in combination with shallow learning and attentive probing on five diverse and challenging datasets. Our results demonstrate consistent improvements over hand-crafted features and provide a clear pathway toward practical improvements. Furthermore, our study establishes a benchmark for VFMs in microscopy and informs future developments in this area.
48. 【2603.19795】Controllable Text-to-Motion Generation via Modular Body-Part Phase Control
链接:https://arxiv.org/abs/2603.19795
作者:Minyue Dai,Ke Fan,Anyi Rao,Jingbo Wang,Bo Dai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:interactive avatars, tool for animation, animation and interactive, Phase, practical tool
备注:
点击查看摘要
Abstract:Text-to-motion (T2M) generation is becoming a practical tool for animation and interactive avatars. However, modifying specific body parts while maintaining overall motion coherence remains challenging. Existing methods typically rely on cumbersome, high-dimensional joint constraints (e.g., trajectories), which hinder user-friendly, iterative refinement. To address this, we propose Modular Body-Part Phase Control, a plug-and-play framework enabling structured, localized editing via a compact, scalar-based phase interface. By modeling body-part latent motion channels as sinusoidal phase signals characterized by amplitude, frequency, phase shift, and offset, we extract interpretable codes that capture part-specific dynamics. A modular Phase ControlNet branch then injects this signal via residual feature modulation, seamlessly decoupling control from the generative backbone. Experiments on both diffusion- and flow-based models demonstrate that our approach provides predictable and fine-grained control over motion magnitude, speed, and timing. It preserves global motion coherence and offers a practical paradigm for controllable T2M generation. Project page: this https URL
49. 【2603.19790】From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models
链接:https://arxiv.org/abs/2603.19790
作者:Weile Gong,Yiping Zuo,Zijian Lu,Xin He,Weibei Fan,Chen Dai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern vision-language models, Modern vision-language, generative OCR engines, generative OCR, vision-language models
备注: 10 pages, 5 figures, 5 tables
点击查看摘要
Abstract:Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.
50. 【2603.19788】Learning Hierarchical Orthogonal Prototypes for Generalized Few-Shot 3D Point Cloud Segmentation
链接:https://arxiv.org/abs/2603.19788
作者:Yifei Zhao,Fanyu Zhao,Zhongyuan Zhang,Shengtang Wu,Yixuan Lin,Yinsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:point cloud segmentation, inherent stability-plasticity trade-off, cloud segmentation aims, remains challenging due, maintaining strong performance
备注: 6 pages, 6 figures, 2 tables, Accepted by ICME 2026
点击查看摘要
Abstract:Generalized few-shot 3D point cloud segmentation aims to adapt to novel classes from only a few annotations while maintaining strong performance on base classes, but this remains challenging due to the inherent stability-plasticity trade-off: adapting to novel classes can interfere with shared representations and cause base-class forgetting. We present HOP3D, a unified framework that learns hierarchical orthogonal prototypes with an entropy-based few-shot regularizer to enable robust novel-class adaptation without degrading base-class performance. HOP3D introduces hierarchical orthogonalization that decouples base and novel learning at both the gradient and representation levels, effectively mitigating base-novel interference. To further enhance adaptation under sparse supervision, we incorporate an entropy-based regularizer that leverages predictive uncertainty to refine prototype learning and promote balanced predictions. Extensive experiments on ScanNet200 and ScanNet++ demonstrate that HOP3D consistently outperforms state-of-the-art baselines under both 1-shot and 5-shot settings. The code is available at this https URL.
51. 【2603.19780】Decoupled Sensitivity-Consistency Learning for Weakly Supervised Video Anomaly Detection
链接:https://arxiv.org/abs/2603.19780
作者:Hantao Zheng,Ning Han,Yawen Zeng,Hao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent weakly supervised, weakly supervised video, supervised video anomaly, video anomaly detection, anomaly detection methods
备注: 6 pages, 3 figures, 4 tables. Accepted by ICME 2026
点击查看摘要
Abstract:Recent weakly supervised video anomaly detection methods have achieved significant advances by employing unified frameworks for joint optimization. However, this paradigm is limited by a fundamental sensitivity-stability trade-off, as the conflicting objectives for detecting transient and sustained anomalies lead to either fragmented predictions or over-smoothed responses. To address this limitation, we propose DeSC, a novel Decoupled Sensitivity-Consistency framework that trains two specialized streams using distinct optimization strategies. The temporal sensitivity stream adopts an aggressive optimization strategy to capture high-frequency abrupt changes, whereas the semantic consistency stream applies robust constraints to maintain long-term coherence and reduce noise. Their complementary strengths are fused through a collaborative inference mechanism that reduces individual biases and produces balanced predictions. Extensive experiments demonstrate that DeSC establishes new state-of-the-art performance by achieving 89.37% AUC on UCF-Crime (+1.29%) and 87.18% AP on XD-Violence (+2.22%). Code is available at this https URL.
52. 【2603.19779】One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment
链接:https://arxiv.org/abs/2603.19779
作者:Wen Yin,Cencen Liu,Dingrui Liu,Bing Su,Yuan-Fang Li,Tao He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unifying Image Quality, Image Quality Assessment, Unifying Image, Image Quality, Image Aesthetic Assessment
备注: 10 pages,7 figures
点击查看摘要
Abstract:Unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) in a single multimodal large language model is appealing, yet existing methods adopt a task-agnostic recipe that applies the same reasoning strategy and reward to both tasks. We show this is fundamentally misaligned: IQA relies on low-level, objective perceptual cues and benefits from concise distortion-focused reasoning, whereas IAA requires deliberative semantic judgment and is poorly served by point-wise score regression. We identify these as a reasoning mismatch and an optimization mismatch, and provide empirical evidence for both through controlled probes. Motivated by these findings, we propose TATAR (Task-Aware Thinking with Asymmetric Rewards), a unified framework that shares the visual-language backbone while conditioning post-training on each task's nature. TATAR combines three components: fast--slow task-specific reasoning construction that pairs IQA with concise perceptual rationales and IAA with deliberative aesthetic narratives; two-stage SFT+GRPO learning that establishes task-aware behavioral priors before reward-driven refinement; and asymmetric rewards that apply Gaussian score shaping for IQA and Thurstone-style completion ranking for IAA. Extensive experiments across eight benchmarks demonstrate that TATAR consistently outperforms prior unified baselines on both tasks under in-domain and cross-domain settings, remains competitive with task-specific specialized models, and yields more stable training dynamics for aesthetic assessment. Our results establish task-conditioned post-training as a principled paradigm for unified perceptual scoring. Our code is publicly available at this https URL.
53. 【2603.19776】ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection
链接:https://arxiv.org/abs/2603.19776
作者:Chengzhi Hong,Bijun Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:weak geometric constraints, remains challenging due, detection remains challenging, geometric constraints, remains challenging
备注:
点击查看摘要
Abstract:Monocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchor- or curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric-topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in $\mathbb{R}^3$, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, thereby coupling metric and topology across surfaces, curves, and point sets. Building on this, we propose ReManNet, which first produces initial lane predictions with an image backbone and detection heads, then encodes geometry as Riemannian Gaussian descriptors on the symmetric positive-definite (SPD) manifold, and fuses these descriptors with visual features through a lightweight gate to maintain coherent 3D reasoning. We also propose the 3D Tunnel Lane IoU (3D-TLIoU) loss, a joint point-curve objective that computes slice-wise overlap of tubular neighborhoods along each lane to improve shape-level alignment. Extensive experiments on standard benchmarks demonstrate that ReManNet achieves state-of-the-art (SOTA) or competitive results. On OpenLane, it improves F1 by +8.2% over the baseline and by +1.8% over the previous best, with scenario-level gains of up to +6.6%. The code will be publicly available at this https URL.
54. 【2603.19775】Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach
链接:https://arxiv.org/abs/2603.19775
作者:Shiqi Gao,Zitong Xu,Kang Fu,Huiyu Duan,Xiongkuo Min,Jia wang,Guangtao Zhai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Evaluating text-guided image, Evaluating text-guided, text-guided image editing, TIE models, challenging problem
备注:
点击查看摘要
Abstract:Evaluating text-guided image editing (TIE) methods remains a challenging problem, as reliable assessment should simultaneously consider perceptual quality, alignment with textual instructions, and preservation of original image content. Despite rapid progress in TIE models, existing evaluation benchmarks remain limited in scale and often show weak correlation with human perceptual judgments. In this work, we introduce TIEdit, a benchmark for systematic evaluation of text-guided image editing methods. TIEdit consists of 512 source images paired with editing prompts across eight representative editing tasks, producing 5,120 edited images generated by ten state-of-the-art TIE models. To obtain reliable subjective ratings, 20 experts are recruited to produce 307,200 raw subjective ratings, which accumulates into 15,360 mean opinion scores (MOSs) across three evaluation dimensions: perceptual quality, editing alignment, and content preservation. Beyond the benchmark itself, we further propose EditProbe, an LLM-based evaluator that estimates editing quality via intermediate-layer probing of hidden representations. Instead of relying solely on final model outputs, EditProbe extracts informative representations from intermediate layers of multimodal large language models to better capture semantic and perceptual relationships between source images, editing instructions, and edited results. Experimental results demonstrate that widely used automatic evaluation metrics show limited correlation with human judgments on editing tasks, while EditProbe achieves substantially stronger alignment with human perception. Together, TIEdit and EditProbe provide a foundation for more reliable and perceptually aligned evaluation of text-guided image editing methods.
55. 【2603.19773】mplate-based Object Detection Using a Foundation Model
链接:https://arxiv.org/abs/2603.19773
作者:Valentin Braeutigam,Matthias Stock,Bernhard Egger
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:varying appearances, training, detect objects, object detection methods, Abstract
备注:
点击查看摘要
Abstract:Most currently used object detection methods are learning-based, and can detect objects under varying appearances. Those models require training and a training dataset. We focus on use cases with less data variation, but the requirement of being free of generation of training data and training. Such a setup is for example desired in automatic testing of graphical interfaces during software development, especially for continuous integration testing. In our approach, we use segments from segmentation foundation models and combine them with a simple feature-based classification method. This saves time and cost when changing the object to be searched or its design, as nothing has to be retrained and no dataset has to be created. We evaluate our method on the task of detecting and classifying icons in navigation maps, which is used to simplify and automate the testing of user interfaces in automotive industry. Our methods achieve results almost on par with learning-based object detection methods like YOLO, without the need for training.
56. 【2603.19770】FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision
链接:https://arxiv.org/abs/2603.19770
作者:Zekai Wu,Shuqi Fan,Mengyin Liu,Yuhua Luo,Xincheng Lin,Ming Yan,Junhao Wu,Xiuhong Lin,Yuexin Ma,Chenglu Wen,Lan Xu,Siqi Shen,Cheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:swift motion analysis, crucial for swift, Precise motion timing, PMT, Precise motion
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Precise motion timing (PMT) is crucial for swift motion analysis. A millisecond difference may determine victory or defeat in sports competitions. Despite substantial progress in human pose estimation (HPE), PMT remains largely overlooked by the HPE community due to the limited availability of high-temporal-resolution labeled datasets. Today, PMT is achieved using high-speed RGB cameras in specialized scenarios such as the Olympic Games; however, their high costs, light sensitivity, bandwidth, and computational complexity limit their feasibility for daily use. We developed FlashCap, the first flashing LED-based MoCap system for PMT. With FlashCap, we collect a millisecond-resolution human motion dataset, FlashMotion, comprising the event, RGB, LiDAR, and IMU modalities, and demonstrate its high quality through rigorous validation. To evaluate the merits of FlashMotion, we perform two tasks: precise motion timing and high-temporal-resolution HPE. For these tasks, we propose ResPose, a simple yet effective baseline that learns residual poses based on events and RGBs. Experimental results show that ResPose reduces pose estimation errors by ~40% and achieves millisecond-level timing accuracy, enabling new research opportunities. The dataset and code will be shared with the community.
57. 【2603.19766】Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images
链接:https://arxiv.org/abs/2603.19766
作者:Donghai Fang,Yongheng Li,Zhen Wang,Yuansong Zeng,Wenwen Min
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:limited throughput motivate, throughput motivate predicting, situ expression profiling, motivate predicting expression, predicting expression directly
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Spatial transcriptomics (ST) enables spot-level in situ expression profiling, but its high cost and limited throughput motivate predicting expression directly from HE-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing generative approaches omit explicit modeling of gene-gene dependencies, undermining biological coherence. Single-cell foundation models (sc-FMs), pre-trained across diverse cell populations, capture these critical gene relationships that histology alone cannot reveal. Yet, applying expression-only sc-FMs to histology-conditioned expression modeling is nontrivial due to the absence of a visual pathway, a mismatch between their pre-training and conditional ST objectives, and the scarcity of mixed-cell ST supervision. To address these challenges, we propose HINGE (HIstology-coNditioned GEneration), which retrofits a pre-trained sc-FM into a conditional expression generator while mostly preserving its learned gene relationships. We achieve this by introducing SoftAdaLN, a lightweight, identity-initialized modulation that injects layer-wise visual context into the backbone, coupled with an expression-space masked diffusion objective and a warm-start curriculum to ensure objective alignment and training stability. Evaluated on three ST datasets, ours outperforms state-of-the-art baselines on mean Pearson correlation and yields more accurate spatial marker expression patterns and higher pairwise co-expression consistency, establishing a practical route to adapt pre-trained sc-FMs for histology-conditioned spatial expression generation.
58. 【2603.19765】FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs
链接:https://arxiv.org/abs/2603.19765
作者:Zhihan Yin,Jianxin Liang,Yueqian Wang,Yifeng Yao,Huishuai Zhang,Dongyan Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, Language Models
备注: 34 pages
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the model's ability to perceive target detailed information. Through systematic evaluation of prevailing Chain-of-Thought (CoT) prompting techniques within this task, we reveal critical insights regarding hallucination patterns and model reasoning processes.
59. 【2603.19762】PCSTracker: Long-Term Scene Flow Estimation for Point Cloud Sequences
链接:https://arxiv.org/abs/2603.19762
作者:Min Lin,Gangwei Xu,Xianqi Wang,Yuyi Peng,Xin Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scene flow estimation, scene flow, flow estimation, cloud scene flow, Point cloud scene
备注: Accepted in CVPR 2026 (Findings)
点击查看摘要
Abstract:Point cloud scene flow estimation is fundamental to long-term and fine-grained 3D motion analysis. However, existing methods are typically limited to pairwise settings and struggle to maintain temporal consistency over long sequences as geometry evolves, occlusions emerge, and errors accumulate. In this work, we propose PCSTracker, the first end-to-end framework specifically designed for consistent scene flow estimation in point cloud sequences. Specifically, we introduce an iterative geometry motion joint optimization module (IGMO) that explicitly models the temporal evolution of point features to alleviate correspondence inconsistencies caused by dynamic geometric changes. In addition, a spatio-temporal point trajectory update module (STTU) is proposed to leverage broad temporal context to infer plausible positions for occluded points, ensuring coherent motion estimation. To further handle long sequences, we employ an overlapping sliding-window inference strategy that alternates cross-window propagation and in-window refinement, effectively suppressing error accumulation and maintaining stable long-term motion consistency. Extensive experiments on the synthetic PointOdyssey3D and real-world ADT3D datasets show that PCSTracker achieves the best accuracy in long-term scene flow estimation and maintains real-time performance at 32.5 FPS, while demonstrating superior 3D motion understanding compared to RGB-D-based approaches.
60. 【2603.19759】Growing Networks with Autonomous Pruning
链接:https://arxiv.org/abs/2603.19759
作者:Charles De Lambilly,Stefan Duffner
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:paper introduces Growing, introduces Growing Networks, paper introduces, GNAP, parameters
备注:
点击查看摘要
Abstract:This paper introduces Growing Networks with Autonomous Pruning (GNAP) for image classification. Unlike traditional convolutional neural networks, GNAP change their size, as well as the number of parameters they are using, during training, in order to best fit the data while trying to use as few parameters as possible. This is achieved through two complementary mechanisms: growth and pruning. GNAP start with few parameters, but their size is expanded periodically during training to add more expressive power each time the network has converged to a saturation point. Between these growing phases, model parameters are trained for classification and pruned simultaneously, with complete autonomy by gradient descent. Growing phases allow GNAP to improve their classification performance, while autonomous pruning allows them to keep as few parameters as possible. Experimental results on several image classification benchmarks show that our approach can train extremely sparse neural networks with high accuracy. For example, on MNIST, we achieved 99.44% accuracy with as few as 6.2k parameters, while on CIFAR10, we achieved 92.2\ accuracy with 157.8k parameters.
61. 【2603.19757】Uncertainty-aware Prototype Learning with Variational Inference for Few-shot Point Cloud Segmentation
链接:https://arxiv.org/abs/2603.19757
作者:Yifei Zhao,Fanyu Zhao,Yinsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:generate accurate semantic, semantic segmentation aims, query point clouds, accurate semantic masks, accurate semantic
备注: 5 pages, 3 figures, 3 tables, accepted by ICASSP 2026
点击查看摘要
Abstract:Few-shot 3D semantic segmentation aims to generate accurate semantic masks for query point clouds with only a few annotated support examples. Existing prototype-based methods typically construct compact and deterministic prototypes from the support set to guide query segmentation. However, such rigid representations are unable to capture the intrinsic uncertainty introduced by scarce supervision, which often results in degraded robustness and limited generalization. In this work, we propose UPL (Uncertainty-aware Prototype Learning), a probabilistic approach designed to incorporate uncertainty modeling into prototype learning for few-shot 3D segmentation. Our framework introduces two key components. First, UPL introduces a dual-stream prototype refinement module that enriches prototype representations by jointly leveraging limited information from both support and query samples. Second, we formulate prototype learning as a variational inference problem, regarding class prototypes as latent variables. This probabilistic formulation enables explicit uncertainty modeling, providing robust and interpretable mask predictions. Extensive experiments on the widely used ScanNet and S3DIS benchmarks show that our UPL achieves consistent state-of-the-art performance under different settings while providing reliable uncertainty estimation. The code is available at this https URL.
62. 【2603.19753】ReLi3D: Relightable Multi-view 3D Reconstruction with Disentangled Illumination
链接:https://arxiv.org/abs/2603.19753
作者:Jan-Niklas Dihlmann,Mark Boss,Simon Donne,Andreas Engelhardt,Hendrik P.A. Lensch,Varun Jampani
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:long required separate, computational overhead, long required, distinct limitations, limitations and computational
备注: Project Page: [this https URL](https://reli3d.jdihlmann.com/)
点击查看摘要
Abstract:Reconstructing 3D assets from images has long required separate pipelines for geometry reconstruction, material estimation, and illumination recovery, each with distinct limitations and computational overhead. We present ReLi3D, the first unified end-to-end pipeline that simultaneously reconstructs complete 3D geometry, spatially-varying physically-based materials, and environment illumination from sparse multi-view images in under one second. Our key insight is that multi-view constraints can dramatically improve material and illumination disentanglement, a problem that remains fundamentally ill-posed for single-image methods. Key to our approach is the fusion of the multi-view input via a transformer cross-conditioning architecture, followed by a novel unified two-path prediction strategy. The first path predicts the object's structure and appearance, while the second path predicts the environment illumination from image background or object reflections. This, combined with a differentiable Monte Carlo multiple importance sampling renderer, creates an optimal illumination disentanglement training pipeline. In addition, with our mixed domain training protocol, which combines synthetic PBR datasets with real-world RGB captures, we establish generalizable results in geometry, material accuracy, and illumination quality. By unifying previously separate reconstruction tasks into a single feed-forward pass, we enable near-instantaneous generation of complete, relightable 3D assets. Project Page: this https URL
63. 【2603.19752】PhysNeXt: Next-Generation Dual-Branch Structured Attention Fusion Network for Remote Photoplethysmography Measurement
链接:https://arxiv.org/abs/2603.19752
作者:Junzhe Cao,Bo Zhao,Zhiyi Niu,Dan Guo,Yue Sun,Haochen Liang,Yong Xu,Zitong YU
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enables contactless measurement, Remote photoplethysmography, facial skin induced, enables contactless, cardiac pulsation
备注:
点击查看摘要
Abstract:Remote photoplethysmography (rPPG) enables contactless measurement of heart rate and other vital signs by analyzing subtle color variations in facial skin induced by cardiac pulsation. Current rPPG methods are mainly based on either end-to-end modeling from raw videos or intermediate spatial-temporal map (STMap) representations. The former preserves complete spatiotemporal information and can capture subtle heartbeat-related signals, but it also introduces substantial noise from motion artifacts and illumination variations. The latter stacks the temporal color changes of multiple facial regions of interest into compact two-dimensional representations, significantly reducing data volume and computational complexity, although some high-frequency details may be lost. To effectively integrate the mutual strengths, we propose PhysNeXt, a dual-input deep learning framework that jointly exploits video frames and STMap representations. By incorporating a spatio-temporal difference modeling unit, a cross-modal interaction module, and a structured attention-based decoder, PhysNeXt collaboratively enhances the robustness of pulse signal extraction. Experimental results demonstrate that PhysNeXt achieves more stable and fine-grained rPPG signal recovery under challenging conditions, validating the effectiveness of joint modeling of video and STMap representations. The codes will be released.
64. 【2603.19731】PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing
链接:https://arxiv.org/abs/2603.19731
作者:Jiadong Liang,Bojun Xiong,Jie Tian,Hua Li,Xiao Long,Yong Zheng,Huan Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:paper primarily investigates, driving video, primarily investigates, investigates the task, plays a crucial
备注: Accepted to CVPR 2026. Project Page: [this https URL](https://youku-aigc.github.io/PerformRecast)
点击查看摘要
Abstract:This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile expression-only video editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we improve the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. Our code, data and trained models are available at this https URL.
65. 【2603.19718】BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
链接:https://arxiv.org/abs/2603.19718
作者:Phuong-Anh Nguyen,Tien Anh Pham,Duc-Trong Le,Cam-Van Thi Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:information-rich modalities dominate, modalities dominate optimization, dominate optimization, optimization while weaker, weaker or partially
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Learning from multiple modalities often suffers from imbalance, where information-rich modalities dominate optimization while weaker or partially missing modalities contribute less. This imbalance becomes severe in realistic settings with imbalanced missing rates (IMR), where each modality is absent with different probabilities, distorting representation learning and gradient dynamics. We revisit this issue from a training-process perspective and propose BALM, a model-agnostic plug-in framework to achieve balanced multimodal learning under IMR. The framework comprises two complementary modules: the Feature Calibration Module (FCM), which recalibrates unimodal features using global context to establish a shared representation basis across heterogeneous missing patterns; the Gradient Rebalancing Module (GRM), which balances learning dynamics across modalities by modulating gradient magnitudes and directions from both distributional and spatial perspectives. BALM can be seamlessly integrated into diverse backbones, including multimodal emotion recognition (MER) models, without altering their architectures. Experimental results across multiple MER benchmarks confirm that BALM consistently enhances robustness and improves performance under diverse missing and imbalance settings. Code available at: this https URL
66. 【2603.19708】WorldAgents: Can Foundation Image Models be Agents for 3D World Models?
链接:https://arxiv.org/abs/2603.19708
作者:Ziya Erkoç,Angela Dai,Matthias Nießner
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:models inherently possess, world model capabilities, generate high-fidelity outputs, fundamental question, inherently possess
备注: Webpage: [this https URL](https://ziyaerkoc.com/worldagents/) Video: [this https URL](https://www.youtube.com/watch?v=Mj2FqqhurdI)
点击查看摘要
Abstract:Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.
67. 【2603.19695】Demographic-Aware Self-Supervised Anomaly Detection Pretraining for Equitable Rare Cardiac Diagnosis
链接:https://arxiv.org/abs/2603.19695
作者:Chaoqin Huang,Zi Zeng,Aofan Jiang,Yuchen Xu,Qing Cao,Kang Chen,Chenfei Chi,Yanfeng Wang,Ya Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:extremely limited case, limited case counts, detect from electrocardiograms, difficult to detect, long-tailed distribution
备注:
点击查看摘要
Abstract:Rare cardiac anomalies are difficult to detect from electrocardiograms (ECGs) due to their long-tailed distribution with extremely limited case counts and demographic disparities in diagnostic performance. These limitations contribute to delayed recognition and uneven quality of care, creating an urgent need for a generalizable framework that enhances sensitivity while ensuring equity across diverse populations. In this study, we developed an AI-assisted two-stage ECG framework integrating self-supervised anomaly detection with demographic-aware representation learning. The first stage performs self-supervised anomaly detection pretraining by reconstructing masked global and local ECG signals, modeling signal trends, and predicting patient attributes to learn robust ECG representations without diagnostic labels. The pretrained model is then fine-tuned for multi-label ECG classification using asymmetric loss to better handle long-tail cardiac abnormalities, and additionally produces anomaly score maps for localization, with CPU-based optimization enabling practical deployment. Evaluated on a longitudinal cohort of over one million clinical ECGs, our method achieves an AUROC of 94.7% for rare anomalies and reduces the common-rare performance gap by 73%, while maintaining consistent diagnostic accuracy across age and sex groups. In conclusion, the proposed equity-aware AI framework demonstrates strong clinical utility, interpretable anomaly localization, and scalable performance across multiple cohorts, highlighting its potential to mitigate diagnostic disparities and advance equitable anomaly detection in biomedical signals and digital health. Source code is available at this https URL.
68. 【2603.19684】SegAgent: Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents
链接:https://arxiv.org/abs/2603.19684
作者:Shaojie Zhuang,Lu Yin,Guangshun Wei,Yunpeng Li,Xilu Wang,Yuanfeng Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:neural networks trained, densely annotated datasets, existing approaches rely, Automatic tooth segmentation, intra-oral scanned
备注: MICCAI 2026; Under review
点击查看摘要
Abstract:Automatic tooth segmentation and identification from intra-oral scanned 3D models are fundamental problems in digital dentistry, yet most existing approaches rely on task-specific 3D neural networks trained with densely annotated datasets, resulting in high annotation cost and limited generalization to scans from unseen sources. Thus, we propose TSegAgent, which addresses these challenges by reformulating dental analysis as a zero-shot geometric reasoning problem rather than a purely data-driven recognition task. The key idea is to combine the representational capacity of general-purpose foundation models with explicit geometric inductive biases derived from dental anatomy. Instead of learning dental-specific features, the proposed framework leverages multi-view visual abstraction and geometry-grounded reasoning to infer tooth instances and identities without task-specific training. By explicitly encoding structural constraints such as dental arch organization and volumetric relationships, the method reduces uncertainty in ambiguous cases and mitigates overfitting to particular shape distributions. Experimental results demonstrate that this reasoning-oriented formulation enables accurate and reliable tooth segmentation and identification with low computational and annotation cost, while exhibiting strong generalization across diverse and previously unseen dental scans.
69. 【2603.19682】3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
链接:https://arxiv.org/abs/2603.19682
作者:Takeshi Noda,Yu-Shen Liu,Zhizhong Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modeling of radiance, Gaussians, accurate depth rendering, high fidelity surfaces, prior
备注: Accepted by CVPR 2026. Project page: [this https URL](https://takeshie.github.io/GSPrior)
点击查看摘要
Abstract:Rendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constrain the learning of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is derived from a TSDF grid that is obtained by fusing the depth maps rendered with current 3D Gaussians. The prior measures a distance field around the estimated surface, offering a band centered at the surface for imposing more specific constraints on 3D Gaussians, such as removing Gaussians outside the band, moving Gaussians closer to the surface, and encouraging larger or smaller opacity in a geometry-aware manner. More importantly, our prior can be regularly updated by the most recent depth images which are usually more accurate and complete. In addition, the prior can also progressively narrow the band to tighten the imposed constraints. We justify our idea and report our superiority over the state-of-the-art methods in evaluations on widely used benchmarks.
70. 【2603.19681】Unbiased Dynamic Multimodal Fusion
链接:https://arxiv.org/abs/2603.19681
作者:Shicai Wei,Kaijie Zhang,Luyi Chen,Tao He,Guiduo Duan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Traditional multimodal methods, dynamic real-world scenarios, Traditional multimodal, modality, modality quality
备注: CVPR2026 Findings, 11 pages, 4 figures
点击查看摘要
Abstract:Traditional multimodal methods often assume static modality quality, which limits their adaptability in dynamic real-world scenarios. Thus, dynamical multimodal methods are proposed to assess modality quality and adjust their contribution accordingly. However, they typically rely on empirical metrics, failing to measure the modality quality when noise levels are extremely low or high. Moreover, existing methods usually assume that the initial contribution of each modality is the same, neglecting the intrinsic modality dependency bias. As a result, the modality hard to learn would be doubly penalized, and the performance of dynamical fusion could be inferior to that of static fusion. To address these challenges, we propose the Unbiased Dynamic Multimodal Learning (UDML) framework. Specifically, we introduce a noise-aware uncertainty estimator that adds controlled noise to the modality data and predicts its intensity from the modality feature. This forces the model to learn a clear correspondence between feature corruption and noise level, allowing accurate uncertainty measure across both low- and high-noise conditions. Furthermore, we quantify the inherent modality reliance bias within multimodal networks via modality dropout and incorporate it into the weighting mechanism. This eliminates the dual suppression effect on the hard-to-learn modality. Extensive experiments across diverse multimodal benchmark tasks validate the effectiveness, versatility, and generalizability of the proposed UDML. The code is available at this https URL.
71. 【2603.19678】Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification
链接:https://arxiv.org/abs/2603.19678
作者:Kunlun Xu,Haotong Cheng,Jiangmeng Li,Xu Zou,Jiahuan Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Lifelong person re-identification, unified person retrieval, Lifelong person, person retrieval model, person re-identification
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9\%-2.2\% and 2.1\%-2.5\% on anti-forgetting and generalization capacity. Our source code is available at this https URL
72. 【2603.19676】ATHENA: Adaptive Test-Time Steering for Improving Count Fidelity in Diffusion Models
链接:https://arxiv.org/abs/2603.19676
作者:Mohammad Shahab Sepehri,Asal Mehradfar,Berk Tinaz,Salman Avestimehr,Mahdi Soltanolkotabi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:achieve high visual, surprisingly exhibit systematic, exhibit systematic failures, models achieve high, high visual fidelity
备注:
点击查看摘要
Abstract:Text-to-image diffusion models achieve high visual fidelity but surprisingly exhibit systematic failures in numerical control when prompts specify explicit object counts. To address this limitation, we introduce ATHENA, a model-agnostic, test-time adaptive steering framework that improves object count fidelity without modifying model architectures or requiring retraining. ATHENA leverages intermediate representations during sampling to estimate object counts and applies count-aware noise corrections early in the denoising process, steering the generation trajectory before structural errors become difficult to revise. We present three progressively more advanced variants of ATHENA that trade additional computation for improved numerical accuracy, ranging from static prompt-based steering to dynamically adjusted count-aware control. Experiments on established benchmarks and a new visually and semantically complex dataset show that ATHENA consistently improves count fidelity, particularly at higher target counts, while maintaining favorable accuracy-runtime trade-offs across multiple diffusion backbones.
73. 【2603.19675】DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving
链接:https://arxiv.org/abs/2603.19675
作者:Xiaolu Liu,Yicong Li,Song Wang,Junbo Chen,Angela Yao,Jianke Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:autonomous driving systems, systems to improve, planning reliability, Recently, unreliable action planning
备注: 18 pages, 6 figs
点击查看摘要
Abstract:Recently, world models have been incorporated into the autonomous driving systems to improve the planning reliability. Existing approaches typically predict future states through appearance generation or deterministic regression, which limits their ability to capture trajectory-conditioned scene evolution and leads to unreliable action planning. To address this, we propose DynFlowDrive, a latent world model that leverages flow-based dynamics to model the transition of world states under different driving actions. By adopting the rectifiedflow formulation, the model learns a velocity field that describes how the scene state changes under different driving actions, enabling progressive prediction of future latent states. Building upon this, we further introduce a stability-aware multi-mode trajectory selection strategy that evaluates candidate trajectories according to the stability of the induced scene transitions. Extensive experiments on the nuScenes and NavSim benchmarks demonstrate consistent improvements across diverse driving frameworks without introducing additional inference overhead. Source code will be abaliable at this https URL.
74. 【2603.19672】Making Video Models Adhere to User Intent with Minor Adjustments
链接:https://arxiv.org/abs/2603.19672
作者:Daniel Ajisafe,Eric Hedlin,Helge Rhodin,Kwang Moo Yi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recent drastic advancements, bounding boxes, drawn interest, recent drastic, drastic advancements
备注: Project page and code: [this https URL](https://ubc-vision.github.io/MinorAdjustVideo/docs/webpage/index.html)
点击查看摘要
Abstract:With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.
75. 【2603.19667】oward High-Fidelity Visual Reconstruction: From EEG-Based Conditioned Generation to Joint-Modal Guided Rebuilding
链接:https://arxiv.org/abs/2603.19667
作者:Zhijian Gong,Tianren Yao,Wenjia Dong,Xueyuan Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Human visual reconstruction, reconstruct fine-grained visual, fine-grained visual stimuli, visual stimuli based, Human visual
备注:
点击查看摘要
Abstract:Human visual reconstruction aims to reconstruct fine-grained visual stimuli based on subject-provided descriptions and corresponding neural signals. As a widely adopted modality, Electroencephalography (EEG) captures rich visual cognition information, encompassing complex spatial relationships and chromatic details within scenes. However, current approaches are deeply coupled with an alignment framework that forces EEG features to align with text or image semantic representation. The dependency may condense the rich spatial and chromatic details in EEG that achieved mere conditioned image generation rather than high-fidelity visual reconstruction. To address this limitation, we propose a novel Joint-Modal Visual Reconstruction (JMVR) framework. It treats EEG and text as independent modalities for joint learning to preserve EEG-specific information for reconstruction. It further employs a multi-scale EEG encoding strategy to capture both fine- and coarse-grained features, alongside image augmentation to enhance the recovery of perceptual details. Extensive experiments on the THINGS-EEG dataset demonstrate that JMVR achieves SOTA performance against six baseline methods, specifically exhibiting superior capabilities in modeling spatial structure and chromatic fidelity.
76. 【2603.19660】Semantic Audio-Visual Navigation in Continuous Environments
链接:https://arxiv.org/abs/2603.19660
作者:Yichen Zeng,Hebaixu Wang,Meng Liu,Yu Zhou,Chen Gao,Kehan Chen,Gongping Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enables embodied agents, navigation enables embodied, Semantic Audio-Visual Navigation, navigate toward sound-emitting, leveraging both auditory
备注: This paper has been accepted to CVPR 2026
点击查看摘要
Abstract:Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at this https URL.
77. 【2603.19659】CS-MUNet: A Channel-Spatial Dual-Stream Mamba Network for Multi-Organ Segmentation
链接:https://arxiv.org/abs/2603.19659
作者:Yuyang Zheng,Mingda Zhang,Jianglong Qin,Qi Mo,Jingdan Pan,Haozhe Hu,Hongyi Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recently Mamba-based methods, Recently Mamba-based, Mamba-based methods, shown promise, Recently
备注: 18 pages, 5 figures
点击查看摘要
Abstract:Recently Mamba-based methods have shown promise in abdominal organ segmentation. However, existing approaches neglect cross-channel anatomical semantic collaboration and lack explicit boundary-aware feature fusion mechanisms. To address these limitations, we propose CS-MUNet with two purpose-built modules. The Boundary-Aware State Mamba module employs a Bayesian-attention framework to generate pixel-level boundary posterior maps, injected directly into Mamba's core scan parameters to embed boundary awareness into the SSM state transition mechanism, while dual-branch weight allocation enables complementary modulation between global and local structural representations. The Channel Mamba State Aggregation module redefines the channel dimension as the SSM sequence dimension to explicitly model cross-channel anatomical semantic collaboration in a data-driven manner. Experiments on two public benchmarks demonstrate that CS-MUNet consistently outperforms state-of-the-art methods across multiple metrics, establishing a new SSM modeling paradigm that jointly addresses channel semantic collaboration and boundary-aware feature fusion for abdominal multi-organ segmentation.
78. 【2603.19654】GravCal: Single-Image Calibration of IMU Gravity Priors with Per-Sample Confidence
链接:https://arxiv.org/abs/2603.19654
作者:Haichao Zhu,Qian Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:augmented reality, noisy gravity prior, visual-inertial perception, linear acceleration, transient motion
备注: 14 pages, 4 figures
点击查看摘要
Abstract:Gravity estimation is fundamental to visual-inertial perception, augmented reality, and robotics, yet gravity priors from IMUs are often unreliable under linear acceleration, vibration, and transient motion. Existing methods often estimate gravity directly from images or assume reasonably accurate inertial input, leaving the practical problem of correcting a noisy gravity prior from a single image largely unaddressed. We present GravCal, a feedforward model for single-image gravity prior calibration. Given one RGB image and a noisy gravity prior, GravCal predicts a corrected gravity direction and a per-sample confidence score. The model combines two complementary predictions, including a residual correction of the input prior and a prior-independent image estimate, and uses a learned gate to fuse them adaptively. Extensive experiments show strong gains over raw inertial priors: GravCal reduces mean angular error from 22.02° (IMU prior) to 14.24°, with larger improvements when the prior is severely corrupted. We also introduce a novel dataset of over 148K frames with paired VIO-derived ground-truth gravity and Mahony-filter IMU priors across diverse scenes and arbitrary camera orientations. The learned gate also correlates with prior quality, making it a useful confidence signal for downstream systems.
Comments:
14 pages, 4 figures
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.19654 [cs.CV]
(or
arXiv:2603.19654v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.19654
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
79. 【2603.19643】OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework
链接:https://arxiv.org/abs/2603.19643
作者:Weixuan Zeng,Pengcheng Wei,Huaiqing Wang,Boheng Zhang,Jia Sun,Dewen Fan,Lin HE,Long Chen,Qianqian Gan,Fan Yang,Tingting Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:fine-grained detail preservation, omni Virtual Try-On, Virtual Try-On framework, Virtual Try-On, methods face challenges
备注:
点击查看摘要
Abstract:Despite the rapid advancement of Virtual Try-On (VTON) and Try-Off (VTOFF) technologies, existing VTON methods face challenges with fine-grained detail preservation, generalization to complex scenes, complicated pipeline, and efficient inference. To tackle these problems, we propose OmniDiT, an omni Virtual Try-On framework based on the Diffusion Transformer, which combines try-on and try-off tasks into one unified model. Specifically, we first establish a self-evolving data curation pipeline to continuously produce data, and construct a large VTON dataset Omni-TryOn, which contains over 380k diverse and high-quality garment-model-tryon image pairs and detailed text prompts. Then, we employ the token concatenation and design an adaptive position encoding to effectively incorporate multiple reference conditions. To relieve the bottleneck of long sequence computation, we are the first to introduce Shifted Window Attention into the diffusion model, thus achieving a linear complexity. To remedy the performance degradation caused by local window attention, we utilize multiple timestep prediction and an alignment loss to improve generation fidelity. Experiments reveal that, under various complex scenes, our method achieves the best performance in both the model-free VTON and VTOFF tasks and a performance comparable to current SOTA methods in the model-based VTON task.
80. 【2603.19637】UniBioTransfer: A Unified Framework for Multiple Biometrics Transfer
链接:https://arxiv.org/abs/2603.19637
作者:Caiyi Sun,Yujing Sun,Xiangyu Li,Yuhang Zheng,Yiming Ren,Jiamin Wang,Yuexin Ma,Siu-Ming Yiu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:task-driven paradigm, deepface generation tasks, Deepface generation, transfer, face transfer
备注:
点击查看摘要
Abstract:Deepface generation has traditionally followed a task-driven paradigm, where distinct tasks (e.g., face transfer and hair transfer) are addressed by task-specific models. Nevertheless, this single-task setting severely limits model generalization and scalability. A unified model capable of solving multiple deepface generation tasks in a single pass represents a promising and practical direction, yet remains challenging due to data scarcity and cross-task conflicts arising from heterogeneous attribute transformations. To this end, we propose UniBioTransfer, the first unified framework capable of handling both conventional deepface tasks (e.g., face transfer and face reenactment) and shape-varying transformations (e.g., hair transfer and head transfer). Besides, UniBioTransfer naturally generalizes to unseen tasks, like lip, eye, and glasses transfer, with minimal fine-tuning. Generally, UniBioTransfer addresses data insufficiency in multi-task generation through a unified data construction strategy, including a swapping-based corruption mechanism designed for spatially dynamic attributes like hair. It further mitigates cross-task interference via an innovative BioMoE, a mixture-of-experts based model coupled with a novel two-stage training strategy that effectively disentangles task-specific knowledge. Extensive experiments demonstrate the effectiveness, generalization, and scalability of UniBioTransfer, outperforming both existing unified models and task-specific methods across a wide range of deepface generation tasks. Project page is at this https URL
81. 【2603.19628】Dual Prompt-Driven Feature Encoding for Nighttime UAV Tracking
链接:https://arxiv.org/abs/2603.19628
作者:Yiheng Wang,Changhong Fu,Liangliang Yao,Haobo Zuo,Zijie Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Robust feature encoding, feature encoding constitutes, feature encoding, ensuring reliable tracking, feature encoding methods
备注: Accepted to IEEE International Conference on Robotics and Automation 2026
点击查看摘要
Abstract:Robust feature encoding constitutes the foundation of UAV tracking by enabling the nuanced perception of target appearance and motion, thereby playing a pivotal role in ensuring reliable tracking. However, existing feature encoding methods often overlook critical illumination and viewpoint cues, which are essential for robust perception under challenging nighttime conditions, leading to degraded tracking performance. To overcome the above limitation, this work proposes a dual prompt-driven feature encoding method that integrates prompt-conditioned feature adaptation and context-aware prompt evolution to promote domain-invariant feature encoding. Specifically, the pyramid illumination prompter is proposed to extract multi-scale frequency-aware illumination prompts. %The dynamic viewpoint prompter adapts the sampling to different viewpoints, enabling the tracker to learn view-invariant features. The dynamic viewpoint prompter modulates deformable convolution offsets to accommodate viewpoint variations, enabling the tracker to learn view-invariant features. Extensive experiments validate the effectiveness of the proposed dual prompt-driven tracker (DPTracker) in tackling nighttime UAV tracking. Ablation studies highlight the contribution of each component in DPTracker. Real-world tests under diverse nighttime UAV tracking scenarios further demonstrate the robustness and practical utility. The code and demo videos are available at this https URL.
82. 【2603.19625】IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1
链接:https://arxiv.org/abs/2603.19625
作者:Jun Wang,Xiaoyan Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Relative Pose Regression, Existing Relative Pose, fundamental for SLAM, Relative pose estimation, Relative pose
备注:
点击查看摘要
Abstract:Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.
83. 【2603.19623】Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
链接:https://arxiv.org/abs/2603.19623
作者:Chunlei Zhang,Jiahao Xia,Yun Xiao,Bo Jiang,Jian Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:downstream cross-modal analysis, cross-modal analysis, prerequisite for downstream, downstream cross-modal, hybrid parameter prediction
备注: Accepted by CVPR 2026 main track
点击查看摘要
Abstract:Multimodal image registration is a fundamental task and a prerequisite for downstream cross-modal analysis. Despite recent progress in shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, allowing modality-private cues to leak into the shared space. Second, most multi-scale frameworks support only a single transformation type, limiting their applicability when global misalignment and local deformation coexist. To address these issues, we formulate hybrid multimodal registration as jointly learning a stable shared feature space and a unified hybrid transformation. Based on this view, we propose HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) extracts multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues and projects shared features into a stable subspace for matching. Built on this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative coarse-to-fine estimation of global rigid parameters and deformation fields, which are fused into a coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on rigid and non-rigid registration tasks. The code is available at the project website.
84. 【2603.19616】UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
链接:https://arxiv.org/abs/2603.19616
作者:Chuanrui Zhang,Yingshuang Zou,ZhengXian Wu,Yonggen Ling,Yuxiao Yang,Ziwei Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Perceiving and reconstructing, transfer tasks, Perceiving, robotics community, Existing methods rely
备注:
点击查看摘要
Abstract:Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community. Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline. However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context. To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework. Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity. We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks. Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area. Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.
85. 【2603.19613】OrbitNVS: Harnessing Video Diffusion Priors for Novel View Synthesis
链接:https://arxiv.org/abs/2603.19613
作者:Jinglin Liang,Zijian Zhou,Rui Huang,Shuangping Huang,Yichen Gong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generate unseen views, aims to generate, generate unseen, limited number, unseen views
备注: 26 pages, 10 figures
点击查看摘要
Abstract:Novel View Synthesis (NVS) aims to generate unseen views of a 3D object given a limited number of known views. Existing methods often struggle to synthesize plausible views for unobserved regions, particularly under single-view input, and still face challenges in maintaining geometry- and appearance-consistency. To address these issues, we propose OrbitNVS, which reformulates NVS as an orbit video generation task. Through tailored model design and training strategies, we adapt a pre-trained video generation model to the NVS task, leveraging its rich visual priors to achieve high-quality view synthesis. Specifically, we incorporate camera adapters into the video model to enable accurate camera control. To enhance two key properties of 3D objects, geometry and appearance, we design a normal map generation branch and use normal map features to guide the synthesis of the target views via attention mechanism, thereby improving geometric consistency. Moreover, we apply a pixel-space supervision to alleviate blurry appearance caused by spatial compression in the latent space. Extensive experiments show that OrbitNVS significantly outperforms previous methods on the GSO and OmniObject3D benchmarks, especially in the challenging single-view setting (\eg, +2.9 dB and +2.4 dB PSNR).
86. 【2603.19610】ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
链接:https://arxiv.org/abs/2603.19610
作者:Quan Kong,Yuhao Shen,Yicheng Ji,Huan Li,Cong Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:current Video-LLMs achieve, Video-LLMs achieve impressive, achieve impressive performance, efficiency remains constrained, decoding efficiency remains
备注:
点击查看摘要
Abstract:Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporate an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by $1.6\sim1.8\times$ with high accepted lengths, and accelerates various video understanding benchmarks by 3.36$\times$ on LLaVA-Onevision-72B and 2.42$\times$ on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.
87. 【2603.19609】LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
链接:https://arxiv.org/abs/2603.19609
作者:Shuaibang Peng,Juelin Zhu,Xia Li,Kun Yang,Maojun Zhang,Yu Liu,Shen Yan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:generalized aerial visual, present LoD-Loc, dense urban environments, LoD-Loc, aerial visual localization
备注:
点击查看摘要
Abstract:We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin. The project is available at this https URL.
88. 【2603.19608】FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
链接:https://arxiv.org/abs/2603.19608
作者:Ming Hu,Yongsheng Huo,Mingyu Dou,Jianfu Yin,Peng Zhao,Yao Wang,Cong Hu,Bingliang Hu,Quan Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:medical applications, crucial in industrial, industrial and medical, Fine-grained anomaly detection, making zero-shot detection
备注:
点击查看摘要
Abstract:Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging. While vision-language models like CLIP offer promising solutions, they struggle with foreground-background feature entanglement and coarse textual semantics. We propose FB-CLIP, a framework that enhances anomaly localization via multi-strategy textual representations and foreground-background separation. In the textual modality, it combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, multi-view soft separation along identity, semantic, and spatial dimensions, together with background suppression, reduces interference and improves discriminability. Semantic Consistency Regularization (SCR) aligns image features with normal and abnormal textual prototypes, suppressing uncertain matches and enlarging semantic gaps. Experiments show that FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.
89. 【2603.19607】Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning
链接:https://arxiv.org/abs/2603.19607
作者:Qin Zhang,Peiyu Jing,Hong-Xing Yu,Fangqiang Ding,Fan Nie,Weimin Wang,Yilun Du,James Zou,Jiajun Wu,Bing Shuai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:simulators for storytelling, physical, world simulators, Video, Video generation
备注:
点击查看摘要
Abstract:Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting a clear physical process, and annotated with temporally localized glitches, structured failure categories, and natural-language explanations of the violated physical behavior. Using this dataset, we reveal a striking limitation of current video generation models: in physics-critical scenarios, 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch. We hope Physion-Eval will set a new standard for physical realism evaluation and guide the development of physics-grounded video generation. The benchmark is publicly available at this https URL.
90. 【2603.19606】Beyond Quadratic: Linear-Time Change Detection with RWKV
链接:https://arxiv.org/abs/2603.19606
作者:Zhenyu Yang,Gensheng Pei,Tao Chen,Xia Yuan,Haofeng Zhang,Xiangbo Shu,Yazhou Yao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:lack global context, prohibitive computational cost, capture long-range dependencies, Transformers capture long-range, remote sensing change
备注:
点击查看摘要
Abstract:Existing paradigms for remote sensing change detection are caught in a trade-off: CNNs excel at efficiency but lack global context, while Transformers capture long-range dependencies at a prohibitive computational cost. This paper introduces ChangeRWKV, a new architecture that reconciles this conflict. By building upon the Receptance Weighted Key Value (RWKV) framework, our ChangeRWKV uniquely combines the parallelizable training of Transformers with the linear-time inference of RNNs. Our approach core features two key innovations: a hierarchical RWKV encoder that builds multi-resolution feature representation, and a novel Spatial-Temporal Fusion Module (STFM) engineered to resolve spatial misalignments across scales while distilling fine-grained temporal discrepancies. ChangeRWKV not only achieves state-of-the-art performance on the LEVIR-CD benchmark, with an 85.46% IoU and 92.16% F1 score, but does so while drastically reducing parameters and FLOPs compared to previous leading methods. This work demonstrates a new, efficient, and powerful paradigm for operational-scale change detection. Our code and model are publicly available.
91. 【2603.19601】K-GMRF: Kinetic Gauss-Markov Random Field for First-Principles Covariance Tracking on Lie Groups
链接:https://arxiv.org/abs/2603.19601
作者:ZhiMing Li
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:incurring inevitable phase, neglect manifold constraints, non-stationary covariance matrices, inevitable phase lag, Tracking non-stationary covariance
备注: 33 pages, 13 figures
点击查看摘要
Abstract:Tracking non-stationary covariance matrices is fundamental to vision yet hindered by existing estimators that either neglect manifold constraints or rely on first-order updates, incurring inevitable phase lag during rapid evolution. We propose K-GMRF, an online, training-free framework for covariance tracking that reformulates the problem as forced rigid-body motion on Lie groups. Derived from the Euler-Poincaré equations, our method interprets observations as torques driving a latent angular velocity, propagated via a structure-preserving symplectic integrator. We theoretically prove that this second-order dynamics achieves zero steady-state error under constant rotation, strictly superior to the proportional lag of first-order baselines. Validation across three domains demonstrates robust tracking fidelity: (i) on synthetic ellipses, K-GMRF reduces angular error by 30x compared to Riemannian EMA while maintaining stability at high speeds; (ii) on SO(3) stabilization with 20% dropout, it decreases geodesic error from 29.4° to 9.9°; and (iii) on OTB motion-blur sequences, it improves loU from 0.55 to 0.74 on BlurCar2 with a 96% success rate. As a fully differentiable symplectic module, K-GMRF provides a plug-and-play geometric prior for data-constrained scenarios and an interpretable layer within modern deep architectures.
92. 【2603.19598】FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow
链接:https://arxiv.org/abs/2603.19598
作者:Zhifei Yang,Guangyao Zhai,Keyang Lu,YuYang Yin,Chao Zhang,Zhen Xiao,Jieyi Long,Nassir Navab,Yikai Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:extensive industrial applications, industrial applications, demanding both high, precise control, geometry and appearance
备注:
点击查看摘要
Abstract:Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.
93. 【2603.19588】HiFiGaze: Improving Eye Tracking Accuracy Using Screen Content Knowledge
链接:https://arxiv.org/abs/2603.19588
作者:Taejun Kim,Vimal Mollyn,Riku Arakawa,Chris Harrison
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
关键词:consumer computing devices, estimation on consumer, consumer computing, accurate approach, computing devices
备注: ACM CHI 2026
点击查看摘要
Abstract:We present a new and accurate approach for gaze estimation on consumer computing devices. We take advantage of continued strides in the quality of user-facing cameras found in e.g., smartphones, laptops, and desktops - 4K or greater in high-end devices - such that it is now possible to capture the 2D reflection of a device's screen in the user's eyes. This alone is insufficient for accurate gaze tracking due to the near-infinite variety of screen content. Crucially, however, the device knows what is being displayed on its own screen - in this work, we show this information allows for robust segmentation of the reflection, the location and size of which encodes the user's screen-relative gaze target. We explore several strategies to leverage this useful signal, quantifying performance in a user study. Our best performing model reduces mean tracking error by ~8% compared to a baseline appearance-based model. A supplemental study reveals an additional 10-20% improvement if the gaze-tracking camera is located at the bottom of the device.
94. 【2603.19575】MagicSeg: Open-World Segmentation Pretraining via Counterfactural Diffusion-Based Auto-Generation
链接:https://arxiv.org/abs/2603.19575
作者:Kaixin Cai,Pengzhen Ren,Jianhua Han,Yi Zhu,Hang Xu,Jianzhuang Liu,Xiaodan Liang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:presently relies significantly, extensive image-text pair, fine-grained pixel annotations, Open-world semantic segmentation, image-text pair datasets
备注:
点击查看摘要
Abstract:Open-world semantic segmentation presently relies significantly on extensive image-text pair datasets, which often suffer from a lack of fine-grained pixel annotations on sufficient categories. The acquisition of such data is rendered economically prohibitive due to the substantial investments of both human labor and time. In light of the formidable image generation capabilities of diffusion models, we introduce a novel diffusion model-driven pipeline for automatically generating datasets tailored to the needs of open-world semantic segmentation, named "MagicSeg". Our MagicSeg initiates from class labels and proceeds to generate high-fidelity textual descriptions, which in turn serve as guidance for the diffusion model to generate images. Rather than only generating positive samples for each label, our process encompasses the simultaneous generation of corresponding negative images, designed to serve as paired counterfactual samples for contrastive training. Then, to provide a self-supervised signal for open-world segmentation pretraining, our MagicSeg integrates an open-vocabulary detection model and an interactive segmentation model to extract object masks as precise segmentation labels from images based on the provided category labels. By applying our dataset to the contrastive language-image pretraining model with the pseudo mask supervision and the auxiliary counterfactual contrastive training, the downstream model obtains strong performance on open-world semantic segmentation. We evaluate our model on PASCAL VOC, PASCAL Context, and COCO, achieving SOTA with performance of 62.9%, 26.7%, and 40.2%, respectively, demonstrating our dataset's effectiveness in enhancing open-world semantic segmentation capabilities. Project website: this https URL.
95. 【2603.19571】CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management
链接:https://arxiv.org/abs/2603.19571
作者:Chao Wang,Xudong Tan,Jianjian Cao,Kangcong Li,Tao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
备注:
点击查看摘要
Abstract:Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video this http URL code will be released at this https URL.
96. 【2603.19570】Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation
链接:https://arxiv.org/abs/2603.19570
作者:Chuhan Wang,Hao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modern generative modeling, Image tokenization plays, mapping visual inputs, modern generative, generative modeling
备注:
点击查看摘要
Abstract:Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of $\mathcal{O}(\log n)$ compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.
97. 【2603.19567】Efficiency Follows Global-Local Decoupling
链接:https://arxiv.org/abs/2603.19567
作者:Zhenyu Yang,Gensheng Pei,Tao Chen,Yichao Zhou,Tianfei Zhou,Yazhou Yao,Fumin Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern vision models, remaining computationally affordable, Modern vision, capture image-level context, sacrificing local detail
备注:
点击查看摘要
Abstract:Modern vision models must capture image-level context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.
98. 【2603.19566】PhyUnfold-Net: Advancing Remote Sensing Change Detection with Physics-Guided Deep Unfolding
链接:https://arxiv.org/abs/2603.19566
作者:Zelin Lei,Yaoxing Ren,Jiaming Chang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Bi-temporal change detection, including illumination, Bi-temporal change, acquisition discrepancies, false alarms
备注: 18 pages, 8 figures, 9 tables. Appendix included
点击查看摘要
Abstract:Bi-temporal change detection is highly sensitive to acquisition discrepancies, including illumination, season, and atmosphere, which often cause false alarms. We observe that genuine changes exhibit higher patch-wise singular-value entropy (SVE) than pseudo changes in the feature-difference space. Motivated by this physical prior, we propose PhyUnfold-Net, a physics-guided deep unfolding framework that formulates change detection as an explicit decomposition problem. The proposed Iterative Change Decomposition Module (ICDM) unrolls a multi-step solver to progressively separate mixed discrepancy features into a change component and a nuisance component. To stabilize this process, we introduce a staged Exploration-and-Constraint loss (S-SEC), which encourages component separation in early steps while constraining nuisance magnitude in later steps to avoid degenerate solutions. We further design a Wavelet Spectral Suppression Module (WSSM) to suppress acquisition-induced spectral mismatch before decomposition. Experiments on four benchmarks show improvements over state-of-the-art methods, with gains under challenging conditions.
99. 【2603.19565】PFM-VEPAR: Prompting Foundation Models for RGB-Event Camera based Pedestrian Attribute Recognition
链接:https://arxiv.org/abs/2603.19565
作者:Minghe Xu,Rouying Wu,ChiaWei Chu,Xiao Wang,Yu Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Event-based pedestrian attribute, enhance RGB cameras, Event-based pedestrian, pedestrian attribute recognition, leverages motion cues
备注:
点击查看摘要
Abstract:Event-based pedestrian attribute recognition (PAR) leverages motion cues to enhance RGB cameras in low-light and motion-blur scenarios, enabling more accurate inference of attributes like age and emotion. However, existing two-stream multimodal fusion methods introduce significant computational overhead and neglect the valuable guidance from contextual samples. To address these limitations, this paper proposes an Event Prompter. Discarding the computationally expensive auxiliary backbone, this module directly applies extremely lightweight and efficient Discrete Cosine Transform (DCT) and Inverse DCT (IDCT) operations to the event data. This design extracts frequency-domain event features at a minimal computational cost, thereby effectively augmenting the RGB branch. Furthermore, an external memory bank designed to provide rich prior knowledge, combined with modern Hopfield networks, enables associative memory-augmented representation learning. This mechanism effectively mines and leverages global relational knowledge across different samples. Finally, a cross-attention mechanism fuses the RGB and event modalities, followed by feed-forward networks for attribute prediction. Extensive experiments on multiple benchmark datasets fully validate the effectiveness of the proposed RGB-Event PAR framework. The source code of this paper will be released on this https URL
100. 【2603.19563】Dual-Domain Representation Alignment: Bridging 2D and 3D Vision via Geometry-Aware Architecture Search
链接:https://arxiv.org/abs/2603.19563
作者:Haoyu Zhang,Zhihao Yu,Rui Wang,Yaochu Jin,Qiqi Liu,Ran Cheng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Modern computer vision, resource-constrained edge devices, requires balancing predictive, computer vision requires, vision requires balancing
备注:
点击查看摘要
Abstract:Modern computer vision requires balancing predictive accuracy with real-time efficiency, yet the high inference cost of large vision models (LVMs) limits deployment on resource-constrained edge devices. Although Evolutionary Neural Architecture Search (ENAS) is well suited for multi-objective optimization, its practical use is hindered by two issues: expensive candidate evaluation and ranking inconsistency among subnetworks. To address them, we propose EvoNAS, an efficient distributed framework for multi-objective evolutionary architecture search. We build a hybrid supernet that integrates Vision State Space and Vision Transformer (VSS-ViT) modules, and optimize it with a Cross-Architecture Dual-Domain Knowledge Distillation (CA-DDKD) strategy. By coupling the computational efficiency of VSS blocks with the semantic expressiveness of ViT modules, CA-DDKD improves the representational capacity of the shared supernet and enhances ranking consistency, enabling reliable fitness estimation during evolution without extra fine-tuning. To reduce the cost of large-scale validation, we further introduce a Distributed Multi-Model Parallel Evaluation (DMMPE) framework based on GPU resource pooling and asynchronous scheduling. Compared with conventional data-parallel evaluation, DMMPE improves efficiency by over 70% through concurrent multi-GPU, multi-model execution. Experiments on COCO, ADE20K, KITTI, and NYU-Depth v2 show that the searched architectures, termed EvoNets, consistently achieve Pareto-optimal trade-offs between accuracy and efficiency. Compared with representative CNN-, ViT-, and Mamba-based models, EvoNets deliver lower inference latency and higher throughput under strict computational budgets while maintaining strong generalization on downstream tasks such as novel view synthesis. Code is available at this https URL
101. 【2603.19552】StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention
链接:https://arxiv.org/abs/2603.19552
作者:Zhongrui Yu,Zhao Wang,Yijia Xie,Yida Wang,Xueyang Zhang,Yifei Zhan,Kun Zhan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving applications, time-consuming per-scene optimization, enables efficient utilization, rapid scene reconstruction, scene reconstruction enables
备注:
点击查看摘要
Abstract:Feedforward reconstruction is crucial for autonomous driving applications, where rapid scene reconstruction enables efficient utilization of large-scale driving datasets in closed-loop simulation and other downstream tasks, eliminating the need for time-consuming per-scene optimization. We present StreetForward, a pose-free and tracker-free feedforward framework for dynamic street reconstruction. Building upon the alternating attention mechanism from Visual Geometry Grounded Transformer (VGGT), we propose a simple yet effective temporal mask attention module that captures dynamic motion information from image sequences and produces motion-aware latent representations. Static content and dynamic instances are represented uniformly with 3D Gaussian Splatting, and are optimized jointly by cross-frame rendering with spatio-temporal consistency, allowing the model to infer per-pixel velocities and produce high-fidelity novel views at new poses and times. We train and evaluate our model on the Waymo Open Dataset, demonstrating superior performance on novel view synthesis and depth estimation compared to existing methods. Furthermore, zero-shot inference on CARLA and other datasets validates the generalization capability of our approach. More visualizations are available on our project page: this https URL.
102. 【2603.19547】SeeClear: Reliable Transparent Object Depth Estimation via Generative Opacification
链接:https://arxiv.org/abs/2603.19547
作者:Xiaoying Wang,Yumeng He,Jingkai Shi,Jiayin Lu,Yin Yang,Ying Jiang,Chenfanfu Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:estimation remains challenging, Monocular depth estimation, transparent objects, remains challenging, refraction and transmission
备注: Project page: [this https URL](https://heyumeng.com/SeeClear-web/) . 19 pages, 12 figures
点击查看摘要
Abstract:Monocular depth estimation remains challenging for transparent objects, where refraction and transmission are difficult to model and break the appearance assumptions used by depth networks. As a result, state-of-the-art estimators often produce unstable or incorrect depth predictions for transparent materials. We propose SeeClear, a novel framework that converts transparent objects into generative opaque images, enabling stable monocular depth estimation for transparent objects. Given an input image, we first localize transparent regions and transform their refractive appearance into geometrically consistent opaque shapes using a diffusion-based generative opacification module. The processed image is then fed into an off-the-shelf monocular depth estimator without retraining or architectural changes. To train the opacification model, we construct SeeClear-396k, a synthetic dataset containing 396k paired transparent-opaque renderings. Experiments on both synthetic and real-world datasets show that SeeClear significantly improves depth estimation for transparent objects. Project page: this https URL
103. 【2603.19546】Subspace Kernel Learning on Tensor Sequences
链接:https://arxiv.org/abs/2603.19546
作者:Lei Wang,Xi Ding,Yongsheng Gao,Piotr Koniusz
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:requires capturing complex, remaining computationally efficient, capturing complex interactions, represented as higher-order, requires capturing
备注: Accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)
点击查看摘要
Abstract:Learning from structured multi-way data, represented as higher-order tensors, requires capturing complex interactions across tensor modes while remaining computationally efficient. We introduce Uncertainty-driven Kernel Tensor Learning (UKTL), a novel kernel framework for $M$-mode tensors that compares mode-wise subspaces derived from tensor unfoldings, enabling expressive and robust similarity measure. To handle large-scale tensor data, we propose a scalable Nyström kernel linearization with dynamically learned pivot tensors obtained via soft $k$-means clustering. A key innovation of UKTL is its uncertainty-aware subspace weighting, which adaptively down-weights unreliable mode components based on estimated confidence, improving robustness and interpretability in comparisons between input and pivot tensors. Our framework is fully end-to-end trainable and naturally incorporates both multi-way and multi-mode interactions through structured kernel compositions. Extensive evaluations on action recognition benchmarks (NTU-60, NTU-120, Kinetics-Skeleton) show that UKTL achieves state-of-the-art performance, superior generalization, and meaningful mode-wise insights. This work establishes a principled, scalable, and interpretable kernel learning paradigm for structured multi-way and multi-modal tensor sequences.
104. 【2603.19538】MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane
链接:https://arxiv.org/abs/2603.19538
作者:Changwoo Jeon,Rishi Upadhyay,Achuta Kadambi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:box lifting problem, understanding has largely, largely been cast, Monocular, object understanding
备注: 27 pages, 9 figures, including supplementary material
点击查看摘要
Abstract:Monocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.
105. 【2603.19535】Behavioral Engagement in VR-Based Sign Language Learning: Visual Attention as a Predictor of Performance and Temporal Dynamics
链接:https://arxiv.org/abs/2603.19535
作者:Davide Traini,José Manuel Alcalde-Llergo,Mariana Buenestado-Fernández,Domenico Ursino,Enrique Yeguas-Bolívar
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
关键词:virtual reality application, reality application designed, Video Replay Frequency, Post-Playback Viewing Time, study analyzes behavioral
备注: 22 pages. 5 figures. 2 tables
点击查看摘要
Abstract:This study analyzes behavioral engagement in SONAR, a virtual reality application designed for sign language training and validation. We focus on three automatically derived engagement indicators (Visual Attention (VA), Video Replay Frequency (VRF), and Post-Playback Viewing Time (PPVT)) and examine their relationship with learning performance. Participants completed a self-paced Training phase, followed by a Validation quiz assessing retention. We employed Pearson correlation analysis to examine the relationships between engagement indicators and quiz performance, followed by binomial Generalized Linear Model (GLM) regression to assess their joint predictive contributions. Additionally, we conducted temporal analysis by aggregating moment-to-moment VA traces across all learners to characterize engagement dynamics during the learning session. Results show that VA exhibits a strong positive correlation with quiz performance,followed by PPVT, whereas VRF shows no meaningful association. A binomial GLM confirms that VA and PPVT are significant predictors of learning success, jointly explaining a substantial proportion of performance variance. Going beyond outcome-oriented analysis, we characterize temporal engagement patterns by aggregating moment-to-moment VA traces across all learners. The temporal profile reveals distinct attention peaks aligned with informationally dense segments of both training and validation videos, as well as phase-specific engagement dynamics, including initial acclimatization, oscillatory attention cycles during learning, and pronounced attentional peaks during assessment. Together, these findings highlight the central role of sustained and strategically allocated visual attention in VR-based sign language learning and demonstrate the value of behavioral trace data for understanding and predicting learner engagement in immersive environments.
106. 【2603.19533】Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion
链接:https://arxiv.org/abs/2603.19533
作者:Sima Ashayer,Hoang H. Nguyen,Yu Liang,Mina Sartipi
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Pedestrian intention prediction, Pedestrian intention, urban environments, accurate for autonomous, autonomous vehicles
备注: Accepted to IEEE Intelligent Vehicles Symposium (IV) 2026. 8 pages, 3 figures
点击查看摘要
Abstract:Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.
107. 【2603.19531】dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3
链接:https://arxiv.org/abs/2603.19531
作者:Saikat Dutta,Biplab Banerjee,Hamid Rezatofighi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:assigns pixel-level labels, demanding reliable generalization, assigns pixel-level, text-defined categories, demanding reliable
备注:
点击查看摘要
Abstract:Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of text-defined categories, demanding reliable generalization to unseen classes at inference. Although modern vision-language models (VLMs) support strong open-vocabulary recognition, their representations learned through global contrastive objectives remain suboptimal for dense prediction, prompting many OVSS methods to depend on limited adaptation or refinement of image-text similarity maps. This, in turn, restricts spatial precision and robustness in complex, cluttered scenes. We introduce this http URL, extending this http URL into a dedicated framework for OVSS. Our contributions are four-fold. First, we design a task-specific architecture tailored to this backbone, systematically adapting established design principles from prior open-vocabulary segmentation work. Second, we jointly leverage text embeddings aligned with both the global [CLS] token and local patch-level visual features from ViT-based encoder, effectively combining semantic discrimination with fine-grained spatial locality. Third, unlike prior approaches that rely primarily on post hoc similarity refinement, we perform early refinement of visual representations prior to image-text interaction, followed by late refinement of the resulting image-text correlation features, enabling more accurate and robust dense predictions in cluttered scenes. Finally, we propose a high-resolution local-global inference strategy based on sliding-window aggregation, which preserves spatial detail while maintaining global context. We conduct extensive experiments on five widely adopted OVSS benchmarks to evaluate our approach. The results demonstrate its effectiveness and robustness, consistently outperforming current state-of-the-art methods.
108. 【2603.19529】SurfaceXR: Fusing Smartwatch IMUs and Egocentric Hand Pose for Seamless Surface Interactions
链接:https://arxiv.org/abs/2603.19529
作者:Vasco Xu,Brian Chen,Eric J. Gonzalez,Andrea Colaço,Henry Hoffmann,Mar Gonzalez-Franco,Karan Ahuja
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:Extended Reality, fatigue and imprecision, Mid-air gestures, Mid-air, Reality
备注: Accepted to IEEE VR 2026 as a TVCG journal paper
点击查看摘要
Abstract:Mid-air gestures in Extended Reality (XR) often cause fatigue and imprecision. Surface-based interactions offer improved accuracy and comfort, but current egocentric vision methods struggle due to hand tracking challenges and unreliable surface plane estimation. We introduce SurfaceXR, a sensor fusion approach combining headset-based hand tracking with smartwatch IMU data to enable robust inputs on everyday surfaces. Our insight is that these modalities are complementary: hand tracking provides 3D positional data while IMUs capture high-frequency motion. A 21-participant study validates SurfaceXR's effectiveness for touch tracking and 8-class gesture recognition, demonstrating significant improvements over single-modality approaches.
109. 【2603.19523】Recognising BSL Fingerspelling in Continuous Signing Sequences
链接:https://arxiv.org/abs/2603.19523
作者:Alyssa Chan,Taein Kwon,Andrew Zisserman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:lack established lexical, British Sign Language, component of British, established lexical signs, technical terms
备注: 11 pages, 15 figures
点击查看摘要
Abstract:Fingerspelling is a critical component of British Sign Language (BSL), used to spell proper names, technical terms, and words that lack established lexical signs. Fingerspelling recognition is challenging due to the rapid pace of signing and common letter omissions by native signers, while existing BSL fingerspelling datasets are either small in scale or temporally and letter-wise inaccurate. In this work, we introduce a new large-scale BSL fingerspelling dataset, FS23K, constructed using an iterative annotation framework. In addition, we propose a fingerspelling recognition model that explicitly accounts for bi-manual interactions and mouthing cues. As a result, with refined annotations, our approach halves the character error rate (CER) compared to the prior state of the art on fingerspelling recognition. These findings demonstrate the effectiveness of our method and highlight its potential to support future research in sign language understanding and scalable, automated annotation pipelines. The project page can be found at this https URL.
110. 【2603.19517】ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding
链接:https://arxiv.org/abs/2603.19517
作者:Oishi Banerjee,Sung Eun Kim,Alexandra N. Willauer,Julius M. Kernbach,Abeer Rihan Alomaish,Reema Abdulwahab S. Alghamdi,Hassan Rayhan Alomaish,Mohammed Baharoon,Xiaoman Zhang,Julian Nicolas Acosta,Christine Zhou,Pranav Rajpurkar
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:online health conversations, comprehensive benchmark evaluates, Everyday photographs, health conversations, ordinary cameras
备注: 11 pages, 4 figures
点击查看摘要
Abstract:Everyday photographs taken with ordinary cameras are already widely used in telemedicine and other online health conversations, yet no comprehensive benchmark evaluates whether vision-language models can interpret their medical content. Analyzing these images requires both fine-grained natural image understanding and domain-specific medical reasoning, a combination that challenges both general-purpose and specialized models. We introduce ReXInTheWild, a benchmark of 955 clinician-verified multiple-choice questions spanning seven clinical topics across 484 photographs sourced from the biomedical literature. When evaluated on ReXInTheWild, leading multimodal large language models show substantial performance variation: Gemini-3 achieves 78% accuracy, followed by Claude Opus 4.5 (72%) and GPT-5 (68%), while the medical specialist model MedGemma reaches only 37%. A systematic error analysis also reveals four categories of common errors, ranging from low-level geometric errors to high-level reasoning failures and requiring different mitigation strategies. ReXInTheWild provides a challenging, clinically grounded benchmark at the intersection of natural image understanding and medical reasoning. The dataset is available on HuggingFace.
111. 【2603.19516】Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
链接:https://arxiv.org/abs/2603.19516
作者:Sheng Lu,Hao Chen,Rui Yin,Juyan Ba,Yu Zhang,Yuanzhe Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:shown strong generalization, natural domains, shown strong, strong generalization, abilities in natural
备注: Computer Vision and Pattern Recognition 2026
点击查看摘要
Abstract:Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.
112. 【2603.19512】FedAgain: A Trust-Based and Robust Federated Learning Strategy for an Automated Kidney Stone Identification in Ureteroscopy
链接:https://arxiv.org/abs/2603.19512
作者:Ivan Reyes-Amezcua,Francisco Lopez-Tiro,Clément Larose,Christian Daul,Andres Mendez-Vazquez,Gilberto Ochoa-Ruiz
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:corrupted images acquired, imaging critically depends, Federated Learning, artificial intelligence, highly challenging
备注: Paper submitted for peer review
点击查看摘要
Abstract:The reliability of artificial intelligence (AI) in medical imaging critically depends on its robustness to heterogeneous and corrupted images acquired with diverse devices across different hospitals which is highly challenging. Therefore, this paper introduces FedAgain, a trust-based Federated Learning (Federated Learning) strategy designed to enhance robustness and generalization for automated kidney stone identification from endoscopic images. FedAgain integrates a dual trust mechanism that combines benchmark reliability and model divergence to dynamically weight client contributions, mitigating the impact of noisy or adversarial updates during aggregation. The framework enables the training of collaborative models across multiple institutions while preserving data privacy and promoting stable convergence under real-world conditions. Extensive experiments across five datasets, including two canonical benchmarks (MNIST and CIFAR-10), two private multi-institutional kidney stone datasets, and one public dataset (MyStone), demonstrate that FedAgain consistently outperforms standard Federated Learning baselines under non-identically and independently distributed (non-IID) data and corrupted-client scenarios. By maintaining diagnostic accuracy and performance stability under varying conditions, FedAgain represents a practical advance toward reliable, privacy-preserving, and clinically deployable federated AI for medical imaging.
113. 【2603.19503】Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement
链接:https://arxiv.org/abs/2603.19503
作者:Ange-Clément Akazan,Abdoulaye Koroko,Verlon Roel Mbingui,Choukouriyah Arinloye,Hassan Fifen,Rose Bandolo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deep Convolutional Neural, Convolutional Neural Networks, Convolutional Neural, large Vision Transformers, deep Convolutional
备注:
点击查看摘要
Abstract:The success of deep learning in computer vision has been driven by models of increasing scale, from deep Convolutional Neural Networks (CNN) to large Vision Transformers (ViT). While effective, these architectures are parameter-intensive and demand significant computational resources, limiting deployment in resource-constrained environments. Inspired by Tiny Recursive Models (TRM), which show that small recursive networks can solve complex reasoning tasks through iterative state refinement, we introduce the \textbf{Vision Tiny Recursion Model (ViTRM)}: a parameter-efficient architecture that replaces the $L$-layer ViT encoder with a single tiny $k$-layer block ($k{=}3$) applied recursively $N$ times. Despite using up to $6 \times $ and $84 \times$ fewer parameters than CNN based models and ViT respectively, ViTRM maintains competitive performance on CIFAR-10 and CIFAR-100. This demonstrates that recursive computation is a viable, parameter-efficient alternative to architectural depth in vision.
114. 【2603.19500】aching an Agent to Sketch One Part at a Time
链接:https://arxiv.org/abs/2603.19500
作者:Xiaodan Du,Ruize Xu,David Yunis,Yael Vinker,Greg Shakhnarovich
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:producing vector sketches, develop a method, method for producing, vector sketches, producing vector
备注:
点击查看摘要
Abstract:We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.
115. 【2603.19496】VeloxNet: Efficient Spatial Gating for Lightweight Embedded Image Classification
链接:https://arxiv.org/abs/2603.19496
作者:Md Meftahul Ferdaus,Elias Ioup,Mahdi Abdelguerfi,Anton Netchaev,Steven Sloan,Ken Pathak,Kendall N. Niles
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deploying deep learning, deep learning models, infrastructure inspection requires, inspection requires architectures, Deploying deep
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Deploying deep learning models on embedded devices for tasks such as aerial disaster monitoring and infrastructure inspection requires architectures that balance accuracy with strict constraints on model size, memory, and latency. This paper introduces VeloxNet, a lightweight CNN architecture that replaces SqueezeNet's fire modules with gated multi-layer perceptron (gMLP) blocks for embedded image classification. Each gMLP block uses a spatial gating unit (SGU) that applies learned spatial projections and multiplicative gating, enabling the network to capture spatial dependencies across the full feature map in a single layer. Unlike fire modules, which are limited to local receptive fields defined by small convolutional kernels, the SGU provides global spatial modeling at each layer with fewer parameters. We evaluate VeloxNet on three aerial image datasets: the Aerial Image Database for Emergency Response (AIDER), the Comprehensive Disaster Dataset (CDD), and the Levee Defect Dataset (LDD), comparing against eleven baselines including MobileNet variants, ShuffleNet, EfficientNet, and recent vision transformers. VeloxNet reduces the parameter count by 46.1% relative to SqueezeNet (from 740,970 to 399,366) while improving weighted F1 scores by 6.32% on AIDER, 30.83% on CDD, and 2.51% on LDD. These results demonstrate that substituting local convolutional modules with spatial gating blocks can improve both classification accuracy and parameter efficiency for resource-constrained deployment. The source code will be made publicly available upon acceptance of the paper.
116. 【2603.19482】Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
链接:https://arxiv.org/abs/2603.19482
作者:Myeongkyun Kang,Soopil Kim,Xiaoxiao Li,Sang Hyun Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large vision language, demonstrated impressive performance, Large vision, vision language models, vision language
备注:
点击查看摘要
Abstract:Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
117. 【2603.19481】Narrative Aligned Long Form Video Question Answering
链接:https://arxiv.org/abs/2603.19481
作者:Rahul Jain,Keval Doshi,Burak Uzkent,Garin Kessler
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, Recent progress, large language models, progress in multimodal, multimodal large
备注:
点击查看摘要
Abstract:Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.
118. 【2603.19466】ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
链接:https://arxiv.org/abs/2603.19466
作者:Thomas De Min,Subhankar Roy,Stéphane Lathuilière,Elisa Ricci,Massimiliano Mancini
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Effective collaboration begins, Effective collaboration, collaboration begins, begins with knowing, Effective
备注:
点击查看摘要
Abstract:Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.
119. 【2603.19456】In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing
链接:https://arxiv.org/abs/2603.19456
作者:Xiao Fang,Yiming Gong,Stanislav Panev,Celso de Melo,Shuowen Hu,Shayok Chakraborty,Fernando De la Torre
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep neural networks, achieved remarkable success, remain highly vulnerable, Deep neural, neural networks
备注: 45 pages, 35 figures
点击查看摘要
Abstract:Deep neural networks (DNNs) have achieved remarkable success in computer vision but remain highly vulnerable to adversarial attacks. Among them, camouflage attacks manipulate an object's visible appearance to deceive detectors while remaining stealthy to humans. In this paper, we propose a new framework that formulates vehicle camouflage attacks as a conditional image-editing problem. Specifically, we explore both image-level and scene-level camouflage generation strategies, and fine-tune a ControlNet to synthesize camouflaged vehicles directly on real images. We design a unified objective that jointly enforces vehicle structural fidelity, style consistency, and adversarial effectiveness. Extensive experiments on the COCO and LINZ datasets show that our method achieves significantly stronger attack effectiveness, leading to more than 38% AP50 decrease, while better preserving vehicle structure and improving human-perceived stealthiness compared to existing approaches. Furthermore, our framework generalizes effectively to unseen black-box detectors and exhibits promising transferability to the physical world. Project page is available at this https URL
120. 【2603.19451】LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray
链接:https://arxiv.org/abs/2603.19451
作者:Myeongkyun Kang,Yanting Yang,Xiaoxiao Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:clinically relevant findings, Fine-grained representation learning, Fine-grained representation, spatially confined, clinically relevant
备注:
点击查看摘要
Abstract:Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.
121. 【2603.19371】Factored Levenberg-Marquardt for Diffeomorphic Image Registration: An efficient optimizer for FireANTs
链接:https://arxiv.org/abs/2603.19371
作者:Rohit Jena,Pratik Chaudhari,James C. Gee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Eulerian descent method, test-time optimization problem, Eulerian descent, arbitrary optimizers adapted, behavior with arbitrary
备注:
点击查看摘要
Abstract:FireANTs introduced a novel Eulerian descent method for plug-and-play behavior with arbitrary optimizers adapted for diffeomorphic image registration as a test-time optimization problem, with a GPU-accelerated implementation. FireANTs uses Adam as its default optimizer for fast and more robust optimization. However, Adam requires storing state variables (i.e. momentum and squared-momentum estimates), each of which can consume significant memory, prohibiting its use for significantly large images. In this work, we propose a modified Levenberg-Marquardt (LM) optimizer that requires only a single scalar damping parameter as optimizer state, that is adaptively tuned using a trust region approach. The resulting optimizer reduces memory by up to 24.6% for large volumes, and retaining performance across all four datasets. A single hyperparameter configuration tuned on brain MRI transfers without modification to lung CT and cross-modal abdominal registration, matching or outperforming Adam on three of four benchmarks. We also perform ablations on the effectiveness of using Metropolis-Hastings style rejection step to prevent updates that worsen the loss function.
122. 【2603.19364】AURORA: Adaptive Unified Representation for Robust Ultrasound Analysis
链接:https://arxiv.org/abs/2603.19364
作者:Ufaq Khan,L. D. M. S. Sai Teja,Ayuba Shakiru,Mai A. Shaaban,Yutong Xie,Muhammad Bilal,Muhammad Haris Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:images vary widely, Ultrasound images vary, Foundation Model Challenge, Ultrasound Image Analysis, widely across scanners
备注:
点击查看摘要
Abstract:Ultrasound images vary widely across scanners, operators, and anatomical targets, which often causes models trained in one setting to generalize poorly to new hospitals and clinical conditions. The Foundation Model Challenge for Ultrasound Image Analysis (FMC-UIA) reflects this difficulty by requiring a single model to handle multiple tasks, including segmentation, detection, classification, and landmark regression across diverse organs and datasets. We propose a unified multi-task framework based on a transformer visual encoder from the Qwen3-VL family. Intermediate token features are projected into spatial feature maps and fused using a lightweight multi-scale feature pyramid, enabling both pixel-level predictions and global reasoning within a shared representation. Each task is handled by a small task-specific prediction head, while training uses task-aware sampling and selective loss balancing to manage heterogeneous supervision and reduce task imbalance. Our method is designed to be simple to optimize and adaptable across a wide range of ultrasound analysis tasks. The performance improved from 67% to 85% on the validation set and achieved an average score of 81.84% on the official test set across all tasks. The code is publicly available at: this https URL
123. 【2603.19337】Diffusion-Guided Semantic Consistency for Multimodal Heterogeneity
链接:https://arxiv.org/abs/2603.19337
作者:Jing Liu,Zhengliang Guo,Yan Wang,Xiaoguang Zhu,Yao Du,Zehua Wang,Victor C. M. Leung
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:degrades global model, global model performance, identically distributed, severely challenged, challenged by non-independent
备注: Accepted by IEEE ICME 2026
点击查看摘要
Abstract:Federated learning (FL) is severely challenged by non-independent and identically distributed (non-IID) client data, a problem that degrades global model performance, especially in multimodal perception settings. Conventional methods often fail to address the underlying semantic discrepancies between clients, leading to suboptimal performance for multimedia systems requiring robust perception. To overcome this, we introduce SemanticFL, a novel framework that leverages the rich semantic representations of pre-trained diffusion models to provide privacy-preserving guidance for local training. Our approach leverages multi-layer semantic representations from a pre-trained Stable Diffusion model (including VAE-encoded latents and U-Net hierarchical features) to create a shared latent space that aligns heterogeneous clients, facilitated by an efficient client-server architecture that offloads heavy computation to the server. A unified consistency mechanism, employing cross-modal contrastive learning, further stabilizes convergence. We conduct extensive experiments on benchmarks including CIFAR-10, CIFAR-100, and TinyImageNet under diverse heterogeneity scenarios. Our results demonstrate that SemanticFL surpasses existing federated learning approaches, achieving accuracy gains of up to 5.49% over FedAvg, validating its effectiveness in learning robust representations for heterogeneous and multimodal data for perception tasks.
124. 【2603.19305】PhyGile: Physics-Prefix Guided Motion Generation for Agile General Humanoid Motion Tracking
链接:https://arxiv.org/abs/2603.19305
作者:Jiacheng Bao,Haoran Yang,Yucheng Xin,Junhong Liu,Yuecheng Xu,Han Liang,Pengfei Han,Xiaoguang Ma,Dong Wang,Bin Zhao
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:expected to execute, motions, expressive whole-body motions, robot-native motion generation, Humanoid robots
备注:
点击查看摘要
Abstract:Humanoid robots are expected to execute agile and expressive whole-body motions in real-world settings. Existing text-to-motion generation models are predominantly trained on captured human motion datasets, whose priors assume human biomechanics, actuation, mass distribution, and contact strategies. When such motions are directly retargeted to humanoid robots, the resulting trajectories may satisfy geometric constraints (e.g., joint limits and pose continuity) and appear kinematically reasonable. However, they frequently violate the physical feasibility required for real-world execution. To address these issues, we present PhyGile, a unified framework that closes the loop between robot-native motion generation and General Motion Tracking (GMT). PhyGile performs physics-prefix-guided robot-native motion generation at inference time, directly generating robot-native motions in a 262-dimensional skeletal space with physics-guided prefixes, thereby eliminating inference-time retargeting artifacts and reducing generation-execution discrepancies. Before physics-prefix adaptation, we train the GMT controller with a curriculum-based mixture-of-experts scheme, followed by post-training on unlabeled motion data to improve robustness over large-scale robot motions. During physics-prefix adaptation, the GMT controller is further fine-tuned with generated objectives under physics-derived prefixes, enabling agile and stable execution of complex motions on real robots. Extensive offline and real-robot experiments demonstrate that PhyGile expands the frontier of text-driven humanoid control, enabling stable tracking of agile, highly difficult whole-body motions that go well beyond walking and low-dynamic motions typically achieved by prior methods.
125. 【2603.19272】ransformers are Stateless Differentiable Neural Computers
链接:https://arxiv.org/abs/2603.19272
作者:Bo Tang,Weiwei Xie
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Differentiable Neural Computers, Differentiable Neural Computer, memory supporting differentiable, supporting differentiable read, Differentiable Neural
备注: 7 pages
点击查看摘要
Abstract:Differentiable Neural Computers (DNCs) were introduced as recurrent architectures equipped with an addressable external memory supporting differentiable read and write operations. Transformers, in contrast, are nominally feedforward architectures based on multi-head self-attention. In this work we give a formal derivation showing that a causal Transformer layer is exactly a stateless Differentiable Neural Computer (sDNC) where (1) the controller has no recurrent internal state, (2) the external memory is a write-once matrix of value vectors, (3) content-based addressing via keys implements attention, and (4) multi-head attention corresponds to multiple parallel read heads. We further extend this equivalence to cross-attention, showing that encoder-decoder Transformers are precisely sDNCs with distinct read-from and write-to memories. Our results provide a unified memory-centric interpretation of Transformers and contribute to the ongoing effort to place modern large language models in a principled computational framework.
126. 【2603.19261】Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging
链接:https://arxiv.org/abs/2603.19261
作者:Azam Nouri
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:key design choice, character-level BPE serving, including large language, Subword tokenization, modern language models
备注: 8 pages, 1 figures
点击查看摘要
Abstract:Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which favors compression but can conflate true adjacency cohesion with pairs that are frequent due to high marginal counts. This paper introduces Significance-Gain BPE, a drop-in alternative merge criterion that measures cohesion via a z-statistic under an independence null model and combines it with an explicit compression-aware gain term. Significance-Gain BPE is evaluated on WikiText-103 (raw) character slices using a small causal Transformer language model, reporting both token-dependent perplexity and the tokenizer-invariant metric bits per character (BPC). At a representative operating point, Significance-Gain BPE reduces validation and test perplexity by 13% and 12%, respectively, and improves validation and test BPC by about 0.9 to 1.0%. A vocabulary-size sweep further shows lower BPC in most closest-compression comparisons, suggesting that statistically grounded merge selection can improve predictive efficiency per unit of raw text across a range of compression regimes.
127. 【2603.19260】HATL: Hierarchical Adaptive-Transfer Learning Framework for Sign Language Machine Translation
链接:https://arxiv.org/abs/2603.19260
作者:Nada Shahin,Leila Ismail
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
关键词:Sign Language Machine, Language Machine Translation, Language Machine, communication between Deaf, Deaf and hearing
备注:
点击查看摘要
Abstract:Sign Language Machine Translation (SLMT) aims to bridge communication between Deaf and hearing individuals. However, its progress is constrained by scarce datasets, limited signer diversity, and large domain gaps between sign motion patterns and pretrained representations. Existing transfer learning approaches in SLMT are static and often lead to overfitting. These challenges call for the development of an adaptive framework that preserves pretrained structure while remaining robust across linguistic and signing variations. To fill this void, we propose a Hierarchical Adaptive Transfer Learning (HATL) framework, where pretrained layers are progressively and dynamically unfrozen based on training performance behavior. HATL combines dynamic unfreezing, layer-wise learning rate decay, and stability mechanisms to preserve generic representations while adapting to sign characteristics. We evaluate HATL on Sign2Text and Sign2Gloss2Text translation tasks using a pretrained ST-GCN++ backbone for feature extraction and the Transformer and an adaptive transformer (ADAT)for translation. To ensure robust multilingual generalization, we evaluate the proposed approach across three datasets: RWTH-PHOENIXWeather-2014 (PHOENIX14T), Isharah, and MedASL. Experimental results show that HATL consistently outperforms traditional transfer learning approaches across tasks and models, with ADAT achieving BLEU-4 improvements of 15.0% on PHOENIX14T and Isharah and 37.6% on MedASL.
128. 【2603.20045】Investigating a Policy-Based Formulation for Endoscopic Camera Pose Recovery
链接:https://arxiv.org/abs/2603.20045
作者:Jan Emily Mangulabnan,Akshat Chauhan,Laura Fleig,Lalithkumar Seenivasan,Roger D. Soberanis-Mukul,S. Swaroop Vedula,Russell H. Taylor,Masaru Ishii,Gregory D. Hager,Mathias Unberath
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:evolving visual appearance, surgeons continuously locate, prior knowledge, continuously locate, anatomy by interpreting
备注:
点击查看摘要
Abstract:In endoscopic surgery, surgeons continuously locate the endoscopic view relative to the anatomy by interpreting the evolving visual appearance of the intraoperative scene in the context of their prior knowledge. Vision-based navigation systems seek to replicate this capability by recovering camera pose directly from endoscopic video, but most approaches do not embody the same principles of reasoning about new frames that makes surgeons successful. Instead, they remain grounded in feature matching and geometric optimization over keyframes, an approach that has been shown to degrade under the challenging conditions of endoscopic imaging like low texture and rapid illumination changes. Here, we pursue an alternative approach and investigate a policy-based formulation of endoscopic camera pose recovery that seeks to imitate experts in estimating trajectories conditioned on the previous camera state. Our approach directly predicts short-horizon relative motions without maintaining an explicit geometric representation at inference time. It thus addresses, by design, some of the notorious challenges of geometry-based approaches, such as brittle correspondence matching, instability in texture-sparse regions, and limited pose coverage due to reconstruction failure. We evaluate the proposed formulation on cadaveric sinus endoscopy. Under oracle state conditioning, we compare short-horizon motion prediction quality to geometric baselines achieving lowest mean translation error and competitive rotational accuracy. We analyze robustness by grouping prediction windows according to texture richness and illumination change indicating reduced sensitivity to low-texture conditions. These findings suggest that a learned motion policy offers a viable alternative formulation for endoscopic camera pose recovery.
129. 【2603.20024】Layered Quantum Architecture Search for 3D Point Cloud Classification
链接:https://arxiv.org/abs/2603.20024
作者:Natacha Kuete Meli,Jovita Lukasik,Vladislav Golyanik,Michael Moeller
类目:Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
130. 【2603.19925】ReconMIL: Synergizing Latent Space Reconstruction with Bi-Stream Mamba for Whole Slide Image Analysis
链接:https://arxiv.org/abs/2603.19925
作者:Lubin Gan,Jing Zhang,Heng Zhang,Xin Di,Zhifeng Wang,Wenke Huang,Xiaoyan Sun
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:analysis heavily relies, multiple instance learning, slide image, analysis heavily, instance learning
备注:
点击查看摘要
Abstract:Whole slide image (WSI) analysis heavily relies on multiple instance learning (MIL). While recent methods benefit from large-scale foundation models and advanced sequence modeling to capture long-range dependencies, they still struggle with two critical issues. First, directly applying frozen, task-agnostic features often leads to suboptimal separability due to the domain gap with specific histological tasks. Second, relying solely on global aggregators can cause over-smoothing, where sparse but critical diagnostic signals are overshadowed by the dominant background context. In this paper, we present ReconMIL, a novel framework designed to bridge this domain gap and balance global-local feature aggregation. Our approach introduces a Latent Space Reconstruction module that adaptively projects generic features into a compact, task-specific manifold, improving boundary delineation. To prevent information dilution, we develop a bi-stream architecture combining a Mamba-based global stream for contextual priors and a CNN-based local stream to preserve subtle morphological anomalies. A scale-adaptive selection mechanism dynamically fuses these two streams, determining when to rely on overall architecture versus local saliency. Evaluations across multiple diagnostic and survival prediction benchmarks show that ReconMIL consistently outperforms current state-of-the-art methods, effectively localizing fine-grained diagnostic regions while suppressing background noise. Visualization results confirm the models superior ability to localize diagnostic regions by effectively balancing global structure and local granularity.
131. 【2603.19801】Offshore oil and gas platform dynamics in the North Sea, Gulf of Mexico, and Persian Gulf: Exploiting the Sentinel-1 archive
链接:https://arxiv.org/abs/2603.19801
作者:Robin Spanier,Thorsten Hoeser,John Truckenbrodt,Felix Bachofer,Claudia Kuenzer
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Persian Gulf, North Sea, Gulf, Gulf of Mexico, Earth observation data
备注: 16 pages, 10 figures, 1 table
点击查看摘要
Abstract:The increasing use of marine spaces by offshore infrastructure, including oil and gas platforms, underscores the need for consistent, scalable monitoring. Offshore development has economic, environmental, and regulatory implications, yet maritime areas remain difficult to monitor systematically due to their inaccessibility and spatial extent. This study presents an automated approach to the spatiotemporal detection of offshore oil and gas platforms based on freely available Earth observation data. Leveraging Sentinel-1 archive data and deep learning-based object detection, a consistent quarterly time series of platform locations for three major production regions: the North Sea, the Gulf of Mexico, and the Persian Gulf, was created for the period 2017-2025. In addition, platform size, water depth, distance to the coast, national affiliation, and installation and decommissioning dates were derived. 3,728 offshore platforms were identified in 2025, 356 in the North Sea, 1,641 in the Gulf of Mexico, and 1,731 in the Persian Gulf. While expansion was observed in the Persian Gulf until 2024, the Gulf of Mexico and the North Sea saw a decline in platform numbers from 2018-2020. At the same time, a pronounced dynamic was apparent. More than 2,700 platforms were installed or relocated to new sites, while a comparable number were decommissioned or relocated. Furthermore, the increasing number of platforms with short lifespans points to a structural change in the offshore sector associated with the growing importance of mobile offshore units such as jack-ups or drillships. The results highlighted the potential of freely available Earth observation data and deep learning for consistent, long-term monitoring of marine infrastructure. The derived dataset is public and provides a basis for offshore monitoring, maritime planning, and analyses of the transformation of the offshore energy sector.
132. 【2603.17765】Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarity Search
链接:https://arxiv.org/abs/2603.17765
作者:Himadri Samanta
类目:Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, gained increasing attention, Automated radiology report, Automated radiology, language models
备注: 15 pages, 4 figures, 3 tables
点击查看摘要
Abstract:Automated radiology report generation has gained increasing attention with the rise of deep learning and large language models. However, fully generative approaches often suffer from hallucinations and lack clinical grounding, limiting their reliability in real-world workflows. In this study, we propose a multimodal retrieval-augmented generation (RAG) system for grounded drafting of chest radiograph impressions. The system combines contrastive image-text embeddings, case-based similarity retrieval, and citation-constrained draft generation to ensure factual alignment with historical radiology reports. A curated subset of the MIMIC-CXR dataset was used to construct a multimodal retrieval database. Image embeddings were generated using CLIP encoders, while textual embeddings were derived from structured impression sections. A fusion similarity framework was implemented using FAISS indexing for scalable nearest-neighbor retrieval. Retrieved cases were used to construct grounded prompts for draft impression generation, with safety mechanisms enforcing citation coverage and confidence-based refusal. Experimental results demonstrate that multimodal fusion significantly improves retrieval performance compared to image-only retrieval, achieving Recall@5 above 0.95 on clinically relevant findings. The grounded drafting pipeline produces interpretable outputs with explicit citation traceability, enabling improved trustworthiness compared to conventional generative approaches. This work highlights the potential of retrieval-augmented multimodal systems for reliable clinical decision support and radiology workflow augmentation

