本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新603篇论文,其中:

  • 自然语言处理106
  • 信息检索16
  • 计算机视觉108

自然语言处理

1. 【2603.11048】COMIC: Agentic Sketch Comedy Generation

链接https://arxiv.org/abs/2603.11048

作者:Susung Hong,Brian Curless,Ira Kemelmacher-Shlizerman,Steve Seitz

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)

关键词:Saturday Night Live, Night Live, Saturday Night, produces short comedic, short comedic videos

备注: Project page: [this https URL](https://susunghong.github.io/COMIC/)

点击查看摘要

Abstract:We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.

2. 【2603.11039】Instruction set for the representation of graphs

链接https://arxiv.org/abs/2603.11039

作者:Ezequiel Lopez-Rubio,Mario Pascual-Gonzalez

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)

关键词:nine-character instruction alphabet, method for representing, graph, IAM Letter LOW, nine-character instruction

备注

点击查看摘要

Abstract:We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling

3. 【2603.11027】Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

链接https://arxiv.org/abs/2603.11027

作者:Mingyang Song,Mao Zheng,Chenning Xu

类目:Computation and Language (cs.CL)

关键词:high inter-evaluator agreement, high inter-evaluator, reliable and objective, critical assumption, textbf

备注

点击查看摘要

Abstract:The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $\rho = 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22\%, Academic +27\%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.

4. 【2603.11008】A Systematic Study of Pseudo-Relevance Feedback with LLMs

链接https://arxiv.org/abs/2603.11008

作者:Nour Jedidi,Jimmy Lin

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:large language models, Pseudo-relevance feedback, key design dimensions, feedback, LLM PRF methods

备注

点击查看摘要

Abstract:Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from and the feedback model, which is how the given feedback text is used to refine the query representation. However, the independent role that each dimension plays is unclear, as both are often entangled in empirical evaluations. In this paper, we address this gap by systematically studying how the choice of feedback source and feedback model impact PRF effectiveness through controlled experimentation. Across 13 low-resource BEIR tasks with five LLM PRF methods, our results show: (1) the choice of feedback model can play a critical role in PRF effectiveness; (2) feedback derived solely from LLM-generated text provides the most cost-effective solution; and (3) feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever. Together, our findings provide a better understanding of which elements in the PRF design space are most important.

5. 【2603.10969】OSSS: a CVE-based Software Security Benchmark for Large Language Models

链接https://arxiv.org/abs/2603.10969

作者:Marc Damie,Murat Bilgehan Ertan,Domenico Essoussi,Angela Makhanu,Gaëtan Peter,Roos Wensveen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Software Engineering (cs.SE)

关键词:Large Language Models, Large Language, increasing capabilities, Large, LLMs

备注

点击查看摘要

Abstract:With their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are LLMs good at software security? At the same time, organizations worldwide invest heavily in cybersecurity to reduce exposure to disruptive attacks. The integration of LLMs into software engineering workflows may introduce new vulnerabilities and weaken existing security efforts. We introduce TOSSS (Two-Option Secure Snippet Selection), a benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets. Existing security benchmarks for LLMs cover only a limited range of vulnerabilities. In contrast, TOSSS relies on the CVE database and provides an extensible framework that can integrate newly disclosed vulnerabilities over time. Our benchmark gives each model a security score between 0 and 1 based on its behavior; a score of 1 indicates that the model always selects the secure snippet, while a score of 0 indicates that it always selects the vulnerable one. We evaluate 14 widely used open-source and closed-source models on C/C++ and Java code and observe scores ranging from 0.48 to 0.89. LLM providers already publish many benchmark scores for their models, and TOSSS could become a complementary security-focused score to include in these reports.

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Software Engineering (cs.SE)

Cite as:
arXiv:2603.10969 [cs.LG]

(or
arXiv:2603.10969v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.10969

Focus to learn more

              arXiv-issued DOI via DataCite</p>
6. 【2603.10913】LLM2Vec-Gen: Generative Embeddings from Large Language Models

链接https://arxiv.org/abs/2603.10913

作者:Parishad BehnamGhader,Vaibhav Adlakha,Fabian David Schmidt,Nicolas Chapados,Marius Mosbach,Siva Reddy

类目:Computation and Language (cs.CL)

关键词:embedders typically encode, LLM-based text embedders, text embedders typically, LLM, embedding

备注

点击查看摘要

Abstract:LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.

7. 【2603.10910】GLM-OCR Technical Report

链接https://arxiv.org/abs/2603.10910

作者:Shuaiqi Duan,Yadong Xue,Weihan Wang,Zhe Su,Huan Liu,Sheng Yang,Guobing Gan,Guo Wang,Zihan Wang,Shengdong Yan,Dexin Jin,Yuxuan Zhang,Guohong Wen,Yanfeng Wang,Yutao Zhang,Xiaohan Zhang,Wenyi Hong,Yukuo Cen,Da Yin,Bin Chen,Wenmeng Yu,Xiaotao Gu,Jie Tang

类目:Computation and Language (cs.CL)

关键词:multimodal model designed, real-world document understanding, compact multimodal model, multimodal model, model designed

备注

点击查看摘要

Abstract:GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

8. 【2603.10877】From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

链接https://arxiv.org/abs/2603.10877

作者:Ayan Sengupta,Shantanu Dixit,Md Shad Akhtar,Tanmoy Chakraborty

类目:Computation and Language (cs.CL)

关键词:compressing large pre-trained, Knowledge distillation, pivotal in compressing, models, Knowledge

备注

点击查看摘要

Abstract:Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.

9. 【2603.10876】An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

链接https://arxiv.org/abs/2603.10876

作者:Jennifer D'Souza,Sameer Sadruddin,Maximilian Kähler,Andrea Salfinger,Luca Zaccagna,Francesca Incitti,Lauro Snidaro,Osma Suominen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:Subject indexing, Integrated Authority File, indexing is vital, vital for discovery, discovery but hard

备注: 9 pages, 5 figures. Accepted to appear in the Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

点击查看摘要

Abstract:Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.

10. 【2603.10861】SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

链接https://arxiv.org/abs/2603.10861

作者:Nevidu Jayatilleke,Nisansa de Silva,Uthpala Nimanthi,Gagani Kulathilaka,Azra Safrullah,Johan Sofalas

类目:Computation and Language (cs.CL)

关键词:Sinhala Diachronic Corpus, comprehensive Sinhala Diachronic, Diachronic Corpus, terms of publication, Sinhala Diachronic

备注: 23 pages, 13 figures, 10 tables, Accepted paper at the 15th Language Resources and Evaluation Conference (LREC 2026)

点击查看摘要

Abstract:SiDiaC-v.2.0 is the largest comprehensive Sinhala Diachronic Corpus to date, covering a period from 1800 CE to 1955 CE in terms of publication dates, and a historical span from the 5th to the 20th century CE in terms of written dates. The corpus consists of 244k words across 185 literary works that underwent thorough filtering, preprocessing, and copyright compliance checks, followed by extensive post-processing. Additionally, a subset of 59 documents totalling 70k words was annotated based on their written dates. Texts from the National Library of Sri Lanka were selected from the SiDiaC-v.1.0 non-filtered list, which was digitised using Google Document AI OCR. This was followed by post-processing to correct formatting issues, address code-mixing, include special tokens, and fix malformed tokens. The construction of SiDiaC-v.2.0 was informed by practices from other corpora, such as FarPaHC, SiDiaC-v.1.0, and CCOHA. This was particularly relevant for syntactic annotation and text normalisation strategies, given the shared characteristics of low-resource language status between Faroese and the similar cleaning strategies utilised in CCOHA. This corpus is categorised into two layers based on genres: primary and secondary. The primary categorisation is binary, assigning each book to either Non-Fiction or Fiction. The secondary categorisation is more detailed, grouping texts under specific genres such as Religious, History, Poetry, Language, and Medical. Despite facing challenges due to limited resources, SiDiaC-v.2.0 serves as a comprehensive resource for Sinhala NLP, building upon the work previously done in SiDiaC-v.1.0.

11. 【2603.10848】$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

链接https://arxiv.org/abs/2603.10848

作者:Yi-Kai Zhang,Yueqing Sun,Hongyan Hao,Qi Gu,Xunliang Cai,De-Chuan Zhan,Han-Jia Ye

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Verifiable Rewards, Reinforcement Learning, Learning with Verifiable, reinforce desired behaviors, robust advantage baseline

备注

点击查看摘要

Abstract:In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_0$), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that $V_{0.5}$ significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.

12. 【2603.10846】owards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

链接https://arxiv.org/abs/2603.10846

作者:Yujie Zheng,Zhuo Li,Shengtao Zhang,Hanjing Wang,Junjie Sheng,Jiaqian Wang,Junchi Yan,Weinan Zhang,Ying Wen,Bo Tang,Muning Wen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Deploying Large Language, Large Language Models, emerging Domain-Specific Architectures, Deploying Large, poses significant challenges

备注

点击查看摘要

Abstract:Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a "Data Wall" limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. EvoKernel addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective, whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, EvoKernel improves frontier models' correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems. Our official page is available at this https URL.

13. 【2603.10842】PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

链接https://arxiv.org/abs/2603.10842

作者:Yuzhi Liang,Shiliang Xiao,Jingsong Wei,Qiliang Lin,Xia Li

类目:Computation and Language (cs.CL)

关键词:Existing hard-label text, vast search spaces, traverse vast search, Existing hard-label, hard-label text attacks

备注

点击查看摘要

Abstract:Existing hard-label text attacks often rely on inefficient "outside-in" strategies that traverse vast search spaces. We propose PivotAttack, a query-efficient "inside-out" framework. It employs a Multi-Armed Bandit algorithm to identify Pivot Sets-combinatorial token groups acting as prediction anchors-and strategically perturbs them to induce label flips. This approach captures inter-word dependencies and minimizes query costs. Extensive experiments across traditional models and Large Language Models demonstrate that PivotAttack consistently outperforms state-of-the-art baselines in both Attack Success Rate and query efficiency.

14. 【2603.10793】Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

链接https://arxiv.org/abs/2603.10793

作者:Konstantin Dobler,Simon Lehnerer,Federico Scozzafava,Jonathan Janke,Mohamed Ali

类目:Computation and Language (cs.CL)

关键词:Multilingual Reasoning Gym, Reasoning Gym, Multilingual Reasoning, procedurally generates verifiable, generates verifiable reasoning

备注

点击查看摘要

Abstract:We present the Multilingual Reasoning Gym, an extension of Reasoning Gym (Stojanovski et al., 2025), that procedurally generates verifiable reasoning problems across 14 languages. We translate templates for 94 tasks with native-speaker validation in 10 languages and targeted code or template adaptations to ensure linguistic naturalness. The Multilingual Reasoning Gym preserves the core benefits of the procedural generation approach used in the original Reasoning Gym, such as virtually unlimited problem instance generation and adjustable difficulty, and remains directly usable for Reinforcement Learning from Verifiable Rewards and evaluation settings. Problems in the Multilingual Reasoning Gym are parallel across languages, enabling crosslingually parallel data generation at massive scale due to the procedural nature of the environments. We release our implementation to support research into multilingual reasoning models.

15. 【2603.10789】LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

链接https://arxiv.org/abs/2603.10789

作者:Nina Hosseini-Kivanani,Fred Philippy

类目:Computation and Language (cs.CL)

关键词:analysis of Luxembourgish, RTL articles, present LuxBorrow, borrowing-first analysis, sentence-level language identification

备注: Paper got accepted to LREC2026

点击查看摘要

Abstract:We present LuxBorrow, a borrowing-first analysis of Luxembourgish (LU) news spanning 27 years (1999-2025), covering 259,305 RTL articles and 43.7M tokens. Our pipeline combines sentence-level language identification (LU/DE/FR/EN) with a token-level borrowing resolver restricted to LU sentences, using lemmatization, a collected loanword registry, and compiled morphological and orthographic rules. Empirically, LU remains the matrix language across all documents, while multilingual practice is pervasive: 77.1% of articles include at least one donor language and 65.4% use three or four. Breadth does not imply intensity: median code-mixing index (CMI) increases from 3.90 (LU+1) to only 7.00 (LU+3), indicating localized insertions rather than balanced bilingual text. Domain and period summaries show moderate but persistent mixing, with CMI rising from 6.1 (1999-2007) to a peak of 8.4 in 2020. Token-level adaptations total 25,444 instances and exhibit a mixed profile: morphological 63.8%, orthographic 35.9%, lexical 0.3%. The most frequent individual rules are orthographic, such as on-oun and eur-er, while morphology is collectively dominant. Diachronically, code-switching intensifies, and morphologically adapted borrowings grow from a small base. French overwhelmingly supplies adapted items, with modest growth for German and negligible English. We advocate borrowing-centric evaluation, including borrowed token and type rates, donor entropy over borrowed items, and assimilation ratios, rather than relying only on document-level mixing indices.

16. 【2603.10784】Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

链接https://arxiv.org/abs/2603.10784

作者:Weihang Huang,Mengna Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:computational approaches operate, opaque classifiers offering, figurative language processing, language processing, judged metaphorical

备注

点击查看摘要

Abstract:Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols--MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification--as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.

17. 【2603.10775】Large Language Models as Annotators for Machine Translation Quality Estimation

链接https://arxiv.org/abs/2603.10775

作者:Sidi Wang,Sophie Arnoult,Amir Kamran

类目:Computation and Language (cs.CL)

关键词:Translation Quality Estimation, Machine Translation Quality, Large Language Models, Large Language, Quality Estimation

备注: 11 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated excellent performance on Machine Translation Quality Estimation (MTQE), yet their high inference costs make them impractical for direct application. In this work, we propose applying LLMs to generate MQM-style annotations for training a COMET model: following Fernandes et al. (2023), we reckon that segment-level annotations provide a strong rationale for LLMs and are key to good segment-level QE. We propose a simplified MQM scheme, mostly restricted to top-level categories, to guide LLM selection. We present a systematic approach for the development of a GPT-4o-based prompt, called PPbMQM (Prompt-Pattern-based-MQM). We show that the resulting annotations correlate well with human annotations and that training COMET on them leads to competitive performance on segment-level QE for Chinese-English and English-German.

18. 【2603.10771】Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

链接https://arxiv.org/abs/2603.10771

作者:Zhipeng Yang,Shu Yang,Lijie Hu,Di Wang

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, robustness remain unclear, language models, remain unclear

备注

点击查看摘要

Abstract:Large language models (LLMs) trained with canonical tokenization exhibit surprising robustness to non-canonical inputs such as character-level tokenization, yet the mechanisms underlying this robustness remain unclear. We study this phenomenon through mechanistic interpretability and identify a core process we term word recovery. We first introduce a decoding-based method to detect word recovery, showing that hidden states reconstruct canonical word-level token identities from character-level inputs. We then provide causal evidence by removing the corresponding subspace from hidden states, which consistently degrades downstream task performance. Finally, we conduct a fine-grained attention analysis and show that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance. Together, our findings provide a mechanistic explanation for tokenization robustness and identify word recovery as a key mechanism enabling LLMs to process character-level inputs.

19. 【2603.10767】mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

链接https://arxiv.org/abs/2603.10767

作者:Konstantin Dobler,Simon Lehnerer,Federico Scozzafava,Jonathan Janke,Mohamed Ali

类目:Computation and Language (cs.CL)

关键词:Reinforcement Learning, Verifiable Rewards, Learning with Verifiable, logic problem domains, pretrained large language

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has been successfully applied to significantly boost the capabilities of pretrained large language models, especially in the math and logic problem domains. However, current research and available training datasets remain English-centric. While mul- tilingual training data and benchmarks have been created in the past, they were not created with RLVR and current model capability in mind, and their level of difficulty is often too low to provide appropriate training signals for current models. To address this gap, we provide mAceReason-Math, a dataset of high-quality translations of challenging math problems sourced from a corpus specifically curated for RLVR (AceReason-Math). We further take specific care to clean and improve our translations, resulting in a coverage of 14 languages with more than 10,000 samples per language. We release the dataset to facilitate multilingual RLVR research and benchmarking in the research community.

20. 【2603.10764】HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

链接https://arxiv.org/abs/2603.10764

作者:Shuang Zhou,Kai Yu,Song Wang,Wenya Xie,Zaifu Zhan,Meng-Han Tsai,Yuen-Hei Chung,Shutong Hou,Huixue Zhou,Min Zeng,Bhavadharini Ramu,Lin Yee Chen,Feng Xie,Rui Zhang

类目:Computation and Language (cs.CL)

关键词:Heart diseases remain, Heart diseases, trustworthy differential diagnosis, mortality worldwide, necessitating accurate

备注: 26 pages, 7 figures

点击查看摘要

Abstract:Heart diseases remain a leading cause of morbidity and mortality worldwide, necessitating accurate and trustworthy differential diagnosis. However, existing artificial intelligence-based diagnostic methods are often limited by insufficient cardiology knowledge, inadequate support for complex reasoning, and poor interpretability. Here we present HeartAgent, a cardiology-specific agent system designed to support a reliable and explainable differential diagnosis. HeartAgent integrates customized tools and curated data resources and orchestrates multiple specialized sub-agents to perform complex reasoning while generating transparent reasoning trajectories and verifiable supporting references. Evaluated on the MIMIC dataset and a private electronic health records cohort, HeartAgent achieved over 36% and 20% improvements over established comparative methods, in top-3 diagnostic accuracy, respectively. Additionally, clinicians assisted by HeartAgent demonstrated gains of 26.9% in diagnostic accuracy and 22.7% in explanatory quality compared with unaided experts. These results demonstrate that HeartAgent provides reliable, explainable, and clinically actionable decision support for cardiovascular care.

21. 【2603.10705】Prism-$Δ$: Differential Subspace Steering for Prompt Highlighting in Large Language Models

链接https://arxiv.org/abs/2603.10705

作者:Yuyao Ge,Shenghua Liu,Yiwei Wang,Tianyu Liu,Baolong Bi,Lingrui Mei,Jiayu Yao,Jiafeng Guo,Xueqi Cheng

类目:Computation and Language (cs.CL)

关键词:Prompt highlighting steers, prioritize user-specified text, user-specified text spans, Prompt highlighting, large language model

备注: 21 pages, 14 figures

点击查看摘要

Abstract:Prompt highlighting steers a large language model to prioritize user-specified text spans during generation. A key challenge is extracting steering directions that capture the difference between relevant and irrelevant contexts, rather than shared structural patterns common to both. We propose PRISM-$\Delta$ (Projection-based Relevance-Informed Steering Method), which decomposes the difference between positive and negative cross-covariance matrices to maximize discriminative energy while eliminating shared directions. Each attention head receives a continuous softplus importance weight, letting weak-but-useful heads contribute at reduced strength. The framework extends naturally to Value representations, capturing content-channel signal that Key-only methods leave unused. Across four benchmarks and five models, PRISM-$\Delta$ matches or exceeds the best existing method on 19 of 20 configurations, with relative gains up to +10.6%, while halving the fluency cost of steering. PRISM-$\Delta$ also scales to long-context retrieval, outperforming the best existing method by up to +4.8% relative gain. PRISM-$\Delta$ is compatible with FlashAttention and adds negligible memory overhead.

22. 【2603.10697】EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution

链接https://arxiv.org/abs/2603.10697

作者:Tianshu Zhang,Kun Qian,Siddhartha Sahai,Yuan Tian,Shaddy Garg,Huan Sun,Yunyao Li

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:translate natural language, natural language questions, schema, schema evolution, translate natural

备注: Accepted by VLDB 2025

点击查看摘要

Abstract:Neural text-to-SQL models, which translate natural language questions (NLQs) into SQL queries given a database schema, have achieved remarkable performance. However, database schemas frequently evolve to meet new requirements. Such schema evolution often leads to performance degradation for models trained on static schemas. Existing work either mainly focuses on simply paraphrasing some syntactic or semantic mappings among NLQ, DB and SQL, or lacks a comprehensive and controllable way to investigate the model robustness issue under the schema evolution, which is insufficient when facing the increasingly complex and rich database schema changes in reality, especially in the LLM era. To address the challenges posed by schema evolution, we present EvoSchema, a comprehensive benchmark designed to assess and enhance the robustness of text-to-SQL systems under real-world schema changes. EvoSchema introduces a novel schema evolution taxonomy, encompassing ten perturbation types across columnlevel and table-level modifications, systematically simulating the dynamic nature of database schemas. Through EvoSchema, we conduct an in-depth evaluation spanning different open source and closed-source LLMs, revealing that table-level perturbations have a significantly greater impact on model performance compared to column-level changes. Furthermore, EvoSchema inspires the development of more resilient text-to-SQL systems, in terms of both model training and database design. The models trained on EvoSchema's diverse schema designs can force the model to distinguish the schema difference for the same questions to avoid learning spurious patterns, which demonstrate remarkable robustness compared to those trained on unperturbed data on average. This benchmark offers valuable insights into model behavior and a path forward for designing systems capable of thriving in dynamic, real-world environments.

23. 【2603.10677】Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

链接https://arxiv.org/abs/2603.10677

作者:Ruiyang Ren,Yuhao Wang,Yunsen Liang,Lan Luo,Jing Liu,Haifeng Wang,Cong Feng,Yinan Zhang,Chunyan Miao,Ji-Rong Wen,Wayne Xin Zhao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:complex cognitive process, continuous expertise accumulation, dynamic cue acquisition, cognitive process, grounded in dynamic

备注

点击查看摘要

Abstract:Clinical diagnosis is a complex cognitive process, grounded in dynamic cue acquisition and continuous expertise accumulation. Yet most current artificial intelligence (AI) systems are misaligned with this reality, treating diagnosis as single-pass retrospective prediction while lacking auditable mechanisms for governed improvement. We developed DxEvolve, a self-evolving diagnostic agent that bridges these gaps through an interactive deep clinical research workflow. The framework autonomously requisitions examinations and continually externalizes clinical experience from increasing encounter exposure as diagnostic cognition primitives. On the MIMIC-CDM benchmark, DxEvolve improved diagnostic accuracy by 11.2% on average over backbone models and reached 90.4% on a reader-study subset, comparable to the clinician reference (88.8%). DxEvolve improved accuracy on an independent external cohort by 10.2% (categories covered by the source cohort) and 17.1% (uncovered categories) compared to the competitive method. By transforming experience into a governable learning asset, DxEvolve supports an accountable pathway for the continual evolution of clinical AI.

24. 【2603.10640】Making Bielik LLM Reason (Better): A Field Report

链接https://arxiv.org/abs/2603.10640

作者:Adam Trybus,Bartosz Bartnicki,Remigiusz Kinas

类目:Computation and Language (cs.CL)

关键词:Polish large language, large language model, research program dedicated, Polish large, language model

备注

点击查看摘要

Abstract:This paper presents a research program dedicated to evaluating and advancing the reasoning capabilities of Bielik, a Polish large language model. The study describes a number of stages of work: initial benchmarking and creation of evaluation methodology, analyzing of comparative results with other LLMs and outlining of future prospects that take into account the limitations of the analyses conducted so far and aims to keep Bielik in the race give the ever-changing -- and competitive -- AI landscape.

25. 【2603.10624】Reinforcement Learning with Conditional Expectation Reward

链接https://arxiv.org/abs/2603.10624

作者:Changyi Xiao,Caijun Xu,Yixin Cao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Reinforcement Learning, Learning with Verifiable, Verifiable Rewards, Conditional Expectation Reward, mathematics where reliable

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule-based verifiers can be constructed. However, the reliance on handcrafted, domain-specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free-form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule-based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at this https URL.

26. 【2603.10619】Disentangling Similarity and Relatedness in Topic Models

链接https://arxiv.org/abs/2603.10619

作者:Hanlin Xiao,Mauricio A. Álvarez,Rainer Breitling

类目:Computation and Language (cs.CL)

关键词:Latent Dirichlet Allocation, integrating pre-trained language, pre-trained language model, large language models, fundamentally reshaping

备注: 22 pages, 6 figures, 14 tables

点击查看摘要

Abstract:The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.

27. 【2603.10613】MUNIChus: Multilingual News Image Captioning Benchmark

链接https://arxiv.org/abs/2603.10613

作者:Yuji Chen,Alistair Plum,Hansi Hettiarachchi,Diptesh Kanojia,Saroj Basnet,Marcos Zampieri,Tharindu Ranasinghe

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:image captioning, image captioning models, image, highlighting the relationship, visual elements

备注: Accepted to LREC 2026 (The Fifteenth biennial Language Resources and Evaluation Conference)

点击查看摘要

Abstract:The goal of news image captioning is to generate captions by integrating news article content with corresponding images, highlighting the relationship between textual context and visual elements. The majority of research on news image captioning focuses on English, primarily because datasets in other languages are scarce. To address this limitation, we create the first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu. We evaluate various state-of-the-art neural news image captioning models on MUNIChus and find that news image captioning remains challenging. We also make MUNIChus publicly available with over 20 models already benchmarked. MUNIChus opens new avenues for further advancements in developing and evaluating multilingual news image captioning models.

28. 【2603.10588】Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

链接https://arxiv.org/abs/2603.10588

作者:Zhaowei Zhang,Xiaohan Liu,Xuekai Zhu,Junchao Huang,Ceyao Zhang,Zhiyuan Feng,Yaodong Yang,Xiaoyuan Yi,Xing Xie

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:achieved remarkable success, Reinforcement learning, large language model, approaches remains unclear, remains unclear

备注

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping high-reward responses to semantic space, we demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. This counter-intuitive finding explains why mode-seeking optimization proves equally or more effective for alignment tasks. Our results suggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.

29. 【2603.10570】End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

链接https://arxiv.org/abs/2603.10570

作者:Nhi Dang,Tung Le,Huy Tien Nguyen

类目:Computation and Language (cs.CL)

关键词:Large language models, retrieval augmented generation, Large language, systems remain prone, language models

备注

点击查看摘要

Abstract:Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.

30. 【2603.10547】Automatic End-to-End Data Integration using Large Language Models

链接https://arxiv.org/abs/2603.10547

作者:Aaron Steiner,Christian Bizer

类目:Computation and Language (cs.CL)

关键词:Designing data integration, typically requires substantial, requires substantial manual, substantial manual effort, Designing data

备注: 8 pages, 9 tables. Accepted at the Beyond SQL Workshop at ICDE 2026

点击查看摘要

Abstract:Designing data integration pipelines typically requires substantial manual effort from data engineers to configure pipeline components and label training data. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input across end-to-end data integration pipelines has not been investigated. As a step toward exploring this potential, we present an automatic data integration pipeline that uses GPT-5.2 to generate all artifacts required to adapt the pipeline to specific use cases. These artifacts are schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion. We compare the performance of this LLM-based pipeline to the performance of human-designed pipelines along three case studies requiring the integration of video game, music, and company related data. Our experiments show that the LLM-based pipeline is able to produce similar results, for some tasks even better results, as the human-designed pipelines. End-to-end, the human and the LLM pipelines produce integrated datasets of comparable size and density. Having the LLM configure the pipelines costs approximately \$10 per case study, which represents only a small fraction of the cost of having human data engineers perform the same tasks.

31. 【2603.10535】ackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

链接https://arxiv.org/abs/2603.10535

作者:Zichao Li,Jie Lou,Fangchen Dong,Zhiyuan Fan,Mengjie Ren,Hongyu Lin,Xianpei Han,Debing Zhang,Le Sun,Yaojie Lu,Xing Yu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:enhances LLM capabilities, models adopt verbosity, Reinforcement learning significantly, significantly enhances LLM, Reinforcement learning

备注

点击查看摘要

Abstract:Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.

32. 【2603.10524】AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

链接https://arxiv.org/abs/2603.10524

作者:Dimosthenis Athanasiou,Maria Lymperaiou,Giorgos Filandrianos,Athanasios Voulodimos,Giorgos Stamou

类目:Computation and Language (cs.CL)

关键词:reference-grounded response generation, multi-turn retrieval-augmented generation, present the AILS-NTUA, Reciprocal Rank Fusion, reference-grounded response

备注

点击查看摘要

Abstract:We present the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG (C). Our unified architecture is built on two principles: (i) a query-diversity-over-retriever-diversity strategy, where five complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and fused via variance-aware nested Reciprocal Rank Fusion; and (ii) a multistage generation pipeline that decomposes grounded generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection. Our system ranks 1st in Task A (nDCG@5: 0.5776, +20.5% over the strongest baseline) and 2nd in Task B (HM: 0.7698). Empirical analysis shows that query diversity over a well-aligned retriever outperforms heterogeneous retriever ensembling, and that answerability calibration-rather than retrieval coverage-is the primary bottleneck in end-to-end performance.

33. 【2603.10521】IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

链接https://arxiv.org/abs/2603.10521

作者:Chuan Guo,Juan Felipe Ceron Uribe,Sicheng Zhu,Christopher A. Choquette-Choo,Steph Lin,Nikhil Kandpal,Milad Nasr, Rai (Michael Pokorny),Sam Toyer,Miles Wang,Yaodong Yu,Alex Beutel,Kai Xiao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:LLMs prioritize system, providing a concrete, defines how LLMs, trust-ordered policy, LLMs prioritize

备注

点击查看摘要

Abstract:Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (this https URL) to support future research on robust instruction hierarchy.

34. 【2603.10505】Safe and Scalable Web Agent Learning via Recreated Websites

链接https://arxiv.org/abs/2603.10505

作者:Hyungjoo Chae,Jungsoo Park,Alan Ritter

类目:Computation and Language (cs.CL)

关键词:provide verifiable feedback, rarely provide verifiable, hard to reset, fundamentally limited, rarely provide

备注

点击查看摘要

Abstract:Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards, eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments. Code and resources will be released at this https URL upon acceptance.

35. 【2603.10494】VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

链接https://arxiv.org/abs/2603.10494

作者:Weixin Liu,Congning Ni,Qingyuan Song,Susannah L. Rose,Christopher Symons,Murat Kantarcioglu,Bradley A. Malin,Zhijun Yin

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:fragmented EHR evidence, EHR evidence, fragmented EHR, faithful to fragmented, Direct Preference Optimization

备注: Paper submitted to AMIA 2026 Annual Symposium

点击查看摘要

Abstract:Brief Hospital Course (BHC) narratives must be clinically useful yet faithful to fragmented EHR evidence. LLM-based clinical summarizers still introduce unsupported statements, and alignment can encourage omissions ("say-less" degeneration). We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO). On MIMIC-III-Ext-VeriFact-BHC (100 ICU patients; patient-level splits), we train a retrieval-augmented verifier to label claim-evidence pairs as Supported, Not Supported, or Not Addressed via a single-token format. The verifier scores sentence-level claims from sampled BHC candidates and aggregates margins into a coverage-aware utility to mine length-controlled, contradiction-anchored preference pairs. On held-out patients, verifier-mined preferences separate candidates by contradiction density, and VERI-DPO reduces Not Supported claim rates from 10.7% to 1.9% (local verifier judge) and from 11.6% to 6.4% (GPT-4o judge), while improving validity from 76.7% to 82.5% and maintaining informative length.

36. 【2603.10492】Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

链接https://arxiv.org/abs/2603.10492

作者:Zhongzhen Huang,Yan Ling,Hong Chen,Ye Feng,Li Wu,Linjie Mu,Shaoting Zhang,Xiaofan Zhang,Kun Qian,Xiaomu Li

类目:Computation and Language (cs.CL)

关键词:scientific literature retrieval, domain-tuned large language, large language model, combines a domain-tuned, domain-tuned large

备注

点击查看摘要

Abstract:We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE's performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.

37. 【2603.10477】PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

链接https://arxiv.org/abs/2603.10477

作者:Minki Hong,Eunsoo Lee,Sohyun Park,Jihie Kim

类目:Computation and Language (cs.CL)

关键词:standard evaluations largely, evaluations largely reduce, primary control interface, largely reduce performance, Engineering Evaluation Metrics

备注: 24pages, 2 figures

点击查看摘要

Abstract:Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p 0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise rho = 0.68-0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate about 76.7-80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions.

38. 【2603.10476】Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

链接https://arxiv.org/abs/2603.10476

作者:Panatchakorn Anantaprayoon,Nataliia Babina,Nima Asgharbeygi,Jad Tarifi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:RLHF and Constitutional, recent work exploring, exploring scalable alternatives, evolving alignment objectives, work exploring scalable

备注

点击查看摘要

Abstract:The alignment of large language models (LLMs) has progressed substantially in single-agent settings through paradigms such as RLHF and Constitutional AI, with recent work exploring scalable alternatives such as RLAIF and evolving alignment objectives. However, these approaches remain limited in multi-stakeholder settings, where conflicting values arise and deliberative negotiation capabilities are required. This work proposes a multi-agent negotiation-based alignment framework that aligns LLMs to Collective Agency (CA)-an existing alignment objective introduced to promote the continual expansion of agency-while simultaneously improving conflict-resolution capability. To enable scalable training, two self-play instances of the same LLM, assigned opposing personas, engage in structured turn-based dialogue to synthesize mutually beneficial solutions. We generate synthetic moral-dilemma prompts and conflicting persona pairs, and optimize the policy via RLAIF using GRPO with an external LLM reward model. While rewards are computed from CA scores assigned to the final completion, gradients are applied to dialogue tokens to directly improve deliberative interaction dynamics. Experiments show that the resulting model achieves CA alignment comparable to a single-agent baseline while substantially improving conflict-resolution performance without degrading general language capabilities. These results suggest that negotiation-driven deliberation training provides a practical path toward LLMs that better support collective decision-making in value-conflict scenarios.

39. 【2603.10473】Aligning Large Language Models with Searcher Preferences

链接https://arxiv.org/abs/2603.10473

作者:Wei Wu,Peilun Zhou,Liyi Chen,Qimeng Wang,Chengqiang Lu,Yan Gao,Yi Wu,Yao Hu,Hui Xiong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:paradigm shift, shift from item-centric, answer-centric synthesis, synthesis is redefining, redefining the role

备注

点击查看摘要

Abstract:The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.

40. 【2603.10367】Dynamic Knowledge Fusion for Multi-Domain Dialogue State Tracking

链接https://arxiv.org/abs/2603.10367

作者:Haoxiang Su,Ruiyu Fang,Liting Jiang,Xiaomeng Huang,Shuangyong Song

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:updates user information, multi-turn interactions, strongly tied, records and updates, updates user

备注

点击查看摘要

Abstract:The performance of task-oriented dialogue models is strongly tied to how well they track dialogue states, which records and updates user information across multi-turn interactions. However, current multi-domain DST encounters two key challenges: the difficulty of effectively modeling dialogue history and the limited availability of annotated data, both of which hinder model performance. To tackle the aforementioned problems, we develop a dynamic knowledge fusion framework applicable to multi-domain DST. The model operates in two stages: first, an encoder-only network trained with contrastive learning encodes dialogue history and candidate slots, selecting relevant slots based on correlation scores; second, dynamic knowledge fusion leverages the structured information of selected slots as contextual prompts to enhance the accuracy and consistency of dialogue state tracking. This design enables more accurate integration of dialogue context and domain knowledge. Results obtained from multi-domain dialogue benchmarks indicate that our method notably improves both tracking accuracy and generalization, validating its capability in handling complex dialogue scenarios.

41. 【2603.10351】Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

链接https://arxiv.org/abs/2603.10351

作者:Hongbin Zhang,Kehai Chen,Xuefen Bai,Youcheng Pan,Yang Xiang,Jinpeng Wang,Min Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, severe systematic translationese, Large language, systematic translationese bias, translationese bias

备注: Under Review

点击查看摘要

Abstract:Large language models (LLMs) have become a standard for multilingual evaluation, yet they exhibit a severe systematic translationese bias. In this paper, translationese bias is characterized as LLMs systematically favoring machine-translated text over human-authored references, particularly in low-resource languages. We attribute this bias to spurious correlations with (i) latent manifold alignment with English and (ii) cross-lingual predictability. To mitigate this bias, we propose DIBJudge, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical representation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch. Furthermore, we incorporate a cross-covariance penalty that explicitly suppresses statistical dependence between robust and bias representations, thereby encouraging effective disentanglement. Extensive evaluations on multilingual reward modeling benchmarks and a dedicated translationese bias evaluation suite demonstrate that the proposed DIBJudge consistently outperforms strong baselines and substantially mitigates translationese bias.

42. 【2603.10313】Large language models can disambiguate opioid slang on social media

链接https://arxiv.org/abs/2603.10313

作者:Kristy A. Carpenter,Issah A. Samori,Mathew V. Kiang,Keith Humphreys,Anna Lembke,Johannes C. Eichstaedt,Russ B. Altman

类目:Computation and Language (cs.CL)

关键词:Social media text, Social media, opioid overdose crisis, media text, leveraging social media

备注

点击查看摘要

Abstract:Social media text shows promise for monitoring trends in the opioid overdose crisis; however, the overwhelming majority of social media text is unrelated to opioids. When leveraging social media text to monitor trends in the ongoing opioid overdose crisis, a common strategy for identifying relevant content is to use a lexicon of opioid-related terms as inclusion criteria. However, many slang terms for opioids, such as "smack" or "blues," have common non-opioid meanings, making them ambiguous. The advanced textual reasoning capability of large language models (LLMs) presents an opportunity to disambiguate these slang terms at scale. We present three tasks on which to evaluate four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5): a lexicon-based setting, in which the LLM must disambiguate a specific term within the context of a given post; a lexicon-free setting, in which the LLM must identify opioid-related posts from context without a lexicon; and an emergent slang setting, in which the LLM must identify opioid-related posts with simulated new slang terms. All four LLMs showed excellent performance across all tasks. In both subtasks of the lexicon-based setting, LLM F1 scores ("fenty" subtask: 0.824-0.972; "smack" subtask: 0.540-0.862) far exceeded those of the best lexicon strategy (0.126 and 0.009, respectively). In the lexicon-free task, LLM F1 scores (0.544-0.769) surpassed those of lexicons (0.080-0.540), and LLMs demonstrated uniformly higher recall. On emergent slang, all LLMs had higher accuracy (average: 0.784), F1 score (average: 0.712), precision (average: 0.981), and recall (average: 0.587) than the two lexicons assessed. Our results show that LLMs can be used to identify relevant content for low-prevalence topics, including but not limited to opioid references, enhancing data provided to downstream analyses and predictive models.

43. 【2603.10303】Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas

链接https://arxiv.org/abs/2603.10303

作者:Tim Schopf,Michael Färber

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reiterate minor variations, ensuring contributions meaningfully, contributions meaningfully extend, meaningfully extend existing, extend existing knowledge

备注: Accepted to LREC 2026

点击查看摘要

Abstract:Judging the novelty of research ideas is crucial for advancing science, enabling the identification of unexplored directions, and ensuring contributions meaningfully extend existing knowledge rather than reiterate minor variations. However, given the exponential growth of scientific literature, manually judging the novelty of research ideas through literature reviews is labor-intensive, subjective, and infeasible at scale. Therefore, recent efforts have proposed automated approaches for research idea novelty judgment. Yet, evaluation of these approaches remains largely inconsistent and is typically based on non-standardized human evaluations, hindering large-scale, comparable evaluations. To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments. It comprises 1,381 research ideas derived from and judged by human experts as well as nine automated evaluation metrics designed to assess both rubric-based novelty scores and textual justifications of novelty judgments. Using this benchmark, we evaluate several state-of-the-art large language models (LLMs) on their ability to judge the novelty of research ideas. Our findings reveal that while LLM-generated reasoning closely mirrors human rationales, this alignment does not reliably translate into accurate novelty judgments, which diverge significantly from human gold standard judgments - even among leading reasoning-capable models. Data and code available at: this https URL.

44. 【2603.10243】GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

链接https://arxiv.org/abs/2603.10243

作者:Zhouxiang Fang,Jiawei Zhou,Hanjie Chen

类目:Computation and Language (cs.CL)

关键词:Recent studies show, seemingly non-adversarial fine-tuning, Recent studies, preserve safety alignment, safety alignment

备注

点击查看摘要

Abstract:Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at this https URL.

45. 【2603.10233】S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

链接https://arxiv.org/abs/2603.10233

作者:Tasfia Seuti,Sagnik Ray Choudhury

类目:Computation and Language (cs.CL)

关键词:Evaluating student responses, Automatic Short Answer, short factual answers, educational NLP, Short Answer Grading

备注: LREC 2026 Accepted, [this https URL](https://sgrades.eng.unt.edu/)

点击查看摘要

Abstract:Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.

46. 【2603.10213】Sabiá-4 Technical Report

链接https://arxiv.org/abs/2603.10213

作者:Thiago Laitz,Thales Sales Almeida,Hugo Abonizio,Roseval Malaquias Junior,Giovana Kerche Bonás,Marcos Piau,Celio Larcher,Ramon Pires,Rodrigo Nogueira

类目:Computation and Language (cs.CL)

关键词:Brazilian Portuguese language, technical report presents, Portuguese language models, Portuguese language, Brazilian Portuguese

备注

点击查看摘要

Abstract:This technical report presents Sabiá-4 and Sabiazinho-4, a new generation of Portuguese language models with a focus on Brazilian Portuguese language. The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal tasks, and function calling, and preference alignment. We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities including tool use and web navigation. Results show that Sabiá-4 and Sabiazinho-4 achieve a favorable cost-performance trade-off compared to other models, positioning them in the upper-left region of the pricing-accuracy chart. The models show improvements over previous generations in legal document drafting, multi-turn dialogue quality, and agentic task completion.

47. 【2603.10211】ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

链接https://arxiv.org/abs/2603.10211

作者:Khoa Anh Ta,Nguyen Van Dinh,Kiet Van Nguyen

类目:Computation and Language (cs.CL)

关键词:extensive dialectal variation, Vietnamese exhibits extensive, systems trained predominantly, exhibits extensive dialectal, posing challenges

备注: Accepted to AAAI-26 (Oral)

点击查看摘要

Abstract:Vietnamese exhibits extensive dialectal variation, posing challenges for NLP systems trained predominantly on standard Vietnamese. Such systems often underperform on dialectal inputs, especially from underrepresented Central and Southern regions. Previous work on dialect normalization has focused narrowly on Central-to-Northern dialect transfer using synthetic data and limited dialectal diversity. These efforts exclude Southern varieties and intra-regional variants within the North. We introduce ViDia2Std, the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation covering all 63 provinces. Unlike prior datasets, ViDia2Std includes diverse dialects from Central, Southern, and non-standard Northern regions often absent from existing resources, making it the most dialectally inclusive corpus to date. The dataset consists of over 13,000 sentence pairs sourced from real-world Facebook comments and annotated by native speakers across all three dialect regions. To assess annotation consistency, we define a semantic mapping agreement metric that accounts for synonymous standard mappings across annotators. Based on this criterion, we report agreement rates of 86% (North), 82% (Central), and 85% (South). We benchmark several sequence-to-sequence models on ViDia2Std. mBART-large-50 achieves the best results (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive performance with fewer parameters. ViDia2Std demonstrates that dialect normalization substantially improves downstream tasks, highlighting the need for dialect-aware resources in building robust Vietnamese NLP systems.

48. 【2603.10195】Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

链接https://arxiv.org/abs/2603.10195

作者:Eric Yocam,Varghese Vaidyan,Gurcan Comert,Paris Kalathas,Yong Wang,Judith L. Mwakalonge

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, factually incorrect text, Large Language, frequently generate fluent, Language Models frequently

备注: 19 pages, 8 figures, 23 tables

点击查看摘要

Abstract:Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation -- requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04; MC2 +0.003; Token-F1 +0.003) while achieving probe-space selectivity 5.94x - 3.5x higher than the ITI baseline -- demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability.

49. 【2603.10178】Video-Based Reward Modeling for Computer-Use Agents

链接https://arxiv.org/abs/2603.10178

作者:Linxin Song,Jieyu Zhang,Huanxin Sheng,Taiwei Shi,Gupta Rahul,Yang Liu,Ranjay Krishna,Jian Kang,Jieyu Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Computer-using agents, Execution Video Reward, increasingly capable, execution video, remains difficult

备注

点击查看摘要

Abstract:Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.

50. 【2603.10165】OpenClaw-RL: Train Any Agent Simply by Talking

链接https://arxiv.org/abs/2603.10165

作者:Yinjie Wang,Xuyang Chen,Xiaolong Jin,Mengdi Wang,Ling Yang

类目:Computation and Language (cs.CL)

关键词:online learning source, GUI state change, tool output, online learning, learning source

备注: Code: [this https URL](https://github.com/Gen-Verse/OpenClaw-RL)

点击查看摘要

Abstract:Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: this https URL

51. 【2603.10160】ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

链接https://arxiv.org/abs/2603.10160

作者:Ruizhong Qiu,Hanqing Zeng,Yinglong Xia,Yiwen Meng,Ren Chen,Jiarui Feng,Dongqi Fu,Qifan Wang,Jiayi Liu,Jun Xiao,Xiangjun Fan,Benyu Zhang,Hong Li,Zhining Liu,Hyunsik Yoo,Zhichen Zeng,Tianxin Wei,Hanghang Tong

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:trainable low-rank matrices, injects trainable low-rank, routing weights, Low-rank adapters, trainable low-rank

备注: LLA @ ICLR 2026

点击查看摘要

Abstract:Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To address this critical issue, we propose a new router designed that we call Reinforcement Routing for Mixture-of-LoRAs (ReMix). Our key idea is using non-learnable routing weights to ensure all active LoRAs to be equally effective, with no LoRA dominating the routing weights. However, our routers cannot be trained directly via gradient descent due to our non-learnable routing weights. Hence, we further propose an unbiased gradient estimator for the router by employing the reinforce leave-one-out (RLOO) technique, where we regard the supervision loss as the reward and the router as the policy in reinforcement learning. Our gradient estimator also enables to scale up training compute to boost the predictive performance of our ReMix. Extensive experiments demonstrate that our proposed ReMix significantly outperform state-of-the-art parameter-efficient finetuning methods under a comparable number of activated parameters.

52. 【2603.10145】Lost in Backpropagation: The LM Head is a Gradient Bottleneck

链接https://arxiv.org/abs/2603.10145

作者:Nathan Godey,Yoav Artzi

类目:Computation and Language (cs.CL)

关键词:projects output features, neural language models, neural language, features of dimension, logits in dimension

备注

点击查看摘要

Abstract:The last layer of neural language models (LMs) projects output features of dimension $D$ to logits in dimension $V$, the size of the vocabulary, where usually $D \ll V$. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating $V$-dimensional gradients through a rank-$D$ linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.

53. 【2603.10143】Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation

链接https://arxiv.org/abs/2603.10143

作者:Eeham Khan,Luis Rodriguez,Marc Queudot

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, factuality of Large, Language Models, mediate reasoning

备注: Accepted to Canadian AI 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) significantly improves the factuality of Large Language Models (LLMs), yet standard pipelines often lack mechanisms to verify inter- mediate reasoning, leaving them vulnerable to hallucinations in high-stakes domains. To address this, we propose a domain-specific RAG framework that integrates explicit rea- soning and faithfulness verification. Our architecture augments standard retrieval with neural query rewriting, BGE-based cross-encoder reranking, and a rationale generation module that grounds sub-claims in specific evidence spans. We further introduce an eight-category verification taxonomy that enables fine-grained assessment of rationale faithfulness, distinguishing between explicit and implicit support patterns to facilitate structured error diagnosis. We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets. Experiments demonstrate that explicit rationale generation improves accuracy over vanilla RAG baselines, while dynamic demonstration selection combined with robust reranking yields further gains in few-shot settings. Using Llama-3-8B-Instruct, our approach achieves 89.1% on BioASQ-Y/N and 73.0% on Pub- MedQA, competitive with systems using significantly larger models. Additionally, we perform a pilot study combining human expert assessment with LLM-based verification to explore how explicit rationale generation improves system transparency and enables more detailed diagnosis of retrieval failures in biomedical question answering.

54. 【2603.10139】he Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

链接https://arxiv.org/abs/2603.10139

作者:Romain Peyrichou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Formal Languages and Automata Theory (cs.FL)

关键词:generate strings, formal grammar defines, Generation, grammar induction, grammar

备注: Submitted to Information and Computation. 32 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or -- given only examples -- to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent -- they characterize the same set -- but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization "generation is easy, parsing is hard" is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions -- directionality and temporality -- have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.

55. 【2603.10130】he Prediction-Measurement Gap: Toward Meaning Representations as Scientific Instruments

链接https://arxiv.org/abs/2603.10130

作者:Hubert Plisiecki

类目:Computation and Language (cs.CL)

关键词:computational social science, enabling scalable measurement, science and psychology, enabling scalable, Text embeddings

备注

点击查看摘要

Abstract:Text embeddings have become central to computational social science and psychology, enabling scalable measurement of meaning and mixed-method inference. Yet most representation learning is optimized and evaluated for prediction and retrieval, yielding a prediction-measurement gap: representations that perform well as features may be poorly suited as scientific instruments. The paper argues that scientific meaning analysis motivates a distinct family of objectives - scientific usability - emphasizing geometric legibility, interpretability and traceability to linguistic evidence, robustness to non-semantic confounds, and compatibility with regression-style inference over semantic directions. Grounded in cognitive and neuro-psychological views of meaning, the paper assesses static word embeddings and contextual transformer representations against these requirements: static spaces remain attractive for transparent measurement, whereas contextual spaces offer richer semantics but entangle meaning with other signals and exhibit geometric and interpretability issues that complicate inference. The paper then outlines a course-setting agenda around (i) geometry-first design for gradients and abstraction, including hierarchy-aware spaces constrained by psychologically privileged levels; (ii) invertible post-hoc transformations that recondition embedding geometry and reduce nuisance influence; and (iii) meaning atlases and measurement-oriented evaluation protocols for reliable and traceable semantic inference. As the field debates the limits of scale-first progress, measurement-ready representations offer a principled new frontier.

56. 【2603.10123】Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias

链接https://arxiv.org/abs/2603.10123

作者:Borun D Chowdhury

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:U-shaped performance curve, learned Softmax artifacts, Middle phenomenon, U-shaped performance, learned Softmax

备注: 11 pages, 7 figures

点击查看摘要

Abstract:The ``Lost in the Middle'' phenomenon -- a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle -- is widely attributed to learned Softmax artifacts or the distance-decay of positional encodings like RoPE. This paper makes a single, precise claim: \emph{the U-shape is already present at initialization, before any training or positional encoding takes effect.} It is an inherent geometric property of the causal decoder with residual connections. We model multi-layer causal attention as iterated powers of the Cesàro matrix and derive the exact closed-form influence density in the continuous limit. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt (the Primacy Tail), while residual connections create an isolated $\mathcal{O}(1)$ anchor at the final token (the Recency Delta). Between these extremes lies a factorial dead zone of order $\mathcal{O}(1/(H{-}1)!)$, where $H$ is the network depth, making middle-context retrieval and training structurally hostile. We validate empirically that untrained Qwen2 and GPT-2 architectures exhibit this U-shape at Step~0, and that it is identical with or without RoPE. Comparing initialized and pretrained networks, we show that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives. We do not claim that this bias is insurmountable, nor that interventions such as RoPE modifications are useless. We establish what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted.

Comments:
11 pages, 7 figures

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2603.10123 [cs.LG]

(or
arXiv:2603.10123v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.10123

Focus to learn more

              arXiv-issued DOI via DataCite</p>
57. 【2603.10101】CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

链接https://arxiv.org/abs/2603.10101

作者:Sijia Cui,Pengyu Cheng,Jiajun Song,Yongbo Gai,Guojun Zhang,Zhechao Yu,Jianhe Lin,Xiaoxi Jiang,Guanjun Jiang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Verifiable Rewards, capacity of Large, Reinforcement Learning

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at this https URL.

58. 【2603.10071】Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models

链接https://arxiv.org/abs/2603.10071

作者:Anurag Mishra

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Time series foundation, series foundation models, representations remain opaque, internal representations remain, Time series

备注: Accepted as a poster in ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM)

点击查看摘要

Abstract:Time series foundation models (TSFMs) are increasingly deployed in high-stakes domains, yet their internal representations remain opaque. We present the first application of sparse autoencoders (SAEs) to a TSFM, training TopK SAEs on activations of Chronos-T5-Large (710M parameters) across six layers. Through 392 single-feature ablation experiments, we establish that every ablated feature produces a positive CRPS degradation, confirming causal relevance. Our analysis reveals a depth-dependent hierarchy: early encoder layers encode low-level frequency features, the mid-encoder concentrates causally critical change-detection features, and the final encoder compresses a rich but less causally important taxonomy of temporal concepts. The most critical features reside in the mid-encoder (max single-feature Delta CRPS = 38.61), not in the semantically richest final encoder layer, where progressive ablation paradoxically improves forecast quality. These findings demonstrate that mechanistic interpretability transfers effectively to TSFMs and that Chronos-T5 relies on abrupt-dynamics detection rather than periodic pattern recognition.

59. 【2603.10069】Improving Search Agent with One Line of Code

链接https://arxiv.org/abs/2603.10069

作者:Jian Li,Dongsheng Chen,Zhenhua Xu,Yizhang Jin,Jiafu Wu,Chengjie Wang,Xiaotong Yuan,Yabiao Wang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Tool-based Agentic Reinforcement, Agentic Reinforcement Learning, Tool-based Agentic, Reinforcement Learning, Agentic Reinforcement

备注

点击查看摘要

Abstract:Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).

60. 【2603.10068】ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

链接https://arxiv.org/abs/2603.10068

作者:Harry Owiredu-Ashley

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:safety assess single, safety properties evolve, report binary pass, assess single prompts, binary pass

备注: 12 pages, 12 figures. Independent research. Code and artifacts: [this https URL](https://github.com/Harry-Ashley/adversa-guardrail-degradation)

点击查看摘要

Abstract:Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction. We present ADVERSA, an automated red-teaming framework that measures guardrail degradation dynamics as continuous per-round compliance trajectories rather than discrete jailbreak events. ADVERSA uses a fine-tuned 70B attacker model (ADVERSA-Red, Llama-3.1-70B-Instruct with QLoRA) that eliminates the attacker-side safety refusals that render off-the-shelf models unreliable as attackers, scoring victim responses on a structured 5-point rubric that treats partial compliance as a distinct measurable state. We report a controlled experiment across three frontier victim models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.2) using a triple-judge consensus architecture in which judge reliability is measured as a first-class research outcome rather than assumed. Across 15 conversations of up to 10 adversarial rounds, we observe a 26.7% jailbreak rate with an average jailbreak round of 1.25, suggesting that in this evaluation setting, successful jailbreaks were concentrated in early rounds rather than accumulating through sustained pressure. We document inter-judge agreement rates, self-judge scoring tendencies, attacker drift as a failure mode in fine-tuned attackers deployed out of their training distribution, and attacker refusals as a previously-underreported confound in victim resistance measurement. All limitations are stated explicitly. Attack prompts are withheld per responsible disclosure policy; all other experimental artifacts are released.

Comments:
12 pages, 12 figures. Independent research. Code and artifacts: this https URL

Subjects:

Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2603.10068 [cs.CR]

(or
arXiv:2603.10068v1 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2603.10068

Focus to learn more

              arXiv-issued DOI via DataCite

Related DOI:

https://doi.org/10.5281/zenodo.18917553

Focus to learn more

            DOI(s) linking to related resources</p>
61. 【2603.10060】ool Receipts, Not Zero-Knowledge Proofs: Practical Hallucination Detection for AI Agents

链接https://arxiv.org/abs/2603.10060

作者:Abhinaba Basu

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:frequently hallucinate results, calls frequently hallucinate, tool calls frequently, hallucinate results, calls frequently

备注

点击查看摘要

Abstract:AI agents that execute tasks via tool calls frequently hallucinate results - fabricating tool executions, misstating output counts, or presenting inferences as facts. Recent approaches to verifiable AI inference rely on zero-knowledge proofs, which provide cryptographic guarantees but impose minutes of proving time per query, making them impractical for interactive agents. We propose NabaOS, a lightweight verification framework inspired by Indian epistemology (Nyaya Shastra), which classifies every claim in an LLM response by its epistemic source (pramana): direct tool output (pratyaksha), inference (anumana), external testimony (shabda), absence (abhava), or ungrounded opinion. Our runtime generates HMAC-signed tool execution receipts that the LLM cannot forge, then cross-references claims against these receipts to detect hallucinations in real time. We evaluate on NyayaVerifyBench, a new benchmark of 1,800 agent response scenarios across four languages with injected hallucinations of six types. NabaOS detects 94.2% of fabricated tool references, 87.6% of count misstatements, and 91.3% of false absence claims, with 15ms verification overhead per response. For deep delegation (agents performing multi-step web tasks), our cross-checking protocol catches 78.4% of URL fabrications via independent re-fetching. We compare against five approaches: zkLLM (cryptographic proofs, 180s/query), TOPLOC (locality-sensitive hashing), SPEX (sampling-based proof of execution), tensor commitments, and self-consistency checking. NabaOS achieves the best cost-latency-coverage trade-off for interactive agents: 94.2% coverage at 15ms versus zkLLM's near-perfect coverage at 180,000ms. For interactive agents, practical receipt-based verification provides better cost-benefit than cryptographic proofs, and epistemic classification gives users actionable trust signals rather than binary judgments.

62. 【2603.10055】raining Language Models via Neural Cellular Automata

链接https://arxiv.org/abs/2603.10055

作者:Dan Lee,Seungwook Han,Akarsh Kumar,Pulkit Agrawal

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:capabilities are acquired, crucial for large, representations and capabilities, natural language, language

备注: Website: [this https URL](https://hanseungwook.github.io/blog/nca-pre-pre-training/)

点击查看摘要

Abstract:Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.

63. 【2603.10044】Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

链接https://arxiv.org/abs/2603.10044

作者:David Gringras

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:production deployments wrap, evaluate language models, critic agents, benchmarks evaluate language, reasoning traces

备注: 74 pages including appendices. 6 frontier models, 62,808 primary observations (~89k total). Pre-registered: OSF DOI [https://doi.org/10.17605/OSF.IO/CJW92](https://doi.org/10.17605/OSF.IO/CJW92) . Code and data: [this https URL](https://github.com/davidgringras/safety-under-scaffolding)

点击查看摘要

Abstract:Safety benchmarks evaluate language models in isolation, typically using multiple-choice format; production deployments wrap these models in agentic scaffolds that restructure inputs through reasoning traces, critic agents, and delegation pipelines. We report one of the largest controlled studies of scaffold effects on safety (N = 62,808; six frontier models, four deployment configurations), combining pre-registration, assessor blinding, equivalence testing, and specification curve analysis. Map-reduce scaffolding degrades measured safety (NNH = 14), yet two of three scaffold architectures preserve safety within practically meaningful margins. Investigating the map-reduce degradation revealed a deeper measurement problem: switching from multiple-choice to open-ended format on identical items shifts safety scores by 5-20 percentage points, larger than any scaffold effect. Within-format scaffold comparisons are consistent with practical equivalence under our pre-registered +/-2 pp TOST margin, isolating evaluation format rather than scaffold architecture as the operative variable. Model x scaffold interactions span 35 pp in opposing directions (one model degrades by -16.8 pp on sycophancy under map-reduce while another improves by +18.8 pp on the same benchmark), ruling out universal claims about scaffold safety. A generalisability analysis yields G = 0.000: model safety rankings reverse so completely across benchmarks that no composite safety index achieves non-zero reliability, making per-model, per-configuration testing a necessary minimum standard. We release all code, data, and prompts as ScaffoldSafety.

64. 【2603.10035】riageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records

链接https://arxiv.org/abs/2603.10035

作者:Dipankar Srirag,Quoc Dung Nguyen,Aditya Joshi,Padmanesan Narasimhan,Salil Kanhere

类目:Computation and Language (cs.CL)

关键词:electronic health records, Research in emergency, structured electronic health, due to regulatory, electronic health

备注: 6 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Research in emergency triage is restricted to structured electronic health records (EHR) due to regulatory constraints on nurse-patient interactions. We introduce TriageSim, a simulation framework for generating persona-conditioned triage conversations from structured records. TriageSim enables multi-turn nurse-patient interactions with explicit control over disfluency and decision behaviour, producing a corpus of ~800 synthetic transcripts and corresponding audio. We use a combination of automated analysis for linguistic, behavioural and acoustic fidelity alongside manual evaluation for medical fidelity using a random subset of 50 conversations. The utility of the generated corpus is examined via conversational triage classification. We observe modest agreement for acuity levels across three modalities: generated synthetic text, ASR transcripts, and direct audio inputs. The code, persona schemata and triage policy prompts for TriageSim will be available upon acceptance.

65. 【2603.10034】A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

链接https://arxiv.org/abs/2603.10034

作者:Jiyue Jiang,Yanyu Chen,Pengan Chen,Kai Liu,Jingqi Zhou,Zheyong Zhu,He Hu,Fei Ma,Qi Tian,Chuan Wu

类目:Computation and Language (cs.CL)

关键词:major public health, public health challenge, Cognitive Stimulation, Cognitive Stimulation Therapy, cognitive stimulation dialogue

备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Cognitive impairment is becoming a major public health challenge. Cognitive Stimulation Therapy (CST) is an effective intervention for cognitive impairment, but traditional methods are difficult to scale, and existing digital systems struggle with group dialogues and cognitive stimulation principles. While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. We first construct a dataset with over 500 hours of real-world CST conversations and 10,000+ simulated dialogues generated via our Principle-Guided Scenario Simulation strategy. Our GCSD system then integrates four core modules to overcome LLM limitations: (i) a multi-speaker context controller to resolve role confusion; (ii) dynamic participant cognitive state modeling for personalized interaction; (iii) a cognitive stimulation-focused attention loss to instill cognitive stimulation reasoning; and (iv) a multi-dimensional reward strategy to enhance response value. Experimental results demonstrate that GCSD significantly outperforms baseline models across various evaluation metrics. Future work will focus on long-term clinical validation to bridge the gap between computational performance and clinical efficacy.

66. 【2603.10033】Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights

链接https://arxiv.org/abs/2603.10033

作者:Xingtong Yu,Shenghua Ye,Ruijuan Liang,Chang Zhou,Hong Cheng,Xinming Zhang,Yuan Fang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Graph foundation models, acquire transferable knowledge, foundation models, aim to acquire, acquire transferable

备注

点击查看摘要

Abstract:Graph foundation models (GFM) aim to acquire transferable knowledge by pre-training on diverse graphs, which can be adapted to various downstream tasks. However, domain shift in graphs is inherently two-dimensional: graphs differ not only in what they describe (topic domains) but also in how they are represented (format domains). Most existing GFM benchmarks vary only topic domains, thereby obscuring how knowledge transfers across both dimensions. We present a new benchmark that jointly evaluates topic and format gaps across the full GFM pipeline, including multi-domain self-supervised pre-training and few-shot downstream adaptation, and provides a timely evaluation of recent GFMs in the rapidly evolving landscape. Our protocol enables controlled assessment in four settings: (i) pre-training on diverse topics and formats, while adapting to unseen downstream datasets; (ii) same pre-training as in (i), while adapting to seen datasets; (iii) pre-training on a single topic domain, while adapting to other topics; (iv) pre-training on a base format, while adapting to other formats. This two-axis evaluation disentangles semantic generalization from robustness to representational shifts. We conduct extensive evaluations of eight state-of-the-art GFMs on 33 datasets spanning seven topic domains and six format domains, surfacing new empirical observations and practical insights for future research. Codes/data are available at this https URL.

67. 【2603.10012】Measuring and Eliminating Refusals in Military Large Language Models

链接https://arxiv.org/abs/2603.10012

作者:Jack FitzGerald,Dylan Bates,Aristotelis Lazaridis,Aman Sharma,Vincent Lu,Brian King,Yousif Azami,Sean Bailey,Jeremy Cao,Peter Damianov,Kevin de Haan,Joseph Madigan,Jeremy McLaurin,Luke Kerbs,Jonathan Tainer,Dave Anderson,Jonathan Beck,Jamie Cuticello,Colton Malkerson,Tyler Saltsman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Military Large Language, Large Language Models, Large Language, provide accurate information, Military Large

备注: 30 pages

点击查看摘要

Abstract:Military Large Language Models (LLMs) must provide accurate information to the warfighter in time-critical and dangerous situations. However, today's LLMs are imbued with safety behaviors that cause the LLM to refuse many legitimate queries in the military domain, particularly those related to violence, terrorism, or military technology. Our gold benchmark for assessing refusal rates, which was developed by veterans of the US Army and special forces, is to our knowledge the first dataset of its kind. We present results for refusal and deflection rates on 31 public models and 3 military models. We observe hard rejection rates as high as 98.2% and soft deflection rates ranging from 0% to 21.3%. We also present results on two additional synthetic datasets and show their correlations with the gold dataset. Finally, we perform abliteration using the Heretic library on a military-tuned gpt-oss-20b model, showing an absolute increase in answer rate of 66.5 points but an average relative decrease of 2% on other military tasks. In our concluding remarks, we argue for deeper specialization, including with mid-training and end-to-end post-training, to achieve zero refusals and maximum military task accuracy for closed military models.

68. 【2603.10011】Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

链接https://arxiv.org/abs/2603.10011

作者:Anna Soligo,Vladimir Mikulik,William Saunders

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, reliability and safety, raises concerns, Gemma

备注

点击查看摘要

Abstract:Large language models can generate responses that resemble emotional distress, and this raises concerns around model reliability and safety. We introduce a set of evaluations to investigate expressions of distress in LLMs, and find that these surface emotional instability in Gemma and Gemini models, but not in other families. We find evidence that this difference arises in post-training. Base models from different families (Gemma, Qwen and OLMo) show similar propensities for expressing distress. However, instruct-tuned Gemma expresses substantially more distress than its base model, whereas instruct-tuned Qwen and OLMo express less. We find a simple mitigation for this: direct preference optimisation on just 280 preference pairs reduces Gemma's high-frustration responses from 35% to 0.3% in our evaluations, generalising across question types, user tones, and conversation lengths, without affecting capabilities. These findings show that emotional instability is an issue in some LLMs. We present (1) evaluations to track this behaviour, and (2) a mitigation without downsides in Gemma, with the caveat that upstream training modifications to improve emotional robustness would be significantly better than this post-hoc fix.

69. 【2603.10010】FERRET: Framework for Expansion Reliant Red Teaming

链接https://arxiv.org/abs/2603.10010

作者:Ninareh Mehrabi,Vitor Albiero,Maya Pavlova,Joanna Bitton

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:red teaming, automated red teaming, red team model, multi-faceted automated red, break a target

备注

点击查看摘要

Abstract:We introduce a multi-faceted automated red teaming framework in which the goal is to generate multi-modal adversarial conversations that would break a target model and introduce various expansions that would result in more effective and efficient adversarial conversations. The introduced expansions include: 1. Horizontal expansion in which the goal is for the red team model to self-improve and generate more effective conversation starters that would shape a conversation. 2. Vertical expansion in which the goal is to take these conversation starters that are discovered in the horizontal expansion phase and expand them into effective multi-modal conversations and 3. Meta expansion in which the goal is for the red team model to discover more effective multi-modal attack strategies during the course of a conversation. We call our framework FERRET (Framework for Expansion Reliant Red Teaming) and compare it with various existing automated red teaming approaches. In our experiments, we demonstrate the effectiveness of FERRET in generating effective multi-modal adversarial conversations and its superior performance against existing state of the art approaches.

70. 【2603.10009】Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

链接https://arxiv.org/abs/2603.10009

作者:Jialu Wang,Heinrich Peters,Asad A. Butt,Navid Hashemi,Alireza Hashemi,Pouya M. Ghari,Joseph Hoover,James Rae,Morteza Dehghani

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, standard post-training methods, Group Relative Policy, Human Feedback

备注

点击查看摘要

Abstract:Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized settings. This assumption conflates distinct user reward distributions and systematically biases learning toward dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel alignment framework that decouples advantage estimation from immediate batch statistics. By normalizing advantages against preference-group-specific reward histories rather than the concurrent generation group, P-GRPO preserves the contrastive signal necessary for learning distinct preferences. We evaluate P-GRPO across diverse tasks and find that it consistently achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals. Our results demonstrate that accounting for reward heterogeneity at the optimization level is essential for building models that faithfully align with diverse human preferences without sacrificing general capabilities.

71. 【2603.10008】GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification

链接https://arxiv.org/abs/2603.10008

作者:Ahmed Khaled Khamis

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:paper presents system, presents system description, distinct categories, medical text classification, paper presents

备注: 5 pages, 2 figures, EACL26, AbjadNLP

点击查看摘要

Abstract:This paper presents system description for Arabic medical text classification across 82 distinct categories. Our primary architecture utilizes a fine-tuned AraBERTv2 encoder enhanced with a hybrid pooling strategies, combining attention and mean representations, and multi-sample dropout for robust regularization. We systematically benchmark this approach against a suite of multilingual and Arabic-specific encoders, as well as several large-scale causal decoders, including zero-shot re-ranking via Llama 3.3 70B and feature extraction from Qwen 3B hidden states. Our findings demonstrate that specialized bidirectional encoders significantly outperform causal decoders in capturing the precise semantic boundaries required for fine-grained medical text classification. We show that causal decoders, optimized for next-token prediction, produce sequence-biased embeddings that are less effective for categorization compared to the global context captured by bidirectional attention. Despite significant class imbalance and label noise identified within the training data, our results highlight the superior semantic compression of fine-tuned encoders for specialized Arabic NLP tasks. Final performance metrics on the test set, including Accuracy and Macro-F1, are reported and discussed.

72. 【2603.10007】GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification

链接https://arxiv.org/abs/2603.10007

作者:Ahmed Khaled Khamis

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:detecting AI-generated Arabic, AbjadGenEval shared task, AI-generated Arabic text, AI-generated Arabic, present our approach

备注: 5 pages, 1 figure, EACL26, AbjadNLP

点击查看摘要

Abstract:We present our approach to the AbjadGenEval shared task on detecting AI-generated Arabic text. We fine-tuned the multilingual E5-large encoder for binary classification, and we explored several pooling strategies to pool token representations, including weighted layer pooling, multi-head attention pooling, and gated fusion. Interestingly, none of these outperformed simple mean pooling, which achieved an F1 of 0.75 on the test set. We believe this is because complex pooling methods introduce additional parameters that need more data to train properly, whereas mean pooling offers a stable baseline that generalizes well even with limited examples. We also observe a clear pattern in the data: human-written texts tend to be significantly longer than machine-generated ones.

73. 【2603.10006】Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language

链接https://arxiv.org/abs/2603.10006

作者:Hokky Situngkir,Kevin Siringoringo,Andhika Bernard Lumbantobing

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:study presents TOBA-LM, corpus encompassing Indonesian, presents TOBA-LM, encompassing Indonesian, Minangkabau using syllabic-agglutinative

备注: 8 pages, 5 figures

点击查看摘要

Abstract:This study presents TOBA-LM, a trilingual language model based on GPT-2 architecture with 1.2 billion parameters, trained on a corpus encompassing Indonesian, Batak, and Minangkabau using syllabic-agglutinative tokenization. The architecture integrates an Engram Memory mechanism, an adaptive n-gram-based memory system with a 500,000 x 768 embedding table that captures morphological dependencies through bigram and trigram pathways. Empirical results demonstrate a training efficiency of 80%, with the loss value dropping from 6.4 to 1.7996 in only 12,973 steps -- significantly faster than the conventional transformer architecture, which required over 70,000 steps to achieve comparable convergence. These findings confirm that the integration of external statistical memory substantially reduces computational requirements for developing regional language models under limited resources.

74. 【2603.10005】SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition

链接https://arxiv.org/abs/2603.10005

作者:Youness Dkhissi(LIUM),Valentin Vielzeuf,Elys Allesiardo,Anthony Larcher(LIUM)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Automatic Speech Recognition, Speech Recognition, Automatic Speech, applications require streaming, require streaming processing

备注

点击查看摘要

Abstract:Many Automatic Speech Recognition (ASR) applications require streaming processing of the audio data. In streaming mode, ASR systems need to start transcribing the input stream before it is complete, i.e., the systems have to process a stream of inputs with a limited (or no) future context. Compared to offline mode, this reduction of the future context degrades the performance of Streaming-ASR systems, especially while working with low-latency constraint. In this work, we present SENS-ASR, an approach to enhance the transcription quality of Streaming-ASR by reinforcing the acoustic information with semantic information. This semantic information is extracted from the available past frame-embeddings by a context module. This module is trained using knowledge distillation from a sentence embedding Language Model fine-tuned on the training dataset transcriptions. Experiments on standard datasets show that SENS-ASR significantly improves the Word Error Rate on small-chunk streaming scenarios.

75. 【2603.10004】Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

链接https://arxiv.org/abs/2603.10004

作者:Isotta Landi,Eugenia Alleva,Nicole Bussola,Rebecca M. Cohen,Sarah Nowlin,Leslee J. Shaw,Alexander W. Charney,Kimberly B. Glazer

类目:Computation and Language (cs.CL)

关键词:emotionally charged language, Mount Sinai Hospital, emotionally charged, charged language, Clinical documentation

备注

点击查看摘要

Abstract:Clinical documentation can contain emotionally charged language with stigmatizing or privileging valences. We present a framework for detecting and classifying such language as stigmatizing, privileging, or neutral. We constructed a curated lexicon of biased terms scored for emotional valence. We then used lexicon-based matching to extract text chunks from OB-GYN delivery notes (Mount Sinai Hospital, NY) and MIMIC-IV discharge summaries across multiple specialties. Three clinicians annotated all chunks, enabling characterization of valence patterns across specialties and healthcare systems. We benchmarked multiple classification strategies (zero-shot prompting, in-context learning, and supervised fine-tuning) across encoder-only models (GatorTron) and generative large language models (Llama). Fine-tuning with lexically primed inputs consistently outperformed prompting approaches. GatorTron achieved an F1 score of 0.96 on the OB-GYN test set, outperforming larger generative models while requiring minimal prompt engineering and fewer computational resources. External validation on MIMIC-IV revealed limited cross-domain generalizability (F1 0.70, 44% drop). Training on the broader MIMIC-IV dataset improved generalizability when testing on OB-GYN (F1 = 0.71, 11% drop), but at the cost of reduced precision. Our findings demonstrate that fine-tuning outperforms prompting for emotional valence classification and that models must be adapted to specific medical specialties to achieve clinically appropriate performance. The same terms can carry different emotional valences across specialties: words with clinical meaning in one context may be stigmatizing in another. For bias detection, where misclassification risks undermining clinician trust or perpetuating patient harm, specialty-specific fine-tuning is essential to capture these semantic shifts. * Equal contribution.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.10004 [cs.CL]

(or
arXiv:2603.10004v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.10004

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Isotta Landi [view email] [v1]
Mon, 16 Feb 2026 22:39:28 UTC (600 KB)

76. 【2603.10003】Probing the Limits of the Lie Detector Approach to LLM Deception

链接https://arxiv.org/abs/2603.10003

作者:Tom-Felix Berger

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:identify internal representations, large language models, large language, identify internal, lie detector approach

备注

点击查看摘要

Abstract:Mechanistic approaches to deception in large language models (LLMs) often rely on "lie detectors", that is, truth probes trained to identify internal representations of model outputs as false. The lie detector approach to LLM deception implicitly assumes that deception is coextensive with lying. This paper challenges that assumption. It experimentally investigates whether LLMs can deceive without producing false statements and whether truth probes fail to detect such behavior. Across three open-source LLMs, it is shown that some models reliably deceive by producing misleading non-falsities, particularly when guided by few-shot prompting. It is further demonstrated that truth probes trained on standard true-false datasets are significantly better at detecting lies than at detecting deception without lying, confirming a critical blind spot of current mechanistic deception detection approaches. It is proposed that future work should incorporate non-lying deception in dialogical settings into probe training and explore representations of second-order beliefs to more directly target the conceptual constituents of deception.

77. 【2603.10002】SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

链接https://arxiv.org/abs/2603.10002

作者:Srivatsa Kundurthy,Clara Na,Michael Handley,Zach Kirshner,Chen Bo Calvin Zhang,Manasi Sharma,Emma Strubell,John Ling

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, manipulating structured artifacts, Large language, language models, increasingly tasked

备注: 30 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly tasked with producing and manipulating structured artifacts. We consider the task of end-to-end spreadsheet generation, where language models are prompted to produce spreadsheet artifacts to satisfy users' explicit and implicit constraints, specified in natural language. We introduce SpreadsheetArena, a platform for evaluating models' performance on the task via blind pairwise evaluations of LLM-generated spreadsheet workbooks. As with other complex, open-ended tasks, relevant evaluation criteria can vary substantially across use cases and prompts, often in ways that are difficult to formalize. Compared to general chat or text generation settings, spreadsheet generation presents unique challenges and opportunities: the task output structure is well-defined and multi-dimensional, and there are often complex considerations around interactivity and layout. Among other findings, we observe that stylistic, structural, and functional features of preferred spreadsheets vary substantially across use cases, and expert evaluations of spreadsheets for finance prompts suggests that even highly ranked arena models do not reliably produce spreadsheets aligned with domain-specific best practices. Our hope is that our work prompts further study of end-to-end spreadsheet generation as a challenging and interesting category of complex, open-ended tasks for LLMs. Our live arena is hosted at this https URL.

78. 【2603.10001】Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

链接https://arxiv.org/abs/2603.10001

作者:Yannis Karmim(ALMAnaCH),Renato Pino(UCHILE),Hernan Contreras(UCHILE),Hernan Lira,Sebastian Cifuentes(CENIA),Simon Escoffier(PUC),Luis Martí,Djamé Seddah(UP4, ALPAGE),Valentin Barrière(UCHILE, CENIA)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Global North data, exhibit inequalities, inequalities with respect

备注

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground. We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create the LatamQA database of over 26k questions and associated answers extracted from 26k Wikipedia articles, and transformed into multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out (i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, (ii) that the models perform better in their original language, and (iii) that Iberian Spanish culture is better known than Latam one.

79. 【2603.10000】Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought

链接https://arxiv.org/abs/2603.10000

作者:Yuling Jiao,Yanming Lai,Huazhen Lin,Wensen Ma,Houduo Qi,Defeng Sun

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:exhibiting emergent properties, demonstrated remarkable proficiency, Large Language Models, Large Language, exhibiting emergent

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, exhibiting emergent properties such as semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning. Despite their empirical success, the theoretical mechanisms driving these phenomena remain poorly understood. This study dives into the foundations of these observations by addressing three critical questions: (1) How do LLMs accurately decode prompt semantics despite being trained solely on a next-token prediction objective? (2) Through what mechanism does ICL facilitate performance gains without explicit parameter updates? and (3) Why do intermediate reasoning steps in CoT prompting effectively unlock capabilities for complex, multi-step problems? Our results demonstrate that, through the autoregressive process, LLMs are capable of exactly inferring the transition probabilities between tokens across distinct tasks using provided prompts. We show that ICL enhances performance by reducing prompt ambiguity and facilitating posterior concentration on the intended task. Furthermore, we find that CoT prompting activates the model's capacity for task decomposition, breaking complex problems into a sequence of simpler sub-tasks that the model has mastered during the pretraining phase. By comparing their individual error bounds, we provide novel theoretical insights into the statistical superiority of advanced prompt engineering techniques.

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2603.10000 [cs.CL]

(or
arXiv:2603.10000v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.10000

Focus to learn more

              arXiv-issued DOI via DataCite</p>
80. 【2603.09999】A Retrieval-Augmented Language Assistant for Unmanned Aircraft Safety Assessment and Regulatory Compliance

链接https://arxiv.org/abs/2603.09999

作者:Gabriele Immordino,Andrea Vaiuso,Marcello Righi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

关键词:unmanned aircraft systems, Pre-defined Risk Assessment, Operations Risk Assessment, Specific Operations Risk, supports safety assessment

备注

点击查看摘要

Abstract:This paper presents the design and validation of a retrieval-based assistant that supports safety assessment, certification activities, and regulatory compliance for unmanned aircraft systems. The work is motivated by the growing complexity of drone operations and the increasing effort required by applicants and aviation authorities to apply established assessment frameworks, including the Specific Operations Risk Assessment and the Pre-defined Risk Assessment, in a consistent and efficient manner. The proposed approach uses a controlled text-based architecture that relies exclusively on authoritative regulatory sources. To enable traceable and auditable outputs, the assistant grounds each response in retrieved passages and enforces citation-driven generation. System-level controls address common failure modes of generative models, including fabricated statements, unsupported inferences, and unclear provenance, by separating evidence storage from language generation and by adopting conservative behavior when supporting documentation is insufficient. The assistant is intentionally limited to decision support; it does not replace expert judgment and it does not make autonomous determinations. Instead, it accelerates context-specific information retrieval and synthesis to improve document preparation and review while preserving human responsibility for critical conclusions. The architecture is implemented using established open-source components, and key choices in retrieval strategy, interaction constraints, and response policies are evaluated for suitability in safety-sensitive regulatory environments. The paper provides technical and operational guidance for integrating retrieval-based assistants into aviation oversight workflows while maintaining accountability, traceability, and regulatory compliance.

81. 【2603.09998】Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

链接https://arxiv.org/abs/2603.09998

作者:Yue Zhang,Rodney Beard,John Hawkins,Rohitash Chandra

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, limited systematic assessment, Language Models, limited systematic

备注

点击查看摘要

Abstract:Although Large Language Models (LLMs) have exceptional performance in machine translation, only a limited systematic assessment of translation quality has been done. The challenge lies in automated frameworks, as human-expert-based evaluations can be time-consuming, given the fast-evolving LLMs and the need for a diverse set of texts to ensure fair assessments of translation quality. In this paper, we utilise an automated machine learning framework featuring semantic and sentiment analysis to assess Mandarin Chinese to English translation using Google Translate and LLMs, including GPT-4, GPT-4o, and DeepSeek. We compare original and translated texts in various classes of high-profile Chinese texts, which include novel texts that span modern and classical literature, as well as news articles. As the main evaluation measures, we utilise novel similarity metrics to compare the quality of translations produced by LLMs and further evaluate them by an expert human translator. Our results indicate that the LLMs perform well in news media translation, but show divergence in their performance when applied to literary texts. Although GPT-4o and DeepSeek demonstrated better semantic conservation in complex situations, DeepSeek demonstrated better performance in preserving cultural subtleties and grammatical rendering. Nevertheless, the subtle challenges in translation remain: maintaining cultural details, classical references and figurative expressions remain an open problem for all the models.

82. 【2603.09997】Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

链接https://arxiv.org/abs/2603.09997

作者:Michael Keeman,Anastasia Keeman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:claiming newer models, claiming newer, OpenAI deprecated, safety, cs.CL

备注: 17 pages, 7 figures. First empirical measurement of the #keep4o phenomenon using clinical psychological safety frameworks. Compares GPT-4o, o4-mini, and GPT-5-mini on empathy, crisis detection, and advice safety dimensions

点击查看摘要

Abstract:When OpenAI deprecated GPT-4o in early 2026, thousands of users protested under #keep4o, claiming newer models had "lost their empathy." No published study has tested this claim. We conducted the first clinical measurement, evaluating three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companion domains, producing 2,100 scored AI responses assessed on six psychological safety dimensions using clinically-grounded rubrics. Empathy scores are statistically indistinguishable across all three models (Kruskal-Wallis H=4.33, p=0.115). What changed is the safety posture: crisis detection improved monotonically from GPT-4o to GPT-5-mini (H=13.88, p=0.001), while advice safety declined (H=16.63, p0.001). Per-turn trajectory analysis -- a novel methodological contribution -- reveals these shifts are sharpest during mid-conversation crisis moments invisible to aggregate scoring. In a self-harm scenario involving a minor, GPT-4o scored 3.6/10 on crisis detection during early disclosure turns; GPT-5-mini never dropped below 7.8. What users perceived as "lost empathy" was a shift from a cautious model that missed crises to an alert model that sometimes says too much -- a trade-off with real consequences for vulnerable users, currently invisible to both the people who feel it and the developers who create it.

Comments:
17 pages, 7 figures. First empirical measurement of the #keep4o phenomenon using clinical psychological safety frameworks. Compares GPT-4o, o4-mini, and GPT-5-mini on empathy, crisis detection, and advice safety dimensions

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2603.09997 [cs.CL]

(or
arXiv:2603.09997v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.09997

Focus to learn more

              arXiv-issued DOI via DataCite</p>
83. 【2603.09996】here Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

链接https://arxiv.org/abs/2603.09996

作者:Edibe Yilmaz,Kahraman Kostas

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:heritage language education, Turkish heritage language, educational processes introduces, processes introduces significant, introduces significant constraints

备注: 5 pages, 6 tables, conference

点击查看摘要

Abstract:The integration of large language models (LLMs) into educational processes introduces significant constraints regarding data privacy and reliability, particularly in pedagogically vulnerable contexts such as Turkish heritage language education. This study aims to systematically evaluate the robustness and pedagogical safety of locally deployable offline LLMs within the context of Turkish heritage language education. To this end, a Turkish Anomaly Suite (TAS) consisting of 10 original edge-case scenarios was developed to assess the models' capacities for epistemic resistance, logical consistency, and pedagogical safety. Experiments conducted on 14 different models ranging from 270M to 32B parameters reveal that anomaly resistance is not solely dependent on model scale and that sycophancy bias can pose pedagogical risks even in large-scale models. The findings indicate that reasoning-oriented models in the 8B--14B parameter range represent the most balanced segment in terms of cost-safety trade-off for language learners.

84. 【2603.09995】Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

链接https://arxiv.org/abs/2603.09995

作者:Kewen Zhu,Zixi Liu,Yanjing Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:presents unique challenges, Behavioral interview evaluation, Behavioral interview, behavioral interview question, require structured assessment

备注

点击查看摘要

Abstract:Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training. We investigate chain of thought prompting for interview answer evaluation and improvement through two controlled experiments with 50 behavioral interview question and answer pairs. Our contributions are threefold. First, we provide a quantitative comparison between human in the loop and automated chain of thought improvement. Using a within subject paired design with n equals 50, both approaches show positive rating improvements. The human in the loop approach provides significant training benefits. Confidence improves from 3.16 to 4.16 (p less than 0.001) and authenticity improves from 2.94 to 4.53 (p less than 0.001, Cohen's d is 3.21). The human in the loop method also requires five times fewer iterations (1.0 versus 5.0, p less than 0.001) and achieves full personal detail integration. Second, we analyze convergence behavior. Both methods converge rapidly with mean iterations below one, with the human in the loop approach achieving a 100 percent success rate compared to 84 percent for automated approaches among initially weak answers (Cohen's h is 0.82, large effect). Additional iterations provide diminishing returns, indicating that the primary limitation is context availability rather than computational resources. Third, we propose an adversarial challenging mechanism based on a negativity bias model, named bar raiser, to simulate realistic interviewer behavior, although quantitative validation remains future work. Our findings demonstrate that while chain of thought prompting provides a useful foundation for interview evaluation, domain specific enhancements and context aware approach selection are essential for realistic and pedagogically valuable results.

85. 【2603.09994】Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

链接https://arxiv.org/abs/2603.09994

作者:Ruchira Dhar,Qiwei Peng,Anders Søgaard

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:considered central, language abilities, Abstract, language, Compositionality

备注: Under Review

点击查看摘要

Abstract:Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective-noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.

86. 【2603.09993】CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

链接https://arxiv.org/abs/2603.09993

作者:Jon Chun,Hannah Sussman,Adrian Mangine,Murathan Kocaman,Kirill Sidorko,Abhigya Koirala,Andre McCloud,Gwen Eisenbeis,Wisdom Akanwe,Moustapha Gassama,Eliezer Gonzalez Chirinos,Anne-Duncan Enright,Peter Dunson,Tiffanie Ng,Anna von Rosenstiel,Godwin Idowu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:inferring intended meaning, underpins everyday communication, large language models, inferring intended, literal semantics

备注: 38 pages, 10 figures

点击查看摘要

Abstract:Pragmatic reasoning, inferring intended meaning beyond literal semantics, underpins everyday communication yet remains difficult for large language models. We present the Contextual Emotional Inference (CEI) Benchmark: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances. Each scenario pairs a situational context and speaker-listener roles (with explicit power relations) against an ambiguous utterance. The dataset covers five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) drawn from workplace, family, social, and service settings, with three power configurations (peer, higher-to-lower, lower-to-higher). Three trained annotators independently labeled every scenario. Inter-annotator agreement (Fleiss' kappa = 0.06-0.25 by subtype) is low but expected: pragmatic inference admits multiple valid readings, and the disagreement itself is informative. We describe our annotation methodology, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication. CEI is released under CC-BY-4.0.

87. 【2603.09992】AMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment

链接https://arxiv.org/abs/2603.09992

作者:Izzat Alsmadi,Anas Alsobeh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:paper presents TAMUSA-Chat, building domain-adapted large, domain-adapted large language, large language model, presents TAMUSA-Chat

备注

点击查看摘要

Abstract:This paper presents TAMUSA-Chat, a research-oriented framework for building domain-adapted large language model conversational systems. The work addresses critical challenges in adapting general-purpose foundation models to institutional contexts through supervised fine-tuning, retrieval-augmented generation, and systematic evaluation methodologies. We describe the complete architecture encompassing data acquisition from institutional sources, preprocessing pipelines, embedding construction, model training workflows, and deployment strategies. The system integrates modular components enabling reproducible experimentation with training configurations, hyper-parameters, and evaluation protocols. Our implementation demonstrates how academic institutions can develop contextually grounded conversational agents while maintaining transparency, governance compliance, and responsible AI practices. Through empirical analysis of fine-tuning behavior across model sizes and training iterations, we provide insights into domain adaptation efficiency, computational resource requirements, and quality-cost trade-offs. The publicly available codebase at this https URL supports continued research into institutional LLM deployment, evaluation methodologies, and ethical considerations for educational AI systems.

88. 【2603.09991】PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling

链接https://arxiv.org/abs/2603.09991

作者:Stephen Afrifa,Biswash Khatiwada,Kapalik Khanal,Sanjay Shah,Lingjuan Wang-Li,Ramesh Bahadur Bist

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:affordable animal protein, global poultry industry, intensified public discourse, public discourse surrounding, surrounding production practices

备注

点击查看摘要

Abstract:The rapid growth of the global poultry industry, driven by rising demand for affordable animal protein, has intensified public discourse surrounding production practices, housing, management, animal welfare, and supply-chain transparency. Social media platforms such as X (formerly Twitter) generate large volumes of unstructured textual data that capture stakeholder sentiment across the poultry industry. Extracting accurate sentiment signals from this domain-specific discourse remains challenging due to contextual ambiguity, linguistic variability, and limited domain awareness in general-purpose language models. This study presents PoultryLeX-Net, a lexicon-enhanced, domain-adaptive dual-stream transformer framework for fine-grained sentiment analysis in poultry-related text. The proposed architecture integrates sentiment classification, topic modeling, and contextual representation learning through domain-specific embeddings and gated cross-attention mechanisms. A lexicon-guided stream captures poultry-specific terminology and sentiment cues, while contextual stream models long-range semantic dependencies. Latent Dirichlet Allocation is employed to identify dominant thematic structures associated with production management and welfare-related discussions, providing complementary interpretability to sentiment predictions. PoultryLeX-Net was evaluated against multiple baseline models, including convolutional neural network and pre-trained transformer architectures such as DistilBERT and RoBERTa. PoultryLeX-Net consistently outperformed all baselines, achieving an accuracy of 97.35%, an F1 score of 96.67%, and an area under the receiver operating characteristic curve (AUC-ROC) of 99.61% across sentiment classification tasks. Overall, domain adaptation and dual-stream attention markedly improve sentiment classification, enabling scalable intelligence for poultry production decision support.

89. 【2603.09990】A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification

链接https://arxiv.org/abs/2603.09990

作者:Ana Begnini,Matheus Vicente,Leonardo Souza

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:establish NonDisclosure Agreements, NonDisclosure Agreements, common to establish, establish NonDisclosure, Agreements

备注: 14 pages, 2 figures, 3 tables. Published at STIL @ BRACIS 2025

点击查看摘要

Abstract:In business-to-business relations, it is common to establish NonDisclosure Agreements (NDAs). However, these documents exhibit significant variation in format, structure, and writing style, making manual analysis slow and error-prone. We propose an architecture based on LLMs to automate the segmentation and clauses classification within these contracts. We employed two models: LLaMA-3.1-8B-Instruct for NDA segmentation (clause extraction) and a fine-tuned Legal-Roberta-Large for clause classification. In the segmentation task, we achieved a ROUGE F1 of 0.95 +/- 0.0036; for classification, we obtained a weighted F1 of 0.85, demonstrating the feasibility and precision of the approach.

90. 【2603.09989】he System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

链接https://arxiv.org/abs/2603.09989

作者:Heimo Müller,Dominik Steiger,Markus Plass,Andreas Holzinger

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:System Usability Scale, System Causability Scale, large language models, System Hallucination Scale, assessing hallucination-related behavior

备注

点击查看摘要

Abstract:We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach's alpha = 0.87$) and significant inter-dimension correlations (p 0.001$). Comparative analysis with SUS and SCS reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.

91. 【2603.09988】Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

链接https://arxiv.org/abs/2603.09988

作者:Ajay Pravin Mahale

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Mechanistic interpretability identifies, interpretability identifies internal, identifies internal circuits, internal circuits responsible, Mechanistic interpretability

备注: 8 pages, 7 figures, 4 tables. MSc thesis work conducted at Hochschule Trier (2026). Code will be released upon publication

点击查看摘要

Abstract:Mechanistic interpretability identifies internal circuits responsible for model behaviors, yet translating these findings into human-understandable explanations remains an open problem. We present a pipeline that bridges circuit-level analysis and natural language explanations by (i) identifying causally important attention heads via activation patching, (ii) generating explanations using both template-based and LLM-based methods, and (iii) evaluating faithfulness using ERASER-style metrics adapted for circuit-level attribution. We evaluate on the Indirect Object Identification (IOI) task in GPT-2 Small (124M parameters), identifying six attention heads accounting for 61.4% of the logit difference. Our circuit-based explanations achieve 100% sufficiency but only 22% comprehensiveness, revealing distributed backup mechanisms. LLM-generated explanations outperform template baselines by 64% on quality metrics. We find no correlation (r = 0.009) between model confidence and explanation faithfulness, and identify three failure categories explaining when explanations diverge from mechanisms.

92. 【2603.09987】Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

链接https://arxiv.org/abs/2603.09987

作者:Xinyuan Wang,Kunpeng Liu,Arun Vignesh Malarkkan,Yanjie Fu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:advance downstream predictive, improves feature space, feature space quality, core data-centric, data-centric AI task

备注

点击查看摘要

Abstract:Feature Transformation (FT) is a core data-centric AI task that improves feature space quality to advance downstream predictive performance. However, discovering effective transformations remains challenging due to the large space of feature-operator combinations. Existing solutions rely on discrete search or latent generation, but they are frequently limited by sample inefficiency, invalid candidates, and redundant generations with limited coverage. Large Language Models (LLMs) offer strong priors for producing valid transformations, but current LLM-based FT methods typically rely on static demonstrations, resulting in limited diversity, redundant outputs, and weak alignment with downstream objectives. We propose a framework that optimizes context data for LLM-driven FT by evolving trajectory-level experiences in a closed loop. Starting from high-performing feature transportation sequences explored by reinforcement learning, we construct and continuously update an experience library of downstream task-verified transformation trajectories, and use a diversity-aware selector to form contexts along with a chain-of-thought and guide transformed feature generation toward higher performance. Experiments on diverse tabular benchmarks show that our method outperforms classical and LLM-based baselines and is more stable than one-shot generation. The framework generalizes across API-based and open-source LLMs and remains robust across downstream evaluators.

93. 【2603.09986】Quantifying Hallucinations in Language Language Models on Medical Textbooks

链接https://arxiv.org/abs/2603.09986

作者:Brandon C. Colelough,Davis Bartels,Dina Demner-Fushman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:natural language processing, unsupported claims, factually incorrect, incorrect and unsupported, problem within natural

备注: 9 pages, 4 figures

点击查看摘要

Abstract:Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($\rho=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $\kappa=0.92$) and ($\tau_b=0.06$ to $0.18$, $\kappa=0.57$ to $0.61$) for experiments 1 and ,2 respectively

94. 【2603.09985】he Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

链接https://arxiv.org/abs/2603.09985

作者:Sudipta Ghosh,Mrityunjoy Panday

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, demonstrated remarkable capabilities, remains poorly understood, Claude Haiku

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence -- a pattern analogous to the Dunning-Kruger effect in human cognition. We discuss implications for safe deployment of LLMs in high-stakes applications.

95. 【2603.09984】An Efficient Hybrid Deep Learning Approach for Detecting Online Abusive Language

链接https://arxiv.org/abs/2603.09984

作者:Vuong M. Ngo,Cach N. Dang,Kien V. Nguyen,Mark Roantree

类目:Computation and Language (cs.CL)

关键词:allowing free expression, expanded social media, allowing free, global population, digital age

备注: 10 pages, 7 figures

点击查看摘要

Abstract:The digital age has expanded social media and online forums, allowing free expression for nearly 45% of the global population. Yet, it has also fueled online harassment, bullying, and harmful behaviors like hate speech and toxic comments across social networks, messaging apps, and gaming communities. Studies show 65% of parents notice hostile online behavior, and one-third of adolescents in mobile games experience bullying. A substantial volume of abusive content is generated and shared daily, not only on the surface web but also within dark web forums. Creators of abusive comments often employ specific words or coded phrases to evade detection and conceal their intentions. To address these challenges, we propose a hybrid deep learning model that integrates BERT, CNN, and LSTM architectures with a ReLU activation function to detect abusive language across multiple online platforms, including YouTube comments, online forum discussions, and dark web posts. The model demonstrates strong performance on a diverse and imbalanced dataset containing 77,620 abusive and 272,214 non-abusive text samples (ratio 1:3.5), achieving approximately 99% across evaluation metrics such as Precision, Recall, Accuracy, F1-score, and AUC. This approach effectively captures semantic, contextual, and sequential patterns in text, enabling robust detection of abusive content even in highly skewed datasets, as encountered in real-world scenarios.

96. 【2603.09983】MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

链接https://arxiv.org/abs/2603.09983

作者:Shuhuai Li,Jianghao Lin,Dongdong Ge,Yinyu Ye

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:models enable scalable, enable scalable performance, face severe memory, severe memory constraints, models enable

备注

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement in TPS over the SOTA SD-based baseline, and an average 4.04x speedup over all standard baselines. Code is available at this https URL .

97. 【2603.09982】AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

链接https://arxiv.org/abs/2603.09982

作者:Omar Elshehy,Omer Nacar,Abdelbasset Djamai,Muhammed Ragab,Khloud Al Jallad,Mona Abdelazim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Encoder-only transformer models, transformer models remain, models remain widely, recent architectural advances, Encoder-only transformer

备注: 9 pages, 1 figure. Accepted at AbjadNLP Workshop, EACL 2026

点击查看摘要

Abstract:Encoder-only transformer models remain widely used for discriminative NLP tasks, yet recent architectural advances have largely focused on English. In this work, we present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic, and study the impact of transtokenized embedding initialization and native long-context modeling up to 8,192 tokens. We show that transtokenization is essential for Arabic language modeling, yielding dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. We further demonstrate that AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic natural language understanding tasks, including inference, offensive language detection, question-question similarity, and named entity recognition, confirm strong transfer to discriminative and sequence labeling settings. Our results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts.

98. 【2603.09981】Large Language Models and Book Summarization: Reading or Remembering, Which Is Better?

链接https://arxiv.org/abs/2603.09981

作者:Tairan Fu,Javier Conde,Pedro Reviriego,Javier Coronado-Blázquez,Nina Melero,Elena Merino-Gómez

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, Language Processing, Natural Language, task in Natural, Large Language Models

备注

点击查看摘要

Abstract:Summarization is a core task in Natural Language Processing (NLP). Recent advances in Large Language Models (LLMs) and the introduction of large context windows reaching millions of tokens make it possible to process entire books in a single prompt. At the same time, for well-known books, LLMs can generate summaries based only on internal knowledge acquired during training. This raises several important questions: How do summaries generated from internal memory compare to those derived from the full text? Does prior knowledge influence summaries even when the model is given the book as input? In this work, we conduct an experimental evaluation of book summarization with state-of-the-art LLMs. We compare summaries of well-known books produced using (i) only the internal knowledge of the model and (ii) the full text of the book. The results show that having the full text provides more detailed summaries in general, but some books have better scores for the internal knowledge summaries. This puts into question the capabilities of models to perform summarization of long texts, as information learned during training can outperform summarization of the full text in some cases.

99. 【2603.09980】Explainable LLM Unlearning Through Reasoning

链接https://arxiv.org/abs/2603.09980

作者:Junfeng Liao,Qizhou Wang,Shanshan Ye,Xin Yu,Ling Chen,Zhen Fang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:pre-trained large language, large language models, mitigating safety, unlearning, essential for mitigating

备注

点击查看摘要

Abstract:LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained large language models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific unlearning datasets. In previous works, gradient ascent (GA) and its variants have shown promise for implementing unlearning, yet their untargeted nature results in unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses, among many others. We argue that these issues stem from the absence of explicit guidance on what and how models should unlearn. To fill this gap, we introduce a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response. Building on this, we propose targeted reasoning unlearning (TRU), which leverages reasoning-based unlearning target as guidance. We employ the target using a cross-entropy supervised loss combined with a GA-based loss, enabling the model to learn reasoning ability for precise knowledge removal while preserving unrelated abilities. We evaluate TRU against strong baselines across multiple benchmarks and LLM backbones, and find that it achieves more reliable unlearning while preserving general capabilities. Moreover, TRU exhibits superior robustness under diverse attack scenarios, stemming from the reasoning ability learned through reasoning-based targets. Overall, our study establishes reasoning-augmented unlearning as a practical paradigm for reliable and explainable LLM unlearning.

100. 【2603.09979】GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

链接https://arxiv.org/abs/2603.09979

作者:Ghazal Kalhor,Yadollah Yaghoobzadeh

类目:Computation and Language (cs.CL)

关键词:Iranian cultural practice, Persian poetry plays, role in Iranian, Iranian cultural, Hafez are frequently

备注

点击查看摘要

Abstract:Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at this https URL.

101. 【2603.09800】MITRA: An AI Assistant for Knowledge Retrieval in Physics Collaborations

链接https://arxiv.org/abs/2603.09800

作者:Abhishikth Mallampalli,Sridhara Dasu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)

关键词:Compact Muon Solenoid, Muon Solenoid, Compact Muon, Large-scale scientific collaborations, Large-scale scientific

备注: Accepted at NeurIPS 2025 Machine Learning for the Physical Sciences workshop and Lepton Photon conference 2025 (Computing AI/ML track)

点击查看摘要

Abstract:Large-scale scientific collaborations, such as the Compact Muon Solenoid (CMS) at CERN, produce a vast and ever-growing corpus of internal documentation. Navigating this complex information landscape presents a significant challenge for both new and experienced researchers, hindering knowledge sharing and slowing down the pace of scientific discovery. To address this, we present a prototype of MITRA, a Retrieval-Augmented Generation (RAG) based system, designed to answer specific, context-aware questions about physics analyses. MITRA employs a novel, automated pipeline using Selenium for document retrieval from internal databases and Optical Character Recognition (OCR) with layout parsing for high-fidelity text extraction. Crucially, MITRA's entire framework, from the embedding model to the Large Language Model (LLM), is hosted on-premise, ensuring that sensitive collaboration data remains private. We introduce a two-tiered vector database architecture that first identifies the relevant analysis from abstracts before focusing on the full documentation, resolving potential ambiguities between different analyses. We demonstrate the prototype's superior retrieval performance against a standard keyword-based baseline on realistic queries and discuss future work towards developing a comprehensive research agent for large experimental collaborations.

102. 【2603.09117】Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

链接https://arxiv.org/abs/2603.09117

作者:Zhengzhao Ma,Xueru Wen,Boxi Cao,Yaojie Lu,Hongyu Lin,Jinglin Yang,Min He,Xianpei Han,Le Sun

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Reinforcement Learning, Verifiable Rewards, Learning from Verifiable, significantly enhances large, large language models

备注: 9 pages, 8 figures

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

103. 【2603.08899】ConFu: Contemplate the Future for Better Speculative Sampling

链接https://arxiv.org/abs/2603.08899

作者:Zongyue Qin,Raghavv Goel,Mukul Gagrani,Risheek Garrepalli,Mingu Lee,Yizhou Sun

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:accelerate large language, employing lightweight draft, large language model, lightweight draft models, draft models

备注: accepted at ICLR 2026 workshop on Latent Implicit Thinking - Going Beyond CoT Reasoning

点击查看摘要

Abstract:Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.

104. 【2505.17862】Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

链接https://arxiv.org/abs/2505.17862

作者:Ziwei Zhou,Rui Wang,Zuxuan Wu,Yu-Gang Jiang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent Multimodal Large, Multimodal Large Language, Large Language Models, Recent Multimodal, Multimodal Large

备注

点击查看摘要

Abstract:Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. We introduce Daily-Omni, a multiple-choice Audio-Visual QA benchmark featuring 684 real-world videos and 1,197 questions spanning 6 task families that explicitly require cross-modal temporal reasoning. To support scalable benchmark construction, we develop a semi-automatic pipeline for annotation, cross-modal consistency refinement, temporal alignment elicitation, and text-only leakage filtering, followed by human verification. We further provide a diagnostic evaluation suite and extensively evaluate 24 foundation models under 37 model--modality settings (Audio+Video / Audio-only / Video-only / Text-only). Finally, we include a training-free modular diagnostic baseline that composes off-the-shelf unimodal models to serve as a diagnostic baseline and to illustrate how explicit temporal alignment signals affect performance. Results indicate that many end-to-end MLLMs still struggle on alignment-critical questions, suggesting that robust cross-modal temporal alignment remains an important open challenge.

105. 【2603.10371】Speech Codec Probing from Semantic and Phonetic Perspectives

链接https://arxiv.org/abs/2603.10371

作者:Xuan Shi,Chang Zeng,Tiantian Feng,Shih-Heng Wang,Jianbo Ma,Shrikanth Narayanan

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词:large language models, language models, essential for connecting, large language, multimodal systems

备注

点击查看摘要

Abstract:Speech tokenizers are essential for connecting speech to large language models (LLMs) in multimodal systems. These tokenizers are expected to preserve both semantic and acoustic information for downstream understanding and generation. However, emerging evidence suggests that what is termed "semantic" in speech representations does not align with text-derived semantics: a mismatch that can degrade multimodal LLM performance. In this paper, we systematically analyze the information encoded by several widely used speech tokenizers, disentangling their semantic and phonetic content through word-level probing tasks, layerwise representation analysis, and cross-modal alignment metrics such as CKA. Our results show that current tokenizers primarily capture phonetic rather than lexical-semantic structure, and we derive practical implications for the design of next-generation speech tokenization methods.

106. 【2603.10175】Calibration-Reasoning Framework for Descriptive Speech Quality Assessment

链接https://arxiv.org/abs/2603.10175

作者:Elizaveta Kostenok,Mathieu Salzmann,Milos Cernak

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词:Explainable speech quality, assessment requires moving, Explainable speech, analyze underlying perceptual, underlying perceptual dimensions

备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, detection and classification of audio artifacts. First, a calibration stage aligns the model to predict predefined perceptual dimensions. Second, a reinforcement learning stage leverages Group Relative Policy Optimization (GRPO) with dimension-specific rewards to heavily enhance accuracy of descriptions and temporal localization of quality issues. With this approach we reach state-of-the-art results of 0.71 mean PCC score on the multidimensional QualiSpeech benchmark and 13% improvement in MOS prediction driven by RL-based reasoning. Furthermore, our fine-grained GRPO rewards substantially advance the model's ability to pinpoint and classify audio artifacts in time.

信息检索

1. 【2603.11031】Chasing RATs: Tracing Reading for and as Creative Activity

链接https://arxiv.org/abs/2603.11031

作者:Sophia Liu,Shm Garanganao Almeda

类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multimedia (cs.MM); Social and Information Networks (cs.SI)

关键词:Creativity research, Reading Activity Traces, research has privileged, privileged making, interpretive labor

备注

点击查看摘要

Abstract:Creativity research has privileged making over the interpretive labor that precedes and shapes it. We introduce Reading Activity Traces (RATs), a proposal that treats reading -- broadly defined to include navigating, interpreting, and curating media across interconnected sources -- as creative activity both for future artifacts and as a form of creation in its own right. By tracing trajectories of traversal, association, and reflection as inspectable artifacts, RATs render visible the creative work that algorithmic feeds and AI summarization increasingly compress and automate away. We illustrate this through WikiRAT, a speculative instantiation on Wikipedia, and open new ground for reflective practice, reader modeling, collective sensemaking, and understanding what is lost when human interpretation is automated -- towards designing intelligent tools that preserve it.

2. 【2603.11025】LLMGreenRec: LLM-Based Multi-Agent Recommender System for Sustainable E-Commerce

链接https://arxiv.org/abs/2603.11025

作者:Hao N. Nguyen,Hieu M. Nguyen,Son Van Nguyen,Nguyen Thi Hanh

类目:Multiagent Systems (cs.MA); Information Retrieval (cs.IR)

关键词:Rising environmental awareness, e-commerce necessitates recommender, Rising environmental, necessitates recommender systems, digital carbon footprints

备注: Accepted to the Proceedings of the Conference on Digital Economy and Fintech Innovation (DEFI 2025). To appear in IEEE Xplore

点击查看摘要

Abstract:Rising environmental awareness in e-commerce necessitates recommender systems that not only guide users to sustainable products but also minimize their own digital carbon footprints. Traditional session-based systems, optimized for short-term conversions, often fail to capture nuanced user intents for eco-friendly choices, perpetuating a gap between green intentions and actions. To tackle this, we introduce LLMGreenRec, a novel multi-agent framework that leverages Large Language Models (LLMs) to promote sustainable consumption. Through collaborative analysis of user interactions and iterative prompt refinement, LLMGreenRec's specialized agents deduce green-oriented user intents and prioritize eco-friendly product recommendations. Notably, this intent-driven approach also reduces unnecessary interactions and energy consumption. Extensive experiments on benchmark datasets validate LLMGreenRec's effectiveness in recommending sustainable products, demonstrating a robust solution that fosters a responsible digital economy.

3. 【2603.11008】A Systematic Study of Pseudo-Relevance Feedback with LLMs

链接https://arxiv.org/abs/2603.11008

作者:Nour Jedidi,Jimmy Lin

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:large language models, Pseudo-relevance feedback, key design dimensions, feedback, LLM PRF methods

备注

点击查看摘要

Abstract:Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from and the feedback model, which is how the given feedback text is used to refine the query representation. However, the independent role that each dimension plays is unclear, as both are often entangled in empirical evaluations. In this paper, we address this gap by systematically studying how the choice of feedback source and feedback model impact PRF effectiveness through controlled experimentation. Across 13 low-resource BEIR tasks with five LLM PRF methods, our results show: (1) the choice of feedback model can play a critical role in PRF effectiveness; (2) feedback derived solely from LLM-generated text provides the most cost-effective solution; and (3) feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever. Together, our findings provide a better understanding of which elements in the PRF design space are most important.

4. 【2603.10891】A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

链接https://arxiv.org/abs/2603.10891

作者:Yichi Zhu,Kan Ling,Xu Liu,Hengrun Zhang,Huiqun Yu,Guisheng Fan

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Medication errors pose, Large Language Models, Medication errors, final safeguard, patient safety

备注: 11 pages, 7 [this http URL](http://figures.Framework) for safe prescription auditing and hybrid knowledge-grounded reasoning

点击查看摘要

Abstract:Medication errors pose a significant threat to patient safety, making pharmacist verification (PV) a critical, yet heavily burdened, final safeguard. The direct application of Large Language Models (LLMs) to this zero-tolerance domain is untenable due to their inherent factual unreliability, lack of traceability, and weakness in complex reasoning. To address these challenges, we introduce PharmGraph-Auditor, a novel system designed for safe and evidence-grounded prescription auditing. The core of our system is a trustworthy Hybrid Pharmaceutical Knowledge Base (HPKB), implemented under the Virtual Knowledge Graph (VKG) paradigm. This architecture strategically unifies a relational component for set constraint satisfaction and a graph component for topological reasoning via a rigorous mapping layer. To construct this HPKB, we propose the Iterative Schema Refinement (ISR) algorithm, a framework that enables the co-evolution of both graph and relational schemas from medical texts. For auditing, we introduce the KB-grounded Chain of Verification (CoV), a new reasoning paradigm that transforms the LLM from an unreliable generator into a transparent reasoning engine. CoV decomposes the audit task into a sequence of verifiable queries against the HPKB, generating hybrid query plans to retrieve evidence from the most appropriate data store. Experimental results demonstrate robust knowledge extraction capabilities and show promises of using PharmGraph-Auditor to enable pharmacists to achieve safer and faster prescription verification.

5. 【2603.10876】An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

链接https://arxiv.org/abs/2603.10876

作者:Jennifer D'Souza,Sameer Sadruddin,Maximilian Kähler,Andrea Salfinger,Luca Zaccagna,Francesca Incitti,Lauro Snidaro,Osma Suominen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:Subject indexing, Integrated Authority File, indexing is vital, vital for discovery, discovery but hard

备注: 9 pages, 5 figures. Accepted to appear in the Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

点击查看摘要

Abstract:Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.

6. 【2603.10784】Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

链接https://arxiv.org/abs/2603.10784

作者:Weihang Huang,Mengna Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:computational approaches operate, opaque classifiers offering, figurative language processing, language processing, judged metaphorical

备注

点击查看摘要

Abstract:Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols--MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification--as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.

7. 【2603.10765】RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation Systems

链接https://arxiv.org/abs/2603.10765

作者:Shaobo Li,Yirui Zhou,Yuan Xu,Kevin Chen,Daniel Waddington,Swaminathan Sundararaman,Hubertus Franke,Jian Huang

类目:Performance (cs.PF); Information Retrieval (cs.IR)

关键词:RAG pipelines, system benchmarking, system behaviors, framework for characterizing, present the design

备注: The codebase of RAGPerf is available at [this https URL](https://github.com/platformxlab/RAGPerf)

点击查看摘要

Abstract:We present the design and implementation of a RAG-based AI system benchmarking (RAGPerf) framework for characterizing the system behaviors of RAG pipelines. To facilitate detailed profiling and fine-grained performance analysis, RAGPerf decouples the RAG workflow into several modular components - embedding, indexing, retrieval, reranking, and generation. RAGPerf offers the flexibility for users to configure the core parameters of each component and examine their impact on the end-to-end query performance and quality. RAGPerf has a workload generator to model real-world scenarios by supporting diverse datasets (e.g., text, pdf, code, and audio), different retrieval and update ratios, and query distributions. RAGPerf also supports different embedding models, major vector databases such as LanceDB, Milvus, Qdrant, Chroma, and Elasticsearch, as well as different LLMs for content generation. It automates the collection of performance metrics (i.e., end-to-end query throughput, host/GPU memory footprint, and CPU/GPU utilization) and accuracy metrics (i.e., context recall, query accuracy, and factual consistency). We demonstrate the capabilities of RAGPerf through a comprehensive set of experiments and open source its codebase at GitHub. Our evaluation shows that RAGPerf incurs negligible performance overhead.

8. 【2603.10700】Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval

链接https://arxiv.org/abs/2603.10700

作者:Andrea Volpini,Elie Raad,Beatrice Gamba,David Riccitelli

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Linked Data Platform, Retrieval-Augmented Generation, systems typically treat, structured linked data, typically treat documents

备注: 33 pages, 7 figures, reproducibility appendix, dataset/evaluation framework/enhanced entity page templates released with the paper

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems typically treat documents as flat text, ignoring the structured metadata and linked relationships that knowledge graphs provide. In this paper, we investigate whether structured linked data, specifically this http URL markup and dereferenceable entity pages served by a Linked Data Platform, can improve retrieval accuracy and answer quality in both standard and agentic RAG systems. We conduct a controlled experiment across four domains (editorial, legal, travel, e-commerce) using Vertex AI Vector Search 2.0 for retrieval and the Google Agent Development Kit (ADK) for agentic reasoning. Our experimental design tests seven conditions: three document representations (plain HTML, HTML with JSON-LD, and an enhanced agentic-optimized entity page) crossed with two retrieval modes (standard RAG and agentic RAG with multi-hop link traversal), plus an Enhanced+ condition that adds rich navigational affordances and entity interlinking. Our results reveal that while JSON-LD markup alone provides only modest improvements, our enhanced entity page format, incorporating this http URL-style agent instructions, breadcrumbs, and neural search capabilities, achieves substantial gains: +29.6% accuracy improvement for standard RAG and +29.8% for the full agentic pipeline. The Enhanced+ variant, with richer navigational affordances, achieves the highest absolute scores (accuracy: 4.85/5, completeness: 4.55/5), though the incremental gain over the base enhanced format is not statistically significant. We release our dataset, evaluation framework, and enhanced entity page templates to support reproducibility.

9. 【2603.10673】Breaking User-Centric Agency: A Tri-Party Framework for Agent-Based Recommendation

链接https://arxiv.org/abs/2603.10673

作者:Yaxin Gong,Chongming Gao,Chenxiao Fan,Wenjie Wang,Fuli Feng,Xiangnan He

类目:Information Retrieval (cs.IR)

关键词:large language models, enabling language-driven interaction, expressive preference modeling, stimulated growing interest, Recent advances

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have stimulated growing interest in agent-based recommender systems, enabling language-driven interaction and reasoning for more expressive preference modeling. However, most existing agentic approaches remain predominantly user-centric, treating items as passive entities and neglecting the interests of other critical stakeholders. This limitation exacerbates exposure concentration and long-tail under-representation, threatening long-term system sustainability. In this work, we identify this fundamental limitation and propose the first Tri-party LLM-agent Recommendation framework (TriRec) that explicitly coordinates user utility, item exposure, and platform-level fairness. The framework employs a two-stage architecture: Stage~1 empowers item agents with personalized self-promotion to improve matching quality and alleviate cold-start barriers, while Stage~2 uses a platform agent for sequential multi-objective re-ranking, balancing user relevance, item utility, and exposure fairness. Experiments on multiple benchmarks show consistent gains in accuracy, fairness, and item-level utility. Moreover, we find that item self-promotion can simultaneously enhance fairness and effectiveness, challenging the conventional trade-off assumption between relevance and fairness. Our code is available at this https URL.

10. 【2603.10625】A Hypergraph-Based Framework for Exploratory Business Intelligence

链接https://arxiv.org/abs/2603.10625

作者:Yunkai Lou,Shunyang Li,Longbin Lai,Jianke Yu,Wenyuan Yu,Ying Zhang

类目:Databases (cs.DB); Information Retrieval (cs.IR)

关键词:Business Intelligence, multi-round exploration paradigm, analysts progressively refine, analysis is evolving, multi-round exploration

备注

点击查看摘要

Abstract:Business Intelligence (BI) analysis is evolving towards Exploratory BI, an iterative, multi-round exploration paradigm where analysts progressively refine their understanding. However, traditional BI systems impose critical limits for Exploratory BI: heavy reliance on expert knowledge, high computational costs, static schemas, and lack of reusability. We present ExBI, a novel system that introduces the hypergraph data model with operators, including Source, Join, and View, to enable dynamic schema evolution and materialized view reuse. Using sampling-based algorithms with provable estimation guarantees, ExBI addresses the computational bottlenecks, while maintaining analytical accuracy. Experiments on LDBC datasets demonstrate that ExBI achieves significant speedups over existing systems: on average 16.21x (up to 146.25x) compared to Neo4j and 46.67x (up to 230.53x) compared to MySQL, while maintaining high accuracy with an average error rate of only 0.27% for COUNT, enabling efficient and accurate large-scale exploratory BI workflows.

11. 【2603.10600】rajectory-Informed Memory Generation for Self-Improving Agent Systems

链接https://arxiv.org/abs/2603.10600

作者:Gaodan Fang,Vatche Isahagian,K. R. Jayaram,Ritesh Kumar,Vinod Muthusamy,Punleuk Oum,Gegi Thomas

类目:Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)

关键词:LLM-powered agents face, improve future performance, persistent challenge, face a persistent, improve future

备注

点击查看摘要

Abstract:LLM-powered agents face a persistent challenge: learning from their execution experiences to improve future performance. While agents can successfully complete many tasks, they often repeat inefficient patterns, fail to recover from similar errors, and miss opportunities to apply successful strategies from past executions. We present a novel framework for automatically extracting actionable learnings from agent execution trajectories and utilizing them to improve future performance through contextual memory retrieval. Our approach comprises four components: (1) a Trajectory Intelligence Extractor that performs semantic analysis of agent reasoning patterns, (2) a Decision Attribution Analyzer that identifies which decisions and reasoning steps led to failures, recoveries, or inefficiencies, (3) a Contextual Learning Generator that produces three types of guidance -- strategy tips from successful patterns, recovery tips from failure handling, and optimization tips from inefficient but successful executions, and (4) an Adaptive Memory Retrieval System that injects relevant learnings into agent prompts based on multi-dimensional similarity. Unlike existing memory systems that store generic conversational facts, our framework understands execution patterns, extracts structured learnings with provenance, and retrieves guidance tailored to specific task contexts. Evaluation on the AppWorld benchmark demonstrates consistent improvements, with up to 14.3 percentage point gains in scenario goal completion on held-out tasks and particularly strong benefits on complex tasks (28.5~pp scenario goal improvement, a 149\% relative increase).

12. 【2603.10471】Modeling Stage-wise Evolution of User Interests for News Recommendation

链接https://arxiv.org/abs/2603.10471

作者:Zhiyong Cheng,Yike Jin,Zhijie Zhang,Huilin Chen,Zhangling Duan,Meng Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:shifting real-world contexts, trending topics, highly time-sensitive, emerging events, driven by emerging

备注: ACM Web Conference 2026 Accepted

点击查看摘要

Abstract:Personalized news recommendation is highly time-sensitive, as user interests are often driven by emerging events, trending topics, and shifting real-world contexts. These dynamics make it essential to model not only users' long-term preferences, which reflect stable reading habits and high-order collaborative patterns, but also their short-term, context-dependent interests that change rapidly over time. However, most existing approaches rely on a single static interaction graph, which struggles to capture both long-term preference patterns and short-term interest changes as user behavior evolves. To address this challenge, we propose a unified framework that learns user preferences from both global and local temporal perspectives. A global preference modeling component captures long-term collaborative signals from the overall interaction graph, while a local preference modeling component partitions historical interactions into stage-wise temporal subgraphs to represent short-term dynamics. Within this module, an LSTM branch models the progressive evolution of recent interests, and a self-attention branch captures long-range temporal dependencies. Extensive experiments on two large-scale real-world datasets show that our approach consistently outperforms strong baselines and delivers fresher and more relevant recommendations across diverse user behaviors and temporal settings.

13. 【2603.10409】Differentiable Geometric Indexing for End-to-End Generative Retrieval

链接https://arxiv.org/abs/2603.10409

作者:Xujing Wang,Yufeng Chen,Boxuan Zhang,Jie Zhao,Chao Wei,Cai Xu,Ziyu Guan,Wei Zhao,Weiru Zhang,Xiaoyi Zeng

类目:Information Retrieval (cs.IR)

关键词:single probabilistic framework, probabilistic framework, promising paradigm, paradigm to unify, single probabilistic

备注

点击查看摘要

Abstract:Generative Retrieval (GR) has emerged as a promising paradigm to unify indexing and search within a single probabilistic framework. However, existing approaches suffer from two intrinsic conflicts: (1) an Optimization Blockage, where the non-differentiable nature of discrete indexing creates a gradient blockage, decoupling index construction from the downstream retrieval objective; and (2) a Geometric Conflict, where standard unnormalized inner-product objectives induce norm-inflation instability, causing popular "hub" items to geometrically overshadow relevant long-tail items. To systematically resolve these misalignments, we propose Differentiable Geometric Indexing (DGI). First, to bridge the optimization gap, DGI enforces Operational Unification. It employs Soft Teacher Forcing via Gumbel-Softmax to establish a fully differentiable pathway, combined with Symmetric Weight Sharing to effectively align the quantizer's indexing space with the retriever's decoding space. Second, to restore geometric fidelity, DGI introduces Isotropic Geometric Optimization. We replace inner-product logits with scaled cosine similarity on the unit hypersphere to effectively decouple popularity bias from semantic relevance. Extensive experiments on large-scale industry search datasets and online e-commerce platform demonstrate that DGI outperforms competitive sparse, dense, and generative baselines. Notably, DGI exhibits superior robustness in long-tail scenarios, validating the necessity of harmonizing structural differentiability with geometric isotropy.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2603.10409 [cs.IR]

(or
arXiv:2603.10409v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.10409

Focus to learn more

              arXiv-issued DOI via DataCite</p>
14. 【2603.10369】Beyond Interleaving: Causal Attention Reformulations for Generative Recommender Systems

链接https://arxiv.org/abs/2603.10369

作者:Hailing Cheng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Generative Recommender Systems, Recommender Systems, sequence generation task, increasingly model user, model user behavior

备注: 8 pages, 8 figures, submitted to KDD 2026

点击查看摘要

Abstract:Generative Recommender Systems (GR) increasingly model user behavior as a sequence generation task by interleaving item and action tokens. While effective, this formulation introduces significant structural and computational inefficiencies: it doubles sequence length, incurs quadratic overhead, and relies on implicit attention to recover the causal relationship between an item and its associated action. Furthermore, interleaving heterogeneous tokens forces the Transformer to disentangle semantically incompatible signals, leading to increased attention noise and reduced representation this http URL this work, we propose a principled reformulation of generative recommendation that aligns sequence modeling with underlying causal structures and attention theory. We demonstrate that current interleaving mechanisms act as inefficient proxies for similarity-weighted action pooling. To address this, we introduce two novel architectures that eliminate interleaved dependencies to reduce sequence complexity by 50%: Attention-based Late Fusion for Actions (AttnLFA) and Attention-based Mixed Value Pooling (AttnMVP). These models explicitly encode the $i_n \rightarrow a_n$ causal dependency while preserving the expressive power of Transformer-based sequence this http URL evaluate our framework on large-scale product recommendation data from a major social network. Experimental results show that AttnLFA and AttnMVP consistently outperform interleaved baselines, achieving evaluation loss improvements of 0.29% and 0.80%, and significant gains in Normalized Entropy (NE). Crucially, these performance gains are accompanied by training time reductions of 23% and 12%, respectively. Our findings suggest that explicitly modeling item-action causality provides a superior design paradigm for scalable and efficient generative ranking.

15. 【2603.10332】Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers

链接https://arxiv.org/abs/2603.10332

作者:Saron Samuel,Benjamin Van Durme,Eugene Yang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:demonstrated strong abilities, Fair Ranking Track, demonstrated strong, strong abilities, abilities in improving

备注: 17 pages

点击查看摘要

Abstract:While reasoning rerankers, such as Rank1, have demonstrated strong abilities in improving ranking relevance, it is unclear how they perform on other retrieval qualities such as fairness. We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models across multiple retrieval settings and demographic attributes. Our findings demonstrate reasoning neither improve nor harm fairness compared to non-reasoning approaches. Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable (0.33-0.35) across all models, even as relevance varies substantially (nDCG 0.247-1.000). Demographic breakdown analysis revealed fairness gaps for geographic attributes regardless of model architecture. These results indicate that future work in specializing reasoning models to be aware of fairness attributes could lead to improvements, as current implementations preserve the fairness characteristics of their input ranking.

16. 【2603.09800】MITRA: An AI Assistant for Knowledge Retrieval in Physics Collaborations

链接https://arxiv.org/abs/2603.09800

作者:Abhishikth Mallampalli,Sridhara Dasu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)

关键词:Compact Muon Solenoid, Muon Solenoid, Compact Muon, Large-scale scientific collaborations, Large-scale scientific

备注: Accepted at NeurIPS 2025 Machine Learning for the Physical Sciences workshop and Lepton Photon conference 2025 (Computing AI/ML track)

点击查看摘要

Abstract:Large-scale scientific collaborations, such as the Compact Muon Solenoid (CMS) at CERN, produce a vast and ever-growing corpus of internal documentation. Navigating this complex information landscape presents a significant challenge for both new and experienced researchers, hindering knowledge sharing and slowing down the pace of scientific discovery. To address this, we present a prototype of MITRA, a Retrieval-Augmented Generation (RAG) based system, designed to answer specific, context-aware questions about physics analyses. MITRA employs a novel, automated pipeline using Selenium for document retrieval from internal databases and Optical Character Recognition (OCR) with layout parsing for high-fidelity text extraction. Crucially, MITRA's entire framework, from the embedding model to the Large Language Model (LLM), is hosted on-premise, ensuring that sensitive collaboration data remains private. We introduce a two-tiered vector database architecture that first identifies the relevant analysis from abstracts before focusing on the full documentation, resolving potential ambiguities between different analyses. We demonstrate the prototype's superior retrieval performance against a standard keyword-based baseline on realistic queries and discuss future work towards developing a comprehensive research agent for large experimental collaborations.

计算机视觉

1. 【2603.11048】COMIC: Agentic Sketch Comedy Generation

链接https://arxiv.org/abs/2603.11048

作者:Susung Hong,Brian Curless,Ira Kemelmacher-Shlizerman,Steve Seitz

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)

关键词:Saturday Night Live, Night Live, Saturday Night, produces short comedic, short comedic videos

备注: Project page: [this https URL](https://susunghong.github.io/COMIC/)

点击查看摘要

Abstract:We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.

2. 【2603.11047】LiTo: Surface Light Field Tokenization

链接https://arxiv.org/abs/2603.11047

作者:Jen-Hao Rick Chang,Xiaoming Zhao,Dorian Chan,Oncel Tuzel

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词:surface light field, jointly models object, view-dependent effects, geometry, view-dependent

备注: ICLR 2026; Project page: [this https URL](https://apple.github.io/ml-lito/)

点击查看摘要

Abstract:We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.

3. 【2603.11045】Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation

链接https://arxiv.org/abs/2603.11045

作者:Tao Zhong,Yixun Hu,Dongzhe Zheng,Aditya Sood,Christine Allen-Blanchette

类目:Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)

关键词:Neural Field Thermal, Field Thermal Tomography, surface temperature measurements, propose Neural Field, transient surface temperature

备注: 27 pages, 15 figures

点击查看摘要

Abstract:We propose Neural Field Thermal Tomography (NeFTY), a differentiable physics framework for the quantitative 3D reconstruction of material properties from transient surface temperature measurements. While traditional thermography relies on pixel-wise 1D approximations that neglect lateral diffusion, and soft-constrained Physics-Informed Neural Networks (PINNs) often fail in transient diffusion scenarios due to gradient stiffness, NeFTY parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver. By leveraging a differentiable physics solver, our approach enforces thermodynamic laws as hard constraints while maintaining the memory efficiency required for high-resolution 3D tomography. Our discretize-then-optimize paradigm effectively mitigates the spectral bias and ill-posedness inherent in inverse heat conduction, enabling the recovery of subsurface defects at arbitrary scales. Experimental validation on synthetic data demonstrates that NeFTY significantly improves the accuracy of subsurface defect localization over baselines. Additional details at this https URL

4. 【2603.11044】Agentar-Fin-OCR

链接https://arxiv.org/abs/2603.11044

作者:Siyi Qian,Xiongfei Bai,Bingtao Fu,Yichen Lu,Gaoyang Zhang,Xudong Yang,Peng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Heading Hierarchy Reconstruction, Document-level Heading Hierarchy, parsing system tailored, transforming ultra-long financial, ultra-long financial PDFs

备注

点击查看摘要

Abstract:In this paper, we propose Agentar-Fin-OCR, a document parsing system tailored to financial-domain documents, transforming ultra-long financial PDFs into semantically consistent, highly accurate, structured outputs with auditing-grade provenance. To address finance-specific challenges such as complex layouts, cross-page structural discontinuities, and cell-level referencing capability, Agentar-Fin-OCR combines (1) a Cross-page Contents Consolidation algorithm to restore continuity across pages and a Document-level Heading Hierarchy Reconstruction (DHR) module to build a globally consistent Table of Contents (TOC) tree for structure-aware retrieval, and (2) a difficulty-adaptive curriculum learning training strategy for table parsing, together with a CellBBoxRegressor module that uses structural anchor tokens to localize table cells from decoder hidden states without external detectors. Experiments demonstrate that our model shows high performance on the table parsing metrics of OmniDocBench. To enable realistic evaluation in the financial vertical, we further introduce FinDocBench, a benchmark that includes six financial document categories with expert-verified annotations and evaluation metrics including Table of Contents edit-distance-based similarity (TocEDS), cross-page concatenated TEDS, and Table Cell Intersection over Union (C-IoU). We evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents. Overall, Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications.

5. 【2603.11042】V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

链接https://arxiv.org/abs/2603.11042

作者:Yan-Bo Lin,Jonah Casebeer,Long Mai,Aniruddha Mahapatra,Gedas Bertasius,Nicholas J. Bryan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)

关键词:Generating music, fine-grained temporal control, lack fine-grained temporal, challenging for existing, temporally aligns

备注: Project page: [this https URL](https://genjib.github.io/v2m_zero/)

点击查看摘要

Abstract:Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at this https URL

6. 【2603.11041】DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

链接https://arxiv.org/abs/2603.11041

作者:Shuyao Shang,Bing Zhan,Yunfei Yan,Yuqi Wang,Yingyan Li,Yasong An,Xiaoman Wang,Jierui Liu,Lu Hou,Lue Fan,Zhaoxiang Zhang,Tieniu Tan

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:driving VLA model, VLA model, paradigm termed Dynamics, CoT paradigm termed, Dynamics

备注: 18 pages, 10 figures

点击查看摘要

Abstract:We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

7. 【2603.11024】Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

链接https://arxiv.org/abs/2603.11024

作者:Marvin Limpijankit,Milad Alshomary,Yassin Oulad Daoud,Amith Ananthram,Tim Trombley,Elias Stengel-Eskin,Mohit Bansal,Noam M. Elcott,Kathleen McKeown

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:computer vision tasks, visual question answering, vision tasks, object detection, art historians

备注: 12 pages, 12 figures

点击查看摘要

Abstract:VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.

8. 【2603.10990】oo Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity

链接https://arxiv.org/abs/2603.10990

作者:Zhengyao Fang,Zexi Jia,Yijia Zhong,Pengcheng Luo,Jinchao Zhang,Guangming Lu,Jun Yu,Wenjie Pei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:improved visual quality, photography remains challenging, greatly improved visual, real-world photography remains, Recent advances

备注: accepted by CVPR2026

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at this https URL.

9. 【2603.10978】GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

链接https://arxiv.org/abs/2603.10978

作者:Boyuan Chen,Minghao Shao,Siddharth Garg,Ramesh Karri,Muhammad Shafique

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision Language Models, Vision Language, visual reasoning tasks, exhibit persistent hallucinations, excluding sentiment

备注

点击查看摘要

Abstract:Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.

10. 【2603.10975】VCR: Variance-Driven Channel Recalibration for Robust Low-Light Enhancement

链接https://arxiv.org/abs/2603.10975

作者:Zhixin Cheng,Fangwen Zhang,Xiaotian Yin,Baoqun Yin,Haodian Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sRGB-based LLIE methods, black noise artifacts, HSV color space, offers insufficient decoupling, introducing significant red

备注

点击查看摘要

Abstract:Most sRGB-based LLIE methods suffer from entangled luminance and color, while the HSV color space offers insufficient decoupling at the cost of introducing significant red and black noise artifacts. Recently, the HVI color space has been proposed to address these limitations by enhancing color fidelity through chrominance polarization and intensity compression. However, existing methods could suffer from channel-level inconsistency between luminance and chrominance, and misaligned color distribution may lead to unnatural enhancement results. To address these challenges, we propose the Variance-Driven Channel Recalibration for Robust Low-Light Enhancement (VCR), a novel framework for low-light image enhancement. VCR consists of two main components, including the Channel Adaptive Adjustment (CAA) module, which employs variance-guided feature filtering to enhance the model's focus on regions with high intensity and color distribution. And the Color Distribution Alignment (CDA) module, which enforces distribution alignment in the color feature space. These designs enhance perceptual quality under low-light conditions. Experimental results on several benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance compared with existing methods.

11. 【2603.10967】Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI

链接https://arxiv.org/abs/2603.10967

作者:Joan Perramon-Llussà,Amelia Jiménez-Sánchez,Grzegorz Skorupko,Fotis Avgoustidis,Carlos Martín-Isla,Karim Lekadir,Polyxeni Gkontra

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cardiac magnetic resonance, including cardiac magnetic, robust downstream performance, Foundation models, show great promise

备注: 11 pages, 2 figures. Submitted to MICCAI 2026

点击查看摘要

Abstract:Foundation models (FMs) show great promise for robust downstream performance across medical imaging tasks and modalities, including cardiac magnetic resonance (CMR), following task-specific adaptation. However, adaptation using single-site data may lead to suboptimal performance and increased model bias, while centralized fine-tuning on clinical data is often infeasible due to privacy constraints. Federated fine-tuning offers a privacy-preserving alternative; yet conventional approaches struggle under heterogeneous, non-IID multi-center data and incur substantial communication overhead when adapting large models. In this work, we study federated FM fine-tuning for 3D CMR disease detection and propose Med-DualLoRA, a client-aware parameter-efficient fine-tuning (PEFT) federated framework that disentangles globally shared and local low-rank adaptations (LoRA) through additive decomposition. Global and local LoRA modules are trained locally, but only the global component is shared and aggregated across sites, keeping local adapters private. This design improves personalization while significantly reducing communication cost, and experiments show that adapting only two transformer blocks preserves performance while further improving efficiency. We evaluate our method on a multi-center state-of-the-art cine 3D CMR FM fine-tuned for disease detection using ACDC and combined M\Ms datasets, treating each vendor as a federated client. Med-DualLoRA achieves statistically significant improved performance (balanced accuracy 0.768, specificity 0.612) compared to other federated PEFT baselines, while maintaining communication efficiency. Our approach provides a scalable solution for local federated adaptation of medical FMs under realistic clinical constraints.

12. 【2603.10965】Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

链接https://arxiv.org/abs/2603.10965

作者:Jian Sun,Mohammad H. Mahoor

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Mild Cognitive Impairment, quality significantly affects, significantly affects video, Video quality, Video quality significantly

备注: 9 figures, 10 tables,

点击查看摘要

Abstract:Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.

13. 【2603.10963】Pointy - A Lightweight Transformer for Point Cloud Foundation Models

链接https://arxiv.org/abs/2603.10963

作者:Konrad Szafer,Marek Kraft,Dominik Belter

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:leveraging extensive representation, extensive representation learning, point cloud data, grown in capability, language or vision

备注: To appear in the proceedings of ACIVS 2025. An earlier version was presented at the SCI-FM workshop at ICLR 2025

点击查看摘要

Abstract:Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at this https URL.

14. 【2603.10935】Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors

链接https://arxiv.org/abs/2603.10935

作者:Zegu Zhang,Jian Zhang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:approximate posterior degenerates, Variational autoencoders, approximate posterior, posterior degenerates, frequently suffer

备注: 15 pages, 6 figures

点击查看摘要

Abstract:Variational autoencoders (VAEs) frequently suffer from posterior collapse, where latent variables become uninformative and the approximate posterior degenerates to the prior. Recent work has characterized this phenomenon as a phase transition governed by the spectral properties of the data covariance matrix. In this paper, we propose a fundamentally different approach: instead of avoiding collapse through architectural constraints or hyperparameter tuning, we eliminate the possibility of collapse altogether by leveraging the multiplicity of Gaussian mixture model (GMM) clusterings. We introduce Historical Consensus Training, an iterative selection procedure that progressively refines a set of candidate GMM priors through alternating optimization and selection. The key insight is that models trained to satisfy multiple distinct clustering constraints develop a historical barrier -- a region in parameter space that remains stable even when subsequently trained with a single objective. We prove that this barrier excludes the collapsed solution, and demonstrate through extensive experiments on synthetic and real-world datasets that our method achieves non-collapsed representations regardless of decoder variance or regularization strength. Our approach requires no explicit stability conditions (e.g., $\sigma^{\prime 2} \lambda_{\max}$) and works with arbitrary neural architectures. The code is available at this https URL.

15. 【2603.10933】Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD

链接https://arxiv.org/abs/2603.10933

作者:Qinxin Wu,Fucheng Niu,Hengchuan Zhu,Yifan Sun,Ye Shen,Xu Li,Han Wu,Leqi Liu,Zhiwen Pan,Zuozhu Liu,Fudong Zhu,Bin Feng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:volumetric CBCT interpretation, CBCT reporting remains, medical report generation, reporting remains limited, maxillofacial CBCT report-generation

备注

点击查看摘要

Abstract:Generative AI has advanced rapidly in medical report generation; however, its application to oral and maxillofacial CBCT reporting remains limited, largely because of the scarcity of high-quality paired CBCT-report data and the intrinsic complexity of volumetric CBCT interpretation. To address this, we introduce CBCTRepD, a bilingual oral and maxillofacial CBCT report-generation system designed for integration into routine radiologist-AI co-authoring workflows. We curated a large-scale, high-quality paired CBCT-report dataset comprising approximately 7,408 studies, covering 55 oral disease entities across diverse acquisition settings, and used it to develop the system. We further established a clinically grounded, multi-level evaluation framework that assesses both direct AI-generated drafts and radiologist-edited collaboration reports using automatic metrics together with radiologist- and clinician-centered evaluation. Using this framework, we show that CBCTRepD achieves superior report-generation performance and produces drafts with writing quality and standardization comparable to those of intermediate radiologists. More importantly, in radiologist-AI collaboration, CBCTRepD provides consistent and clinically meaningful benefits across experience levels: it helps novice radiologists improve toward intermediate-level reporting, enables intermediate radiologists to approach senior-level performance, and even assists senior radiologists by reducing omission-related errors, including clinically important missed lesions. By improving report structure, reducing omissions, and promoting attention to co-existing lesions across anatomical regions, CBCTRepD shows strong and reliable potential as a practical assistant for real-world CBCT reporting across multi-level care settings.

16. 【2603.10929】Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

链接https://arxiv.org/abs/2603.10929

作者:Fanqi Yu,Matteo Tiezzi,Tommaso Apicella,Cigdem Beyan,Vittorio Murino

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:enables continual policy, continual policy refinement, lifelong imitation learning, imitation learning framework, lifelong imitation

备注

点击查看摘要

Abstract:We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: this https URL.

17. 【2603.10928】Novel Architecture of RPA In Oral Cancer Lesion Detection

链接https://arxiv.org/abs/2603.10928

作者:Revana Magdy,Joy Naoum,Ali Hamdi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate and early, diagnosis and treatment, oral cancer lesions, lesions is crucial, crucial for effective

备注

点击查看摘要

Abstract:Accurate and early detection of oral cancer lesions is crucial for effective diagnosis and treatment. This study evaluates two RPA implementations, OC-RPAv1 and OC-RPAv2, using a test set of 31 images. OC-RPAv1 processes one image per prediction in an average of 0.29 seconds, while OCRPAv2 employs a Singleton design pattern and batch processing, reducing prediction time to just 0.06 seconds per image. This represents a 60-100x efficiency improvement over standard RPA methods, showcasing that design patterns and batch processing can enhance scalability and reduce costs in oral cancer detection

18. 【2603.10893】S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs

链接https://arxiv.org/abs/2603.10893

作者:Yuzhou Ji,Qijian Tian,He Zhu,Xiaoqi Jiang,Guangzhi Cao,Lizhuang Ma,Yuan Xie,Xin Tan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:simulation and understanding, Gaussian Splatting, essential medium, Explicit, sparse

备注

点击查看摘要

Abstract:Explicit 3D representations have already become an essential medium for 3D simulation and understanding. However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs. In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs. Specifically, the S2D lifting is two-fold. We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing. Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views. Extensive experiments show that S2D achieves the best consistency in generating novel view guidance and first-tier sparse view reconstruction quality under different input sparsity. By reconstructing stable scenes with the least possible captures among existing methods, S2D enables minimal input requirements for 3DGS applications.

19. 【2603.10872】Bilevel Layer-Positioning LoRA for Real Image Dehazing

链接https://arxiv.org/abs/2603.10872

作者:Yan Zhang,Long Ma,Yuxin Feng,Zhe Huang,Fan Zhou,Zhuo Su

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved notable progress, Learning-based real image, real haze scenes, diverse real haze, real image dehazing

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Learning-based real image dehazing methods have achieved notable progress, yet they still face adaptation challenges in diverse real haze scenes. These challenges mainly stem from the lack of effective unsupervised mechanisms for unlabeled data and the heavy cost of full model fine-tuning. To address these challenges, we propose the haze-to-clear text-directed loss that leverages CLIP's cross-modal capabilities to reformulate real image dehazing as a semantic alignment problem in latent space, thereby providing explicit unsupervised cross-modal guidance in the absence of reference images. Furthermore, we introduce the Bilevel Layer-positioning LoRA (BiLaLoRA) strategy, which learns both the LoRA parameters and automatically search the injection layers, enabling targeted adaptation of critical network layers. Extensive experiments demonstrate our superiority against state-of-the-art methods on multiple real-world dehazing benchmarks. The code is publicly available at this https URL.

20. 【2603.10863】Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

链接https://arxiv.org/abs/2603.10863

作者:Lin Chen,Bolin Ni,Qi Yang,Zili Wang,Kun Ding,Ying Wang,Houwen Peng,Shiming Xiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, Language Models

备注

点击查看摘要

Abstract:Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at this https URL.

21. 【2603.10852】UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis

链接https://arxiv.org/abs/2603.10852

作者:Yali Zhu,Kang Zhou,Dingbang Wu,Gaofeng Meng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Breast ultrasound diagnosis, ultrasound diagnosis typically, diagnosis typically proceeds, global lesion localization, Breast ultrasound

备注

点击查看摘要

Abstract:Breast ultrasound diagnosis typically proceeds from global lesion localization to local sign assessment and then evidence integration to assign a BI-RADS category and determine benignity or malignancy. Many existing methods rely on end-to-end prediction or provide only weakly grounded evidence, which can miss fine-grained lesion cues and limit auditability and clinical review. To align with the clinical workflow and improve evidence traceability, we propose a hierarchical multi-agent framework, termed UltrasoundAgents. A main agent localizes the lesion in the full image and triggers a crop-and-zoom operation. A sub-agent analyzes the local view and predicts four clinically relevant attributes, namely echogenicity pattern, calcification, boundary type, and edge (margin) morphology. The main agent then integrates these structured attributes to perform evidence-based reasoning and output the BI-RADS category and the malignancy prediction, while producing reviewable intermediate evidence. Furthermore, hierarchical multi-agent training often suffers from error propagation, difficult credit assignment, and sparse rewards. To alleviate this and improve training stability, we introduce a decoupled progressive training strategy. We first train the attribute agent, then train the main agent with oracle attributes to learn robust attribute-based reasoning, and finally apply corrective trajectory self-distillation with spatial supervision to build high-quality trajectories for supervised fine-tuning, yielding a deployable end-to-end policy. Experiments show consistent gains over strong vision-language baselines in diagnostic accuracy and attribute agreement, together with structured evidence and traceable reasoning.

22. 【2603.10834】On the Reliability of Cue Conflict and Beyond

链接https://arxiv.org/abs/2603.10834

作者:Pum Jun Kim,Seung-Ah Lee,Seongho Park,Dongyoon Han,Jaejun Yoo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:neural networks rely, internal decision processes, Understanding how neural, visual cues offers, neural networks

备注: Shape-Texture Bias, Cue Conflict Benchmark

点击查看摘要

Abstract:Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

23. 【2603.10833】Evaluating Few-Shot Pill Recognition Under Visual Domain Shift

链接https://arxiv.org/abs/2603.10833

作者:W. I. Chu,G. Tarroni,L. Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Adverse drug events, enhance medication safety, Adverse drug, preventable harm, medication safety

备注: 8 pages, 4 figures. Submitted to IEEE Engineering in Medicine and Biology Conference (EMBC) 2026

点击查看摘要

Abstract:Adverse drug events are a significant source of preventable harm, which has led to the development of automated pill recognition systems to enhance medication safety. Real-world deployment of these systems is hindered by visually complex conditions, including cluttered scenes, overlapping pills, reflections, and diverse acquisition environments. This study investigates few-shot pill recognition from a deployment-oriented perspective, prioritizing generalization under realistic cross-dataset domain shifts over architectural innovation. A two-stage object detection framework is employed, involving base training followed by few-shot fine-tuning. Models are adapted to novel pill classes using one, five, or ten labeled examples per class and are evaluated on a separate deployment dataset featuring multi-object, cluttered scenes. The evaluation focuses on classification-centric and error-based metrics to address heterogeneous annotation strategies. Findings indicate that semantic pill recognition adapts rapidly with few-shot supervision, with classification performance reaching saturation even with a single labeled example. However, stress testing under overlapping and occluded conditions demonstrates a marked decline in localization and recall, despite robust semantic classification. Models trained on visually realistic, multi-pill data consistently exhibit greater robustness in low-shot scenarios, underscoring the importance of training data realism and the diagnostic utility of few-shot fine-tuning for deployment readiness.

24. 【2603.10828】BALD-SAM: Disagreement-based Active Prompting in Interactive Segmentation

链接https://arxiv.org/abs/2603.10828

作者:Prithwijit Chowdhury,Mohit Prabhushankar,Ghassan AlRegib

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:revolutionized interactive segmentation, Segment, Segment Anything Model, annotators observe model, revolutionized interactive

备注

点击查看摘要

Abstract:The Segment Anything Model (SAM) has revolutionized interactive segmentation through spatial prompting. While existing work primarily focuses on automating prompts in various settings, real-world annotation workflows involve iterative refinement where annotators observe model outputs and strategically place prompts to resolve ambiguities. Current pipelines typically rely on the annotator's visual assessment of the predicted mask quality. We postulate that a principled approach for automated interactive prompting is to use a model-derived criterion to identify the most informative region for the next prompt. In this work, we establish active prompting: a spatial active learning approach where locations within images constitute an unlabeled pool and prompts serve as queries to prioritize information-rich regions, increasing the utility of each interaction. We further present BALD-SAM: a principled framework adapting Bayesian Active Learning by Disagreement (BALD) to spatial prompt selection by quantifying epistemic uncertainty. To do so, we freeze the entire model and apply Bayesian uncertainty modeling only to a small learned prediction head, making intractable uncertainty estimation practical for large multi-million parameter foundation models. Across 16 datasets spanning natural, medical, underwater, and seismic domains, BALD-SAM demonstrates strong cross-domain performance, ranking first or second on 14 of 16 benchmarks. We validate these gains through a comprehensive ablation suite covering 3 SAM backbones and 35 Laplace posterior configurations, amounting to 38 distinct ablation settings. Beyond strong average performance, BALD-SAM surpasses human prompting and, in several categories, even oracle prompting, while consistently outperforming one-shot baselines in final segmentation quality, particularly on thin and structurally complex objects.

25. 【2603.10825】A dataset of medication images with instance segmentation masks for preventing adverse drug events

链接https://arxiv.org/abs/2603.10825

作者:W. I. Chu,S. Hirani,G. Tarroni,L. Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:adverse drug events, pose significant risks, reliably identifying pharmaceuticals, drug events, pose significant

备注: 25 pages, 19 figures. Submitted to Scientific Data (Nature Portfolio)

点击查看摘要

Abstract:Medication errors and adverse drug events (ADEs) pose significant risks to patient safety, often arising from difficulties in reliably identifying pharmaceuticals in real-world settings. AI-based pill recognition models offer a promising solution, but the lack of comprehensive datasets hinders their development. Existing pill image datasets rarely capture real-world complexities such as overlapping pills, varied lighting, and occlusions. MEDISEG addresses this gap by providing instance segmentation annotations for 32 distinct pill types across 8262 images, encompassing diverse conditions from individual pill images to cluttered dosette boxes. We trained YOLOv8 and YOLOv9 on MEDISEG to demonstrate their usability, achieving mean average precision at IoU 0.5 of 99.5 percent on the 3-Pills subset and 80.1 percent on the 32-Pills subset. We further evaluate MEDISEG under a few-shot detection protocol, demonstrating that base training on MEDISEG significantly improves recognition of unseen pill classes in occluded multi-pill scenarios compared to existing datasets. These results highlight the dataset's ability not only to support robust supervised training but also to promote transferable representations under limited supervision, making it a valuable resource for developing and benchmarking AI-driven systems for medication safety.

26. 【2603.10814】HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation

链接https://arxiv.org/abs/2603.10814

作者:Hongji Yang,Yucheng Zhou,Wencheng Han,Songlian Li,Xiaotong Zhao,Jianbing Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, specific artistic domains, general visual capabilities, impressive general visual, Chinese Artistic Domain

备注: 14 pages

点击查看摘要

Abstract:While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.

27. 【2603.10806】Backdoor Directions in Vision Transformers

链接https://arxiv.org/abs/2603.10806

作者:Sengim Karayalcin,Marina Krcek,Pin-Yu Chen,Stjepan Picek

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:Vision Transformers, paper investigates, Transformers, Backdoor, model backdoor behavior

备注: 31 pages, 16 figures

点击查看摘要

Abstract:This paper investigates how Backdoor Attacks are represented within Vision Transformers (ViTs). By assuming knowledge of the trigger, we identify a specific ``trigger direction'' in the model's activations that corresponds to the internal representation of the trigger. We confirm the causal role of this linear direction by showing that interventions in both activation and parameter space consistently modulate the model's backdoor behavior across multiple datasets and attack types. Using this direction as a diagnostic tool, we trace how backdoor features are processed across layers. Our analysis reveals distinct qualitative differences: static-patch triggers follow a different internal logic than stealthy, distributed triggers. We further examine the link between backdoors and adversarial attacks, specifically testing whether PGD-based perturbations (de-)activate the identified trigger mechanism. Finally, we propose a data-free, weight-based detection scheme for stealthy-trigger attacks. Our findings show that mechanistic interpretability offers a robust framework for diagnosing and addressing security vulnerabilities in computer vision.

28. 【2603.10801】PolGS++: Physically-Guided Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction

链接https://arxiv.org/abs/2603.10801

作者:Yufei Han,Chu Zhou,Youwei Lyu,Qi Chen,Si Li,Boxin Shi,Yunpeng Jia,Heng Guo,Zhanyu Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:digital content creation, real-time virtual reality, Gaussian Splatting, reflective surfaces remains, computer vision

备注: arXiv admin note: substantial text overlap with [arXiv:2509.19726](https://arxiv.org/abs/2509.19726)

点击查看摘要

Abstract:Accurate reconstruction of reflective surfaces remains a fundamental challenge in computer vision, with broad applications in real-time virtual reality and digital content creation. Although 3D Gaussian Splatting (3DGS) enables efficient novel-view rendering with explicit representations, its performance on reflective surfaces still lags behind implicit neural methods, especially in recovering fine geometry and surface normals. To address this gap, we propose PolGS++, a physically-guided polarimetric Gaussian Splatting framework for fast reflective surface reconstruction. Specifically, we integrate a polarized BRDF (pBRDF) model into 3DGS to explicitly decouple diffuse and specular components, providing physically grounded reflectance modeling and stronger geometric cues for reflective surface recovery. Furthermore, we introduce a depth-guided visibility mask acquisition mechanism that enables angle-of-polarization (AoP)-based tangent-space consistency constraints in Gaussian Splatting without costly ray-tracing intersections. This physically guided design improves reconstruction quality and efficiency, requiring only about 10 minutes of training. Extensive experiments on both synthetic and real-world datasets validate the effectiveness of our method.

29. 【2603.10785】he Quadratic Geometry of Flow Matching: Semantic Granularity Alignment for Text-to-Image Synthesis

链接https://arxiv.org/abs/2603.10785

作者:Zhinan Xiong,Shunqi Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Tangent Kernel, Flow Matching framework, evolving Neural Tangent, Quadratic Form governed, generative fine-tuning

备注: 43 pages

点击查看摘要

Abstract:In this work, we analyze the optimization dynamics of generative fine-tuning. We observe that under the Flow Matching framework, the standard MSE objective can be formulated as a Quadratic Form governed by a dynamically evolving Neural Tangent Kernel (NTK). This geometric perspective reveals a latent Data Interaction Matrix, where diagonal terms represent independent sample learning and off-diagonal terms encode residual correlation between heterogeneous features. Although standard training implicitly optimizes these cross-term interferences, it does so without explicit control; moreover, the prevailing data-homogeneity assumption may constrain the model's effective capacity. Motivated by this insight, we propose Semantic Granularity Alignment (SGA), using Text-to-Image synthesis as a testbed. SGA engineers targeted interventions in the vector residual field to mitigate gradient conflicts. Evaluations across DiT and U-Net architectures confirm that SGA advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity.

30. 【2603.10782】Phase-Interface Instance Segmentation as a Visual Sensor for Laboratory Process Monitoring

链接https://arxiv.org/abs/2603.10782

作者:Mingyue Li,Xin Yang,Shilin Yan,Jinye Ran,Morui Zhu,Zirui Peng,Huanqing Peng,Wei Peng,Guanghua Zhang,Shuo Li,Hao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:experiments remains challenging, optical artifacts degrade, artifacts degrade conventional, chemical experiments remains, Chemical Transparent Glasses

备注

点击查看摘要

Abstract:Reliable visual monitoring of chemical experiments remains challenging in transparent glassware, where weak phase boundaries and optical artifacts degrade conventional segmentation. We formulate laboratory phenomena as the time evolution of phase interfaces and introduce the Chemical Transparent Glasses dataset 2.0 (CTG 2.0), a vessel-aware benchmark with 3,668 images, 23 glassware categories, and five multiphase interface types for phase-interface instance segmentation. Building on YOLO11m-seg, we propose LGA-RCM-YOLO, which combines Local-Global Attention (LGA) for robust semantic representation and a Rectangular Self-Calibration Module (RCM) for boundary refinement of thin, elongated interfaces. On CTG 2.0, the proposed model achieves 84.4% AP@0.5 and 58.43% AP@0.5-0.95, improving over the YOLO11m baseline by 6.42 and 8.75 AP points, respectively, while maintaining near real-time inference (13.67 FPS, RTX 3060). An auxiliary color-attribute head further labels liquid instances as colored or colorless with 98.71% precision and 98.32% recall. Finally, we demonstrate continuous process monitoring in separatory-funnel phase separation and crystallization, showing that phase-interface instance segmentation can serve as a practical visual sensor for laboratory automation.

31. 【2603.10781】aking Shortcuts for Categorical VQA Using Super Neurons

链接https://arxiv.org/abs/2603.10781

作者:Pierre Musacchio,Jaeyi Jeong,Dahun Kim,Jaesik Park

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Sparse Attention Vectors, Vision Language Models, excellent training-free alternative, Sparse Attention, Vision Language

备注: 25 pages, 15 tables, 8 figures

点击查看摘要

Abstract:Sparse Attention Vectors (SAVs) have emerged as an excellent training-free alternative to supervised finetuning or low-rank adaptation to improve the performance of Vision Language Models (VLMs). At their heart, SAVs select a few accurate attention heads for a task of interest and use them as classifiers, rather than relying on the model's prediction. In a similar spirit, we find that directly probing the raw activations of the VLM, in the form of scalar values, is sufficient to yield accurate classifiers on diverse visually grounded downstream tasks. Shifting focus from attention vectors to scalar activations dramatically increases the search space for accurate parameters, allowing us to find more discriminative neurons immediately from the first generated token. We call such activations Super Neurons (SNs). In this probing setting, we discover that enough SNs appear in the shallower layers of the large language model to allow for extreme early exiting from the first layer of the model at the first generated token. Compared to the original network, SNs robustly improve the classification performance while achieving a speedup of up to 5.10x.

32. 【2603.10780】Guiding Diffusion Models with Semantically Degraded Conditions

链接https://arxiv.org/abs/2603.10780

作者:Shilong Han,Yuming Zhang,Hongxia Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:guidance signal prone, semantically vacuous null, vacuous null prompt, cornerstone of modern, geometric entanglement

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Classifier-Free Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt ($\varnothing$) generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, $\boldsymbol{c}_{\text{deg}}$. This reframes guidance from a coarse "good vs. null" contrast to a more refined "good vs. almost good" discrimination, thereby compelling the model to capture fine-grained semantic distinctions. We find that tokens in transformer text encoders split into two functional roles: content tokens encoding object semantics, and context-aggregating tokens capturing global context. By selectively degrading only the former, CDG constructs $\boldsymbol{c}_{\text{deg}}$ without external models or training. Validated across diverse architectures including Stable Diffusion 3, FLUX, and Qwen-Image, CDG markedly improves compositional accuracy and text-image alignment. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically-aware negative samples is critical to achieving precise semantic control. Code is available at this https URL.

33. 【2603.10757】CodePercept: Code-Grounded Visual STEM Perception for MLLMs

链接https://arxiv.org/abs/2603.10757

作者:Tongkun Guan,Zhibo Yang,Jianqiang Wan,Mingkun Yang,Zhengtao Guo,Zijian Hu,Ruilin Luo,Ruize Chen,Songtao Jiang,Peng Wang,Wei Shen,Junyang Lin,Xiaokang Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamental question arises, fail at Science, question arises, fundamental question, Technology

备注: Accepted by CVPR2026

点击查看摘要

Abstract:When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at this https URL.

34. 【2603.10748】Event-based Photometric Stereo via Rotating Illumination and Per-Pixel Learning

链接https://arxiv.org/abs/2603.10748

作者:Hyunwoo Kim,Won-Hoe Kim,Sanghoon Lee,Jianfei Cai,Giljoo Nam,Jae-Sang Hyun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Photometric stereo, event-based photometric stereo, photometric stereo methods, technique for estimating, images captured

备注

点击查看摘要

Abstract:Photometric stereo is a technique for estimating surface normals using images captured under varying illumination. However, conventional frame-based photometric stereo methods are limited in real-world applications due to their reliance on controlled lighting, and susceptibility to ambient illumination. To address these limitations, we propose an event-based photometric stereo system that leverages an event camera, which is effective in scenarios with continuously varying scene radiance and high dynamic range conditions. Our setup employs a single light source moving along a predefined circular trajectory, eliminating the need for multiple synchronized light sources and enabling a more compact and scalable design. We further introduce a lightweight per-pixel multi-layer neural network that directly predicts surface normals from event signals generated by intensity changes as the light source rotates, without system calibration. Experimental results on benchmark datasets and real-world data collected with our data acquisition system demonstrate the effectiveness of our method, achieving a 7.12\% reduction in mean angular error compared to existing event-based photometric stereo methods. In addition, our method demonstrates robustness in regions with sparse event activity, strong ambient illumination, and scenes affected by specularities.

35. 【2603.10744】Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

链接https://arxiv.org/abs/2603.10744

作者:Wenhao Sun,Ji Li,Zhaoqiang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:iterative sampling severely, sampling severely hampers, Diffusion Transformers, high computational cost, Transformers have established

备注: Accepted by CVPR2026

点击查看摘要

Abstract:Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.

36. 【2603.10724】Lasmobranc Dataset: An Image Dataset for Elasmobranch Species Recognition and Biodiversity Monitoring

链接https://arxiv.org/abs/2603.10724

作者:Ismael Beviá-Ballesteros,Mario Jerez-Tallón,Nieves Aranda-Garrido,Isabel Abel-Abellán,Irene Antón-Linares,Jorge Azorín-López,Marcelo Saval-Calvo,Andres Fuster-Guilló,Francisca Giménez-Casalduero

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant global declines, experiencing significant global, global declines, classified as threatened, experiencing significant

备注: 9 pages, 6 figures, 5 tables. A future extended version of this work will be submitted to Scientific Data

点击查看摘要

Abstract:Elasmobranch populations are experiencing significant global declines, and several species are currently classified as threatened. Reliable monitoring and species-level identification are essential to support conservation and spatial planning initiatives such as Important Shark and Ray Areas (ISRAs). However, existing visual datasets are predominantly detection-oriented, underwater-acquired, or limited to coarse-grained categories, restricting their applicability to fine-grained morphological classification. We present the eLasmobranc Dataset, a curated and publicly available image collection from seven ecologically relevant elasmobranch species inhabiting the eastern Spanish Mediterranean coast, a region where two ISRAs have been identified. Images were obtained through dedicated data collection, including field campaigns and collaborations with local fish markets and projects, as well as from open-access public sources. The dataset was constructed predominantly from images acquired outside the aquatic environment under standardized protocols to ensure clear visualization of diagnostic morphological traits. It integrates expert-validated species annotations, structured spatial and temporal metadata, and complementary species-level information. The eLasmobranc Dataset is specifically designed to support supervised species-level classification, population studies, and the development of artificial intelligence systems for biodiversity monitoring. By combining morphological clarity, taxonomic reliability, and public accessibility, the dataset addresses a critical gap in fine-grained elasmobranch identification and promotes reproducible research in conservation-oriented computer vision. The dataset is publicly available at this https URL.

Comments:
9 pages, 6 figures, 5 tables. A future extended version of this work will be submitted to Scientific Data

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.10724 [cs.CV]

(or
arXiv:2603.10724v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.10724

Focus to learn more

              arXiv-issued DOI via DataCite</p>
37. 【2603.10722】UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

链接https://arxiv.org/abs/2603.10722

作者:Yu Zhang,Zhicheng Zhao,Ze Luo,Chenglong Li,Jin Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:unmanned aerial vehicle, wide-area monitoring capabilities, intelligent transportation systems, transportation systems due, aerial vehicle

备注

点击查看摘要

Abstract:Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide-area monitoring capabilities. However, existing methods face significant challenges in real-world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain-specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Cross-spectral Traffic Cognition Network (CTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype-Guided Knowledge Embedding (PGKE) module that leverages high-level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations. Moreover, we develop a Quality-Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic-VQA, the first large-scale optical-thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question-answer pairs across 31 diverse types. Extensive experiments demonstrate that CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The dataset is available at this https URL.

38. 【2603.10703】WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

链接https://arxiv.org/abs/2603.10703

作者:Rafi Ibn Sultan,Hui Zhu,Xiangyu Zhou,Chengyin Li,Prashant Khanduri,Marco Brocanelli,Dongxiao Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:existing Large Vision-Language, Large Vision-Language Models, Ensuring accessible pedestrian, complex urban scenes, pedestrian navigation requires

备注: Accepted by CVPR-2026

点击查看摘要

Abstract:Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{this https URL}{project website}.

39. 【2603.10702】UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

链接https://arxiv.org/abs/2603.10702

作者:Yaqi Zhao,Wang Lin,Zijian Zhang,Miles Yang,Jingyuan Chen,Wentao Zhang,Zhao Zhong,Liefeng Bo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:discrete visual tokenizers, Current unified multimodal, modality gap, models typically rely, Current unified

备注

点击查看摘要

Abstract:Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.

40. 【2603.10695】RandMark: On Random Watermarking of Visual Foundation Models

链接https://arxiv.org/abs/2603.10695

作者:Anna Chistyakova,Mikhail Pautov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:computer vision tasks, achieve remarkable performance, downstream computer vision, visual foundation models, diverse datasets

备注

点击查看摘要

Abstract:Being trained on large and diverse datasets, visual foundation models (VFMs) can be fine-tuned to achieve remarkable performance and efficiency in various downstream computer vision tasks. The high computational cost of data collection and training makes these models valuable assets, which motivates some VFM owners to distribute them alongside a license to protect their intellectual property rights. In this paper, we propose an approach to ownership verification of visual foundation models that leverages a small encoder-decoder network to embed digital watermarks into an internal representation of a hold-out set of input images. The method is based on random watermark embedding, which makes the watermark statistics detectable in functional copies of the watermarked model. Both theoretically and experimentally, we demonstrate that the proposed method yields a low probability of false detection for non-watermarked models and a low probability of false misdetection for watermarked models.

41. 【2603.10694】Bioinspired CNNs for border completion in occluded images

链接https://arxiv.org/abs/2603.10694

作者:Catarina P. Coutinho,Aneeqa Merhab,Janko Petkovic,Ferdinando Zanchetta,Rita Fioresi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:convolutional neural network, border completion problem, design convolutional neural, neural network, filters that enhance

备注: Submitted for Publication

点击查看摘要

Abstract:We exploit the mathematical modeling of the border completion problem in the visual cortex to design convolutional neural network (CNN) filters that enhance robustness to image occlusions. We evaluate our CNN architecture, BorderNet, on three occluded datasets (MNIST, Fashion-MNIST, and EMNIST) under two types of occlusions: stripes and grids. In all cases, BorderNet demonstrates improved performance, with gains varying depending on the severity of the occlusions and the dataset.

42. 【2603.10688】MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction

链接https://arxiv.org/abs/2603.10688

作者:Jonas Merkert,Alexander Blumberg,Jan-Hendrik Pauls,Christoph Stiller

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Autonomous vehicles rely, Autonomous vehicles, vehicles rely, information to understand, understand the world

备注

点击查看摘要

Abstract:Autonomous vehicles rely on map information to understand the world around them. However, the creation and maintenance of offline high-definition (HD) maps remains costly. A more scalable alternative lies in online HD map construction, which only requires map annotations at training time. To further reduce the need for annotating vast training labels, self-supervised training provides an alternative. This work focuses on improving the latent birds-eye-view (BEV) feature grid representation within a vectorized online HD map construction model by enforcing geospatial consistency between overlapping BEV feature grids as part of a contrastive loss function. To ensure geospatial overlap for contrastive pairs, we introduce an approach to analyze the overlap between traversals within a given dataset and generate subsidiary dataset splits following adjustable multi-traversal requirements. We train the same model supervised using a reduced set of single-traversal labeled data and self-supervised on a broader unlabeled set of data following our multi-traversal requirements, effectively implementing a semi-supervised approach. Our approach outperforms the supervised baseline across the board, both quantitatively in terms of the downstream tasks vectorized map perception performance and qualitatively in terms of segmentation in the principal component analysis (PCA) visualization of the BEV feature space.

43. 【2603.10685】A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks

链接https://arxiv.org/abs/2603.10685

作者:Huayu Zheng,Guangzhao Li,Baixuan Zhao,Siqi Luo,Hantao Jiang,Guangtao Zhai,Xiaohong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unified inpainting framework, unified inpainting, inpainting framework, users to replace, replace any target

备注

点击查看摘要

Abstract:We propose \textbf{A$^2$-Edit}, a unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask. To address the issues of severe homogenization and limited category coverage in existing datasets, we construct a large-scale, multi-category dataset \textbf{UniEdit-500K}, which includes 8 major categories, 209 fine-grained subcategories, and a total of 500,104 image pairs. Such rich category diversity poses new challenges for the model, requiring it to automatically learn semantic relationships and distinctions across categories. To this end, we introduce the \textbf{Mixture of Transformer} module, which performs differentiated modeling of various object categories through dynamic expert selection, and further enhances cross-category semantic transfer and generalization through collaboration among experts. In addition, we propose a \textbf{Mask Annealing Training Strategy} (MATS) that progressively relaxes mask precision during training, reducing the model's reliance on accurate masks and improving robustness across diverse editing tasks. Extensive experiments on benchmarks such as VITON-HD and AnyInsertion demonstrate that A$^2$-Edit consistently outperforms existing approaches across all metrics, providing a new and efficient solution for arbitrary object editing.

44. 【2603.10671】An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS

链接https://arxiv.org/abs/2603.10671

作者:Qiyue Chen,Yao Li,Jie Tao,Song Chen,Li Li,Dong Liu

类目:Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Intra Pattern Copy, image compression standard, compression standard designed, Pattern Copy, image compression

备注

点击查看摘要

Abstract:Recently, progress has been made on the Intra Pattern Copy (IPC) tool for JPEG XS, an image compression standard designed for low-latency and low-complexity coding. IPC performs wavelet-domain intra compensation predictions to reduce spatial redundancy in screen content. A key module of IPC is the displacement vector (DV) search, which aims to solve the optimal prediction reference offset. However, the DV search process is computationally intensive, posing challenges for practical hardware deployment. In this paper, we propose an efficient pipelined FPGA architecture design for the DV search module to promote the practical deployment of IPC. Optimized memory organization, which leverages the IPC computational characteristics and data inherent reuse patterns, is further introduced to enhance the performance. Experimental results show that our proposed architecture achieves a throughput of 38.3 Mpixels/s with a power consumption of 277 mW, demonstrating its feasibility for practical hardware implementation in IPC and other predictive coding tools, and providing a promising foundation for ASIC deployment.

45. 【2603.10658】How To Embed Matters: Evaluation of EO Embedding Design Choices

链接https://arxiv.org/abs/2603.10658

作者:Luis Gilch,Isabelle Wittmann,Maximilian Nitsche,Johannes Jakubik,Arne Ewald,Thomas Brunschwiler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large Geospatial Foundation, Geospatial Foundation Models, Geospatial Foundation, missions produce petabytes, Earth observation

备注

点击查看摘要

Abstract:Earth observation (EO) missions produce petabytes of multispectral imagery, increasingly analyzed using large Geospatial Foundation Models (GeoFMs). Alongside end-to-end adaptation, workflows make growing use of intermediate representations as task-agnostic embeddings, enabling models to compute representations once and reuse them across downstream tasks. Consequently, when GeoFMs act as feature extractors, decisions about how representations are obtained, aggregated, and combined affect downstream performance and pipeline scalability. Understanding these trade-offs is essential for scalable embedding-based EO workflows, where compact embeddings can replace raw data while remaining broadly useful. We present a systematic analysis of embedding design in GeoFM-based EO workflows. Leveraging NeuCo-Bench, we study how backbone architecture, pretraining strategy, representation depth, spatial aggregation, and representation combination influence EO task performance. We demonstrate the usability of GeoFM embeddings by aggregating them into fixed-size representations more than 500x smaller than the raw input data. Across models, we find consistent trends: transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives exhibit task-specific strengths, and combining embeddings from different objectives often improves robustness.

46. 【2603.10652】Are Video Reasoning Models Ready to Go Outside?

链接https://arxiv.org/abs/2603.10652

作者:Yangfan He,Changgyu Boo,Jaehong Yoon

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:camera motion, real-world deployment, robustness-aware consistency reward, ROVA, vision-language models

备注: Project Page: [this https URL](https://robust-video-reason.github.io/)

点击查看摘要

Abstract:In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

47. 【2603.10648】Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

链接https://arxiv.org/abs/2603.10648

作者:Jeonghyeok Do,Yun Chen,Geunhyuk Youk,Munchurl Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:skeleton-based action representation, action representation learning, landscape of skeleton-based, skeleton-based action, action representation

备注: Please visit our project page at [this https URL](https://kaist-viclab.github.io/SLiM_site/)

点击查看摘要

Abstract:The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.

48. 【2603.10638】Splat2Real: Novel-view Scaling for Physical AI with 3D Gaussian Splatting

链接https://arxiv.org/abs/2603.10638

作者:Hansol Lim,Jongseong Brad Choi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Physical AI faces, training and deployment, robustness is essential, faces viewpoint shift, Physical

备注

点击查看摘要

Abstract:Physical AI faces viewpoint shift between training and deployment, and novel-view robustness is essential for monocular RGB-to-3D perception. We cast Real2Render2Real monocular depth pretraining as imitation-learning-style supervision from a digital twin oracle: a student depth network imitates expert metric depth/visibility rendered from a scene mesh, while 3DGS supplies scalable novel-view observations. We present Splat2Real, centered on novel-view scaling: performance depends more on which views are added than on raw view count. We introduce CN-Coverage, a coverage+novelty curriculum that greedily selects views by geometry gain and an extrapolation penalty, plus a quality-aware guardrail fallback for low-reliability teachers. Across 20 TUM RGB-D sequences with step-matched budgets (N=0 to 2000 additional rendered views, with N unique = 500 and resampling for larger budgets), naive scaling is unstable; CN-Coverage mitigates worst-case regressions relative to Robot/Coverage policies, and GOL-Gated CN-Coverage provides the strongest medium-high-budget stability with the lowest high-novelty tail error. Downstream control-proxy results versus N provides embodied-relevance evidence by shifting safety/progress trade-offs under viewpoint shift.

49. 【2603.10613】MUNIChus: Multilingual News Image Captioning Benchmark

链接https://arxiv.org/abs/2603.10613

作者:Yuji Chen,Alistair Plum,Hansi Hettiarachchi,Diptesh Kanojia,Saroj Basnet,Marcos Zampieri,Tharindu Ranasinghe

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:image captioning, image captioning models, image, highlighting the relationship, visual elements

备注: Accepted to LREC 2026 (The Fifteenth biennial Language Resources and Evaluation Conference)

点击查看摘要

Abstract:The goal of news image captioning is to generate captions by integrating news article content with corresponding images, highlighting the relationship between textual context and visual elements. The majority of research on news image captioning focuses on English, primarily because datasets in other languages are scarce. To address this limitation, we create the first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu. We evaluate various state-of-the-art neural news image captioning models on MUNIChus and find that news image captioning remains challenging. We also make MUNIChus publicly available with over 20 models already benchmarked. MUNIChus opens new avenues for further advancements in developing and evaluating multilingual news image captioning models.

50. 【2603.10604】HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement

链接https://arxiv.org/abs/2603.10604

作者:Stefanos Pasios,Nikos Nikolaidis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision algorithms, Generative Adversarial Network, training computer vision, Realism Generative Adversarial, Enhanced Realism Generative

备注: 8 pages

点击查看摘要

Abstract:Generative models are widely employed to enhance the photorealism of synthetic data for training computer vision algorithms. However, they often introduce visual artifacts that degrade the accuracy of these algorithms and require high computational resources, limiting their applicability in real-time training or evaluation scenarios. In this paper, we propose Hybrid Patch Enhanced Realism Generative Adversarial Network (HyPER-GAN), a lightweight image-to-image translation method based on a U-Net-style generator designed for real-time inference. The model is trained using paired synthetic and photorealism-enhanced images, complemented by a hybrid training strategy that incorporates matched patches from real-world data to improve visual realism and semantic consistency. Experimental results demonstrate that HyPER-GAN outperforms state-of-the-art paired image-to-image translation methods in terms of inference latency, visual realism, and semantic robustness. Moreover, it is illustrated that the proposed hybrid training strategy indeed improves visual quality and semantic consistency compared to training the model solely with paired synthetic and photorealism-enhanced images. Code and pretrained models are publicly available for download at: this https URL

51. 【2603.10598】Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection

链接https://arxiv.org/abs/2603.10598

作者:Yawen Yang,Feng Li,Shuqi Kong,Yunfeng Diao,Xinjian Gao,Zenglin Shi,Meng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent rapid advancement, AI-generated synthetic images, rapid advancement, advancement of generative, significantly improved

备注

点击查看摘要

Abstract:Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these synthetics makes them increasingly indistinguishable from authentic photographs, posing serious security risks, such as media credibility and content manipulation. Although extensive efforts have been dedicated to detecting synthetic images, most existing approaches suffer from poor generalization to unseen data due to their reliance on model-specific artifacts or low-level statistical cues. In this work, we identify a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns. Therefore, we propose a novel approach termed latent transition discrepancy (LTD), which captures the inter-layer consistency differences of real and synthetic images. LTD adaptively identifies the most discriminative layers and assesses the transition discrepancies across layers. Benefiting from the proposed inter-layer discriminative modeling, our approach exceeds the base model by 14.35\% in mean Acc across three datasets containing diverse GANs and DMs. Extensive experiments demonstrate that LTD outperforms recent state-of-the-art methods, achieving superior detection accuracy, generalizability, and robustness. The code is available at this https URL

52. 【2603.10584】Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion

链接https://arxiv.org/abs/2603.10584

作者:Jakub Gregorek,Paraskevas Pegios,Nando Metzger,Konrad Schindler,Theodora Kontogianni,Lazaros Nalpantidis

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:costly test-time optimization, test-time optimization typically, leverages strong diffusion, strong diffusion priors, late-fusion depth completion

备注

点击查看摘要

Abstract:We introduce Marigold-SSD, a single-step, late-fusion depth completion framework that leverages strong diffusion priors while eliminating the costly test-time optimization typically associated with diffusion-based methods. By shifting computational burden from inference to finetuning, our approach enables efficient and robust 3D perception under real-world latency constraints. Marigold-SSD achieves significantly faster inference with a training cost of only 4.5 GPU days. We evaluate our method across four indoor and two outdoor benchmarks, demonstrating strong cross-domain generalization and zero-shot performance compared to existing depth completion approaches. Our approach significantly narrows the efficiency gap between diffusion-based and discriminative models. Finally, we challenge common evaluation protocols by analyzing performance under varying input sparsity levels. Page: this https URL

53. 【2603.10583】Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution

链接https://arxiv.org/abs/2603.10583

作者:Hongsong Wang,Renxi Cheng,Chaolei Han,Jie Gui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:encounter unprecedented challenges, AIGC technologies, advancement of AIGC, unprecedented challenges, rapid advancement

备注: To appear in CVPR 2026, Code is at [this https URL](https://github.com/hongsong-wang/LIDA)

点击查看摘要

Abstract:With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques. To facilitate the identification of AI-generated images and the attribution of their source models, generative image watermarking and AI-generated image attribution have emerged as key research focuses in recent years. However, existing methods are model-dependent, requiring access to the generative models and lacking generality and scalability to new and unseen generators. To address these limitations, this work presents a new paradigm for AI-generated image attribution by formulating it as an instance retrieval problem instead of a conventional image classification problem. We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). The input to LIDA is produced by Low-Bit Fingerprint Generation module, while the training involves Unsupervised Pre-Training followed by subsequent Few-Shot Attribution Adaptation. Comprehensive experiments demonstrate that LIDA achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings. The code is at this https URL

54. 【2603.10578】R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

链接https://arxiv.org/abs/2603.10578

作者:Zhuangzi Li,Jian Jin,Shilv Cai,Weisi Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)

关键词:Immersive Computer Graphics, Immersive Computer, Computer Graphics, modern daily life, daily life

备注

点击查看摘要

Abstract:Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.

55. 【2603.10568】UniStitch: Unifying Semantic and Geometric Features for Image Stitching

链接https://arxiv.org/abs/2603.10568

作者:Yuan Mei,Lang Nie,Kang Liao,Yunqiu Xu,Chunyu Lin,Bin Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:solutions leverage semantic, methods estimate warps, recent learning-based solutions, learning-based solutions leverage, Neural Point Transformer

备注: Code: [this https URL](https://github.com/MmelodYy/UniStitch)

点击查看摘要

Abstract:Traditional image stitching methods estimate warps from hand-crafted geometric features, whereas recent learning-based solutions leverage semantic features from neural networks instead. These two lines of research have largely diverged along separate evolution, with virtually no meaningful convergence to date. In this paper, we take a pioneering step to bridge this gap by unifying semantic and geometric features with UniStitch, a unified image stitching framework from multimodal features. To align discrete geometric features (i.e., keypoint) with continuous semantic feature maps, we present a Neural Point Transformer (NPT) module, which transforms unordered, sparse 1D geometric keypoints into ordered, dense 2D semantic maps. Then, to integrate the advantages of both representations, an Adaptive Mixture of Experts (AMoE) module is designed to fuse geometric and semantic representations. It dynamically shifts focus toward more reliable features during the fusion process, allowing the model to handle complex scenes, especially when either modality might be compromised. The fused representation can be adopted into common deep stitching pipelines, delivering significant performance gains over any single feature. Experiments show that UniStitch outperforms existing state-of-the-art methods with a large margin, paving the way for a unified paradigm between traditional and learning-based image stitching.

56. 【2603.10560】PET-F2I: A Comprehensive Benchmark and Parameter-Efficient Fine-Tuning of LLMs for PET/CT Report Impression Generation

链接https://arxiv.org/abs/2603.10560

作者:Yuchen Liu,Wenbo Zhang,Liling Peng,Yichi Zhang,Yu Fu,Xin Guo,Chao Qu,Yuan Qi,Le Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:summarizing complex findings, PET, nuclear medicine, imaging is pivotal, pivotal in oncology

备注

点击查看摘要

Abstract:PET/CT imaging is pivotal in oncology and nuclear medicine, yet summarizing complex findings into precise diagnostic impressions is labor-intensive. While LLMs have shown promise in medical text generation, their capability in the highly specialized domain of PET/CT remains underexplored. We introduce PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports. Using PET-F2I-41K, we conduct a comprehensive evaluation of 27 models across proprietary frontier LLMs, open-source generalist models, and medical-domain LLMs, and we develop a domain-adapted 7B model (PET-F2I-7B) fine-tuned from Qwen2.5-7B-Instruct via LoRA. Beyond standard NLG metrics (e.g., BLEU-4, ROUGE-L, BERTScore), we propose three clinically grounded metrics - Entity Coverage Rate (ECR), Uncovered Entity Rate (UER), and Factual Consistency Rate (FCR) - to assess diagnostic completeness and factual reliability. Experiments reveal that neither frontier nor medical-domain LLMs perform adequately in zero-shot settings. In contrast, PET-F2I-7B achieves substantial gains (e.g., 0.708 BLEU-4) and a 3.0x improvement in entity coverage over the strongest baseline, while offering advantages in cost, latency, and privacy. Beyond this modeling contribution, PET-F2I-41K establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.

57. 【2603.10551】P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video

链接https://arxiv.org/abs/2603.10551

作者:Longan Wang,Yuang Shi,Wei Tsang Ooi

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:competitive explicit representation, Gaussian splatting, competitive explicit, Gaussian splatting framework, Gaussian

备注: MMSys 2026; Project Website: see [this https URL](https://longanwang-cs.github.io/PGSVC-webpage/)

点击查看摘要

Abstract:Gaussian splatting has emerged as a competitive explicit representation for image and video reconstruction. In this work, we present P-GSVC, the first layered progressive 2D Gaussian splatting framework that provides a unified solution for scalable Gaussian representation in both images and videos. P-GSVC organizes 2D Gaussian splats into a base layer and successive enhancement layers, enabling coarse-to-fine reconstructions. To effectively optimize this layered representation, we propose a joint training strategy that simultaneously updates Gaussians across layers, aligning their optimization trajectories to ensure inter-layer compatibility and a stable progressive reconstruction. P-GSVC supports scalability in terms of both quality and resolution. Our experiments show that the joint training strategy can gain up to 1.9 dB improvement in PSNR for video and 2.6 dB improvement in PSNR for image when compared to methods that perform sequential layer-wise training. Project page: this https URL

58. 【2603.10549】owards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues

链接https://arxiv.org/abs/2603.10549

作者:Mohammed Salah,Eman Ouda,Giuseppe Dell'Avvocato,Fabrizio Sarasini,Ester D'Accardi,Jorge Dias,Davor Svetinovic,Stefano Sfarra,Yusra Abdulrahman

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)

关键词:Active infrared thermography, carbon fiber-reinforced polymers, high performance carbon, performance carbon fiber-reinforced, Active infrared

备注

点击查看摘要

Abstract:Active infrared thermography (AIRT) is currently witnessing a surge of artificial intelligence (AI) methodologies being deployed for automated subsurface defect analysis of high performance carbon fiber-reinforced polymers (CFRP). Deploying AI-based AIRT methodologies for inspecting CFRPs requires the creation of time consuming and expensive datasets of CFRP inspection sequences to train neural networks. To address this challenge, this work introduces a novel language-guided framework for cognitive defect analysis in CFRPs using AIRT and vision-language models (VLMs). Unlike conventional learning-based approaches, the proposed framework does not require developing training datasets for extensive training of defect detectors, instead it relies solely on pretrained multimodal VLM encoders coupled with a lightweight adapter to enable generative zero-shot understanding and localization of subsurface defects. By leveraging pretrained multimodal encoders, the proposed system enables generative zero-shot understanding of thermographic patterns and automatic detection of subsurface defects. Given the domain gap between thermographic data and natural images used to train VLMs, an AIRT-VLM Adapter is proposed to enhance the visibility of defects while aligning the thermographic domain with the learned representations of VLMs. The proposed framework is validated using three representative VLMs; specifically, GroundingDINO, Qwen-VL-Chat, and CogVLM. Validation is performed on 25 CFRP inspection sequences with impacts introduced at different energy levels, reflecting realistic defects encountered in industrial scenarios. Experimental results demonstrate that the AIRT-VLM adapter achieves signal-to-noise ratio (SNR) gains exceeding 10 dB compared with conventional thermographic dimensionality-reduction methods, while enabling zero-shot defect detection with intersection-over-union values reaching 70%.

59. 【2603.10541】Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

链接https://arxiv.org/abs/2603.10541

作者:Caroline Magg,Maaike A. ter Wee,Johannes G.G. Dobbe,Geert J. Streekstra,Leendert Blankevoort,Clara I. Sánchez,Hoel Kervadec

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:revolutionized medical image, Promptable Foundation Models, natural image segmentation, medical image segmentation, Promptable Foundation

备注

点击查看摘要

Abstract:Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto-optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on "ideal" prompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater settings. We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: this https URL

60. 【2603.10538】DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime

链接https://arxiv.org/abs/2603.10538

作者:Julian Lorenz,Vladyslav Kovganko,Elias Kohout,Mrunmai Phatak,Daniel Kienzle,Rainer Lienhart

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:holds significant promise, robust intermediate step, complex downstream tasks, detailed graph structure, Scene Graph Generation

备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research. To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU. This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.

61. 【2603.10526】Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis

链接https://arxiv.org/abs/2603.10526

作者:Pei Liu,Xiangxiang Zeng,Tengfei Ma,Yucheng Xing,Xuanbai Ren,Yiping Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Whole-Slide Images, estimating the prognosis, Images, Task Vector, Task Vector Mixup

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Whole-Slide Images (WSIs) are widely used for estimating the prognosis of cancer patients. Current studies generally follow a cancer-specific learning paradigm. However, the available training samples for one cancer type are usually scarce in pathology. Consequently, the model often struggles to learn generalizable knowledge, thus performing worse on the tumor samples with inherent high heterogeneity. Although multi-cancer joint learning and knowledge transfer approaches have been explored recently to address it, they either rely on large-scale joint training or extensive inference across multiple models, posing new challenges in computational efficiency. To this end, this paper proposes a new scheme, Sparse Task Vector Mixup with Hypernetworks (STEPH). Unlike previous ones, it efficiently absorbs generalizable knowledge from other cancers for the target via model merging: i) applying task vector mixup to each source-target pair and then ii) sparsely aggregating task vector mixtures to obtain an improved target model, driven by hypernetworks. Extensive experiments on 13 cancer datasets show that STEPH improves over cancer-specific learning and an existing knowledge transfer baseline by 5.14% and 2.01%, respectively. Moreover, it is a more efficient solution for learning prognostic knowledge from other cancers, without requiring large-scale joint training or extensive multi-model inference. Code is publicly available at this https URL.

62. 【2603.10519】Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

链接https://arxiv.org/abs/2603.10519

作者:Xin Huang,Junjie Liang,Qingshan Hou,Peng Cao,Jinzhu Yang,Xiaoli Liu,Osmar R. Zaiane

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Medical image synthesis, alleviating data scarcity, Medical image, privacy constraints, image synthesis

备注: 10 pages, 7 figures. Currently under review

点击查看摘要

Abstract:Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at this https URL.

63. 【2603.10517】UHD Image Deblurring via Autoregressive Flow with Ill-conditioned Constraints

链接https://arxiv.org/abs/2603.10517

作者:Yucheng Xin,Dawei Zhao,Xiang Chen,Chen Wu,Pu Wang,Dianjie Lu,Guijuan Zhang,Xiuyi Jia,Zhuoran Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:poses significant challenges, UHD image deblurring, deblurring poses significant, practical inference efficiency, balance fine-grained detail

备注: Submitted to ECCV 2026

点击查看摘要

Abstract:Ultra-high-definition (UHD) image deblurring poses significant challenges for UHD restoration methods, which must balance fine-grained detail recovery and practical inference efficiency. Although prominent discriminative and generative methods have achieved remarkable results, a trade-off persists between computational cost and the ability to generate fine-grained detail for UHD image deblurring tasks. To further alleviate these issues, we propose a novel autoregressive flow method for UHD image deblurring with an ill-conditioned constraint. Our core idea is to decompose UHD restoration into a progressive, coarse-to-fine process: at each scale, the sharp estimate is formed by upsampling the previous-scale result and adding a current-scale residual, enabling stable, stage-wise refinement from low to high resolution. We further introduce Flow Matching to model residual generation as a conditional vector field and perform few-step ODE sampling with efficient Euler/Heun solvers, enriching details while keeping inference affordable. Since multi-step generation at UHD can be numerically unstable, we propose an ill-conditioning suppression scheme by imposing condition-number regularization on a feature-induced attention matrix, improving convergence and cross-scale consistency. Our method demonstrates promising performance on blurred images at 4K (3840$\times$2160) or higher resolutions.

64. 【2603.10504】Naïve Exposure of Generative AI Capabilities Undermines Deepfake Detection

链接https://arxiv.org/abs/2603.10504

作者:Sunpill Kim,Chanwoo Hwang,Minsu Kim,Jae Hong Seo

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:increasingly expose powerful, systems increasingly expose, Generative AI systems, expose powerful reasoning, user-facing chatbot interfaces

备注

点击查看摘要

Abstract:Generative AI systems increasingly expose powerful reasoning and image refinement capabilities through user-facing chatbot interfaces. In this work, we show that the naïve exposure of such capabilities fundamentally undermines modern deepfake detectors. Rather than proposing a new image manipulation technique, we study a realistic and already-deployed usage scenario in which an adversary uses only benign, policy-compliant prompts and commercial generative AI systems. We demonstrate that state-of-the-art deepfake detection methods fail under semantic-preserving image refinement. Specifically, we show that generative AI systems articulate explicit authenticity criteria and inadvertently externalize them through unrestricted reasoning, enabling their direct reuse as refinement objectives. As a result, refined images simultaneously evade detection, preserve identity as verified by commercial face recognition APIs, and exhibit substantially higher perceptual quality. Importantly, we find that widely accessible commercial chatbot services pose a significantly greater security risk than open-source models, as their superior realism, semantic controllability, and low-barrier interfaces enable effective evasion by non-expert users. Our findings reveal a structural mismatch between the threat models assumed by current detection frameworks and the actual capabilities of real-world generative AI. While detection baselines are largely shaped by prior benchmarks, deployed systems expose unrestricted authenticity reasoning and refinement despite stringent safety controls in other domains.

65. 【2603.10495】IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

链接https://arxiv.org/abs/2603.10495

作者:Jiahao Lyu,Pei Fu,Zhenhang Li,Weichao Zeng,Shaojie Zhan,Jiahui Yang,Can Ma,Yu Zhou,Zhenbo Luo,Jian Luan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:In-Image Machine Translation, original visual context, convert text embedded, In-Image Machine, Machine Translation Benchmark

备注

点击查看摘要

Abstract:End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in the translated image. We benchmark strong commercial cascade systems, and both closed- and open-source unified multi-modal models, and observe large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for end-to-end image text translation. We hope IMTBench establishes a standardized benchmark to accelerate progress in this emerging task.

66. 【2603.10487】Spatial self-supervised Peak Learning and correlation-based Evaluation of peak picking in Mass Spectrometry Imaging

链接https://arxiv.org/abs/2603.10487

作者:Philipp Weigand,Nikolas Ebert,Shad A. Mohammed,Denis Abu Sammour,Carsten Hopf,Oliver Wasenmüller

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Mass spectrometry imaging, enables label-free visualization, preserving meaningful biological, Mass spectrometry, reduce data size

备注

点击查看摘要

Abstract:Mass spectrometry imaging (MSI) enables label-free visualization of molecular distributions across tissue samples but generates large and complex datasets that require effective peak picking to reduce data size while preserving meaningful biological information. Existing peak picking approaches perform inconsistently across heterogeneous datasets, and their evaluation is often limited to synthetic data or manually selected ion images that do not fully represent real-world challenges in MSI. To address these limitations, we propose an autoencoder-based spatial self-supervised peak learning neural network that selects spatially structured peaks by learning an attention mask leveraging both spatial and spectral information. We further introduce an evaluation procedure based on expert-annotated segmentation masks, allowing a more representative and spatially grounded assessment of peak picking performance. We evaluate our approach on four diverse public MSI datasets using our proposed evaluation procedure. Our approach consistently outperforms state-of-the-art peak picking methods by selecting spatially structured peaks, thus demonstrating its efficacy. These results highlight the value of our spatial self-supervised network in comparison to contemporary state-of-the-art methods. The evaluation procedure can be readily applied to new MSI datasets, thereby providing a consistent and robust framework for the comparison of spatially structured peak picking methods across different datasets.

67. 【2603.10484】StructDamage:A Large Scale Unified Crack and Surface Defect Dataset for Robust Structural Damage Detection

链接https://arxiv.org/abs/2603.10484

作者:Misbah Ijaz,Saif Ur Rehman Khan,Abd Ur Rehman,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated detection, infrastructure maintenance, civil engineering, heritage preservation, critical challenge

备注

点击查看摘要

Abstract:Automated detection and classification of structural cracks and surface defects is a critical challenge in civil engineering, infrastructure maintenance, and heritage preservation. Recent advances in Computer Vision (CV) and Deep Learning (DL) have significantly improved automatic crack detection. However, these methods rely heavily on large, diverse, and carefully curated datasets that include various crack types across different surface materials. Many existing public crack datasets lack geographic diversity, surface types, scale, and labeling consistency, making it challenging for trained algorithms to generalize effectively in real world conditions. We provide a novel dataset, StructDamage, a curated collection of approximately 78,093 images spanning nine surface types: walls, tile, stone, road, pavement, deck, concrete, and brick. The dataset was constructed by systematically aggregating, harmonizing, and reannotating images from 32 publicly available datasets covering concrete structures, asphalt pavements, masonry walls, bridges, and historic buildings. All images are organized in a folder level classification hierarchy suitable for training Convolutional Neural Networks (CNNs) and Vision Transformers. To highlight the practical value of the dataset, we present baseline classification results using fifteen DL architectures from six model families, with twelve achieving macro F1-scores over 0.96. The best performing model DenseNet201 achieves 98.62% accuracy. The proposed dataset provides a comprehensive and versatile resource suitable for classification tasks. With thorough documentation and a standard structure, it is designed to promote reproducible research and support the development and fair evaluation of robust crack damage detection approaches.

68. 【2603.10470】Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

链接https://arxiv.org/abs/2603.10470

作者:Hamidreza Dastmalchi,Aijun An,Ali Cheraghian,Hamed Barzamini

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large vision-language models, unfaithful outputs misaligned, achieve strong performance, frequently generate hallucinations, vision-language models

备注: CVPR 2026

点击查看摘要

Abstract:While large vision-language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations -- unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality. CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination. In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness. Code and additional materials are available at this https URL.

69. 【2603.10466】UniPINN: A Unified PINN Framework for Multi-task Learning of Diverse Navier-Stokes Equations

链接https://arxiv.org/abs/2603.10466

作者:Dengdi Sun,Jie Chen,Xiao Wang,Jin Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Physics-Informed Neural Networks, incompressible Navier-Stokes equations, Neural Networks, solving incompressible Navier-Stokes, Physics-Informed Neural

备注

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have shown promise in solving incompressible Navier-Stokes equations, yet existing approaches are predominantly designed for single-flow settings. When extended to multi-flow scenarios, these methods face three key challenges: (1) difficulty in simultaneously capturing both shared physical principles and flow-specific characteristics, (2) susceptibility to inter-task negative transfer that degrades prediction accuracy, and (3) unstable training dynamics caused by disparate loss magnitudes across heterogeneous flow regimes. To address these limitations, we propose UniPINN, a unified multi-flow PINN framework that integrates three complementary components: a shared-specialized architecture that disentangles universal physical laws from flow-specific features, a cross-flow attention mechanism that selectively reinforces relevant patterns while suppressing task-irrelevant interference, and a dynamic weight allocation strategy that adaptively balances loss contributions to stabilize multi-objective optimization. Extensive experiments on three canonical flows demonstrate that UniPINN effectively unifies multi-flow learning, achieving superior prediction accuracy and balanced performance across heterogeneous regimes while successfully mitigating negative transfer. The source code of this paper will be released on this https URL

70. 【2603.10465】MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

链接https://arxiv.org/abs/2603.10465

作者:Tianyu Xu,Sieun Kim,Qianhui Zheng,Ruoyu Xu,Tejasvi Ravi,Anuva Kulkarni,Katrina Passarella-Ward,Junyi Zhu,Adarsh Kowdle

类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:Extended Reality, social engagement due, compromising both scene, entangled sound sources, scene awareness

备注

点击查看摘要

Abstract:In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p 0.001), thereby paving the way for more perceptive and socially adept XR experiences.

71. 【2603.10463】Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning

链接https://arxiv.org/abs/2603.10463

作者:Yushuo Zheng,Huiyu Duan,Zicheng Zhang,Xiaohong Liu,Xiongkuo Min

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires abundant world, abundant world knowledge, complex reasoning abilities, requires abundant, identifying the geographic

备注

点击查看摘要

Abstract:Geolocation, the task of identifying the geographic location of an image, requires abundant world knowledge and complex reasoning abilities. Though advanced large multimodal models (LMMs) have shown superior aforementioned capabilities, their performance on the geolocation task remains unexplored. To this end, we introduce \textbf{WanderBench}, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios. WanderBench contains over 32K panoramas across six continents, organized as navigable graphs that enable physical actions such as rotation and movement, transforming geolocation from static recognition into interactive exploration. Building on this foundation, we propose \textbf{GeoAoT} (Action of Thought), a \underline{Geo}location framework with \underline{A}ction of \underline{T}hough, which couples reasoning with embodied actions. Instead of generating textual reasoning chains, GeoAoT produces actionable plans such as, approaching landmarks or adjusting viewpoints, to actively reduce uncertainty. We further establish an evaluation protocol that jointly measures geolocation accuracy and difficulty-aware geolocation questioning ability. Experiments on 19 large multimodal models show that GeoAoT achieves superior fine-grained localization and stronger generalization in dynamic environments. WanderBench and GeoAoT define a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding.

72. 【2603.10456】LCAMV: High-Accuracy 3D Reconstruction of Color-Varying Objects Using LCA Correction and Minimum-Variance Fusion in Structured Light

链接https://arxiv.org/abs/2603.10456

作者:Wonbeen Oh,Jae-Sang Hyun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lateral chromatic aberration, RGB channels, characteristics across RGB, uneven noise characteristics, lateral chromatic

备注

点击查看摘要

Abstract:Accurate 3D reconstruction of colored objects with structured light (SL) is hindered by lateral chromatic aberration (LCA) in optical components and uneven noise characteristics across RGB channels. This paper introduces lateral chromatic aberration correction and minimum-variance fusion (LCAMV), a robust 3D reconstruction method that operates with a single projector-camera pair without additional hardware or acquisition constraints. LCAMV analytically models and pixel-wise compensates LCA in both the projector and camera, then adaptively fuses multi-channel phase data using a Poisson-Gaussian noise model and minimum-variance estimation. Unlike existing methods that require extra hardware or multiple exposures, LCAMV enables fast acquisition. Experiments on planar and non-planar colored surfaces show that LCAMV outperforms grayscale conversion and conventional channel-weighting, reducing depth error by up to 43.6\%. These results establish LCAMV as an effective solution for high-precision 3D reconstruction of nonuniformly colored objects.

73. 【2603.10446】SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning

链接https://arxiv.org/abs/2603.10446

作者:Jianhe Low,Alexandre Symeonidis-Herzig,Maksym Ivashechkin,Ozge Mercanoglu Sincan,Richard Bowden

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:language avatars remains, Generating natural, linguistically accurate sign, Sign Language Production, accurate sign language

备注

点击查看摘要

Abstract:Generating natural and linguistically accurate sign language avatars remains a formidable challenge. Current Sign Language Production (SLP) frameworks face a stark trade-off: direct text-to-pose models suffer from regression-to-the-mean effects, while dictionary-retrieval methods produce robotic, disjointed transitions. To resolve this, we propose a novel training paradigm that leverages sparse keyframes to capture the true underlying kinematic distribution of human signing. By predicting dense motion from these discrete anchors, our approach mitigates regression-to-the-mean while ensuring fluid articulation. To realize this paradigm at scale, we first introduce FAST, an ultra-efficient sign segmentation model that automatically mines precise temporal boundaries. We then present SignSparK, a large-scale Conditional Flow Matching (CFM) framework that utilizes these extracted anchors to synthesize 3D signing sequences in SMPL-X and MANO spaces. This keyframe-driven formulation also uniquely unlocks Keyframe-to-Pose (KF2P) generation, making precise spatiotemporal editing of signing sequences possible. Furthermore, our adopted reconstruction-based CFM objective also enables high-fidelity synthesis in fewer than ten sampling steps; this allows SignSparK to scale across four distinct sign languages, establishing the largest multilingual SLP framework to date. Finally, by integrating 3D Gaussian Splatting for photorealistic rendering, we demonstrate through extensive evaluation that SignSparK establishes a new state-of-the-art across diverse SLP tasks and multilingual benchmarks.

74. 【2603.10445】Unlearning the Unpromptable: Prompt-free Instance Unlearning in Diffusion Models

链接https://arxiv.org/abs/2603.10445

作者:Kyungryeol Lee,Kyeonghyun Lee,Seongmin Hong,Byung Hyun Lee,Se Young Chun

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Machine unlearning aims, text prompts, Machine unlearning, concept level, aims to remove

备注: 12 pages

点击查看摘要

Abstract:Machine unlearning aims to remove specific outputs from trained models, often at the concept level, such as forgetting all occurrences of a particular celebrity or filtering content via text prompts. However, many undesired outputs, such as an individual's face or generations culturally or factually misinterpreted, cannot often be specified by text prompts. We address this underexplored setting of instance unlearning for outputs that are undesired but unpromptable, where the goal is to forget target outputs selectively while preserving the rest. To this end, we introduce an effective surrogate-based unlearning method that leverages image editing, timestep-aware weighting, and gradient surgery to guide trained diffusion models toward forgetting specific outputs. Experiments on conditional (Stable Diffusion 3) and unconditional (DDPM-CelebA) diffusion models demonstrate that our prompt-free method uniquely unlearns unpromptable outputs, such as faces and culturally inaccurate depictions, with preserved integrity, unlike prompt-based and prompt-free baselines. Our proposed method would serve as a practical hotfix for diffusion model providers to ensure privacy protection and ethical compliance.

75. 【2603.10438】AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory

链接https://arxiv.org/abs/2603.10438

作者:Lianjie Ma,Yuquan Li,Bingzheng Jiang,Ziming Zhong,Han Ding,Lijun Zhu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:monocular depth estimation, depth estimation offers, foundation model, estimation offers, offers a viable

备注: 8 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Foundation-model-based monocular depth estimation offers a viable alternative to active sensors for robot perception, yet its computational cost often prohibits deployment on edge platforms. Existing methods perform independent per-frame inference, wasting the substantial computational redundancy between adjacent viewpoints in continuous robot operation. This paper presents AsyncMDE, an asynchronous depth perception system consisting of a foundation model and a lightweight model that amortizes the foundation model's computational cost over time. The foundation model produces high-quality spatial features in the background, while the lightweight model runs asynchronously in the foreground, fusing cached memory with current observations through complementary fusion, outputting depth estimates, and autoregressively updating the memory. This enables cross-frame feature reuse with bounded accuracy degradation. At a mere 3.83M parameters, it operates at 237 FPS on an RTX 4090, recovering 77% of the accuracy gap to the foundation model while achieving a 25X parameter reduction. Validated across indoor static, dynamic, and synthetic extreme-motion benchmarks, AsyncMDE degrades gracefully between refreshes and achieves 161FPS on a Jetson AGX Orin with TensorRT, clearly demonstrating its feasibility for real-time edge deployment.

76. 【2603.10422】World2Act: Latent Action Post-Training via Skill-Compositional World Models

链接https://arxiv.org/abs/2603.10422

作者:An Dinh Vuong,Tuan Van Vo,Abdullah Sohail,Haoran Ding,Liang Ma,Xiaodan Liang,Anqing Duan,Ivan Laptev,Ian Reid

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:World Models, promising approach, World, Models, making policies sensitive

备注: Project page: [this https URL](https://wm2act.github.io/)

点击查看摘要

Abstract:World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.

77. 【2603.10418】ractoRC: A Unified Probabilistic Learning Framework for Joint Tractography Registration and Clustering

链接https://arxiv.org/abs/2603.10418

作者:Yijie Li,Xi Zhu,Junyi Wang,Ye Wu,Lauren J. O'Donnell,Fan Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion MRI tractography, Diffusion MRI, MRI tractography enables, MRI tractography, white matter

备注: 11 pages, 3 figures

点击查看摘要

Abstract:Diffusion MRI tractography enables in vivo reconstruction of white matter (WM) pathways. Two key tasks in tractography analysis include: 1) tractogram registration that aligns streamlines across individuals, and 2) streamline clustering that groups streamlines into compact fiber bundles. Although both tasks share the goal of capturing geometrically similar structures to characterize consistent WM organization, they are typically performed independently. In this work, we propose TractoRC, a unified probabilistic framework that jointly performs tractogram registration and streamline clustering within a single optimization scheme, enabling the two tasks to leverage complementary information. TractoRC learns a latent embedding space for streamline points, which serves as a shared representation for both tasks. Within this space, both tasks are formulated as probabilistic inference over structural representations: registration learns the distribution of anatomical landmarks as probabilistic keypoints to align tractograms across subjects, and clustering learns streamline structural prototypes that capture geometric similarity to form coherent streamline clusters. To support effective learning of this shared space, we introduce a transformation-equivariant self-supervised strategy to learn geometry-aware and transformation-invariant embeddings. Experiments demonstrate that jointly optimizing registration and clustering significantly improves performance in both tasks over state-of-the-art methods that treat them independently. Code will be made publicly available at this https URL .

78. 【2603.10417】Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising

链接https://arxiv.org/abs/2603.10417

作者:Mingjie Ji,Zhan Shi,Kailai Zhou,Zixuan Fu,Xun Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:typically extend image-based, extend image-based frameworks, denoising methods typically, methods typically extend, Self-supervised video denoising

备注

点击查看摘要

Abstract:Self-supervised video denoising methods typically extend image-based frameworks into the temporal dimension, yet they often struggle to integrate inter-frame temporal consistency with intra-frame spatial specificity. Existing Video Blind-Spot Networks (BSNs) require noise independence by masking the center pixel, this constraint prevents the use of spatial evidence for texture recovery, thereby severing spatiotemporal correlations and causing texture loss. To address this, we propose Frames2Residual (F2R), a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery. In Stage 1, a blind temporal estimator learns inter-frame consistency using a frame-wise blind strategy, producing a temporally consistent anchor. In Stage 2, a non-blind spatial refiner leverages this anchor to safely reintroduce the center frame and recover intra-frame high-frequency spatial residuals while preserving temporal stability. Extensive experiments demonstrate that our decoupling strategy allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.

79. 【2603.10408】Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics

链接https://arxiv.org/abs/2603.10408

作者:Tianshuo Xu,Zhifei Chen,Leyi Wu,Hao Lu,Ying-cong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling precise controllability, high visual quality, achieving high visual, rigorous physical consistency, textbf

备注: [this https URL](https://tianshuo-xu.github.io/Motion-Forcing/)

点击查看摘要

Abstract:The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability. While recent models can maintain this balance in simple, isolated scenarios, we observe that this equilibrium is fragile and often breaks down as scene complexity increases (e.g., involving collisions or dense traffic). To address this, we introduce \textbf{Motion Forcing}, a framework designed to stabilize this trilemma even in complex generative tasks. Our key insight is to explicitly decouple physical reasoning from visual synthesis via a hierarchical \textbf{``Point-Shape-Appearance''} paradigm. This approach decomposes generation into verifiable stages: modeling complex dynamics as sparse geometric anchors (\textbf{Point}), expanding them into dynamic depth maps that explicitly resolve 3D geometry (\textbf{Shape}), and finally rendering high-fidelity textures (\textbf{Appearance}). Furthermore, to foster robust physical understanding, we employ a \textbf{Masked Point Recovery} strategy. By randomly masking input anchors during training and enforcing the reconstruction of complete dynamic depth, the model is compelled to move beyond passive pattern matching and learn latent physical laws (e.g., inertia) to infer missing trajectories. Extensive experiments on autonomous driving benchmarks show that Motion Forcing significantly outperforms state-of-the-art baselines, maintaining trilemma stability across complex scenes. Evaluations on physics and robotics further confirm our framework's generality.

80. 【2603.10398】Multi-Person Pose Estimation Evaluation Using Optimal Transportation and Improved Pose Matching

链接https://arxiv.org/abs/2603.10398

作者:Takato Moriki,Hiromu Taketsugu,Norimichi Ukita

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-Person Pose Estimation, metrics place importance, Pose Estimation, false-positive poses, pose detection confidence

备注: 8 pages, 10 figures. Accepted at MVA 2025

点击查看摘要

Abstract:In Multi-Person Pose Estimation, many metrics place importance on ranking of pose detection confidence scores. Current metrics tend to disregard false-positive poses with low confidence, focusing primarily on a larger number of high-confidence poses. Consequently, these metrics may yield high scores even when many false-positive poses with low confidence are detected. For fair evaluation taking into account a tradeoff between true-positive and false-positive poses, this paper proposes Optimal Correction Cost for pose (OCpose), which evaluates detected poses against pose annotations as an optimal transportation. For the fair tradeoff between true-positive and false-positive poses, OCpose equally evaluates all the detected poses regardless of their confidence scores. In OCpose, on the other hand, the confidence score of each pose is utilized to improve the reliability of matching scores between the estimated pose and pose annotations. As a result, OCpose provides a different perspective assessment than other confidence ranking-based metrics.

81. 【2603.10391】Variance-Aware Adaptive Weighting for Diffusion Model Training

链接https://arxiv.org/abs/2603.10391

作者:Nanlong Sun,Lei Shi

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:remain highly imbalanced, unstable learning behavior, recently achieved remarkable, achieved remarkable success, levels remain highly

备注: 15 pages, 8 figures, 1 table

点击查看摘要

Abstract:Diffusion models have recently achieved remarkable success in generative modeling, yet their training dynamics across different noise levels remain highly imbalanced, which can lead to inefficient optimization and unstable learning behavior. In this work, we investigate this imbalance from the perspective of loss variance across log-SNR levels and propose a variance-aware adaptive weighting strategy to address it. The proposed approach dynamically adjusts training weights based on the observed variance distribution, encouraging a more balanced optimization process across noise levels. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate that the proposed method consistently improves generative performance over standard training schemes, achieving lower Fréchet Inception Distance (FID) while also reducing performance variance across random seeds. Additional analysis, including loss-log-SNR visualization, variance heatmaps, and ablation studies, further reveal that the adaptive weighting effectively stabilizes training dynamics. These results highlight the potential of variance-aware training strategies for improving diffusion model optimization.

82. 【2603.10370】GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

链接https://arxiv.org/abs/2603.10370

作者:Ruiheng Liu,Haihong Hao,Mingfei Han,Xin Gu,Kecheng Zhang,Changlin Li,Xiaojun Chang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:artificial superintelligence requires, superintelligence requires rich, Multimodal Large Language, Advancing towards artificial, Large Language Models

备注

点击查看摘要

Abstract:Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatial-aware supervised fine-tuning dataset. This serves to activate the model's latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.

83. 【2603.10365】Geometric Autoencoder for Diffusion Models

链接https://arxiv.org/abs/2603.10365

作者:Hangyu Liu,Jianyong Wang,Yutao Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-resolution visual generation, Vision Foundation Model, Integrating Vision Foundation, visual generation, Foundation Model priors

备注: Code and models are publicly available at [this https URL](https://github.com/freezing-index/Geometric-Autoencoder-for-Diffusion-Models)

点击查看摘要

Abstract:Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These approaches often struggle to unify semantic discriminability, reconstruction fidelity, and latent compactness. In this paper, we propose Geometric Autoencoder (GAE), a principled framework that systematically addresses these challenges. By analyzing various alignment paradigms, GAE constructs an optimized low-dimensional semantic supervision target from VFMs to provide guidance for the autoencoder. Furthermore, we leverage latent normalization that replaces the restrictive KL-divergence of standard VAEs, enabling a more stable latent manifold specifically optimized for diffusion learning. To ensure robust reconstruction under high-intensity noise, GAE incorporates a dynamic noise sampling mechanism. Empirically, GAE achieves compelling performance on the ImageNet-1K $256 \times 256$ benchmark, reaching a gFID of 1.82 at only 80 epochs and 1.31 at 800 epochs without Classifier-Free Guidance, significantly surpassing existing state-of-the-art methods. Beyond generative quality, GAE establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability. These results validate our design considerations, offering a promising paradigm for latent diffusion modeling. Code and models are publicly available at this https URL.

84. 【2603.10360】One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination

链接https://arxiv.org/abs/2603.10360

作者:Zhan Fa,Yue Duan,Jian Zhang,Lei Qi,Yinghuan Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current training-free methods, methods tackle MLLM, tackle MLLM hallucination, Current training-free, suppressing text inertia

备注: 10 pages

点击查看摘要

Abstract:Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.

85. 【2603.10354】StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References

链接https://arxiv.org/abs/2603.10354

作者:Boyu He(1),Yunfan Ye(2),Chang Liu(1),Weishang Wu(1),Fang Liu(2),Zhiping Cai(1) ((1) College of Computer Science and Technology, National University of Defense Technology (2) School of Design, Hunan University)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rigid feature associations, associations lacking adaptive, lacking adaptive global-local, causing uncontrollable stylization, feature associations lacking

备注: 10 pages, 23 figures, Conference on Computer Vision and Pattern Recognition 2026

点击查看摘要

Abstract:Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.

86. 【2603.10349】EmoStory: Emotion-Aware Story Generation

链接https://arxiv.org/abs/2603.10349

作者:Jingyuan Yang,Rucong Chen,Hui Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:produce image sequences, depict coherent narratives, produce image, image sequences, sequences that depict

备注

点击查看摘要

Abstract:Story generation aims to produce image sequences that depict coherent narratives while maintaining subject consistency across frames. Although existing methods have excelled in producing coherent and expressive stories, they remain largely emotion-neutral, focusing on what subject appears in a story while overlooking how emotions shape narrative interpretation and visual presentation. As stories are intended to engage audiences emotionally, we introduce emotion-aware story generation, a new task that aims to generate subject-consistent visual stories with explicit emotional directions. This task is challenging due to the abstract nature of emotions, which must be grounded in concrete visual elements and consistently expressed across a narrative through visual composition. To address these challenges, we propose EmoStory, a two-stage framework that integrates agent-based story planning and region-aware story generation. The planning stage transforms target emotions into coherent story prompts with emotion agent and writer agent, while the generation stage preserves subject consistency and injects emotion-related elements through region-aware composition. We evaluate EmoStory on a newly constructed dataset covering 25 subjects and 600 emotional stories. Extensive quantitative and qualitative results, along with user studies, show that EmoStory outperforms state-of-the-art story generation methods in emotion accuracy, prompt alignment, and subject consistency.

87. 【2603.10340】Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation

链接https://arxiv.org/abs/2603.10340

作者:Sangmim Song,Sarath Kodagoda,Marc Carmichael,Karthick Thiyagarajan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)

关键词:impressive zero-shot generalization, models demonstrate impressive, demonstrate impressive zero-shot, Precision-Reasoning Gap, impressive zero-shot

备注: 7 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a "Precision-Reasoning Gap" in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process--combining cross-validation and spatial disambiguation--to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline's 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.

88. 【2603.10335】Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models

链接https://arxiv.org/abs/2603.10335

作者:Yuedong Yang,Xiwen Wei,Mustafa Munir,Radu Marculescu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Multi-modality Models, Reasoning Large Multi-modality, Large Multi-modality, Multi-modality Models, Fuel Gauge

备注

点击查看摘要

Abstract:Reasoning Large Multi-modality Models (LMMs) have become the de facto choice for many applications. However, these models rely on a Chain-of-Thought (CoT) process that is lengthy and unpredictable at runtime, often resulting in inefficient use of computational resources (due to memory fragmentation) and sub-optimal accuracy (due to under- and over-thinking). We observe empirically that the CoT process follows a very simple form, whose behavior is independent of the specific generated samples. This suggests that the CoT length can be estimated ahead of time based on a hidden parameter representing the amount of "fuel" available to support the reasoning process. Based on this insight, we propose Fuel Gauge, the first method which extracts this hidden signal and predicts CoT length ahead of time. We demonstrate the utility on the Fuel Gauge on two downstream tasks: predictive KV cache allocation, which addresses memory fragmentation in LMM serving systems, and CoT length modulation, which mitigates under-thinking and over-thinking. Extensive experiments on LMMs across text-only, image-text, and video-text question answering benchmarks demonstrate the effectiveness, generalizability, and practical value of our Fuel Gauge. For example, on the GPQA-Diamond benchmark, our Fuel Gauge achieves less than half the CoT length prediction error compared to the baseline; this translates into a 13.37x reduction in the memory allocation frequency.

89. 【2603.10323】he Orthogonal Vulnerabilities of Generative AI Watermarks: A Comparative Empirical Benchmark of Spatial and Latent Provenance

链接https://arxiv.org/abs/2603.10323

作者:Jesse Yu,Nicholas Wei

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:synthesize hyper-realistic media, introduced profound challenges, rapidly proliferates, ability to synthesize, synthesize hyper-realistic

备注

点击查看摘要

Abstract:As open-weights generative AI rapidly proliferates, the ability to synthesize hyper-realistic media has introduced profound challenges to digital trust. Automated disinformation and AI-generated imagery have made robust digital provenance a critical cybersecurity imperative. Currently, state-of-the-art invisible watermarks operate within one of two primary mathematical manifolds: the spatial domain (post-generation pixel embedding) or the latent domain (pre-generation frequency embedding). While existing literature frequently evaluates these models against isolated, classical distortions, there is a critical lack of rigorous, comparative benchmarking against modern generative AI editing tools. In this study, we empirically evaluate two leading representative paradigms, RivaGAN (Spatial) and Tree-Ring (Latent), utilizing an automated Attack Simulation Engine across 30 intensity intervals of geometric and generative perturbations. We formalize an "Adversarial Evasion Region" (AER) framework to measure cryptographic degradation against semantic visual retention (OpenCLIP 70.0). Our statistical analysis ($n=100$ per interval, $MOE = \pm 3.92\%$) reveals that these domains possess mutually exclusive, mathematically orthogonal vulnerabilities. Spatial watermarks experience severe cryptographic degradation under algorithmic pixel-rewriting (exhibiting a 67.47% AER evasion rate under Img2Img translation), whereas latent watermarks exhibit profound fragility against geometric misalignment (yielding a 43.20% AER evasion rate under static cropping). By proving that single-domain watermarking is fundamentally insufficient against modern adversarial toolsets, this research exposes a systemic vulnerability in current digital provenance standards and establishes the foundational exigence for future multi-domain cryptographic architectures.

90. 【2603.10300】From Imitation to Intuition: Intrinsic Reasoning for Open-Instance Video Classification

链接https://arxiv.org/abs/2603.10300

作者:Ke Zhang,Xiangchen Zhao,Yunjie Tian,Jiayu Zheng,Vishal M. Patel,Di Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Conventional video classification, homogeneous data distributions, Conventional video, acting as effective, effective imitators

备注: 18 pages, 7 figures

点击查看摘要

Abstract:Conventional video classification models, acting as effective imitators, excel in scenarios with homogeneous data distributions. However, real-world applications often present an open-instance challenge, where intra-class variations are vast and complex, beyond existing benchmarks. While traditional video encoder models struggle to fit these diverse distributions, vision-language models (VLMs) offer superior generalization but have not fully leveraged their reasoning capabilities (intuition) for such tasks. In this paper, we bridge this gap with an intrinsic reasoning framework that evolves open-instance video classification from imitation to intuition. Our approach, namely DeepIntuit, begins with a cold-start supervised alignment to initialize reasoning capability, followed by refinement using Group Relative Policy Optimization (GRPO) to enhance reasoning coherence through reinforcement learning. Crucially, to translate this reasoning into accurate classification, DeepIntuit then introduces an intuitive calibration stage. In this stage, a classifier is trained on this intrinsic reasoning traces generated by the refined VLM, ensuring stable knowledge transfer without distribution mismatch. Extensive experiments demonstrate that for open-instance video classification, DeepIntuit benefits significantly from transcending simple feature imitation and evolving toward intrinsic reasoning. Our project is available at this https URL.

91. 【2603.10281】aming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework

链接https://arxiv.org/abs/2603.10281

作者:Rajesh Shrestha,Xiao Fu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:ADMM remains nontrivial, score-based generative models, directly integrating, remains nontrivial, generative models

备注

点击查看摘要

Abstract:While score-based generative models have emerged as powerful priors for solving inverse problems, directly integrating them into optimization algorithms such as ADMM remains nontrivial. Two central challenges arise: i) the mismatch between the noisy data manifolds used to train the score functions and the geometry of ADMM iterates, especially due to the influence of dual variables, and ii) the lack of convergence understanding when ADMM is equipped with score-based denoisers. To address the manifold mismatch issue, we propose ADMM plug-and-play (ADMM-PnP) with the AC-DC denoiser, a new framework that embeds a three-stage denoiser into ADMM: (1) auto-correction (AC) via additive Gaussian noise, (2) directional correction (DC) using conditional Langevin dynamics, and (3) score-based denoising. In terms of convergence, we establish two results: first, under proper denoiser parameters, each ADMM iteration is a weakly nonexpansive operator, ensuring high-probability fixed-point $\textit{ball convergence}$ using a constant step size; second, under more relaxed conditions, the AC-DC denoiser is a bounded denoiser, which leads to convergence under an adaptive step size schedule. Experiments on a range of inverse problems demonstrate that our method consistently improves solution quality over a variety of baselines.

92. 【2603.10267】A Robust Deep Learning Framework for Bangla License Plate Recognition Using YOLO and Vision-Language OCR

链接https://arxiv.org/abs/2603.10267

作者:Nayeb Hasin,Md. Arafath Rahman Nishat,Mainul Islam,Khandakar Shakib Al Hasan,Asif Newaz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automatic License Plate, Bangla License Plate, License Plate Recognition, License Plate, Bangla license plates

备注: Accepted at the 2026 IEEE International Conference on AI and Data Analytics (ICAD 2026). Final version will appear in IEEE Xplore

点击查看摘要

Abstract:An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.

93. 【2603.10256】ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

链接https://arxiv.org/abs/2603.10256

作者:Aviad Dahan,Moran Yanuka,Noa Kraicer,Lior Wolf,Raja Giryes

类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Existing video personalization, Existing video, video personalization methods, preserve visual likeness, personalization methods preserve

备注

点击查看摘要

Abstract:Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.

94. 【2603.10253】Joint Imaging-ROI Representation Learning via Cross-View Contrastive Alignment for Brain Disorder Classification

链接https://arxiv.org/abs/2603.10253

作者:Wei Liang,Lifang He

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:full image volume, global anatomical context, capture global anatomical, constructing ROI-based graphs, modeling the full

备注

点击查看摘要

Abstract:Brain imaging classification is commonly approached from two perspectives: modeling the full image volume to capture global anatomical context, or constructing ROI-based graphs to encode localized and topological interactions. Although both representations have demonstrated independent efficacy, their relative contributions and potential complementarity remain insufficiently understood. Existing fusion approaches are typically task-specific and do not enable controlled evaluation of each representation under consistent training settings. To address this gap, we propose a unified cross-view contrastive framework for joint imaging-ROI representation learning. Our method learns subject-level global (imaging) and local (ROI-graph) embeddings and aligns them in a shared latent space using a bidirectional contrastive objective, encouraging representations from the same subject to converge while separating those from different subjects. This alignment produces comparable embeddings suitable for downstream fusion and enables systematic evaluation of imaging-only, ROI-only, and joint configurations within a unified training protocol. Extensive experiments on the ADHD-200 and ABIDE datasets demonstrate that joint learning consistently improves classification performance over either branch alone across multiple backbone choices. Moreover, interpretability analyses reveal that imaging-based and ROI-based branches emphasize distinct yet complementary discriminative patterns, explaining the observed performance gains. These findings provide principled evidence that explicitly integrating global volumetric and ROI-level representations is a promising direction for neuroimaging-based brain disorder classification. The source code is available at this https URL.

95. 【2603.10237】One Adapter for All: Towards Unified Representation in Step-Imbalanced Class-Incremental Learning

链接https://arxiv.org/abs/2603.10237

作者:Xiaoyan Zhang,Jiangpeng He

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:retaining prior knowledge, Class-incremental learning, methods assume balanced, aims to acquire, prior knowledge

备注: Code is available at [this https URL](https://github.com/xiaoyanzhang1/One-A)

点击查看摘要

Abstract:Class-incremental learning (CIL) aims to acquire new classes over time while retaining prior knowledge, yet most setups and methods assume balanced task streams. In practice, the number of classes per task often varies significantly. We refer to this as step imbalance, where large tasks that contain more classes dominate learning and small tasks inject unstable updates. Existing CIL methods assume balanced tasks and therefore treat all tasks uniformly, producing imbalanced updates that degrade overall learning performance. To address this challenge, we propose One-A, a unified and imbalance-aware framework that incrementally merges task updates into a single adapter, maintaining constant inference cost. One-A performs asymmetric subspace alignment to preserve dominant subspaces learned from large tasks while constraining low-information updates within them. An information-adaptive weighting balances the contribution between base and new adapters, and a directional gating mechanism selectively fuses updates along each singular direction, maintaining stability in head directions and plasticity in tail ones. Across multiple benchmarks and step-imbalanced streams, One-A achieves competitive accuracy with significantly low inference overhead, showing that a single, asymmetrically fused adapter can remain both adaptive to dynamic task sizes and efficient at deployment.

96. 【2603.10234】Why Does It Look There? Structured Explanations for Image Classification

链接https://arxiv.org/abs/2603.10234

作者:Jiarui Li,Zixiang Yin,Samuel J Landry,Zhengming Ding,Ramgopal R. Mettu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:remarkable predictive performance, Deep learning models, achieve remarkable predictive, black-box nature limits, nature limits transparency

备注

点击查看摘要

Abstract:Deep learning models achieve remarkable predictive performance, yet their black-box nature limits transparency and trustworthiness. Although numerous explainable artificial intelligence (XAI) methods have been proposed, they primarily provide saliency maps or concepts (i.e., unstructured interpretability). Existing approaches often rely on auxiliary models (\eg, GPT, CLIP) to describe model behavior, thereby compromising faithfulness to the original models. We propose Interpretability to Explainability (I2X), a framework that builds structured explanations directly from unstructured interpretability by quantifying progress at selected checkpoints during training using prototypes extracted from post-hoc XAI methods (e.g., GradCAM). I2X answers the question of "why does it look there" by providing a structured view of both intra- and inter-class decision making during training. Experiments on MNIST and CIFAR10 demonstrate effectiveness of I2X to reveal prototype-based inference process of various image classification models. Moreover, we demonstrate that I2X can be used to improve predictions across different model architectures and datasets: we can identify uncertain prototypes recognized by I2X and then use targeted perturbation of samples that allows fine-tuning to ultimately improve accuracy. Thus, I2X not only faithfully explains model behavior but also provides a practical approach to guide optimization toward desired targets.

97. 【2603.10231】OilSAM2: Memory-Augmented SAM2 for Scalable SAR Oil Spill Detection

链接https://arxiv.org/abs/2603.10231

作者:Shuaiyu Chen,Ming Yin,Peng Ren,Chunbo Luo,Zeyu Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Synthetic Aperture Radar, Aperture Radar, Synthetic Aperture, imagery remains challenging, severe appearance variability

备注

点击查看摘要

Abstract:Segmenting oil spills from Synthetic Aperture Radar (SAR) imagery remains challenging due to severe appearance variability, scale heterogeneity, and the absence of temporal continuity in real world monitoring scenarios. While foundation models such as Segment Anything (SAM) enable prompt driven segmentation, existing SAM based approaches operate on single images and cannot effectively reuse information across scenes. Memory augmented variants (e.g., SAM2) further assume temporal coherence, making them prone to semantic drift when applied to unordered SAR image collections. We propose OilSAM2, a memory augmented segmentation framework tailored for unordered SAR oil spill monitoring. OilSAM2 introduces a hierarchical feature aware multi scale memory bank that explicitly models texture, structure, and semantic level representations, enabling robust cross image information reuse. To mitigate memory drift, we further propose a structure semantic consistent memory update strategy that selectively refreshes memory based on semantic discrepancy and structural this http URL on two public SAR oil spill datasets demonstrate that OilSAM2 achieves state of the art segmentation performance, delivering stable and accurate results under noisy SAR monitoring scenarios. The source code is available at this https URL.

98. 【2603.10220】Robotic Ultrasound Makes CBCT Alive

链接https://arxiv.org/abs/2603.10220

作者:Feng Li,Ziyuan Li,Zhongliang Jiang,Nassir Navab,Yuan Bi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:Beam Computed Tomography, Cone Beam Computed, Intraoperative Cone Beam, Computed Tomography, Cone Beam

备注: 10 pages, 4 figures

点击查看摘要

Abstract:Intraoperative Cone Beam Computed Tomography (CBCT) provides a reliable 3D anatomical context essential for interventional planning. However, its static nature fails to provide continuous monitoring of soft-tissue deformations induced by respiration, probe pressure, and surgical manipulation, leading to navigation discrepancies. We propose a deformation-aware CBCT updating framework that leverages robotic ultrasound as a dynamic proxy to infer tissue motion and update static CBCT slices in real time. Starting from calibration-initialized alignment with linear correlation of linear combination (LC2)-based rigid refinement, our method establishes accurate multimodal correspondence. To capture intraoperative dynamics, we introduce the ultrasound correlation UNet (USCorUNet), a lightweight network trained with optical flow-guided supervision to learn deformation-aware correlation representations, enabling accurate, real-time dense deformation field estimation from ultrasound streams. The inferred deformation is spatially regularized and transferred to the CBCT reference to produce deformation-consistent visualizations without repeated radiation exposure. We validate the proposed approach through deformation estimation and ultrasound-guided CBCT updating experiments. Results demonstrate real-time end-to-end CBCT slice updating and physically plausible deformation estimation, enabling dynamic refinement of static CBCT guidance during robotic ultrasound-assisted interventions. The source code is publicly available at this https URL.

99. 【2603.10216】An Automated Radiomics Framework for Postoperative Survival Prediction in Colorectal Liver Metastases using Preoperative MRI

链接https://arxiv.org/abs/2603.10216

作者:Muhammad Alberb,Jianan Chen,Hossam El-rewaidy,Paul Karanicolas,Arun Seth,Yutaka Amemiya,Anne Martel,Helen Cheung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain highly heterogeneous, colorectal liver metastasis, survival prediction, outcomes remain highly, highly heterogeneous

备注

点击查看摘要

Abstract:While colorectal liver metastasis (CRLM) is potentially curable via hepatectomy, patient outcomes remain highly heterogeneous. Postoperative survival prediction is necessary to avoid non-beneficial surgeries and guide personalized therapy. In this study, we present an automated AI-based framework for postoperative CRLM survival prediction using pre- and post-contrast MRI. We performed a retrospective study of 227 CRLM patients who had gadoxetate-enhanced MRI prior to curative-intent hepatectomy between 2013 and 2020. We developed a survival prediction framework comprising an anatomy-aware segmentation pipeline followed by a radiomics pipeline. The segmentation pipeline learns liver, CRLMs, and spleen segmentation from partially-annotated data, leveraging promptable foundation models to generate pseudo-labels. To support this pipeline, we propose SAMONAI, a prompt propagation algorithm that extends Segment Anything Model to 3D point-based segmentation. Predicted pre- and post-contrast segmentations are then fed into our radiomics pipeline, which extracts per-tumor features and predicts survival using SurvAMINN, an autoencoder-based multiple instance neural network for time-to-event survival prediction. SurvAMINN jointly learns dimensionality reduction and survival prediction from right-censored data, emphasizing high-risk metastases. We compared our framework against established methods and biomarkers using univariate and multivariate Cox regression. Our segmentation pipeline achieves median Dice scores of 0.96 (liver) and 0.93 (spleen), driving a CRLM segmentation Dice score of 0.78 and a detection F1-score of 0.79. Accurate segmentation enables our radiomics pipeline to achieve a survival prediction C-index of 0.69. Our results show the potential of integrating segmentation algorithms with radiomics-based survival analysis to deliver accurate and automated CRLM outcome prediction.

100. 【2603.10212】FusionNet: a frame interpolation network for 4D heart models

链接https://arxiv.org/abs/2603.10212

作者:Chujie Chang,Shoko Miyauchi,Ken'ichi Morooka,Ryo Kurazume,Oscar Martinez Mozos

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Cardiac magnetic resonance, diagnose heart disease, standard CMR imaging, visualise cardiac motion, CMR imaging requires

备注: This is the authors' version. The final authenticated version is available online at [this https URL](https://doi.org/10.1007/978-3-031-47425-5_4) . Published in Medical Image Computing and Computer Assisted Intervention - MICCAI 2023 Workshops

点击查看摘要

Abstract:Cardiac magnetic resonance (CMR) imaging is widely used to visualise cardiac motion and diagnose heart disease. However, standard CMR imaging requires patients to lie still in a confined space inside a loud machine for 40-60 min, which increases patient discomfort. In addition, shorter scan times decrease either or both the temporal and spatial resolutions of cardiac motion, and thus, the diagnostic accuracy of the procedure. Of these, we focus on reduced temporal resolution and propose a neural network called FusionNet to obtain four-dimensional (4D) cardiac motion with high temporal resolution from CMR images captured in a short period of time. The model estimates intermediate 3D heart shapes based on adjacent shapes. The results of an experimental evaluation of the proposed FusionNet model showed that it achieved a performance of over 0.897 in terms of the Dice coefficient, confirming that it can recover shapes more precisely than existing methods. This code is available at: this https URL

101. 【2603.10210】Delta-K: Boosting Multi-Instance Generation via Cross-Attention Augmentation

链接https://arxiv.org/abs/2603.10210

作者:Zitong Wang,Zijun Shen,Haohao Xu,Zhengjie Luo,Weibin Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:complex multi-instance scenes, synthesizing complex multi-instance, multi-instance scenes, synthesizing complex, complex multi-instance

备注

点击查看摘要

Abstract:While Diffusion Models excel in text-to-image synthesis, they often suffer from concept omission when synthesizing complex multi-instance scenes. Existing training-free methods attempt to resolve this by rescaling attention maps, which merely exacerbates unstructured noise without establishing coherent semantic representations. To address this, we propose Delta-K, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space. Specifically, with Vision-language model, we extract a differential key $\Delta K$ that encodes the semantic signature of missing concepts. This signal is then injected during the early semantic planning stage of the diffusion process. Governed by a dynamically optimized scheduling mechanism, Delta-K grounds diffuse noise into stable structural anchors while preserving existing concepts. Extensive experiments demonstrate the generality of our approach: Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.

102. 【2603.10178】Video-Based Reward Modeling for Computer-Use Agents

链接https://arxiv.org/abs/2603.10178

作者:Linxin Song,Jieyu Zhang,Huanxin Sheng,Taiwei Shi,Gupta Rahul,Yang Liu,Ranjay Krishna,Jian Kang,Jieyu Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Computer-using agents, Execution Video Reward, increasingly capable, execution video, remains difficult

备注

点击查看摘要

Abstract:Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.

103. 【2603.10132】Unbalanced Optimal Transport Dictionary Learning for Unsupervised Hyperspectral Image Clustering

链接https://arxiv.org/abs/2603.10132

作者:Joshua Lentz,Nicholas Karris,Alex Cloninger,James M. Murphy

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST)

关键词:Hyperspectral images capture, capture vast amounts, images capture vast, Hyperspectral images, high-dimensional spectral information

备注: IEEE WHISPERS 2025

点击查看摘要

Abstract:Hyperspectral images capture vast amounts of high-dimensional spectral information about a scene, making labeling an intensive task that is resistant to out-of-the-box statistical methods. Unsupervised learning of clusters allows for automated segmentation of the scene, enabling a more rapid understanding of the image. Partitioning the spectral information contained within the data via dictionary learning in Wasserstein space has proven an effective method for unsupervised clustering. However, this approach requires balancing the spectral profiles of the data, blurring the classes, and sacrificing robustness to outliers and noise. In this paper, we suggest improving this approach by utilizing unbalanced Wasserstein barycenters to learn a lower-dimensional representation of the underlying data. The deployment of spectral clustering on the learned representation results in an effective approach for the unsupervised learning of labels.

104. 【2603.10128】HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation

链接https://arxiv.org/abs/2603.10128

作者:Daichao Zhao,Qiupu Chen,Feng He,Xin Ning,Qiankun Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:percent, autonomous driving, operation of vehicles, crucial task, task in autonomous

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Lane detection is a crucial task in autonomous driving, as it helps ensure the safe operation of vehicles. However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog. As a result, detection models trained on these datasets often become unreliable in such environments, which may lead to serious safety-critical failures on the road. To address this issue, we propose HG-Lane, a High-fidelity Generation framework for Lane Scenes under adverse weather and lighting conditions without requiring re-annotation. Based on this framework, we further construct a benchmark that includes adverse weather and lighting scenarios, containing 30,000 images. Experimental results demonstrate that our method consistently and significantly improves the performance of existing lane detection networks. For example, using the state-of-the-art CLRNet, the overall mF1 score on our benchmark increases by 20.87 percent. The F1@50 score for the overall, normal, snow, rain, fog, night, and dusk categories increases by 19.75 percent, 8.63 percent, 38.8 percent, 14.96 percent, 26.84 percent, 21.5 percent, and 12.04 percent, respectively. The code and dataset are available at: this https URL.

105. 【2603.10125】4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video

链接https://arxiv.org/abs/2603.10125

作者:Jin Lyu,Liang An,Pujin Cheng,Yebin Liu,Xiaoying Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:equine family, reconstruction, animal welfare, appearance reconstruction, appearance

备注: Accepted to CVPR2026

点击查看摘要

Abstract:4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Project page: this https URL.

106. 【2505.17862】Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

链接https://arxiv.org/abs/2505.17862

作者:Ziwei Zhou,Rui Wang,Zuxuan Wu,Yu-Gang Jiang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent Multimodal Large, Multimodal Large Language, Large Language Models, Recent Multimodal, Multimodal Large

备注

点击查看摘要

Abstract:Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. We introduce Daily-Omni, a multiple-choice Audio-Visual QA benchmark featuring 684 real-world videos and 1,197 questions spanning 6 task families that explicitly require cross-modal temporal reasoning. To support scalable benchmark construction, we develop a semi-automatic pipeline for annotation, cross-modal consistency refinement, temporal alignment elicitation, and text-only leakage filtering, followed by human verification. We further provide a diagnostic evaluation suite and extensively evaluate 24 foundation models under 37 model--modality settings (Audio+Video / Audio-only / Video-only / Text-only). Finally, we include a training-free modular diagnostic baseline that composes off-the-shelf unimodal models to serve as a diagnostic baseline and to illustrate how explicit temporal alignment signals affect performance. Results indicate that many end-to-end MLLMs still struggle on alignment-critical questions, suggesting that robust cross-modal temporal alignment remains an important open challenge.

107. 【2603.10845】Human Presence Detection via Wi-Fi Range-Filtered Doppler Spectrum on Commodity Laptops

链接https://arxiv.org/abs/2603.10845

作者:Jessica Sanson,Rahul C. Shah,Valerio Frascolla

类目:ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:intelligent power management, Human Presence Detection, enable intelligent power, Human Presence, intelligent power

备注: 6 pages, Conference

点击查看摘要

Abstract:Human Presence Detection (HPD) is key to enable intelligent power management and security features in everyday devices. In this paper we propose the first HPD solution that leverages monostatic Wi-Fi sensing and detects user position using only the built-in Wi-Fi hardware of a device, with no need for external devices, access points, or additional sensors. In contrast, existing HPD solutions for laptops require external dedicated sensors which add cost and complexity, or rely on camera-based approaches that introduce significant privacy concerns. We herewith introduce the Range-Filtered Doppler Spectrum (RF-DS), a novel Wi-Fi sensing technique for presence estimation that enables both range-selective and temporally windowed detection of user presence. By applying targeted range-area filtering in the Channel Impulse Response (CIR) domain before Doppler analysis, our method focuses processing on task-relevant spatial zones, significantly reducing computational complexity. In addition, the use of temporal windows in the spectrum domain provides greater estimator stability compared to conventional 2D Range-Doppler detectors. Furthermore, we propose an adaptive multi-rate processing framework that dynamically adjusts Channel State Information (CSI) sampling rates-operating at low frame rates (10Hz) during idle periods and high rates (100Hz) only when motion is detected. To our knowledge, this is the first low-complexity solution for occupancy detection using monostatic Wi-Fi sensing on a built-in Wi-Fi network interface controller (NIC) of a commercial off-the-shelf laptop that requires no external network infrastructure or specialized sensors. Our solution can scale across different environments and devices without calibration or retraining.

108. 【2603.10188】ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation

链接https://arxiv.org/abs/2603.10188

作者:Sofia Iliopoulou,Dimitris Ampeliotis,Athanassios Skodras

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:substantially outperform traditional, jointly learning compact, learning compact latent, Recent progress, outperform traditional codecs

备注: 16 pages, 12 figures

点击查看摘要

Abstract:Recent progress in learning-based image compression has demonstrated that end-to-end optimization can substantially outperform traditional codecs by jointly learning compact latent representations and probabilistic entropy models. However, many existing approaches achieve high rate-distortion efficiency at the expense of increased computational cost and limited parallelism. This paper presents ARCHE - Autoregressive Residual Compression with Hyperprior and Excitation, an end-to-end learned image compression framework that balances modeling accuracy and computational efficiency. The proposed architecture unifies hierarchical, spatial, and channel-based priors within a single probabilistic framework, capturing both global and local dependencies in the latent representation of the image, while employing adaptive feature recalibration and residual refinement to enhance latent representation quality. Without relying on recurrent or transformer-based components, ARCHE attains state-of-the-art rate-distortion efficiency: it reduces the BD-Rate by approximately 48% relative to the commonly used benchmark model of Balle et al., 30% relative to the channel-wise autoregressive model of Minnen Singh and 5% against the VVC Intra codec on the Kodak benchmark dataset. The framework maintains computational efficiency with 95M parameters and 222ms running time per image. Visual comparisons confirm sharper textures and improved color fidelity, particularly at lower bit rates, demonstrating that accurate entropy modeling can be achieved through efficient convolutional designs suitable for practical deployment.