本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新573篇论文,其中:
- 自然语言处理97篇
- 信息检索22篇
- 计算机视觉135篇
自然语言处理
1. 【2603.24586】Comparing Developer and LLM Biases in Code Evaluation
链接:https://arxiv.org/abs/2603.24586
作者:Aditya Mittal,Ryan Shar,Zichu Wu,Shyam Agarwal,Tongshuang Wu,Chris Donahue,Ameet Talwalkar,Wayne Chi,Valerie Chen
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:capture partial context, realistic interactive settings, ambiguous intent, interactive settings, settings that capture
备注:
点击查看摘要
Abstract:As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities -- chat-based programming, IDE autocompletion, and instructed code editing -- we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.
2. 【2603.24580】Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
链接:https://arxiv.org/abs/2603.24580
作者:Saahil Mathur,Ryan David Rittner,Vedant Ajit Thakur,Daniel Stuart Schiff,Tunazzina Islam
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:achieving sufficient reliability, expert usage remains, usage remains challenging, dense legal language, overlapping regulatory frameworks
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.
3. 【2603.24579】MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
链接:https://arxiv.org/abs/2603.24579
作者:Zhuo Li,Yupeng Zhang,Pengyu Cheng,Jiajun Song,Mengyu Zhou,Hao Li,Shujie Hu,Yu Qin,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang
类目:Computation and Language (cs.CL)
关键词:large language models, Retrieval-Augmented Generation, undermining their reliability, real-world applications, remains a critical
备注:
点击查看摘要
Abstract:Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver's original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at this https URL.
4. 【2603.24549】A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English
链接:https://arxiv.org/abs/2603.24549
作者:Dana Serditova,Kevin Tang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
关键词:Automatic Speech Recognition, performance remains uneven, mainstream accents represented, Automatic Speech, current speech recognition
备注: 54 pages, 11 figures
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) systems are widely used in everyday communication, education, healthcare, and industry, yet their performance remains uneven across speakers, particularly when dialectal variation diverges from the mainstream accents represented in training data. This study investigates ASR bias through a sociolinguistic analysis of Newcastle English, a regional variety of North-East England that has been shown to challenge current speech recognition technologies. Using spontaneous speech from the Diachronic Electronic Corpus of Tyneside English (DECTE), we evaluate the output of a state-of-the-art commercial ASR system and conduct a fine-grained analysis of more than 3,000 transcription errors. Errors are classified by linguistic domain and examined in relation to social variables including gender, age, and socioeconomic status. In addition, an acoustic case study of selected vowel features demonstrates how gradient phonetic variation contributes directly to misrecognition. The results show that phonological variation accounts for the majority of errors, with recurrent failures linked to dialect-specific features like vowel quality and glottalisation, as well as local vocabulary and non-standard grammatical forms. Error rates also vary across social groups, with higher error frequencies observed for men and for speakers at the extremes of the age spectrum. These findings indicate that ASR errors are not random but socially patterned and can be explained from a sociolinguistic perspective. Thus, the study demonstrates the importance of incorporating sociolinguistic expertise into the evaluation and development of speech technologies and argues that more equitable ASR systems require explicit attention to dialectal variation and community-based speech data.
Comments:
54 pages, 11 figures
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
ACMclasses:
I.2; I.2.7; I.5; J.4; J.5
Cite as:
arXiv:2603.24549 [cs.CL]
(or
arXiv:2603.24549v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.24549
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
5. 【2603.24543】Analysing the Safety Pitfalls of Steering Vectors
链接:https://arxiv.org/abs/2603.24543
作者:Yuxiao Li,Alina Fastowski,Efstratios Zaradoukas,Bardh Prenkaj,Gjergji Kasneci
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Contrastive Activation Addition, weight updates, powerful tool, tool to shape, shape LLM behavior
备注:
点击查看摘要
Abstract:Activation steering has emerged as a powerful tool to shape LLM behavior without the need for weight updates. While its inherent brittleness and unreliability are well-documented, its safety implications remain underexplored. In this work, we present a systematic safety audit of steering vectors obtained with Contrastive Activation Addition (CAA), a widely used steering approach, under a unified evaluation protocol. Using JailbreakBench as benchmark, we show that steering vectors consistently influence the success rate of jailbreak attacks, with stronger amplification under simple template-based attacks. Across LLM families and sizes, steering the model in specific directions can drastically increase (up to 57%) or decrease (up to 50%) its attack success rate (ASR), depending on the targeted behavior. We attribute this phenomenon to the overlap between the steering vectors and the latent directions of refusal behavior. Thus, we offer a traceable explanation for this discovery. Together, our findings reveal the previously unobserved origin of this safety gap in LLMs, highlighting a trade-off between controllability and safety.
6. 【2603.24536】Robust Multilingual Text-to-Pictogram Mapping for Scalable Reading Rehabilitation
链接:https://arxiv.org/abs/2603.24536
作者:Soufiane Jhilal,Martina Galletti
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Reading comprehension presents, Reading comprehension, reading support, requiring intensive, comprehension presents
备注:
点击查看摘要
Abstract:Reading comprehension presents a significant challenge for children with Special Educational Needs and Disabilities (SEND), often requiring intensive one-on-one reading support. To assist therapists in scaling this support, we developed a multilingual, AI-powered interface that automatically enhances text with visual scaffolding. This system dynamically identifies key concepts and maps them to contextually relevant pictograms, supporting learners across languages. We evaluated the system across five typologically diverse languages (English, French, Italian, Spanish, and Arabic), through multilingual coverage analysis, expert clinical review by speech therapists and special education professionals, and latency assessment. Evaluation results indicate high pictogram coverage and visual scaffolding density across the five languages. Expert audits suggested that automatically selected pictograms were semantically appropriate, with combined correct and acceptable ratings exceeding 95% for the four European languages and approximately 90% for Arabic despite reduced pictogram repository coverage. System latency remained within interactive thresholds suitable for real-time educational use. These findings support the technical viability, semantic safety, and acceptability of automated multimodal scaffolding to improve accessibility for neurodiverse learners.
7. 【2603.24535】Representation Learning to Study Temporal Dynamics in Tutorial Scaffolding
链接:https://arxiv.org/abs/2603.24535
作者:Conrad Borchers,Jiayi Zhang,Ashish Gurung
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Adaptive scaffolding enhances, field lacks robust, scaffolding enhances learning, Adaptive scaffolding, lacks robust methods
备注: Accepted as short paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
点击查看摘要
Abstract:Adaptive scaffolding enhances learning, yet the field lacks robust methods for measuring it within authentic tutoring dialogue. This gap has become more pressing with the rise of remote human tutoring and large language model-based systems. We introduce an embedding-based approach that analyzes scaffolding dynamics by aligning the semantics of dialogue turns, problem statements, and correct solutions. Specifically, we operationalize alignment by computing cosine similarity between tutor and student contributions and task-relevant content. We apply this framework to 1,576 real-world mathematics tutoring dialogues from the Eedi Question Anchored Tutoring Dialogues dataset. The analysis reveals systematic differences in task alignment and distinct temporal patterns in how participants ground their contributions in problem and solution content. Further, mixed-effects models show that role-specific semantic alignment predicts tutorial progression beyond baseline features such as message order and length. Tutor contributions exhibited stronger grounding in problem content early in interactions. In contrast, student solution alignment was modestly positively associated with progression. These findings support scaffolding as a continuous, role-sensitive process grounded in task semantics. By capturing role-specific alignment over time, this approach provides a principled method for analyzing instructional dialogue and evaluating conversational tutoring systems.
8. 【2603.24481】Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA
链接:https://arxiv.org/abs/2603.24481
作者:John Ray B. Martinez
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Miscalibrated confidence scores, Miscalibrated confidence, obstacle to deploying, Specialist Confidence Score, S-Score Weighted Fusion
备注: 17 pages, 6 figures. Preprint under review
点击查看摘要
Abstract:Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.
9. 【2603.24472】Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
链接:https://arxiv.org/abs/2603.24472
作者:Jeonghye Kim,Xufang Luo,Minbeom Kim,Sangmook Lee,Dohyung Kim,Jiwon Jeon,Dongsheng Li,Yuqing Yang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:effective post-training paradigm, Self-distillation has emerged, paradigm for LLMs, effective post-training, post-training paradigm
备注:
点击查看摘要
Abstract:Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.
10. 【2603.24470】Counting Without Numbers \ Finding Without Words
链接:https://arxiv.org/abs/2603.24470
作者:Badri Narayana Patro
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
关键词:million pets enter, pets enter shelters, million pets, enter shelters, pets enter
备注:
点击查看摘要
Abstract:Every year, 10 million pets enter shelters, separated from their families. Despite desperate searches by both guardians and lost animals, 70% never reunite, not because matches do not exist, but because current systems look only at appearance, while animals recognize each other through sound. We ask, why does computer vision treat vocalizing species as silent visual objects? Drawing on five decades of cognitive science showing that animals perceive quantity approximately and communicate identity acoustically, we present the first multimodal reunification system integrating visual and acoustic biometrics. Our species-adaptive architecture processes vocalizations from 10Hz elephant rumbles to 4kHz puppy whines, paired with probabilistic visual matching that tolerates stress-induced appearance changes. This work demonstrates that AI grounded in biological communication principles can serve vulnerable populations that lack human language.
11. 【2603.24465】Mechanic: Sorrifier-Driven Formal Decomposition Workflow for Automated Theorem Proving
链接:https://arxiv.org/abs/2603.24465
作者:Ruichen Qiu,Yichuan Cao,Junqi Liu,Dakai Guo,Xiao-Shan Gao,Lihong Zhi,Ruyong Feng
类目:Computation and Language (cs.CL)
关键词:large language models, Recent advances, automated theorem proving, advances in large, large language
备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) and LLM-based agents have substantially improved the capabilities of automated theorem proving. However, for problems requiring complex mathematical reasoning, current systems rarely succeed on the first try and must repeatedly modify their proof strategies. Existing approaches for handling failed attempts typically either discard the entire proof and regenerate it from scratch or iteratively fix errors within the proof. The former is inefficient, as it may abandon mostly correct reasoning due to localized errors, while the latter, although preserving prior progress, leads to progressively longer contexts which progressively degrades the model's ability to attend to the remaining unresolved subproblems. To address this dilemma, we propose Mechanic, a novel agent system that employs a sorry-driven formal decomposition strategy. By leveraging the sorry placeholder in Lean to precisely isolate unresolved subgoals while preserving the surrounding verified proof structure, Mechanic extracts each failed subproblem into a clean, self-contained context and resolves it independently. This avoids both the waste of full regeneration and the excessive context length induced by repeated repairs. Experimental results on challenging mathematical competition benchmarks, including IMO 2025 and Putnam 2025, demonstrate that our agent achieves significant advantages in proving efficiency.
12. 【2603.24432】What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification
链接:https://arxiv.org/abs/2603.24432
作者:Massa Baali,Sarthak Bisht,Rita Singh,Bhiksha Raj
类目:ound (cs.SD); Computation and Language (cs.CL)
关键词:large scale remains, fixed-margin losses treat, large scale, scale remains, remains an open
备注:
点击查看摘要
Abstract:Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8\% and 60.0\% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.
13. 【2603.24422】OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
链接:https://arxiv.org/abs/2603.24422
作者:Ben Chen,Siyuan Wang,Yufei Ma,Zihan Liang,Xuxin Zhang,Yue Lv,Ying Yang,Huangyu Dai,Lingtao Mao,Tong Zhao,Zhipeng Qian,Xinyu Sun,Zhixin Zhai,Yang Zhao,Bochao Liu,Jingshan Lv,Xiao Liang,Hui Kong,Jing Chen,Han Li,Chenyi Lei,Wenwu Ou,Kun Gai
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Generative Retrieval, generative search framework, promising paradigm, paradigm for modern, modern search systems
备注: Key codes are available at [this https URL](https://github.com/benchen4395/onesearch-family) . Feel free to contact benchen4395@gmail.com
点击查看摘要
Abstract:Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi-stage cascaded architecture, it offers advantages such as end-to-end joint optimization and high computational efficiency. OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose \textbf{OneSearch-V2}, a latent reasoning enhanced self-distillation generative search framework. It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized self-distillation training pipeline, which uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98\% item CTR, +3.05\% buyer conversion rate, and +2.11\% order volume. Manual evaluation further confirms gains in search experience quality, with +1.65\% in page good rate and +1.37\% in query-item relevance. More importantly, OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.
14. 【2603.24413】PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation
链接:https://arxiv.org/abs/2603.24413
作者:Manoj Balaji Jagadeeshan,Atul Singh,Nallani Chakravartula Sahith,Amrith Krishna,Pawan Goyal
类目:Computation and Language (cs.CL)
关键词:strict prosodic rules, Sanskrit typically requires, prosodic rules, generation in Sanskrit, semantically coherent
备注:
点击查看摘要
Abstract:Poetry generation in Sanskrit typically requires the verse to be semantically coherent and adhere to strict prosodic rules. In Sanskrit prosody, every line of a verse is typically a fixed length sequence of syllables adhering to prescribed binary patterns of syllable weights. We observe that instead of treating a verse as a monolithic sequence, segmenting them as grouped-lines leads to significant improvement in semantic coherence by 10\% with comparable metrical adherence. Specifically, PINGALA, our proposed decoding approach is designed to encourage every line to have well-formed words and our token selection biases the model towards it by preferring longer tokens. Writing in Sanskrit follows phonemic orthography, hence using a phonetically aware transliteration scheme, SLP1, increased the metrical alignment by 46\% with comparable semantic similarity, for a instruction fine-tuned large language models like Phi-4. We also introduce a new approach for reference-free evaluation using cross-encoders which achieved better alignment with true poetry instances.
15. 【2603.24389】When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools
链接:https://arxiv.org/abs/2603.24389
作者:Xingming Li,Runke Huang,Yanan Bao,Yuye Jin,Yuru Jiao,Qingyong Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:critical scalability challenge, High-quality teacher-child interaction, traditional expert-based assessment, expert-based assessment faces, High-quality teacher-child
备注: Accepted to AIED 2026, Project page: [this https URL](https://qingyonghu.github.io/Interaction2Eval/)
点击查看摘要
Abstract:High-quality teacher-child interaction (TCI) is fundamental to early childhood development, yet traditional expert-based assessment faces a critical scalability challenge. In large systems like China's-serving 36 million children across 250,000+ kindergartens-the cost and time requirements of manual observation make continuous quality monitoring infeasible, relegating assessment to infrequent episodic audits that limit timely intervention and improvement tracking. In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments. Our contributions include: (1) TEPE-TCI-370h (Tracing Effective Preschool Education), the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations; (2) We develop Interaction2Eval, a specialized LLM-based framework addressing domain-specific challenges-child speech recognition, Mandarin homophone disambiguation, and rubric-based reasoning-achieving up to 88% agreement; (3) Deployment validation across 43 classrooms demonstrating an 18x efficiency gain in the assessment workflow, highlighting its potential for shifting from annual expert audits to monthly AI-assisted monitoring with targeted human oversight. This work not only demonstrates the technical feasibility of scalable, AI-augmented quality assessment but also lays the foundation for a new paradigm in early childhood education-one where continuous, inclusive, AI-assisted evaluation becomes the engine of systemic improvement and equitable growth.
Comments:
Accepted to AIED 2026, Project page: this https URL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:
arXiv:2603.24389 [cs.CL]
(or
arXiv:2603.24389v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.24389
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
16. 【2603.24375】owards Reward Modeling for AI Tutors in Math Mistake Remediation
链接:https://arxiv.org/abs/2603.24375
作者:Kseniia Petukhova,Ekaterina Kochmar
类目:Computation and Language (cs.CL)
关键词:standard NLG metrics, tutors remains challenging, standard NLG, NLG metrics, responses identify mistakes
备注:
点击查看摘要
Abstract:Evaluating the pedagogical quality of AI tutors remains challenging: standard NLG metrics do not determine whether responses identify mistakes, scaffold reasoning, or avoid revealing the answers. For the task of mistake remediation, we derive a hierarchy of pedagogical aspects from human pairwise preferences on MRBench, and synthesize minimally contrastive response pairs that differ along key aspects (e.g., mistake identification and location, targetedness, scaffolding, actionability, clarity, and coherence). We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations. Using only synthetic data, our best model reaches 0.69 pairwise accuracy on a human preference test, and combining weighted-sum data with targeted synthetic groups improves accuracy to 0.74, outperforming larger general-purpose reward models while using only a 0.5B-parameter backbone.
17. 【2603.24372】Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning
链接:https://arxiv.org/abs/2603.24372
作者:Arsen Shebzukhov
类目:Computation and Language (cs.CL)
关键词:AI-assisted mathematical research, accelerate AI-assisted mathematical, automatically translating natural, language mathematical texts, formal proof language
备注: 10 pages, 10 figures, pages 10-27 appendix
点击查看摘要
Abstract:Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL' loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.
18. 【2603.24329】GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
链接:https://arxiv.org/abs/2603.24329
作者:Yunzhe Wang,Runhui Xu,Kexin Zheng,Tianyi Zhang,Jayavibhav Niranjan Kogundi,Soham Hans,Volkan Ustun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal LLMs, LLMs are increasingly, increasingly deployed, deployed as perceptual, perceptual backbones
备注:
点击查看摘要
Abstract:Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.
19. 【2603.24307】Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation
链接:https://arxiv.org/abs/2603.24307
作者:N J Karthika,Keerthana Suryanarayanan,Jahanvi Purohit,Ganesh Ramakrishnan,Jitin Singla,Anil Kumar Gourishetty
类目:Computation and Language (cs.CL)
关键词:release Samasāmayik, meticulously curated, parallel sentences, large-scale Hindi-Sanskrit corpus, Samasāmayik
备注:
点击查看摘要
Abstract:We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.
20. 【2603.24258】Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
链接:https://arxiv.org/abs/2603.24258
作者:He Huang
类目:Computation and Language (cs.CL)
关键词:Ancient Egyptian, study word-level semantic, word-level semantic alignment, study word-level, word-level semantic
备注: Accepted to LREC 2026
点击查看摘要
Abstract:We study word-level semantic alignment across four historical stages of Ancient Egyptian. These stages differ in script and orthography, and parallel data are scarce. We jointly train a compact encoder-decoder model with a shared byte-level tokenizer on all four stages, combining masked language modeling (MLM), translation language modeling (TLM), sequence-to-sequence translation, and part-of-speech tagging under a task-aware loss with fixed weights and uncertainty-based scaling. To reduce surface divergence we add Latin transliteration and IPA reconstruction as auxiliary views. We integrate these views through KL-based consistency and through embedding-level fusion. We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets. Translation yields the strongest gains. IPA with KL consistency improves cross-branch alignment, while early fusion demonstrates limited efficacy. Although the overall alignment remains limited, the findings provide a reproducible baseline and practical guidance for modeling historical languages under real constraints. They also show how normalization and task design shape what counts as alignment in typologically distant settings.
21. 【2603.24246】Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution
链接:https://arxiv.org/abs/2603.24246
作者:Julia Matela,Frank Krüger
类目:Computation and Language (cs.CL)
关键词:Cross-Document Coreference Resolution, Shared Task, Task for Cross-Document, Cross-Document Coreference, Coreference Resolution
备注:
点击查看摘要
Abstract:This paper describes the system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) of software mentions. Our approach addresses the challenge of identifying and clustering inconsistent software mentions across scientific corpora. We propose a hybrid framework that combines dense semantic embeddings from a pre-trained Sentence-BERT model, Knowledge Base (KB) lookup strategy built from training-set cluster centroids using FAISS for efficient retrieval, and HDBSCAN density-based clustering for mentions that cannot be confidently assigned to existing clusters. Surface-form normalization and abbreviation resolution are applied to improve canonical name matching. The same core pipeline is applied to Subtasks 1 and 2. To address the large scale settings of Subtask 3, the pipeline was adapted by utilising a blocking strategy based on entity types and canonicalized surface forms. Our system achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on Subtasks 1, 2, and 3 respectively.
22. 【2603.24242】Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition
链接:https://arxiv.org/abs/2603.24242
作者:Aleix Sant,Jordi Luque,Carlos Escolano
类目:Computation and Language (cs.CL)
关键词:environments presents significant, Large Language Models, presents significant challenges, significant challenges stemming, language resource availability
备注: 12 pages, 4 figures, 5 tables
点击查看摘要
Abstract:Federated Learning (FL) of Large Language Models (LLMs) in multilingual environments presents significant challenges stemming from heterogeneous language distributions across clients and disparities in language resource availability. To address these challenges, we extended the FederatedScope-LLM framework to support multilingual instruction-tuning experiments with LLMs. We also introduced a novel client-specific early stopping mechanism, Local Dynamic Early Stopping (LDES-FL), which allows clients to pause and resume local training based on client-side validation performance, enhancing training efficiency and sustainability. Through a series of experiments, we studied how client language composition - from fully monolingual to increasingly multilingual clients - affects multilingual quality, fairness and training cost. Monolingual local fine-tuning remains the most effective for single-language specialization, whereas federated training is better suited to learning a single balanced multilingual model. In FL, increasing within-client multilinguality leads to stronger and fairer global models, narrows the gap to centralized multilingual fine-tuning, and yields the largest gains for lower-resource languages, albeit at the cost of more optimization steps. Overall, our results identify client language composition as a key design variable in multilingual FL, shaping performance, fairness and efficiency
23. 【2603.24231】Stance Labels Fail When They Matter Most: The Projection Problem in Stance Detection
链接:https://arxiv.org/abs/2603.24231
作者:Bowen Zhang
类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)
关键词:Stance detection, formulated as classifying, convention inherited, inherited from debate, debate analysis
备注:
点击查看摘要
Abstract:Stance detection is nearly always formulated as classifying text into Favor, Against, or Neutral -- a convention inherited from debate analysis and applied without modification to social media since SemEval-2016. But attitudes toward complex targets are not unitary: a person can accept climate science while opposing carbon taxes, expressing support on one dimension and opposition on another. When annotators must compress such multi-dimensional attitudes into a single label, different annotators weight different dimensions -- producing disagreement that reflects not confusion but different compression choices. We call this the \textbf{projection problem}, and show that its cost is conditional: when a text's dimensions align, any weighting yields the same label and three-way annotation works well; when dimensions conflict, label agreement collapses while agreement on individual dimensions remains intact. A pilot study on SemEval-2016 Task 6 confirms this crossover: on dimension-consistent texts, label agreement (Krippendorff's $\alpha = 0.307$) exceeds dimensional agreement ($\alpha = 0.082$); on dimension-conflicting texts, the pattern reverses -- label $\alpha$ drops to $0.085$ while dimensional $\alpha$ rises to $0.334$, with Policy reaching $0.572$. The projection problem is real -- but it activates precisely where it matters most.
24. 【2603.24222】Variation is the Norm: Embracing Sociolinguistics in NLP
链接:https://arxiv.org/abs/2603.24222
作者:Anne-Marie Lutgen,Alistair Plum,Verena Blaschke,Barbara Plank,Christoph Purschke
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, Natural Language, Language Processing, Processing, variation
备注: Accepted at LREC 2026
点击查看摘要
Abstract:In Natural Language Processing (NLP), variation is typically seen as noise and "normalised away" before processing, even though it is an integral part of language. Conversely, studying language variation in social contexts is central to sociolinguistics. We present a framework to combine the sociolinguistic dimension of language with the technical dimension of NLP. We argue that by embracing sociolinguistics, variation can actively be included in a research setup, in turn informing the NLP side. To illustrate this, we provide a case study on Luxembourgish, an evolving language featuring a large amount of orthographic variation, demonstrating how NLP performance is impacted. The results show large discrepancies in the performance of models tested and fine-tuned on data with a large amount of orthographic variation in comparison to data closer to the (orthographic) standard. Furthermore, we provide a possible solution to improve the performance by including variation in the fine-tuning process. This case study highlights the importance of including variation in the research setup, as models are currently not robust to occurring variation. Our framework facilitates the inclusion of variation in the thought-process while also being grounded in the theoretical framework of sociolinguistics.
25. 【2603.24150】A visual observation on the geometry of UMAP projections of the difference vectors of antonym and synonym word pair embeddings
链接:https://arxiv.org/abs/2603.24150
作者:Rami Luisto
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:contextually relevant properties, word pairs, contextually relevant, relevant properties, antonymic word pairs
备注: Code available at [this https URL](https://github.com/ramiluisto/CuriousSwirl.git)
点击查看摘要
Abstract:Antonyms, or opposites, are sometimes defined as \emph{word pairs that have all of the same contextually relevant properties but one}. Seeing how transformer models seem to encode concepts as directions, this begs the question if one can detect ``antonymity'' in the geometry of the embedding vectors of word pairs, especially based on their difference vectors. Such geometrical studies are then naturally contrasted by comparing antonymic pairs to their opposites; synonyms. This paper started as an exploratory project on the complexity of the systems needed to detect the geometry of the embedding vectors of antonymic word pairs. What we now report is a curious ``swirl'' that appears across embedding models in a somewhat specific projection configuration.
Comments:
Code available at this https URL
Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
MSC classes:
68T50 (Primary) 62H30, 68T09 (Secondary)
Cite as:
arXiv:2603.24150 [cs.CL]
(or
arXiv:2603.24150v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.24150
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
26. 【2603.24132】MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare
链接:https://arxiv.org/abs/2603.24132
作者:Shubham Kumar Nigam,Suparnojit Sarkar,Piyush Patel
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:professionals is limited, potential to assist, assist users, users in preliminary, settings where access
备注:
点击查看摘要
Abstract:Conversational artificial intelligence has the potential to assist users in preliminary medical consultations, particularly in settings where access to healthcare professionals is limited. However, many existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. In this work, we introduce MedAidDialog, a multilingual multi-turn medical dialogue dataset designed to simulate realistic physician--patient consultations. The dataset extends the MDDial corpus by generating synthetic consultations using large language models and further expands them into a parallel multilingual corpus covering seven languages: English, Hindi, Telugu, Tamil, Bengali, Marathi, and Arabic. Building on this dataset, we develop MedAidLM, a conversational medical model trained using parameter-efficient fine-tuning on quantized small language models, enabling deployment without high-end computational infrastructure. Our framework additionally incorporates optional patient pre-context information (e.g., age, gender, allergies) to personalize the consultation process. Experimental results demonstrate that the proposed system can effectively perform symptom elicitation through multi-turn dialogue and generate diagnostic recommendations. We further conduct medical expert evaluation to assess the plausibility and coherence of the generated consultations.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2603.24132 [cs.CL]
(or
arXiv:2603.24132v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.24132
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
27. 【2603.24125】Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study
链接:https://arxiv.org/abs/2603.24125
作者:Nour Bouchouchi,Thiabult Laugel,Xavier Renard,Christophe Marsala,Marie-Jeanne Lesot,Marcin Detyniecki
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, learn social regularities, Language Models, learn social
备注:
点击查看摘要
Abstract:During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.
28. 【2603.24124】he Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation
链接:https://arxiv.org/abs/2603.24124
作者:Mingyi Liu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:single semantic cluster, produce a single, single semantic, semantic cluster, questions produce
备注: 23 pages, 3 figures, 10 tables, 22 experiments across 5 benchmarks. Code: [this https URL](https://github.com/DigitLion/ucbd-experiment)
点击查看摘要
Abstract:RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81). A base-vs-instruct ablation confirms the causal role of alignment: the base model shows 1.0% single-cluster rate vs. 28.5% for the instruct model (p 10^{-6}). A training stage ablation (Base 0.0% - SFT 1.5% - DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross-family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with Jaccard, embedding, and NLI-based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross-embedder validation with two independent embedding families rules out coupling bias. Cross-dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding -- response homogenization -- is implementation-independent and label-free. Motivated by this diagnosis, we explore a cheapest-first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| = 0.12) enable 57% cost savings.
Comments:
23 pages, 3 figures, 10 tables, 22 experiments across 5 benchmarks. Code: this https URL
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2603.24124 [cs.LG]
(or
arXiv:2603.24124v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2603.24124
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
29. 【2603.24080】LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale
链接:https://arxiv.org/abs/2603.24080
作者:Muhammed Saeed,Simon Razniewski
类目:Computation and Language (cs.CL); Databases (cs.DB)
关键词:MMLU suggest flagship, suggest flagship language, MMLU suggest, language models approach, flagship language models
备注:
点击查看摘要
Abstract:Benchmarks such as MMLU suggest flagship language models approach factuality saturation, with scores above 90\%. We show this picture is incomplete. \emph{LLMpedia} generates encyclopedic articles entirely from parametric memory, producing ${\sim}$1M articles across three model families without retrieval. For gpt-5-mini, the verifiable true rate on Wikipedia-covered subjects is only 74.7\% -- more than 15 percentage points below the benchmark-based picture, consistent with the availability bias of fixed-question evaluation. Beyond Wikipedia, frontier subjects verifiable only through curated web evidence fall further to 63.2\% true rate. Wikipedia covers just 61\% of surfaced subjects, and three model families overlap by only 7.3\% in subject choice. In a capture-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia achieves substantially higher factuality at roughly half the textual similarity to Wikipedia. Unlike Grokipedia, every prompt, artifact, and evaluation verdict is publicly released, making LLMpedia the first fully open parametric encyclopedia -- bridging factuality evaluation and knowledge materialization. All data, code, and a browsable interface are at this https URL.
30. 【2603.24073】ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing
链接:https://arxiv.org/abs/2603.24073
作者:Yu-Chen Kang,Yu-Chien Tang,An-Zi Yen
类目:Computation and Language (cs.CL)
关键词:Knowledge Tracing, modeling student knowledge, support personalized learning, critical technique, technique for modeling
备注: Accepted by LREC 2026
点击查看摘要
Abstract:Knowledge Tracing (KT) is a critical technique for modeling student knowledge to support personalized learning. However, most KT systems focus on binary correctness prediction and cannot diagnose the underlying conceptual misunderstandings that lead to errors. Such fine-grained diagnostic feedback is essential for designing targeted instruction and effective remediation. In this work, we introduce the task of concept-level deficiency prediction, which extends traditional KT by identifying the specific concepts a student is likely to struggle with on future problems. We present ConceptKT, a dataset annotated with labels that capture both the concepts required to solve each question and the missing concepts underlying incorrect responses. We investigate in-context learning approaches to KT and evaluate the diagnostic capabilities of various Large Language Models (LLMs) and Large Reasoning Models (LRMs). Different strategies for selecting informative historical records are explored. Experimental results demonstrate that selecting response histories based on conceptual alignment and semantic similarity leads to improved performance on both correctness prediction and concept-level deficiency identification.
31. 【2603.24051】FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval
链接:https://arxiv.org/abs/2603.24051
作者:Caishuang Huang,Yang Qiao,Rongyu Zhang,Junjie Ye,Pu Lu,Wenxi Wu,Meng Zhou,Xiku Du,Tao Gui,Qi Zhang,Xuanjing Huang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, vital for Large, massive investment targets, Language Models
备注:
点击查看摘要
Abstract:Tool-use capabilities are vital for Large Language Models (LLMs) in finance, a domain characterized by massive investment targets and data-intensive inquiries. However, existing data synthesis methods typically rely on a reverse synthesis paradigm, generating user queries from pre-sampled tools. This approach inevitably introduces artificial explicitness, yielding queries that fail to capture the implicit, event-driven nature of real-world needs. Moreover, its reliance on static tool sets overlooks the dynamic retrieval process required to navigate massive tool spaces. To address these challenges, we introduce \textit{FinToolSyn}, a forward synthesis framework designed to generate high-quality financial dialogues. Progressing from persona instruction and atomic tool synthesis to dynamic retrieval dialogue generation, our pipeline constructs a repository of 43,066 tools and synthesizes over 148k dialogue instances, incorporating dynamic retrieval to emulate the noisy candidate sets typical of massive tool spaces. We also establish a dedicated benchmark to evaluate tool-calling capabilities in realistic financial scenarios. Extensive experiments demonstrate that models trained on FinToolSyn achieve a 21.06\% improvement, providing a robust foundation for tool learning in financial scenarios.
32. 【2603.24044】MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning
链接:https://arxiv.org/abs/2603.24044
作者:Andrea Manzoni
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Standard LoRA fine-tuning, Standard LoRA, highly skewed, rarely activated, handles most tokens
备注: 17 pages, 6 figures, 10 tables
点击查看摘要
Abstract:Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated ("cold"). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those experts. Across two architecturally distinct MoE models and three diverse tasks, tuning only the top 25% routed experts per layer remains competitive with full LoRA, with mean differences within +/-1 percentage point across all conditions. This reduces LoRA trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and wall-clock training time by up to 50%. We also observe a non-monotonic relationship between expert count and seed-to-seed variance, consistent with the hypothesis that adapting cold experts can introduce gradient noise without improving accuracy. Further ablations show that random expert selection at matched budget is about 2.5 percentage points worse, indicating that the routing signal matters, while greedy per-layer budget optimization does not improve over uniform top-k.
33. 【2603.24034】From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs
链接:https://arxiv.org/abs/2603.24034
作者:Xiaoyong Guo,Nanjie Li,Zijie Zeng,Kai Wang,Hao Huang,Haihua Xu,Wei Shi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:contextual exposure bias, Contextual automatic speech, term contextual exposure, automatic speech recognition, Direct Preference Optimization
备注:
点击查看摘要
Abstract:Contextual automatic speech recognition (ASR) with Speech-LLMs is typically trained with oracle conversation history, but relies on error-prone history at inference, causing a train-test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED-LIUM 3 (in-domain) and zero-shot LibriSpeech (out-of-domain) show consistent gains under predicted-history decoding. With a two-utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle-history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant-context attacks, DPO yields the smallest degradation (5.17% - 5.63%), indicating improved robustness to misleading context. Our code and models are published on this https URL.
34. 【2603.24023】Schema on the Inside: A Two-Phase Fine-Tuning Method for High-Efficiency Text-to-SQL at Scale
链接:https://arxiv.org/abs/2603.24023
作者:Chinmay Soni,Shivam Chourasia,Gaurav Kumar,Hitesh Kapoor
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:significant industry challenge, per-token API costs, prohibitive per-token API, Applying large, proprietary API-based language
备注: 8 pages, 6 figures. Published in the Proceedings of the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), 2026
点击查看摘要
Abstract:Applying large, proprietary API-based language models to text-to-SQL tasks poses a significant industry challenge: reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11, India's largest fantasy sports platform with over 250 million users, that answers user queries about cricket statistics. Our novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100, and replaces costly external API calls with efficient local inference. The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google's Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.
35. 【2603.24012】CVPD at QIAS 2026: RAG-Guided LLM Reasoning for Al-Mawarith Share Computation and Heir Allocation
链接:https://arxiv.org/abs/2603.24012
作者:Wassim Swaileh,Mohammed-En-Nadhir Zighem,Hichem Telli,Salah Eddine Bekhouche,Abdellah Zakaria Sellam,Fadi Dornaika,Dimitrios Kotzinos
类目:Computation and Language (cs.CL)
关键词:consistent final distribution, Ilm al-Mawarith, Islamic inheritance, eligible heirs, resolution of blocking
备注:
点击查看摘要
Abstract:Islamic inheritance (Ilm al-Mawarith) is a multi-stage legal reasoning task requiring the identification of eligible heirs, resolution of blocking rules (hajb), assignment of fixed and residual shares, handling of adjustments such as awl and radd, and generation of a consistent final distribution. The task is further complicated by variations across legal schools and civil-law codifications, requiring models to operate under explicit legal configurations. We present a retrieval-augmented generation (RAG) pipeline for this setting, combining rule-grounded synthetic data generation, hybrid retrieval (dense and BM25) with cross-encoder reranking, and schema-constrained output validation. A symbolic inheritance calculator is used to generate a large high-quality synthetic corpus with full intermediate reasoning traces, ensuring legal and numerical consistency. The proposed system achieves a MIR-E score of 0.935 and ranks first on the official QIAS 2026 blind-test leaderboard. Results demonstrate that retrieval-grounded, schema-aware generation significantly improves reliability in high-precision Arabic legal reasoning tasks.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2603.24012 [cs.CL]
(or
arXiv:2603.24012v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.24012
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
36. 【2603.24004】hinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning
链接:https://arxiv.org/abs/2603.24004
作者:Kun-Yang Yu,Zhi Zhou,Shi-Yu Tian,Xiao-Wen Yang,Zi-Yi Jia,Ming Yang,Zi-Jian Cheng,Lan-Zhe Guo,Yu-Feng Li
类目:Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, demonstrated remarkable reasoning
备注: 20 pages, 6 figures
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities across modalities such as images and text. However, tabular data, despite being a critical real-world modality, remains relatively underexplored in multimodal learning. In this paper, we focus on the task of Tabular-Vision Multi-Modal Understanding (TVMU) and identify three core challenges: (1) high structural variability and data incompleteness in tables, (2) implicit and complex feature dependencies, and (3) significant heterogeneity in problem-solving pipelines across downstream tasks. To address these issues, we propose Thinking with Tables (TWT). TWT employs a program-aided code-based neuro-symbolic reasoning mechanism that facilitates key operations, such as information extraction and element modeling, by interacting with external environments. We evaluate TWT on eight representative datasets. Experimental results demonstrate that TWT consistently outperforms existing baselines by an average of 10\% in accuracy, achieving performance comparable to, or even surpassing, proprietary commercial SOTA LLMs on TVMU tasks. Models and codes are available at this https URL
37. 【2603.23998】Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping
链接:https://arxiv.org/abs/2603.23998
作者:Yao Chen,Yilong Chen,Yinqi Yang,Junyuan Shang,Zhenyu Zhang,Zefeng Zhang,Shuaiyi Nie,Shuohuan Wang,Yu Sun,Hua Wu,HaiFeng Wang,Tingwen Liu
类目:Computation and Language (cs.CL)
关键词:Existing approaches, Transformers predominantly rely, extending computation, recursive execution, predominantly rely
备注:
点击查看摘要
Abstract:Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.
38. 【2603.23989】CoCR-RAG: Enhancing Retrieval-Augmented Generation in Web QA via Concept-oriented Context Reconstruction
链接:https://arxiv.org/abs/2603.23989
作者:Kaize Shi,Xueyao Sun,Qika Lin,Firoj Alam,Qing Li,Xiaohui Tao,Guandong Xu
类目:Computation and Language (cs.CL)
关键词:shown promising results, Retrieval-augmented generation, shown promising, promising results, results in enhancing
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) has shown promising results in enhancing QA by incorporating information from the web and other external sources. However, the supporting documents retrieved from the heterogeneous web often originate from multiple sources with diverse writing styles, varying formats, and inconsistent granularity. Fusing such multi-source documents into a coherent and knowledge-intensive context remains a significant challenge, as the presence of irrelevant and redundant information can compromise the factual consistency of the inferred answers. This paper proposes the Concept-oriented Context Reconstruction RAG (CoCR-RAG), a framework that addresses the multi-source information fusion problem in RAG through linguistically grounded concept-level integration. Specifically, we introduce a concept distillation algorithm that extracts essential concepts from Abstract Meaning Representation (AMR), a stable semantic representation that structures the meaning of texts as logical graphs. The distilled concepts from multiple retrieved documents are then fused and reconstructed into a unified, information-intensive context by Large Language Models, which supplement only the necessary sentence elements to highlight the core knowledge. Experiments on the PopQA and EntityQuestions datasets demonstrate that CoCR-RAG significantly outperforms existing context-reconstruction methods across these Web QA benchmarks. Furthermore, CoCR-RAG shows robustness across various backbone LLMs, establishing itself as a flexible, plug-and-play component adaptable to different RAG frameworks.
39. 【2603.23972】Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
链接:https://arxiv.org/abs/2603.23972
作者:Somaya Eltanbouly,Samer Rashwani
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Quran and Hadith, achieved remarkable progress, Large language models, Large language, achieved remarkable
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable progress in many language tasks, yet they continue to struggle with complex historical and religious Arabic texts such as the Quran and Hadith. To address this limitation, we develop a retrieval-augmented generation (RAG) framework grounded in diachronic lexicographic knowledge. Unlike prior RAG systems that rely on general-purpose corpora, our approach retrieves evidence from the Doha Historical Dictionary of Arabic (DHDA), a large-scale resource documenting the historical development of Arabic vocabulary. The proposed pipeline combines hybrid retrieval with an intent-based routing mechanism to provide LLMs with precise, contextually relevant historical information. Our experiments show that this approach improves the accuracy of Arabic-native LLMs, including Fanar and ALLaM, to over 85\%, substantially reducing the performance gap with Gemini, a proprietary large-scale model. Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments. The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87). An error analysis further highlights key linguistic challenges, including diacritics and compound expressions. These findings demonstrate the value of integrating diachronic lexicographic resources into retrieval-augmented generation frameworks to enhance Arabic language understanding, particularly for historical and religious texts. The code and resources are publicly available at: this https URL.
40. 【2603.23971】he Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More
链接:https://arxiv.org/abs/2603.23971
作者:Lingjiao Chen,Chi Zhang,Yeye He,Ion Stoica,Matei Zaharia,James Zou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:consumers increasingly choose, Developers and consumers, choose reasoning language, increasingly choose reasoning, consumers increasingly
备注:
点击查看摘要
Abstract:Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 78% cheaper than GPT-5.2's, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall's $\tau$ ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.
41. 【2603.23951】From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents
链接:https://arxiv.org/abs/2603.23951
作者:Sirui Xia,Yikai Zhang,Aili Chen,Siye Wu,Siyu Yuan,Yanghua Xiao
类目:Computation and Language (cs.CL)
关键词:costly manual process, manual process requiring, process requiring repeated, requiring repeated mechanism-level, repeated mechanism-level modification
备注:
点击查看摘要
Abstract:Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.
42. 【2603.23949】Argument Mining as a Text-to-Text Generation Task
链接:https://arxiv.org/abs/2603.23949
作者:Masayuki Kawarada,Tsutomu Hirao,Wataru Uchida,Masaaki Nagata
类目:Computation and Language (cs.CL)
关键词:Argument Mining, aims to uncover, argumentative structures, Argument, Mining
备注:
点击查看摘要
Abstract:Argument Mining(AM) aims to uncover the argumentative structures within a text. Previous methods require several subtasks, such as span identification, component classification, and relation classification. Consequently, these methods need rule-based postprocessing to derive argumentative structures from the output of each subtask. This approach adds to the complexity of the model and expands the search space of the hyperparameters. To address this difficulty, we propose a simple yet strong method based on a text-to-text generation approach using a pretrained encoder-decoder language model. Our method simultaneously generates argumentatively annotated text for spans, components, and relations, eliminating the need for task-specific postprocessing and hyperparameter tuning. Furthermore, because it is a straightforward text-to-text generation method, we can easily adapt our approach to various types of argumentative structures. Experimental results demonstrate the effectiveness of our method, as it achieves state-of-the-art performance on three different types of benchmark datasets: the Argument-annotated Essays Corpus(AAEC), AbstRCT, and the Cornell eRulemaking Corpus(CDCP)
43. 【2603.23938】OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models
链接:https://arxiv.org/abs/2603.23938
作者:Seunghee Kim,Bumkyu Park,Kyudan Jung,Joosung Lee,Soyoon Kim,Jeonghoon Kim,Taeuk Kim,Hwiyeol Jo
类目:Computation and Language (cs.CL)
关键词:assess multimodal understanding, textual outputs, leaving it unclear, speak their answers, omni-modal models assess
备注:
点击查看摘要
Abstract:Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct control, failed implicit inference, and failed multimodal grounding-providing insights for developing models that can verbalize responses effectively.
44. 【2603.23937】Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development
链接:https://arxiv.org/abs/2603.23937
作者:Zongliang Ji,Ziyang Zhang,Xincheng Tan,Matthew Thompson,Anna Goldenberg,Carl Yang,Rahul G. Krishnan,Fan Zhang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:primary care settings, fast-paced primary care, Evidence-based medicine, central to high-quality, remains difficult
备注: 9 pages. To appear in Proceedings of Machine Learning Research (PMLR), Machine Learning for Health (ML4H) Symposium 2025
点击查看摘要
Abstract:Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.
45. 【2603.23933】ORACLE: Orchestrate NPC Daily Activities using Contrastive Learning with Transformer-CVAE
链接:https://arxiv.org/abs/2603.23933
作者:Seong-Eun Hong,JuYeong Hwang,RyunHa Lee,HyeongYeop Kang
类目:Graphics (cs.GR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:augment user immersion, Non-player characters, integration of Non-player, cognitive engagement, digital environments
备注: 17 pages, 7 figures. Accepted to CVM 2026
点击查看摘要
Abstract:The integration of Non-player characters (NPCs) within digital environments has been increasingly recognized for its potential to augment user immersion and cognitive engagement. The sophisticated orchestration of their daily activities, reflecting the nuances of human daily routines, contributes significantly to the realism of digital environments. Nevertheless, conventional approaches often produce monotonous repetition, falling short of capturing the intricacies of real human activity plans. In response to this, we introduce ORACLE, a novel generative model for the synthesis of realistic indoor daily activity plans, ensuring NPCs' authentic presence in digital habitats. Exploiting the CASAS smart home dataset's 24-hour indoor activity sequences, ORACLE addresses challenges in the dataset, including its imbalanced sequential data, the scarcity of training samples, and the absence of pre-trained models encapsulating human daily activity patterns. ORACLE's training leverages the sequential data processing prowess of Transformers, the generative controllability of Conditional Variational Autoencoders (CVAE), and the discriminative refinement of contrastive learning. Our experimental results validate the superiority of generating NPC activity plans and the efficacy of our design strategies over existing methods.
46. 【2603.23911】Self-Distillation for Multi-Token Prediction
链接:https://arxiv.org/abs/2603.23911
作者:Guoliang Zhao,Ruobing Xie,An Wang,Shuaipeng Li,Huaibing Xie,Xingwu Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Language Models, Large Language, MTP, critical bottleneck
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.
47. 【2603.23848】BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
链接:https://arxiv.org/abs/2603.23848
作者:Praveen Kumar Myakala,Manan Agrawal,Rahul Manche
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:long-running conversational agents, memory treats user, treats user information, major benchmark evaluating, conversational agents
备注:
点击查看摘要
Abstract:LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That's the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).
Subjects:
Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:
arXiv:2603.23848 [cs.CL]
(or
arXiv:2603.23848v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.23848
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Praveen Kumar Myakala [view email] [v1]
Wed, 25 Mar 2026 02:09:35 UTC (24 KB)
48. 【2603.23844】Language Model Planners do not Scale, but do Formalizers?
链接:https://arxiv.org/abs/2603.23844
作者:Owen Jiang,Cassie Huang,Ashish Sabharwal,Li Zhang
类目:Computation and Language (cs.CL)
关键词:Recent work shows, Recent work, work shows overwhelming, shows overwhelming evidence, LLM formalizers
备注:
点击查看摘要
Abstract:Recent work shows overwhelming evidence that LLMs, even those trained to scale their reasoning trace, perform unsatisfactorily when solving planning problems too complex. Whether the same conclusion holds for LLM formalizers that generate solver-oriented programs remains unknown. We systematically show that LLM formalizers greatly out-scale LLM planners, some retaining perfect accuracy in the classic BlocksWorld domain with a huge state space of size up to $10^{165}$. While performance of smaller LLM formalizers degrades with problem complexity, we show that a divide-and-conquer formalizing technique can greatly improve its robustness. Finally, we introduce unraveling problems where one line of problem description realistically corresponds to exponentially many lines of formal language such as the Planning Domain Definition Language (PDDL), greatly challenging LLM formalizers. We tackle this challenge by introducing a new paradigm, namely LLM-as-higher-order-formalizer, where an LLM generates a program generator. This decouples token output from the combinatorial explosion of the underlying formalization and search space.
49. 【2603.23841】PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay
链接:https://arxiv.org/abs/2603.23841
作者:Rohan Khetan,Ashna Khetan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, sources of information, impact their objectivity, Language Models
备注: 13 pages, 8 tables, 3 figures
点击查看摘要
Abstract:While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate gender and racial stereotypes. When political bias is included, it is typically measured at a coarse level, neglecting the specific values that shape sociopolitical leanings. This study investigates political bias in eight prominent LLMs (Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned) using PoliticsBench: a novel multi-turn roleplay framework adapted from the EQ-Bench-v3 psychometric benchmark. We test whether commercially developed LLMs display a systematic left-leaning bias that becomes more pronounced in later stages of multi-stage roleplay. Through twenty evolving scenarios, each model reported its stance and determined its course of action. Scoring these responses on a scale of ten political values, we explored the values underlying chatbots' deviations from unbiased standards. Seven of our eight models leaned left, while Grok leaned right. Each left-leaning LLM strongly exhibited liberal traits and moderately exhibited conservative ones. We discovered slight variations in alignment scores across stages of roleplay, with no particular pattern. Though most models used consequence-based reasoning, Grok frequently argued with facts and statistics. Our study presents the first psychometric evaluation of political values in LLMs through multi-stage, free-text interactions.
50. 【2603.23840】VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
链接:https://arxiv.org/abs/2603.23840
作者:Yuhao Chen,Yi Xu,Xinyun Ding,Xiang Fang,Shuochen Liu,Luxi Lin,Qingyu Zhang,Ya Li,Quan Liu,Tong Xu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:intelligent in-vehicle experiences, growing demand, demand for intelligent, evolving from simple, simple assistants
备注:
点击查看摘要
Abstract:With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions. This evolution requires agents to continuously model multi-user preferences and make reliable decisions in the face of inter-user preference conflicts and changing habits over time. However, existing benchmarks are largely limited to single-user, static question-answer settings, failing to capture the temporal evolution of preferences and the multi-user, tool-interactive nature of real vehicle environments. To address this gap, we introduce VehicleMemBench, a multi-user long-context memory benchmark built on an executable in-vehicle simulation environment. The benchmark evaluates tool use and memory by comparing the post-action environment state with a predefined target state, enabling objective and reproducible evaluation without LLM-based or human scoring. VehicleMemBench includes 23 tool modules, and each sample contains over 80 historical memory events. Experiments show that powerful models perform well on direct instruction tasks but struggle in scenarios involving memory evolution, particularly when user preferences change dynamically. Even advanced memory systems struggle to handle domain-specific memory requirements in this environment. These findings highlight the need for more robust and specialized memory management mechanisms to support long-term adaptive decision-making in real-world in-vehicle systems. To facilitate future research, we release the data and code.
51. 【2603.23822】How Vulnerable Are Edge LLMs?
链接:https://arxiv.org/abs/2603.23822
作者:Ao Ding,Hongzong Li,Zi Liang,Zhanpeng Shi,Shuxin Zhuang,Shiqin Tang,Rong Feng,Ping Lu
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, implications remain unclear, Large language, security implications remain, remain unclear
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed on edge devices under strict computation and quantization constraints, yet their security implications remain unclear. We study query-based knowledge extraction from quantized edge-deployed LLMs under realistic query budgets and show that, although quantization introduces noise, it does not remove the underlying semantic knowledge, allowing substantial behavioral recovery through carefully designed queries. To systematically analyze this risk, we propose \textbf{CLIQ} (\textbf{Cl}ustered \textbf{I}nstruction \textbf{Q}uerying), a structured query construction framework that improves semantic coverage while reducing redundancy. Experiments on quantized Qwen models (INT8/INT4) demonstrate that CLIQ consistently outperforms original queries across BERTScore, BLEU, and ROUGE, enabling more efficient extraction under limited budgets. These results indicate that quantization alone does not provide effective protection against query-based extraction, highlighting a previously underexplored security risk in edge-deployed LLMs.
52. 【2603.23821】Perturbation: A simple and efficient adversarial tracer for representation learning in language models
链接:https://arxiv.org/abs/2603.23821
作者:Joshua Rozner,Cory Shain
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:neural language models, deep neural language, language models, studied for decades, theoretical reasons
备注:
点击查看摘要
Abstract:Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects'' other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone.
53. 【2603.23797】Infrequent Child-Directed Speech Is Bursty and May Draw Infant Vocalizations
链接:https://arxiv.org/abs/2603.23797
作者:Margaret Cychosz,Adriana Weisleder
类目:Computation and Language (cs.CL)
关键词:language development milestones, reach major language, major language development, development milestones, speech
备注:
点击查看摘要
Abstract:Children in many parts of the world hear relatively little speech directed to them, yet still reach major language development milestones. What differs about the speech input that infants learn from when directed input is rare? Using longform, infant-centered audio recordings taken in rural Bolivia and the urban U.S., we examined temporal patterns of infants' speech input and their pre-linguistic vocal behavior. We find that child-directed speech in Bolivia, though less frequent, was just as temporally clustered as speech input in the U.S, arriving in concentrated bursts rather than spread across the day. In both communities, infants were most likely to produce speech-like vocalizations during periods of speech directed to them, with the probability of infants' speech-like vocalizations during target child-directed speech nearly double that during silence. In Bolivia, infants' speech-like vocalizations were also more likely to occur during bouts of directed speech from older children than from adults. Together, these findings suggest that the developmental impact of child-directed speech may depend not only on quantity, but on temporal concentration and source, with older children serving as an important source of input in some communities, including where adult speech to infants is less frequent.
54. 【2603.23750】IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge
链接:https://arxiv.org/abs/2603.23750
作者:Ali Abdelaal,Mohammed Nader Al Haffar,Mahmoud Fawzi,Walid Magdy
类目:Computation and Language (cs.CL)
关键词:Large language models, core Islamic disciplines, Large language, Islamic knowledge, increasingly consulted
备注: Leaderboard link: [this https URL](https://huggingface.co/spaces/islamicmmlu/leaderboard)
点击查看摘要
Abstract:Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash). The Quran track shows the widest span (99.3\% to 32.4\%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.
55. 【2603.23714】LLMs Do Not Grade Essays Like Humans
链接:https://arxiv.org/abs/2603.23714
作者:Jerin George Mathew,Sumayya Taher,Anindita Kundu,Denilson Barbosa
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, Large language, GPT and Llama, recently been proposed, proposed as tools
备注:
点击查看摘要
Abstract:Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics. In particular, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays, while assigning lower scores to longer essays that contain minor grammatical or spelling errors. We also find that the scores generated by LLMs are generally consistent with the feedback they generate: essays receiving more praise tend to receive higher scores, while essays receiving more criticism tend to receive lower scores. These results suggest that LLM-generated scores and feedback follow coherent patterns but rely on signals that differ from those used by human raters, resulting in limited alignment with human grading practices. Nevertheless, our work shows that LLMs produce feedback that is consistent with their grading and that they can be reliably used in supporting essay scoring.
56. 【2603.23701】he Diminishing Returns of Early-Exit Decoding in Modern LLMs
链接:https://arxiv.org/abs/2603.23701
作者:Rui Wei,Rui Du,Hanfei Yu,Devesh Tiwari,Jian Li,Zhaozhuo Xu,Hao Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Model, Large Language, Language Model, sufficiently confident, latency and cost
备注:
点击查看摘要
Abstract:In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a model's intrinsic suitability for early-exit and propose a benchmark for researchers to explore the potential early-exit benefits on different models and workloads. Our results show a diminishing trend in early-exit effectiveness across newer model generations. We further find that dense transformers generally offer greater early-exit potential than Mixture-of-Experts and State Space Models. In addition, larger models, particularly those with more than 20 billion parameters, and base pretrained models without specialized tuning tend to exhibit higher early-exit potential.
57. 【2603.23678】PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation
链接:https://arxiv.org/abs/2603.23678
作者:Manjushree B. Aithal,Ph.D.,Alexander Kotz,James Mitchell,Ph.D
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, offer transformative solutions, data privacy constraints, strict data privacy
备注: 10 pages, 2 figures, Under review AMIA Symposium
点击查看摘要
Abstract:Large Language Models (LLMs) offer transformative solutions across many domains, but healthcare integration is hindered by strict data privacy constraints. Clinical narratives are dense with ambiguous acronyms, misinterpretation these abbreviations can precipitate severe outcomes like life-threatening medication errors. While cloud-dependent LLMs excel at Acronym Disambiguation, transmitting Protected Health Information to external servers violates privacy frameworks. To bridge this gap, this study pioneers the evaluation of small-parameter models deployed entirely on-device to ensure privacy preservation. We introduce a privacy-preserving cascaded pipeline leveraging general-purpose local models to detect clinical acronyms, routing them to domain-specific biomedical models for context-relevant expansions. Results reveal that while general instruction-following models achieve high detection accuracy (~0.988), their expansion capabilities plummet (~0.655). Our cascaded approach utilizes domain-specific medical models to increase expansion accuracy to (~0.81). This novel work demonstrates that privacy-preserving, on-device (2B-10B) models deliver high-fidelity clinical acronym disambiguation support.
58. 【2603.23659】Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges
链接:https://arxiv.org/abs/2603.23659
作者:Weilun Xu,Alexander Rusnak,Frederic Kaplan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:single acceptability dimension, large language models, language models make, make ethical judgments, internal representations distinguish
备注:
点击查看摘要
Abstract:When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B--72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns -- e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious interpretation. We discuss both the structural insights these methods provide and their epistemological limitations.
59. 【2603.23654】Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages
链接:https://arxiv.org/abs/2603.23654
作者:Badr M. Abdullah,Israel Abebe Azime,Atnafu Lambebo Tonja,Jesujoba O. Alabi,Abel Mulat Alemu,Eyob G. Hagos,Bontu Fufa Balcha,Mulubrhan A. Nerea,Debela Desalegn Yadeta,Dagnachew Mekonnen Marilign,Amanuel Temesgen Fentahun,Tadesse Kebede,Israel D. Gebru,Michael Melese Woldeyohannis,Walelign Tewabe Sewunetie,Bernd Möbius,Dietrich Klakow
类目:Computation and Language (cs.CL)
关键词:automatic speech recognition, CTC-based automatic speech, multilingual CTC-based automatic, models jointly trained, Ethiopian languages
备注: Preprint (under review)
点击查看摘要
Abstract:We present Ethio-ASR, a suite of multilingual CTC-based automatic speech recognition (ASR) models jointly trained on five Ethiopian languages: Amharic, Tigrinya, Oromo, Sidaama, and Wolaytta. These languages belong to the Semitic, Cushitic, and Omotic branches of the Afroasiatic family, and remain severely underrepresented in speech technology despite being spoken by the vast majority of Ethiopia's population. We train our models on the recently released WAXAL corpus using several pre-trained speech encoders and evaluate against strong multilingual baselines, including OmniASR. Our best model achieves an average WER of 30.48% on the WAXAL test set, outperforming the best OmniASR model with substantially fewer parameters. We further provide a comprehensive analysis of gender bias, the contribution of vowel length and consonant gemination to ASR errors, and the training dynamics of multilingual CTC models. Our models and codebase are publicly available to the research community.
60. 【2603.23646】Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks
链接:https://arxiv.org/abs/2603.23646
作者:Fatih Uenal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:academic legal reasoning, Swiss regulatory compliance, applied Swiss regulatory, benchmarked large language, Swiss regulatory
备注: 21 pages, 5 figures, 7 tables. Code and data: [this https URL](https://github.com/FUenal/swiss-bench)
点击查看摘要
Abstract:While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35-38% correct), Tier B (26-29%), and Tier C (13-21%). The benchmark proves difficult: even the top-ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69-72% correct rates, while regulatory QA, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open-weight, three closed-source), an open-weight model leads the ranking, and several open-weight models match or outperform their closed-source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions.
61. 【2603.23626】A Theory of LLM Information Susceptibility
链接:https://arxiv.org/abs/2603.23626
作者:Zhuo-Yang Song,Hua Xing Zhu
类目:Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Adaptation and Self-Organizing Systems (nlin.AO)
关键词:remain poorly understood, LLM-mediated improvement remain, improvement remain poorly, Large language models, poorly understood
备注: 16 pages, 9 figures
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed as optimization modules in agentic systems, yet the fundamental limits of such LLM-mediated improvement remain poorly understood. Here we propose a theory of LLM information susceptibility, centred on the hypothesis that when computational resources are sufficiently large, the intervention of a fixed LLM does not increase the performance susceptibility of a strategy set with respect to budget. We develop a multi-variable utility-function framework that generalizes this hypothesis to architectures with multiple co-varying budget channels, and discuss the conditions under which co-scaling can exceed the susceptibility bound. We validate the theory empirically across structurally diverse domains and model scales spanning an order of magnitude, and show that nested, co-scaling architectures open response channels unavailable to fixed configurations. These results clarify when LLM intervention helps and when it does not, demonstrating that tools from statistical physics can provide predictive constraints for the design of AI systems. If the susceptibility hypothesis holds generally, the theory suggests that nested architectures may be a necessary structural condition for open-ended agentic self-improvement.
62. 【2603.23625】Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework
链接:https://arxiv.org/abs/2603.23625
作者:Zeinab Dehghani,Rameez Raja Kureshi,Koorosh Aslansefat,Faezeh Alsadat Abedi,Dhavalkumar Thakker,Lisa Greaves,Bhupesh Kumar Mishra,Baseer Ahmad,Tanaya Maslekar
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:reduce administrative workload, Artificial intelligence, Home Smart Speaker, Smart Speaker designed, Care Home Smart
备注:
点击查看摘要
Abstract:Artificial intelligence (AI) is increasingly being explored in health and social care to reduce administrative workload and allow staff to spend more time on patient care. This paper evaluates a voice-enabled Care Home Smart Speaker designed to support everyday activities in residential care homes, including spoken access to resident records, reminders, and scheduling tasks. A safety-focused evaluation framework is presented that examines the system end-to-end, combining Whisper-based speech recognition with retrieval-augmented generation (RAG) approaches (hybrid, sparse, and dense). Using supervised care-home trials and controlled testing, we evaluated 330 spoken transcripts across 11 care categories, including 184 reminder-containing interactions. These evaluations focus on (i) correct identification of residents and care categories, (ii) reminder recognition and extraction, and (iii) end-to-end scheduling correctness under uncertainty (including safe deferral/clarification). Given the safety-critical nature of care homes, particular attention is also paid to reliability in noisy environments and across diverse accents, supported by confidence scoring, clarification prompts, and human-in-the-loop oversight. In the best-performing configuration (GPT-5.2), resident ID and care category matching reached 100% (95% CI: 98.86-100), while reminder recognition reached 89.09\% (95% CI: 83.81-92.80) with zero missed reminders (100% recall) but some false positives. End-to-end scheduling via calendar integration achieved 84.65% exact reminder-count agreement (95% CI: 78.00-89.56), indicating remaining edge cases in converting informal spoken instructions into actionable events. The findings suggest that voice-enabled systems, when carefully evaluated and appropriately safeguarded, can support accurate documentation, effective task management, and trustworthy use of AI in care home settings.
63. 【2603.23624】Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths
链接:https://arxiv.org/abs/2603.23624
作者:Amani Maina-Kilaas,Roger Levy
类目:Computation and Language (cs.CL)
关键词:longer ambiguous regions, structural commitments strengthen, ambiguous regions, strengthen over time, disambiguation difficulty increases
备注: 8 pages, 5 figures
点击查看摘要
Abstract:Digging-in effects, where disambiguation difficulty increases with longer ambiguous regions, have been cited as evidence for self-organized sentence processing, in which structural commitments strengthen over time. In contrast, surprisal theory predicts no such effect unless lengthening genuinely shifts statistical expectations, and neural language models appear to show the opposite pattern. Whether digging-in is a robust real-time phenomenon in human sentence processing -- or an artifact of wrap-up processes or methodological confounds -- remains unclear. We report two experiments on English NP/Z garden-path sentences using Maze and self-paced reading, comparing human behavior with predictions from an ensemble of large language models. We find no evidence for real-time digging-in effects. Critically, items with sentence-final versus nonfinal disambiguation show qualitatively different patterns: positive digging-in trends appear only sentence-finally, where wrap-up effects confound interpretation. Nonfinal items -- the cleaner test of real-time processing -- show reverse trends consistent with neural model predictions.
64. 【2603.23611】LLMORPH: Automated Metamorphic Testing of Large Language Models
链接:https://arxiv.org/abs/2603.23611
作者:Steven Cho,Stefano Ruberto,Valerio Terragni
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, reliability of Large, verifying output correctness, output correctness remains
备注: Accepted for publication in the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE 2025). This arXiv version is the authors' accepted manuscript. DOI: [https://doi.org/10.1109/ASE63991.2025.00385](https://doi.org/10.1109/ASE63991.2025.00385) Code: [this http URL](http://github.com/steven-b-cho/llmorph)
点击查看摘要
Abstract:Automated testing is essential for evaluating and improving the reliability of Large Language Models (LLMs), yet the lack of automated oracles for verifying output correctness remains a key challenge. We present LLMORPH, an automated testing tool specifically designed for LLMs performing NLP tasks, which leverages Metamorphic Testing (MT) to uncover faulty behaviors without relying on human-labeled data. MT uses Metamorphic Relations (MRs) to generate follow-up inputs from source test input, enabling detection of inconsistencies in model outputs without the need of expensive labelled data. LLMORPH is aimed at researchers and developers who want to evaluate the robustness of LLM-based NLP systems. In this paper, we detail the design, implementation, and practical usage of LLMORPH, demonstrating how it can be easily extended to any LLM, NLP task, and set of MRs. In our evaluation, we applied 36 MRs across four NLP benchmarks, testing three state-of-the-art LLMs: GPT-4, LLAMA3, and HERMES 2. This produced over 561,000 test executions. Results demonstrate LLMORPH's effectiveness in automatically exposing inconsistencies.
65. 【2603.23577】he Geometric Price of Discrete Logic: Context-driven Manifold Dynamics of Number Representations
链接:https://arxiv.org/abs/2603.23577
作者:Long Zhang,Dai-jun Lin,Wei-neng Chen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Large language models, continuous semantic spaces, Large language, logical reasoning demands, generalize smoothly
备注:
点击查看摘要
Abstract:Large language models (LLMs) generalize smoothly across continuous semantic spaces, yet strict logical reasoning demands the formation of discrete decision boundaries. Prevailing theories relying on linear isometric projections fail to resolve this fundamental tension. In this work, we argue that task context operates as a non-isometric dynamical operator that enforces a necessary "topological distortion." By applying Gram-Schmidt decomposition to residual-stream activations , we reveal a dual-modulation mechanism driving this process: a class-agnostic topological preservation that anchors global structure to prevent semantic collapse, and a specific algebraic divergence that directionally tears apart cross-class concepts to forge logical boundaries. We validate this geometric evolution across a gradient of tasks, from simple mapping to complex primality testing. Crucially, targeted specific vector ablation establishes a strict causal binding between this topology and model function: algebraically erasing the divergence component collapses parity classification accuracy from 100% to chance levels (38.57%). Furthermore, we uncover a three-phase layer-wise geometric dynamic and demonstrate that under social pressure prompts, models fail to generate sufficient divergence. This results in a "manifold entanglement" that geometrically explains sycophancy and hallucination. Ultimately, our findings revise the linear-isometric presumption, demonstrating that the emergence of discrete logic in LLMs is purchased at an irreducible cost of topological deformation.
66. 【2603.23539】PLDR-LLMs Reason At Self-Organized Criticality
链接:https://arxiv.org/abs/2603.23539
作者:Burc Gokden
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
关键词:self-organized criticality exhibit, PLDR-LLM deductive outputs, deductive outputs, criticality exhibit reasoning, PLDR-LLMs pretrained
备注:
点击查看摘要
Abstract:We show that PLDR-LLMs pretrained at self-organized criticality exhibit reasoning at inference time. The characteristics of PLDR-LLM deductive outputs at criticality is similar to second-order phase transitions. At criticality, the correlation length diverges, and the deductive outputs attain a metastable steady state. The steady state behaviour suggests that deductive outputs learn representations equivalent to scaling functions, universality classes and renormalization groups from the training dataset, leading to generalization and reasoning capabilities in the process. We can then define an order parameter from the global statistics of the model's deductive output parameters at inference. The reasoning capabilities of a PLDR-LLM is better when its order parameter is close to zero at criticality. This observation is supported by the benchmark scores of the models trained at near-criticality and sub-criticality. Our results provide a self-contained explanation on how reasoning manifests in large language models, and the ability to reason can be quantified solely from global model parameter values of the deductive outputs at steady state, without any need for evaluation of curated benchmark datasets through inductive output for reasoning and comprehension.
67. 【2603.23534】Not All Pretraining are Created Equal: Threshold Tuning and Class Weighting for Imbalanced Polarization Tasks in Low-Resource Settings
链接:https://arxiv.org/abs/2603.23534
作者:Abass Oguntade
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:social media text, Polarization Shared Task, addresses polarization detection, Polarization Shared, Shared Task
备注:
点击查看摘要
Abstract:This paper describes my submission to the Polarization Shared Task at SemEval-2025, which addresses polarization detection and classification in social media text. I develop Transformer-based systems for English and Swahili across three subtasks: binary polarization detection, multi-label target type classification, and multi-label manifestation identification. The approach leverages multilingual and African language-specialized models (mDeBERTa-v3-base, SwahBERT, AfriBERTa-large), class-weighted loss functions, iterative stratified data splitting, and per-label threshold tuning to handle severe class imbalance. The best configuration, mDeBERTa-v3-base, achieves 0.8032 macro-F1 on validation for binary detection, with competitive performance on multi-label tasks (up to 0.556 macro-F1). Error analysis reveals persistent challenges with implicit polarization, code-switching, and distinguishing heated political discourse from genuine polarization.
68. 【2603.23533】MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG
链接:https://arxiv.org/abs/2603.23533
作者:Bhavik Mangla
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:RAG pipelines typically, requires multiple LLM, ignores document structure, multiple LLM calls, pipelines typically rely
备注: 13 pages, 4 figures, 7 tables, 2 algorithms. Code: [this https URL](https://github.com/bhavik-mangla/MDKeyChunker)
点击查看摘要
Abstract:RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.
69. 【2603.23532】Generating Hierarchical JSON Representations of Scientific Sentences Using LLMs
链接:https://arxiv.org/abs/2603.23532
作者:Satya Sri Rajiteswari Nimmagadda,Ethan Young,Niladri Sengupta,Ananya Jana,Aniruddha Maiti
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:paper investigates, investigates whether structured, structured representations, representations can preserve, preserve the meaning
备注: accepted to 21th International Conference on Semantic Computing (IEEE ICSC 2026)
点击查看摘要
Abstract:This paper investigates whether structured representations can preserve the meaning of scientific sentences. To test this, a lightweight LLM is fine-tuned using a novel structural loss function to generate hierarchical JSON structures from sentences collected from scientific articles. These JSONs are then used by a generative model to reconstruct the original text. Comparing the original and reconstructed sentences using semantic and lexical similarity we show that hierarchical formats are capable of retaining information of scientific texts effectively.
70. 【2603.23531】Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction
链接:https://arxiv.org/abs/2603.23531
作者:Özgür Togay,Florian Kunneman,Javier Garcia-Bernardo,Anastasia Giachanou
类目:Computation and Language (cs.CL)
关键词:Political polarization emerges, polarization emerges, Large Language Models, Political, Political polarization
备注:
点击查看摘要
Abstract:Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.
71. 【2603.23530】Did You Forget What I Asked? Prospective Memory Failures in Large Language Models
链接:https://arxiv.org/abs/2603.23530
作者:Avni Mittal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, simultaneously perform demanding, Large language, perform demanding tasks, fail to satisfy
备注:
点击查看摘要
Abstract:Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.
72. 【2603.23529】Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language
链接:https://arxiv.org/abs/2603.23529
作者:Reuben Chagas Fernandes,Gaurang S. Patkar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, low-resource linguistic contexts, consistently under perform, perform in low-resource
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3. We establish rigorous baseline benchmarks by evaluating leading open-weights architectures including Llama 3.1, Qwen2.5 and Gemma 3 alongside proprietary closed-source models. Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation. In machine translation, Konkani LLM delivers consistent gains over the corresponding base models and is competitive with and in several settings surpasses proprietary baselines
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2603.23529 [cs.CL]
(or
arXiv:2603.23529v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.23529
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
73. 【2603.23528】he Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression
链接:https://arxiv.org/abs/2603.23528
作者:Warren Johnson
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, global carbon emissions, solve climate challenges, proliferation of Large
备注: 16 pages, 5 figures, 5 tables. Includes data/code availability, ethics statement, and competing interests
点击查看摘要
Abstract:The rapid proliferation of Large Language Models has created an environmental paradox: the very technology that could help solve climate challenges is itself becoming a significant contributor to global carbon emissions. We test whether prompt compression improves inference energy efficiency in 28,421 successful API trials (28,428 planned) across three providers (OpenAI GPT-4o-mini, Anthropic Claude-3.5-Sonnet, and DeepSeek-Chat), five benchmarks (HumanEval, MBPP, GSM8K, MATH, MMLU), and four compression ratios (r in {1.0, 0.7, 0.5, 0.3}). Energy is estimated with a token-based proxy calibrated against local direct measurements, and quality is tracked with benchmark pass rates. Compression produced substantial quality loss (overall pass rate 26.0% at baseline vs. 1.5% at r=0.7) and strongly provider-dependent energy behavior. DeepSeek exhibited output expansion under compression (21 to 798 tokens at r=0.3), corresponding to energy increases up to +2,140%, while GPT-4o-mini showed mixed effects including a reduction at r=0.5. These results indicate that input-token reduction alone is not a reliable energy optimization strategy in production inference. For the evaluated settings, model selection and output-length control provided more consistent energy-quality tradeoffs than prompt compression.
74. 【2603.23527】Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression
链接:https://arxiv.org/abs/2603.23527
作者:Warren Johnson
类目:Computation and Language (cs.CL)
关键词:total inference cost, real deployment impact, deployment impact depends, input-token reduction, inference cost
备注: 19 pages. Includes figures and tables. Companion code/data repository and direct NVML calibration dataset are cited in manuscript
点击查看摘要
Abstract:Prompt compression is often evaluated by input-token reduction, but its real deployment impact depends on how compression changes output length and total inference cost. We present a controlled replication and extension study of benchmark-dependent output dynamics under aggressive compression, covering 5,400 API calls across three benchmarks and multiple providers. To explain conflicting prior observations, we formalize instruction survival probability (Psi), a structural metric that captures whether task-critical prompt segments remain after truncation. Results show a strong benchmark effect: under r=0.3, DeepSeek exhibits severe output expansion on MBPP (56x, Psi approx 0.15) but substantially lower expansion on HumanEval (5x, Psi approx 0.72), while GPT-4o-mini is comparatively stable across benchmarks. This reconciles the apparent discrepancy between previously reported extreme explosion and lower replication effects by identifying prompt structure, not provider identity alone, as the primary moderator. We introduce the Compression Robustness Index (CRI) for cross-benchmark evaluation and show that single-benchmark assessments can produce misleading conclusions about compression safety and efficiency. To contextualize energy claims, we incorporate companion direct NVML measurements from rented RunPod GPUs and show that token savings can overstate joule savings. These findings motivate benchmark-diverse testing and structure-aware compression policies for reliable, energy-conscious LLM deployment.
75. 【2603.23526】Plato's Cave: A Human-Centered Research Verification System
链接:https://arxiv.org/abs/2603.23526
作者:Matheus Kunzler Maldaner,Raul Valle,Junsung Kim,Tonuka Sultan,Pranav Bhargava,Matthew Maloni,John Courtney,Hoang Nguyen,Aamogh Sawant,Kristian O'Connor,Stephen Wormald,Damon L. Woodard
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
关键词:assess writing quality, identify unverifiable claims, growing publication rate, fact-check information, assess writing
备注: 15 pages, 4 figures
点击查看摘要
Abstract:The growing publication rate of research papers has created an urgent need for better ways to fact-check information, assess writing quality, and identify unverifiable claims. We present Plato's Cave as an open-source, human-centered research verification system that (i) creates a directed acyclic graph (DAG) from a document, (ii) leverages web agents to assign credibility scores to nodes and edges from the DAG, and (iii) gives a final score by interpreting and evaluating the paper's argumentative structure. We report the system implementation and results on a collected dataset of 104 research papers.
76. 【2603.23525】Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial
链接:https://arxiv.org/abs/2603.23525
作者:Warren Johnson,Charles Lee
类目:Computation and Language (cs.CL)
关键词:successful Claude Sonnet, prompt compression depend, times higher, typically priced, priced several times
备注: 28 pages, 9 tables, 1 CONSORT figure; pre-registered randomized controlled trial on production orchestration prompts
点击查看摘要
Abstract:The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre-registered six-arm randomized controlled trial of prompt compression on production multi-agent task-orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59-61 per arm) drawn from a randomized corpus of 1,199 real orchestration instructions. We compare an uncompressed control with three uniform retention rates (r=0.8, 0.5, 0.2) and two structure-aware strategies (entropy-adaptive and recency-weighted), measuring total inference cost (input+output) and embedding-based response similarity. Moderate compression (r=0.5) reduced mean total cost by 27.9%, while aggressive compression (r=0.2) increased mean cost by 1.8% despite substantial input reduction, consistent with small mean output expansion (1.03x vs. control) and heavy-tailed uncertainty. Recency-weighted compression achieved 23.5% savings and, together with moderate compression, occupied the empirical cost-similarity Pareto frontier, whereas aggressive compression was dominated on both cost and similarity. These results show that "compress more" is not a reliable production heuristic and that output tokens must be treated as a first-class outcome when designing compression policies.
77. 【2603.23524】Navigating the Concept Space of Language Models
链接:https://arxiv.org/abs/2603.23524
作者:Wilson E. Marcílio-Jr,Danilo M. Eler
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language model, language model activations, model activations output, activations output thousands, Sparse autoencoders
备注:
点击查看摘要
Abstract:Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2, where it reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.
78. 【2603.23523】Do 3D Large Language Models Really Understand 3D Spatial Relationships?
链接:https://arxiv.org/abs/2603.23523
作者:Xianzheng Ma,Tao Sun,Shuai Chen,Yash Bhalgat,Jindong Gu,Angel X Chang,Iro Armeni,Iro Laina,Songyou Peng,Victor Adrian Prisacariu
类目:Computation and Language (cs.CL); Robotics (cs.RO)
关键词:claim to understand, Recent, Large-Language Models, Abstract, worlds
备注: ICLR 2026
点击查看摘要
Abstract:Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that guides model to rely more on 3D visual clues, substantially enhancing 3D-LLMs performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding. Project page: this https URL.
79. 【2603.23522】Qworld: Question-Specific Evaluation Criteria for LLMs
链接:https://arxiv.org/abs/2603.23522
作者:Shanghua Gao,Yuchang Su,Pengwei Sui,Curtis Ginder,Marinka Zitnik
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Evaluating large language, large language models, response quality depends, Evaluating large, language models
备注:
点击查看摘要
Abstract:Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity's Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.
80. 【2603.23521】Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages
链接:https://arxiv.org/abs/2603.23521
作者:Shaharukh Khan,Ali Faraz,Abhinav Ravi,Mohd Nauman,Mohd Sarfraz,Akshat Patidar,Raja Kolla,Chandra Khatri,Shubham Agarwal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal research, single-image reasoning, research has predominantly, predominantly focused, focused on single-image
备注: Accepted at "CVPR 2025: Workshop Vision Language Models For All"
点击查看摘要
Abstract:Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset's representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.
81. 【2603.23520】From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians' Medical Expertise with Lightweight LLM
链接:https://arxiv.org/abs/2603.23520
作者:Chanyong Luo,Jirui Dai,Zhendong Wang,Kui Chen,Jiaxi Yang,Bingjie Lu,Jing Wang,Jiaxin Hao,Bing Li,Ruiyang He,Yiyu Qiao,Chenkai Zhang,Kaiyu Wang,Zhi Liu,Zeyu Zheng,Yan Li,Xiaohong Gu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:empirical discipline refined, high-variance reality, empirical discipline, discipline refined, refined through long-term
备注:
点击查看摘要
Abstract:Medicine is an empirical discipline refined through long-term observation and the messy, high-variance reality of clinical practice. Physicians build diagnostic and therapeutic competence through repeated cycles of application, reflection, and improvement, forming individualized methodologies. Yet outcomes vary widely, and master physicians' knowledge systems are slow to develop and hard to transmit at scale, contributing to the scarcity of high-quality clinical expertise. To address this, we propose Med-Shicheng, a general framework that enables large language models to systematically learn and transfer distinguished physicians' diagnostic-and-therapeutic philosophy and case-dependent adaptation rules in a standardized way. Built on Tianyi, Med-Shicheng consists of five stages. We target five National Masters of Chinese Medicine or distinguished TCM physicians, curate multi-source materials, and train a single model to internalize all five knowledge systems across seven tasks, including etiology-pathogenesis analysis, syndrome diagnosis, treatment principle selection, prescription generation, prescription explanation, symptom evolution with regimen adjustment, and clinical advice. Implemented on Qwen2.5-1.5B-Base, Med-Shicheng runs on resource-constrained GPUs while achieving performance comparable to DeepSeek-R1 and GPT-5. We also examine the reliability of LLM-as-a-judge versus physician evaluation: automated judging tracks overall trends but shows bias on fine-grained individualized distinctions, highlighting the need for physician involvement when ground truth is unavailable and for domain-adapted judge models.
82. 【2603.23519】MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
链接:https://arxiv.org/abs/2603.23519
作者:Lin Yang,Yuancheng Yang,Xu Wang,Changkun Liu,Haihua Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, demonstrated impressive capabilities, Language Models, demonstrated impressive
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94\%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00\%), with the best model reaching 59.75\%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in this https URL
83. 【2603.23518】Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents
链接:https://arxiv.org/abs/2603.23518
作者:Peijun Qing,Puneet Mathur,Nedim Lipka,Varun Manjunatha,Ryan Rossi,Franck Dernoncourt,Saeed Hassanpour,Soroush Vosoughi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:recognizing semantic similarities, General-purpose embedding models, General-purpose embedding, embedding models excel, excel at recognizing
备注:
点击查看摘要
Abstract:General-purpose embedding models excel at recognizing semantic similarities but fail to capture the characteristics of texts specified by user instructions. In contrast, instruction-tuned embedders can align embeddings with textual instructions yet cannot autonomously infer latent corpus structures, such as determining the optimal number of clusters. To address both limitations, we reframe instruction-following clustering as a generative task and train large reasoning models (LRMs) as autonomous clustering agents. Our reasoning-driven training pipeline enables LRMs to interpret high-level clustering instructions and then infer the corresponding latent groupings. To evaluate this paradigm, we introduce ReasonCluster, a comprehensive benchmark comprising 28 diverse tasks spanning daily dialogue, legal cases, and financial reports. Experiments across diverse datasets and clustering scenarios show that our approach consistently outperforms strong embedding-based methods and LRM baselines, demonstrating that explicit reasoning fosters more faithful and interpretable instruction-based clustering.
84. 【2603.23517】Beyond Accuracy: Introducing a Symbolic-Mechanistic Approach to Interpretable Evaluation
链接:https://arxiv.org/abs/2603.23517
作者:Reza Habibi,Darian Lee,Magy Seif El-Nasr
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
关键词:reliably distinguish genuine, distinguish genuine generalization, brittle heuristics, small-data regimes, Accuracy-based evaluation
备注:
点击查看摘要
Abstract:Accuracy-based evaluation cannot reliably distinguish genuine generalization from shortcuts like memorization, leakage, or brittle heuristics, especially in small-data regimes. In this position paper, we argue for mechanism-aware evaluation that combines task-relevant symbolic rules with mechanistic interpretability, yielding algorithmic pass/fail scores that show exactly where models generalize versus exploit patterns. We demonstrate this on NL-to-SQL by training two identical architectures under different conditions: one without schema information (forcing memorization), one with schema (enabling grounding). Standard evaluation shows the memorization model achieves 94% field-name accuracy on unseen data, falsely suggesting competence. Our symbolic-mechanistic evaluation reveals this model violates core schema generalization rules, a failure invisible to accuracy metrics.
85. 【2603.23516】MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
链接:https://arxiv.org/abs/2603.23516
作者:Yu Chen,Runkai Chen,Sheng Yi,Xinda Zhao,Xiaohong Li,Jianjin Zhang,Jun Sun,Chuanrui Hu,Yunyun Han,Lidong Bing,Yafeng Deng,Tianqiao Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Toggle, memory, Long-term memory, Memory Sparse Attention, Code Toggle Papers
备注:
点击查看摘要
Abstract:Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:
arXiv:2603.23516 [cs.CL]
(or
arXiv:2603.23516v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.23516
Focus to learn more
arXiv-issued DOI via DataCite
Submission history From: Runkai Chen [view email] [v1]
Fri, 6 Mar 2026 02:29:54 UTC (383 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens, by Yu Chen and 11 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.CL
prev
|
next
new
|
recent
| 2026-03
Change to browse by:
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
86. 【2603.23515】raining a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data
链接:https://arxiv.org/abs/2603.23515
作者:John Cook,Michael Wyatt,Peng Wei,Iris Chin,Santosh Gupta,Van Zyl Van Vuuren,Richie Siburian,Amanda Spicer,Kristen Viviano,Alda Cami,Raunaq Malhotra,Zhewei Yao,Jeff Rasley,Gaurav Kaushik
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:revenue cycle processes, reduces clinician burnout, Improving the accuracy, coding reduces clinician, supports revenue cycle
备注: 20 pages, 6 figures
点击查看摘要
Abstract:Improving the accuracy and reliability of medical coding reduces clinician burnout and supports revenue cycle processes, freeing providers to focus more on patient care. However, automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains a challenge due to heterogeneous records, nuanced coding guidelines, and long-tail distributions. Large language models have been proposed to help or automate specific medical coding tasks. However, foundation models are not explicitly trained for medical coding and zero-shot coding has yielded poor results. We investigate whether a modern open-weight foundation model can be adapted for an expert-level medical coding task using privacy-preserving synthetic training data derived from electronic health records. We fine-tune Llama 3-70B on pairs of clinical notes and gold codes generated from EHR-grounded templates and coding policies, then evaluate exact-code prediction for ICD-10-CM and CPT. A zero-shot baseline with the unadapted model achieved an F1 score of 0.18 for exact code match. After fine-tuning on the synthetic corpus, exact-match F1 exceeded 0.70, representing a large absolute gain across both code systems. Notably, performance remained high on complex categories that often require multi-step clinical reasoning and code composition, including Advanced Illness and Frailty classes, and the model retained its performance on medical comprehension tasks. These results indicate that synthetic, policy-aware data can efficiently teach a general-purpose large language model to support precise medical coding without exposing protected health information. The approach offers a practical path for training coding agents safely and iteratively on specific tasks that represent real-world populations.
87. 【2603.23514】DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models
链接:https://arxiv.org/abs/2603.23514
作者:Alexander Sheppert
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, answering general questions, Large Language, competent when answering, answering general
备注:
点击查看摘要
Abstract:Large Language Models appear competent when answering general questions but often fail when pushed into domain-specific details. No existing methodology provides an out-of-the-box solution for measuring how deeply LLMs can sustain accurate responses under adaptive follow-up questioning across arbitrary domains. We present DepthCharge, a domain-agnostic framework that measures knowledge depth through three innovations: adaptive probing that generates follow-up questions based on concepts the model actually mentions, on-demand fact verification from authoritative sources, and survival statistics with constant sample sizes at every depth level. The framework can be deployed on any knowledge domain with publicly verifiable facts, without requiring pre-constructed test sets or domain-specific expertise. DepthCharge results are relative to the evaluator model used for answer checking, making the framework a tool for comparative evaluation rather than absolute accuracy certification. Empirical validation across four diverse domains (Medicine, Constitutional Law, Ancient Rome, and Quantum Computing) with five frontier models demonstrates that DepthCharge reveals depth-dependent performance variation hidden by standard benchmarks. Expected Valid Depth (EVD) ranges from 3.45 to 7.55 across model-domain combinations, and model rankings vary substantially by domain, with no single model dominating all areas. Cost-performance analysis further reveals that expensive models do not always achieve deeper knowledge, suggesting that domain-specific evaluation is more informative than aggregate benchmarks for model selection in professional applications.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2603.23514 [cs.CL]
(or
arXiv:2603.23514v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.23514
Focus to learn more
arXiv-issued DOI via DataCite</p>
88. 【2603.23513】Berta: an open-source, modular tool for AI-enabled clinical documentation
链接:https://arxiv.org/abs/2603.23513
作者:Samridhi Vaid,Mike Weldon,Jesse Dunn,Sacha Davis,Kevin Lonergan,Henry Li,Jeffrey Franc,Mohamed Abdalla,Daniel C. Baumgart,Jake Hayward,J Ross Mitchell
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:limiting organizational control, Alberta Health Services, quality improvement, Data Cloud infrastructure, operate as opaque
备注:
点击查看摘要
Abstract:Commercial AI scribes cost \$99-600 per physician per month, operate as opaque systems, and do not return data to institutional infrastructure, limiting organizational control over data governance, quality improvement, and clinical workflows. We developed Berta, an open-source modular scribe platform for AI-enabled clinical documentation, and deployed a customized implementation within Alberta Health Services (AHS) integrated with their existing Snowflake AI Data Cloud infrastructure. The system combines automatic speech recognition with large language models while retaining all clinical data within the secure AHS environment. During eight months (November 2024 to July 2025), 198 emergency physicians used the system in 105 urban and rural facilities, generating 22148 clinical sessions and more than 2800 hours of audio. The use grew from 680 to 5530 monthly sessions. Operating costs averaged less than \$30 per physician per month, a 70-95% reduction compared to commercial alternatives. AHS has since approved expansion to 850 physicians. This is the first provincial-scale deployment of an AI scribe integrated with existing health system infrastructure. By releasing Berta as open source, we provide a reproducible, cost-effective alternative that health systems can adapt to their own secure environments, supporting data sovereignty and informed evaluation of AI documentation technology.
89. 【2603.23512】S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering
链接:https://arxiv.org/abs/2603.23512
作者:Rong Fu,Yemin Wang,Tianxiang Xu,Yongtai Liu,Weizhi Tang,Wangyu Wu,Xiaowen Ma,Simon Fong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:shortest-path Retrieval-Augmented Generation, Retrieval-Augmented Generation framework, Generation framework designed, semantic-aware shortest-path Retrieval-Augmented, large knowledge graphs
备注:
点击查看摘要
Abstract:We present S-Path-RAG, a semantic-aware shortest-path Retrieval-Augmented Generation framework designed to improve multi-hop question answering over large knowledge graphs. S-Path-RAG departs from one-shot, text-heavy retrieval by enumerating bounded-length, semantically weighted candidate paths using a hybrid weighted $k$-shortest, beam, and constrained random-walk strategy, learning a differentiable path scorer together with a contrastive path encoder and lightweight verifier, and injecting a compact soft mixture of selected path latents into a language model via cross-attention. The system runs inside an iterative Neural-Socratic Graph Dialogue loop in which concise diagnostic messages produced by the language model are mapped to targeted graph edits or seed expansions, enabling adaptive retrieval when the model expresses uncertainty. This combination yields a retrieval mechanism that is both token-efficient and topology-aware while preserving interpretable path-level traces for diagnostics and intervention. We validate S-Path-RAG on standard multi-hop KGQA benchmarks and through ablations and diagnostic analyses. The results demonstrate consistent improvements in answer accuracy, evidence coverage, and end-to-end efficiency compared to strong graph- and LLM-based baselines. We further analyze trade-offs between semantic weighting, verifier filtering, and iterative updates, and report practical recommendations for deployment under constrained compute and token budgets.
90. 【2603.23511】DISCO: Document Intelligence Suite for COmparative Evaluation
链接:https://arxiv.org/abs/2603.23511
作者:Kenza Benkirane,Dan Goldwater,Martin Asenov,Aneiss Ghodsi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:intelligence requires accurate, Document intelligence requires, requires accurate text, accurate text extraction, Document Intelligence Suite
备注: Accepted at the ICLR 2026 Workshop on Multimodal Intelligence (MMIntelligence). 10 pages, 7 figures
点击查看摘要
Abstract:Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for COmparative Evaluation}, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.
91. 【2603.23510】Visuospatial Perspective Taking in Multimodal Language Models
链接:https://arxiv.org/abs/2603.23510
作者:Jonathan Prunty,Seraphina Zhang,Patrick Quinn,Jianxun Lian,Xing Xie,Lucy Cheke
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:multimodal language models, language models, multimodal language, crucial to evaluate, Rotating Figure Task
备注:
点击查看摘要
Abstract:As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one's own perspective to adopt another's. These results expose critical limitations in current MLMs' ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.
92. 【2603.23509】Internal Safety Collapse in Frontier Large Language Models
链接:https://arxiv.org/abs/2603.23509
作者:Yutao Wu,Xiao Liu,Yifeng Gao,Xiang Zheng,Hanxun Huang,Yige Li,Cong Wang,Bo Li,Xingjun Ma,Yu-Gang Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:Internal Safety Collapse, continuously generate harmful, Safety Collapse, critical failure mode, large language models
备注: 15 pages of the main text, qualitative examples of jailbreaks may be harmful in nature
点击查看摘要
Abstract:This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability--even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: this https URL
93. 【2603.23508】Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems
链接:https://arxiv.org/abs/2603.23508
作者:Xunzhuo Liu,Bowei He,Xue Liu,Haichen Zhang,Huamin Chen
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:complex source materials, document-centric assistants, source materials, increasingly deployed, deployed in enterprise
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) is increasingly deployed in enterprise search and document-centric assistants, where responses must be grounded in long and complex source materials. In practice, verifying that generated answers faithfully reflect retrieved documents is difficult: large language models can check long contexts but are too slow and costly for interactive services, while lightweight classifiers operate within strict context limits and frequently miss evidence outside truncated passages. We present the design of a real-time verification component integrated into a production RAG pipeline that enables full-document grounding under latency constraints. The system processes documents up to 32K tokens and employs adaptive inference strategies to balance response time and verification coverage across workloads. We describe the architectural decisions, operational trade-offs, and evaluation methodology used to deploy the verifier, and show that full-context verification substantially improves detection of unsupported responses compared with truncated validation. Our experience highlights when long-context verification is necessary, why chunk-based checking often fails in real documents, and how latency budgets shape model design. These findings provide practical guidance for practitioners building reliable large-scale retrieval-augmented applications. (Model, benchmark, and code: this https URL)
94. 【2603.23507】Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes
链接:https://arxiv.org/abs/2603.23507
作者:Fangyu Ding,Ding Ding,Sijin Chen,Kaibo Wang,Peng Xu,Zijin Feng,Haoli Bai,Kai Han,Youliang Yan,Binhang Yuan,Jiacheng Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Diffusion Language Models, Masked Diffusion Language, Masked Diffusion, Language Models, Deletion-Insertion Diffusion language
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) MASK tokens inherent to the paradigm, and 2) PAD tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism during generation due to insertion that dynamically adjusts token positions. To train DID, we design a score-based approach that assigns scores to token insertion operations and derive appropriate training objectives. The objectives involve subsequence counting problems, which we efficiently solve via a parallelized dynamic programming algorithm. Our experiments across fixed and variable-length settings demonstrate the advantage of DID over baselines of MDLMs and existing insertion-based LMs, in terms of modeling performance, sampling quality, and training/inference speed, without any hyperparameter tuning.
95. 【2603.23506】Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking
链接:https://arxiv.org/abs/2603.23506
作者:Tianpeng Zheng,Zhehan Jiang,Jiayi Liu,Shicong Feng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:sound evaluation methods, psychometrically sound evaluation, large language models, proliferation of large, large language
备注: 37 pages, 6 figures
点击查看摘要
Abstract:The rapid proliferation of large language models (LLMs) in healthcare creates an urgent need for scalable and psychometrically sound evaluation methods. Conventional static benchmarks are costly to administer repeatedly, vulnerable to data contamination, and lack calibrated measurement properties for fine-grained performance tracking. We propose and validate a computerized adaptive testing (CAT) framework grounded in item response theory (IRT) for efficient assessment of standardized medical knowledge in LLMs. The study comprises a two-phase design: a Monte Carlo simulation to identify optimal CAT configurations and an empirical evaluation of 38 LLMs using a human-calibrated medical item bank. Each model completed both the full item bank and an adaptive test that dynamically selected items based on real-time ability estimates and terminated upon reaching a predefined reliability threshold (standard error = 0.3). Results show that CAT-derived proficiency estimates achieved a near-perfect correlation with full-bank estimates (r = 0.988) while using only 1.3 percent of the items. Evaluation time was reduced from several hours to minutes per model, with substantial reductions in token usage and computational cost, while preserving inter-model performance rankings. This work establishes a psychometric framework for rapid, low-cost benchmarking of foundational medical knowledge in LLMs. The proposed adaptive methodology is intended as a standardized pre-screening and continuous monitoring tool and is not a substitute for real-world clinical validation or safety-oriented prospective studies.
96. 【2410.02064】Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
链接:https://arxiv.org/abs/2410.02064
作者:Christopher Ackerman,Nina Panickssery
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:model, reported that LLMs, LLMs can recognize, vector, chat model
备注: 10 pages, 13 figs, 2 tables, accepted as conference paper to ICLR 2025
点击查看摘要
Abstract:It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.
97. 【2603.24298】SpinGQE: A Generative Quantum Eigensolver for Spin Hamiltonians
链接:https://arxiv.org/abs/2603.24298
作者:Alexander Holden,Moinul Hossain Rahat,Nii Osae Osae Dade
类目:Quantum Physics (quant-ph); Computation and Language (cs.CL)
关键词:condensed matter physics, spanning quantum chemistry, applications spanning quantum, ground state search, state search problem
备注:
点击查看摘要
Abstract:The ground state search problem is central to quantum computing, with applications spanning quantum chemistry, condensed matter physics, and optimization. The Variational Quantum Eigensolver (VQE) has shown promise for small systems but faces significant limitations. These include barren plateaus, restricted ansatz expressivity, and reliance on domain-specific structure. We present SpinGQE, an extension of the Generative Quantum Eigensolver (GQE) framework to spin Hamiltonians. Our approach reframes circuit design as a generative modeling task. We employ a transformer-based decoder to learn distributions over quantum circuits that produce low-energy states. Training is guided by a weighted mean-squared error loss between model logits and circuit energies evaluated at each gate subsequence. We validate our method on the four-qubit Heisenberg model, demonstrating successfulconvergencetonear-groundstates. Throughsystematichyperparameterexploration, we identify optimal configurations: smaller model architectures (12 layers, 8 attention heads), longer sequence lengths (12 gates), and carefully chosen operator pools yield the most reliable convergence. Our results show that generative approaches can effectively navigate complex energy landscapes without relying on problem-specific symmetries or structure. This provides a scalable alternative to traditional variational methods for general quantum systems. An open-source implementation is available at this https URL.
信息检索
1. 【2603.24580】Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
链接:https://arxiv.org/abs/2603.24580
作者:Saahil Mathur,Ryan David Rittner,Vedant Ajit Thakur,Daniel Stuart Schiff,Tunazzina Islam
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:achieving sufficient reliability, expert usage remains, usage remains challenging, dense legal language, overlapping regulatory frameworks
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.
2. 【2603.24556】Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents
链接:https://arxiv.org/abs/2603.24556
作者:Samuel Taiwo,Mohd Amaluddin Yusoff
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Retrieval-Augmented Generation, constraints of Large, Language Models
备注: Presented at CCSEIT 2026. This version matches the published proceedings
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality. This paper presents an empirical study quantifying performance differences across four chunking strategies: fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware. We evaluated these methods using a proprietary corpus of oil and gas enterprise documents, including text-heavy manuals, table-heavy specifications, and piping and instrumentation diagrams (P and IDs). Our findings show that structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. Crucially, all four methods demonstrated limited effectiveness on P and IDs, underscoring a core limitation of purely text-based RAG within visually and spatially encoded documents. We conclude that while explicit structure preservation is essential for specialised domains, future work must integrate multimodal models to overcome current limitations.
3. 【2603.24480】Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories
链接:https://arxiv.org/abs/2603.24480
作者:Kawtar Zaher,Olivier Buisson,Alexis Joly
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:large unlabeled collections, Real-world fine-grained visual, minimal supervision, requires discovering, unlabeled collections
备注:
点击查看摘要
Abstract:Real-world fine-grained visual retrieval often requires discovering a rare concept from large unlabeled collections with minimal supervision. This is especially critical in biodiversity monitoring, ecological studies, and long-tailed visual domains, where the target may represent only a tiny fraction of the data, creating highly imbalanced binary problems. Interactive retrieval with relevance feedback offers a practical solution: starting from a small query, the system selects candidates for binary user annotation and iteratively refines a lightweight classifier. While Active Learning (AL) is commonly used to guide selection, conventional AL assumes symmetric class priors and large annotation budgets, limiting effectiveness in imbalanced, low-budget, low-latency settings. We introduce Positive-First Most Ambiguous (PF-MA), a simple yet effective AL criterion that explicitly addresses the class imbalance asymmetry: it prioritizes near-boundary samples while favoring likely positives, enabling rapid discovery of subtle visual categories while maintaining informativeness. Unlike standard methods that oversample negatives, PF-MA consistently returns small batches with a high proportion of relevant samples, improving early retrieval and user satisfaction. To capture retrieval diversity, we also propose a class coverage metric that measures how well selected positives span the visual variability of the target class. Experiments on long-tailed datasets, including fine-grained botanical data, demonstrate that PF-MA consistently outperforms strong baselines in both coverage and classifier performance, across varying class sizes and descriptors. Our results highlight that aligning AL with the asymmetric and user-centric objectives of interactive fine-grained retrieval enables simple yet powerful solutions for retrieving rare and visually subtle categories in realistic human-in-the-loop settings.
4. 【2603.24422】OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
链接:https://arxiv.org/abs/2603.24422
作者:Ben Chen,Siyuan Wang,Yufei Ma,Zihan Liang,Xuxin Zhang,Yue Lv,Ying Yang,Huangyu Dai,Lingtao Mao,Tong Zhao,Zhipeng Qian,Xinyu Sun,Zhixin Zhai,Yang Zhao,Bochao Liu,Jingshan Lv,Xiao Liang,Hui Kong,Jing Chen,Han Li,Chenyi Lei,Wenwu Ou,Kun Gai
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Generative Retrieval, generative search framework, promising paradigm, paradigm for modern, modern search systems
备注: Key codes are available at [this https URL](https://github.com/benchen4395/onesearch-family) . Feel free to contact benchen4395@gmail.com
点击查看摘要
Abstract:Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi-stage cascaded architecture, it offers advantages such as end-to-end joint optimization and high computational efficiency. OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose \textbf{OneSearch-V2}, a latent reasoning enhanced self-distillation generative search framework. It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized self-distillation training pipeline, which uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98\% item CTR, +3.05\% buyer conversion rate, and +2.11\% order volume. Manual evaluation further confirms gains in search experience quality, with +1.65\% in page good rate and +1.37\% in query-item relevance. More importantly, OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.
5. 【2603.24396】Exploring How Fair Model Representations Relate to Fair Recommendations
链接:https://arxiv.org/abs/2603.24396
作者:Bjørnar Vassøy,Benjamin Kille,Helge Langseth
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:recent recommender system, recommender system research, system research targets, research targets mitigating, targets mitigating demographic
备注: 17 pages
点击查看摘要
Abstract:One of the many fairness definitions pursued in recent recommender system research targets mitigating demographic information encoded in model representations. Models optimized for this definition are typically evaluated on how well demographic attributes can be classified given model representations, with the (implicit) assumption that this measure accurately reflects \textit{recommendation parity}, i.e., how similar recommendations given to different users are. We challenge this assumption by comparing the amount of demographic information encoded in representations with various measures of how the recommendations differ. We propose two new approaches for measuring how well demographic information can be classified given ranked recommendations. Our results from extensive testing of multiple models on one real and multiple synthetically generated datasets indicate that optimizing for fair representations positively affects recommendation parity, but also that evaluation at the representation level is not a good proxy for measuring this effect when comparing models. We also provide extensive insight into how recommendation-level fairness metrics behave for various models by evaluating their performances on numerous generated datasets with different properties.
6. 【2603.24326】Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
链接:https://arxiv.org/abs/2603.24326
作者:Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Jing Zhang,Jun Zhang,Xing Wei,Yi Liu,Dianhai Yu,Yanjun Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:resolution significantly impacts, significantly impacts performance, image resolution significantly, fine-grained task, Region Focus Module
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at this https URL.
7. 【2603.24226】UniScale: Synergistic Entire Space Data and Model Scaling for Search Ranking
链接:https://arxiv.org/abs/2603.24226
作者:Liren Yu,Caiyuan Li,Feiyi Dong,Tao Zhang,Zhixuan Zhang,Dan Ou,Haihong Tang,Bo Zheng
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Recent advances, advances in Large, scaling law research
备注:
点击查看摘要
Abstract:Recent advances in Large Language Models (LLMs) have inspired a surge of scaling law research in industrial search, advertising, and recommendation systems. However, existing approaches focus mainly on architectural improvements, overlooking the critical synergy between data and architecture design. We observe that scaling model parameters alone exhibits diminishing returns, i.e., the marginal gain in performance steadily declines as model size increases, and that the performance degradation caused by complex heterogeneous data distributions is often irrecoverable through model design alone. In this paper, we propose UniScale to address these limitation, a novel co-design framework that jointly optimizes data and architecture to unlock the full potential of model scaling, which includes two core parts: (1) ES$^3$ (Entire-Space Sample System), a high-quality data scaling system that expands the training signal beyond conventional sampling strategies from both intra-domain request contexts with global supervised signal constructed by hierarchical label attribution and cross-domain samples aligning with the essence of user decision under similar content exposure environment in search domain; and (2) HHSFT (Heterogeneous Hierarchical Sample Fusion Transformer), a novel architecture designed to effectively model the complex heterogeneous distribution of scaled data and to harness the entire space user behavior data with Heterogeneous Hierarchical Feature Interaction and Entire Space User Interest Fusion, thereby surpassing the performance ceiling of structure-only model tuning. Extensive experiments on large-scale real world E-commerce search platform demonstrate that UniScale achieves significant improvements through the synergistic co-design of data and architecture and exhibits clear scaling trends, delivering substantial gains in key business metrics.
8. 【2603.24218】Who Benefits from RAG? The Role of Exposure, Utility and Attribution Bias
链接:https://arxiv.org/abs/2603.24218
作者:Mahdi Dehghan,Graham McDonald
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, achieved substantial improvements, query group fairness
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have achieved substantial improvements in accuracy by grounding their responses in external documents that are relevant to the user's query. However, relatively little work has investigated the impact of RAG in terms of fairness. Particularly, it is not yet known if queries that are associated with certain groups within a fairness category systematically receive higher accuracy, or accuracy improvements in RAG systems compared to LLM-only, a phenomenon we refer to as query group fairness. In this work, we conduct extensive experiments to investigate the impact of three key factors on query group fairness in RAG, namely: Group exposure, i.e., the proportion of documents from each group appearing in the retrieved set, determined by the retriever; Group utility, i.e., the degree to which documents from each group contribute to improving answer accuracy, capturing retriever-generator interactions; and Group attribution, i.e., the extent to which the generator relies on documents from each group when producing responses. We examine group-level average accuracy and accuracy improvements disparities across four fairness categories using three datasets derived from the TREC 2022 Fair Ranking Track for two tasks: article generation and title generation. Our findings show that RAG systems suffer from the query group fairness problem and amplify disparities in terms of average accuracy across queries from different groups, compared to an LLM-only setting. Moreover, group utility, exposure, and attribution can exhibit strong positive or negative correlations with average accuracy or accuracy improvements of queries from that group, highlighting their important role in fair RAG. Our data and code are publicly available from Github.
9. 【2603.24216】Where Do Your Citations Come From? Citation-Constellation: A Free, Open-Source, No-Code, and Auditable Tool for Citation Network Decomposition with Complementary BARON and HEROCON Scores
链接:https://arxiv.org/abs/2603.24216
作者:Mahbub Ul Alam
类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
关键词:scholarly influence propagates, Standard citation metrics, Equilibrated Research Outreach, Research Outreach Network, Standard citation
备注: Citation-Constellation No-Code Tool Link: [this https URL](https://citation-constellation.serve.scilifelab.se)
点击查看摘要
Abstract:Standard citation metrics treat all citations as equal, obscuring the social and structural pathways through which scholarly influence propagates. I introduce Citation-Constellation, a freely available no-code tool for citation network analysis with two complementary bibliometric scores that decompose a researcher's citation profile by network proximity between citing and cited authors. BARON (Boundary-Anchored Research Outreach Network score) is a strict binary metric counting only citations from outside the detected collaborative network. HEROCON (Holistic Equilibrated Research Outreach CONstellation score) applies graduated weights assigning partial credit to in-group citations based on relationship proximity. The gap between scores serves as a diagnostic of inner-circle dependence. An extended abstract with full details appears in the paper. The tool implements this through a phased architecture: (1) self-citation analysis, (2) co-authorship graph traversal, (3) temporal institutional affiliation matching via ROR, and (4) AI-agent-driven venue governance extraction using a local LLM. Phases 1-3 are fully operational; Phase 4 is under development. Key design choices include ORCID-validated author identity resolution, an UNKNOWN classification for citations with insufficient metadata, and comprehensive audit trails documenting every classification decision. A no-code web interface enables researchers to compute scores without programming, installation, or registration. I present these scores as structural diagnostics, not quality indicators. BARON and HEROCON describe where in the social graph citations originate. They should not be used for hiring, promotion, or funding decisions. HEROCON weights are experimental and require empirical calibration.
Comments:
Citation-Constellation No-Code Tool Link: this https URL
Subjects:
Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
Cite as:
arXiv:2603.24216 [cs.DL]
(or
arXiv:2603.24216v1 [cs.DL] for this version)
https://doi.org/10.48550/arXiv.2603.24216
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2603.24204】SumRank: Aligning Summarization Models for Long-Document Listwise Reranking
链接:https://arxiv.org/abs/2603.24204
作者:Jincheng Feng,Wenhan Liu,Zhicheng Dou
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, passage reranking task, demonstrated superior performance, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated superior performance in listwise passage reranking task. However, directly applying them to rank long-form documents introduces both effectiveness and efficiency issues due to the substantially increased context length. To address this challenge, we propose a pointwise summarization model SumRank, aligned with downstream listwise reranking, to compress long-form documents into concise rank-aligned summaries before the final listwise reranking stage. To obtain our summarization model SumRank, we introduce a three-stage training pipeline comprising cold-start Supervised Fine-Tuning (SFT), specialized RL data construction, and rank-driven alignment via Reinforcement Learning. This paradigm aligns the SumRank with downstream ranking objectives to preserve relevance signals. We conduct extensive experiments on five benchmark datasets from the TREC Deep Learning tracks (TREC DL 19-23). Results show that our lightweight SumRank model achieves state-of-the-art (SOTA) ranking performance while significantly improving efficiency by reducing both summarization overhead and reranking complexity.
11. 【2603.24136】Sequence-aware Large Language Models for Explainable Recommendation
链接:https://arxiv.org/abs/2603.24136
作者:Gangyi Zhang,Runzhe Teng,Chongming Gao
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, generating natural language, Large Language, Language Models, shown strong potential
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown strong potential in generating natural language explanations for recommender systems. However, existing methods often overlook the sequential dynamics of user behavior and rely on evaluation metrics misaligned with practical utility. We propose SELLER (SEquence-aware LLM-based framework for Explainable Recommendation), which integrates explanation generation with utility-aware evaluation. SELLER combines a dual-path encoder-capturing both user behavior and item semantics with a Mixture-of-Experts adapter to align these signals with LLMs. A unified evaluation framework assesses explanations via both textual quality and their effect on recommendation outcomes. Experiments on public benchmarks show that SELLER consistently outperforms prior methods in explanation quality and real-world utility.
12. 【2603.24118】S4CMDR: a metadata repository for electronic health records
链接:https://arxiv.org/abs/2603.24118
作者:Jiawei Zhao(1),Md Shamim Ahmed(1),Nicolai Dinh Khang Truong(1),Verena Schuster(2),Rudolf Mayer(2),Richard Röttger(1) ((1) University of Southern Denmark, Department for Mathematics and Computer Science, Denmark, (2) SBA Research, Austria)
类目:Information Retrieval (cs.IR)
关键词:Electronic health records, Electronic health, compatible EHR records, Background, machine learning
备注: 16 pages, 7 figures
点击查看摘要
Abstract:Background: Electronic health records (EHRs) enable machine learning for diagnosis, prognosis, and clinical decision support. However, EHR standards vary by country and hospital, making records often incompatible. This limits large-scale and cross-clinical machine learning. To address such complexity, a metadata repository cataloguing available data elements, their value domains, and their compatibility is an essential tool. This allows researchers to leverage relevant data for tasks such as identifying undiagnosed rare disease patients. Results: Within the Screen4Care project, we developed S4CMDR, an open-source metadata repository built on ISO 11179-3, based on a middle-out metadata standardisation approach. It automates cataloguing to reduce errors and enable the discovery of compatible feature sets across data registries. S4CMDR supports on-premise Linux deployment and cloud hosting, with state-of-the-art user authentication and an accessible interface. Conclusions: S4CMDR is a clinical metadata repository registering and discovering compatible EHR records. Novel contributions include a microservice architecture, a middle-out standardisation approach, and a user-friendly interface for error-free data registration and visualisation of metadata compatibility. We validate S4CMDR's case studies involving rare disease patients. We invite clinical data holders to populate S4CMDR using their metadata to validate the generalisability and support further development.
Comments:
16 pages, 7 figures
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2603.24118 [cs.IR]
(or
arXiv:2603.24118v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2603.24118
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
13. 【2603.24054】Hierarchical Spatial-Temporal Graph-Enhanced Model for Map-Matching
链接:https://arxiv.org/abs/2603.24054
作者:Anjun Gao,Zhenglin Wan,Pingfu Chao,Shunyu Yao
类目:Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:integration of GNSS, GNSS data, portable devices, devices has led, generation of vast
备注:
点击查看摘要
Abstract:The integration of GNSS data into portable devices has led to the generation of vast amounts of trajectory data, which is crucial for applications such as map-matching. To tackle the limitations of rule-based methods, recent works in deep learning for trajectory-related tasks occur. However, existing models remain challenging due to issues such as the difficulty of large-scale data labeling, ineffective modeling of spatial-temporal relationships, and discrepancies between training and test data distributions. To tackle these challenges, we propose HSTGMatch, a novel model designed to enhance map-matching performance. Our approach involves a two-stage process: hierarchical self-supervised learning and spatial-temporal supervised learning. We introduce a hierarchical trajectory representation, leveraging both grid cells and geographic tuples to capture moving patterns effectively. The model constructs an Adaptive Trajectory Adjacency Graph to dynamically capture spatial relationships, optimizing GATs for improved efficiency. Furthermore, we incorporate a Spatial-Temporal Factor to extract relevant features and employ a decay coefficient to address variations in trajectory length. Our extensive experiments demonstrate the model's superior performance, module effectiveness, and robustness, providing a promising solution for overcoming the existing limitations in map-matching applications. The source code of HSTGMatch is publicly available on GitHub at this https URL.
14. 【2603.23972】Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
链接:https://arxiv.org/abs/2603.23972
作者:Somaya Eltanbouly,Samer Rashwani
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Quran and Hadith, achieved remarkable progress, Large language models, Large language, achieved remarkable
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable progress in many language tasks, yet they continue to struggle with complex historical and religious Arabic texts such as the Quran and Hadith. To address this limitation, we develop a retrieval-augmented generation (RAG) framework grounded in diachronic lexicographic knowledge. Unlike prior RAG systems that rely on general-purpose corpora, our approach retrieves evidence from the Doha Historical Dictionary of Arabic (DHDA), a large-scale resource documenting the historical development of Arabic vocabulary. The proposed pipeline combines hybrid retrieval with an intent-based routing mechanism to provide LLMs with precise, contextually relevant historical information. Our experiments show that this approach improves the accuracy of Arabic-native LLMs, including Fanar and ALLaM, to over 85\%, substantially reducing the performance gap with Gemini, a proprietary large-scale model. Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments. The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87). An error analysis further highlights key linguistic challenges, including diacritics and compound expressions. These findings demonstrate the value of integrating diachronic lexicographic resources into retrieval-augmented generation frameworks to enhance Arabic language understanding, particularly for historical and religious texts. The code and resources are publicly available at: this https URL.
15. 【2603.23849】VILLA: Versatile Information Retrieval From Scientific Literature Using Large LAnguage Models
链接:https://arxiv.org/abs/2603.23849
作者:Blessy Antony,Amartya Dutta,Sneha Aggarwal,Vasu Gatne,Ozan Gökdemir,Samantha Grimes,Adam Lauring,Brian R. Wasik,Anuj Karpatne,T. M. Murali
类目:Information Retrieval (cs.IR)
关键词:train machine learning, machine learning, models impedes, artificial intelligence, SIE
备注: Under review at ACM KDD 2026 (AI for Sciences Track)
点击查看摘要
Abstract:The lack of high-quality ground truth datasets to train machine learning (ML) models impedes the potential of artificial intelligence (AI) for science research. Scientific information extraction (SIE) from the literature using LLMs is emerging as a powerful approach to automate the creation of these datasets. However, existing LLM-based approaches and benchmarking studies for SIE focus on broad topics such as biomedicine and chemistry, are limited to choice-based tasks, and focus on extracting information from short and well-formatted text. The potential of SIE methods in complex, open-ended tasks is considerably under-explored. In this study, we used a domain that has been virtually ignored in SIE, namely virology, to address these research gaps. We design a unique, open-ended SIE task of extracting mutations in a given virus that modify its interaction with the host. We develop a new, multi-step retrieval augmented generation (RAG) framework called VILLA for SIE. In parallel, we curate a novel dataset of 629 mutations in ten influenza A virus proteins obtained from 239 scientific publications to serve as ground truth for the mutation extraction task. Finally, we demonstrate VILLA's superior performance using a novel and comprehensive evaluation and comparison with vanilla RAG and other state-of-the art RAG- and agent-based tools for SIE.
16. 【2603.23710】An In-Depth Study of Filter-Agnostic Vector Search on a PostgreSQL Database System: [Experiments and Analysis]
链接:https://arxiv.org/abs/2603.23710
作者:Duo Lu,Helena Caminal,Manos Chatzakis,Yannis Papakonstantinou,Yannis Chronis,Vaibhav Jain,Fatma Özcan
类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Filtered Vector Search, supporting semantic search, Filtered Vector, Vector Search, semantic search
备注: 26 pages, 13 figures, to be published at SIGMOD 2026
点击查看摘要
Abstract:Filtered Vector Search (FVS) is critical for supporting semantic search and GenAI applications in modern database systems. However, existing research most often evaluates algorithms in specialized libraries, making optimistic assumptions that do not align with enterprise-grade database systems. Our work challenges this premise by demonstrating that in a production-grade database system, commonly made assumptions do not hold, leading to performance characteristics and algorithmic trade-offs that are fundamentally different from those observed in isolated library settings. This paper presents the first in-depth analysis of filter-agnostic FVS algorithms within a production PostgreSQL-compatible system. We systematically evaluate post-filtering and inline-filtering strategies across a wide range of selectivities and correlations. Our central finding is that the optimal algorithm is not dictated by the cost of distance computations alone, but that system-level overheads that come from both distance computations and filter operations (like page accesses and data retrieval) play a significant role. We demonstrate that graph-based approaches (such as NaviX/ACORN) can incur prohibitive numbers of filter checks and system-level overheads, compared with clustering-based indexes such as ScaNN, often canceling out their theoretical benefits in real-world database environments. Ultimately, our findings provide the database community with crucial insights and practical guidelines, demonstrating that the optimal choice for a filter-agnostic FVS algorithm is not absolute, but rather a system-aware decision contingent on the interplay between workload characteristics and the underlying costs of data access in a real-world database architecture.
Comments:
26 pages, 13 figures, to be published at SIGMOD 2026
Subjects:
Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:
arXiv:2603.23710 [cs.DB]
(or
arXiv:2603.23710v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2603.23710
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Related DOI:
https://doi.org/10.1145/3802011
Focus to learn more
DOI(s) linking to related resources</p>
17. 【2603.23554】Mixture of Demonstrations for Textual Graph Understanding and Question Answering
链接:https://arxiv.org/abs/2603.23554
作者:Yukun Wu,Lihui Liu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:graph-based retrieval-augmented generation, large language models, enhancing large language, domain-specific question answering, Textual graph-based retrieval-augmented
备注:
点击查看摘要
Abstract:Textual graph-based retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) in domain-specific question answering. While existing approaches primarily focus on zero-shot GraphRAG, selecting high-quality demonstrations is crucial for improving reasoning and answer accuracy. Furthermore, recent studies have shown that retrieved subgraphs often contain irrelevant information, which can degrade reasoning performance. In this paper, we propose MixDemo, a novel GraphRAG framework enhanced with a Mixture-of-Experts (MoE) mechanism for selecting the most informative demonstrations under diverse question contexts. To further reduce noise in the retrieved subgraphs, we introduce a query-specific graph encoder that selectively attends to information most relevant to the query. Extensive experiments across multiple textual graph benchmarks show that MixDemo significantly outperforms existing methods.
18. 【2603.23533】MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG
链接:https://arxiv.org/abs/2603.23533
作者:Bhavik Mangla
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:RAG pipelines typically, requires multiple LLM, ignores document structure, multiple LLM calls, pipelines typically rely
备注: 13 pages, 4 figures, 7 tables, 2 algorithms. Code: [this https URL](https://github.com/bhavik-mangla/MDKeyChunker)
点击查看摘要
Abstract:RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.
19. 【2603.23516】MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
链接:https://arxiv.org/abs/2603.23516
作者:Yu Chen,Runkai Chen,Sheng Yi,Xinda Zhao,Xiaohong Li,Jianjin Zhang,Jun Sun,Chuanrui Hu,Yunyun Han,Lidong Bing,Yafeng Deng,Tianqiao Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Toggle, memory, Long-term memory, Memory Sparse Attention, Code Toggle Papers
备注:
点击查看摘要
Abstract:Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:
arXiv:2603.23516 [cs.CL]
(or
arXiv:2603.23516v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.23516
Focus to learn more
arXiv-issued DOI via DataCite
Submission history From: Runkai Chen [view email] [v1]
Fri, 6 Mar 2026 02:29:54 UTC (383 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens, by Yu Chen and 11 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.CL
prev
|
next
new
|
recent
| 2026-03
Change to browse by:
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
20. 【2603.23512】S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering
链接:https://arxiv.org/abs/2603.23512
作者:Rong Fu,Yemin Wang,Tianxiang Xu,Yongtai Liu,Weizhi Tang,Wangyu Wu,Xiaowen Ma,Simon Fong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:shortest-path Retrieval-Augmented Generation, Retrieval-Augmented Generation framework, Generation framework designed, semantic-aware shortest-path Retrieval-Augmented, large knowledge graphs
备注:
点击查看摘要
Abstract:We present S-Path-RAG, a semantic-aware shortest-path Retrieval-Augmented Generation framework designed to improve multi-hop question answering over large knowledge graphs. S-Path-RAG departs from one-shot, text-heavy retrieval by enumerating bounded-length, semantically weighted candidate paths using a hybrid weighted $k$-shortest, beam, and constrained random-walk strategy, learning a differentiable path scorer together with a contrastive path encoder and lightweight verifier, and injecting a compact soft mixture of selected path latents into a language model via cross-attention. The system runs inside an iterative Neural-Socratic Graph Dialogue loop in which concise diagnostic messages produced by the language model are mapped to targeted graph edits or seed expansions, enabling adaptive retrieval when the model expresses uncertainty. This combination yields a retrieval mechanism that is both token-efficient and topology-aware while preserving interpretable path-level traces for diagnostics and intervention. We validate S-Path-RAG on standard multi-hop KGQA benchmarks and through ablations and diagnostic analyses. The results demonstrate consistent improvements in answer accuracy, evidence coverage, and end-to-end efficiency compared to strong graph- and LLM-based baselines. We further analyze trade-offs between semantic weighting, verifier filtering, and iterative updates, and report practical recommendations for deployment under constrained compute and token budgets.
21. 【2603.23508】Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems
链接:https://arxiv.org/abs/2603.23508
作者:Xunzhuo Liu,Bowei He,Xue Liu,Haichen Zhang,Huamin Chen
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:complex source materials, document-centric assistants, source materials, increasingly deployed, deployed in enterprise
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) is increasingly deployed in enterprise search and document-centric assistants, where responses must be grounded in long and complex source materials. In practice, verifying that generated answers faithfully reflect retrieved documents is difficult: large language models can check long contexts but are too slow and costly for interactive services, while lightweight classifiers operate within strict context limits and frequently miss evidence outside truncated passages. We present the design of a real-time verification component integrated into a production RAG pipeline that enables full-document grounding under latency constraints. The system processes documents up to 32K tokens and employs adaptive inference strategies to balance response time and verification coverage across workloads. We describe the architectural decisions, operational trade-offs, and evaluation methodology used to deploy the verifier, and show that full-context verification substantially improves detection of unsupported responses compared with truncated validation. Our experience highlights when long-context verification is necessary, why chunk-based checking often fails in real documents, and how latency budgets shape model design. These findings provide practical guidance for practitioners building reliable large-scale retrieval-augmented applications. (Model, benchmark, and code: this https URL)
22. 【2603.22779】KARMA: Knowledge-Action Regularized Multimodal Alignment for Personalized Search at Taobao
链接:https://arxiv.org/abs/2603.22779
作者:Zhi Sun,Wenming Zhang,Yi Wei,Liren Yu,Zhixuan Zhang,Dan Ou,Haihong Tang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, personalized search systems, profound semantic knowledge
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are equipped with profound semantic knowledge, making them a natural choice for injecting semantic generalization into personalized search systems. However, in practice we find that directly fine-tuning LLMs on industrial personalized tasks (e.g. next item prediction) often yields suboptimal results. We attribute this bottleneck to a critical Knowledge--Action Gap: the inherent conflict between preserving pre-trained semantic knowledge and aligning with specific personalized actions by discriminative objectives. Empirically, action-only training objectives induce Semantic Collapse, such as attention ``sinks''. This degradation severely cripples the LLM's generalization, failing to bring improvements to personalized search systems. We propose KARMA (Knowledge--Action Regularized Multimodal Alignment), a unified framework that treats semantic reconstruction as a train-only regularizer. KARMA optimizes a next-interest embedding for retrieval (Action) while enforcing semantic decodability (Knowledge) through two complementary objectives: (i) history-conditioned semantic generation, which anchors optimization to the LLM's native next-token distribution, and (ii) embedding-conditioned semantic reconstruction, which constrains the interest embedding to remain semantically recoverable. On Taobao search system, KARMA mitigates semantic collapse (attention-sink analysis) and improves both action metrics and semantic fidelity. In ablations, semantic decodability yields up to +22.5 HR@200. With KARMA, we achieve +0.25 CTR AUC in ranking, +1.86 HR in pre-ranking and +2.51 HR in recalling. Deployed online with low inference overhead at ranking stage, KARMA drives +0.5% increase in Item Click.
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2603.22779 [cs.IR]
(or
arXiv:2603.22779v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2603.22779
Focus to learn more
arXiv-issued DOI via DataCite</p>
计算机视觉
1. 【2603.24584】AG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models
链接:https://arxiv.org/abs/2603.24584
作者:Jiaying Zhou,Zhihao Zhan,Ruifeng Zhai,Qinhan Lyu,Hao Liu,Keze Wang,Liang Lin,Guangrun Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:mapping language instructions, shown strong progress, robotic actions, mapping language, language instructions
备注:
点击查看摘要
Abstract:Vision--Language--Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.
2. 【2603.24581】Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving
链接:https://arxiv.org/abs/2603.24581
作者:Linbo Wang,Yupeng Zheng,Qiang Chen,Shiwei Li,Yichen Zhang,Zebin Xing,Qichao Zhang,Xiang Li,Deheng Qian,Pengxuan Yang,Yihang Dong,Ce Hao,Xiaoqing Ye,Junyu han,Yifeng Pan,Dongbin Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:autonomous driving framework, achieves strong trajectory, strong trajectory planning, dynamics-informed latent world, latent world representations
备注:
点击查看摘要
Abstract:We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.
3. 【2603.24578】Vision-Language Models vs Human: Perceptual Image Quality Assessment
链接:https://arxiv.org/abs/2603.24578
作者:Imran Mehmood,Imad Ali Shah,Ming Ronnier Luo,Brian Deegan
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:encourage automated approaches, limited scalability encourage, scalability encourage automated, image quality assessment, Psychophysical experiments remain
备注:
点击查看摘要
Abstract:Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (\rho up to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.
4. 【2603.24577】EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction
链接:https://arxiv.org/abs/2603.24577
作者:Falong Fan,Yi Xie,Arnis Lektauers,Bo Liu,Jerzy Rozenblit
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:surgical robotic perception, deformable soft tissues, robotic perception, deformable soft, Accurate
备注:
点击查看摘要
Abstract:Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.
5. 【2603.24576】Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation
链接:https://arxiv.org/abs/2603.24576
作者:Xinying Guo,Chenxi Jiang,Hyun Bin Kim,Ying Sun,Yang Xiao,Yuhang Han,Jianfei Yang
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:making action selection, make decision-time observations, action selection non-Markovian, observations perceptually aliased, decision-time observations perceptually
备注: Code is available at [this https URL](https://github.com/gxyes/MARS_Chameleon)
点击查看摘要
Abstract:Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.
6. 【2603.24575】VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models
链接:https://arxiv.org/abs/2603.24575
作者:Qijia He,Xunmei Liu,Hammaad Memon,Ziang Li,Zixian Ma,Jaemin Cho,Jason Ren,Daniel S Weld,Ranjay Krishna
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Scalable Vector Graphics, offering precise resolution, flexible semantic editability, precise resolution independence, Scalable Vector
备注:
点击查看摘要
Abstract:Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.
7. 【2603.24571】owards Training-Free Scene Text Editing
链接:https://arxiv.org/abs/2603.24571
作者:Yubo Li,Xugong Qin,Peng Zhang,Hailun Lin,Gangyan Zeng,Kexin Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:maintaining visual realism, Scene text editing, text editing seeks, Flow Manifold Steering, modify textual content
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm. Code is available at this https URL
8. 【2603.24570】Anti-I2V: Safeguarding your photos from malicious image-to-video generation
链接:https://arxiv.org/abs/2603.24570
作者:Duc Vu,Anh Nguyen,Chi Tran,Anh Tran
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:specific person photo, poses threats, text prompts, threats of misuse, creation of fake
备注: Accepted to CVPR 2026 (Main Conference)
点击查看摘要
Abstract:Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$*$a$*$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.
9. 【2603.24569】POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan
链接:https://arxiv.org/abs/2603.24569
作者:Marta Moscati,Muhammad Saad Saeed,Marina Zanoni,Mubashir Noman,Rohan Kumar Das,Monorama Swain,Yufang Hou,Elisabeth Andre,Khalid Mahmood Malik,Markus Schedl,Shah Nawaz
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:homogeneous audio-visual modalities, Multimodal speaker identification, POLY-SIM Grand Challenge, speaker identification systems, systems typically assume
备注: Grand challenge at ACM MM 2026
点击查看摘要
Abstract:Multimodal speaker identification systems typically assume the availability of complete and homogeneous audio-visual modalities during both training and testing. However, in real-world applications, such assumptions often do not hold. Visual information may be missing due to occlusions, camera failures, or privacy constraints, while multilingual speakers introduce additional complexity due to linguistic variability across languages. These challenges significantly affect the robustness and generalization of multimodal speaker identification systems. The POLY-SIM Grand Challenge 2026 aims to advance research in multimodal speaker identification under missing-modality and cross-lingual conditions. Specifically, the Grand Challenge encourages the development of robust methods that can effectively leverage incomplete multimodal inputs while maintaining strong performance across different languages. This report presents the design and organization of the POLY-SIM Grand Challenge 2026, including the dataset, task formulation, evaluation protocol, and baseline model. By providing a standardized benchmark and evaluation framework, the challenge aims to foster progress toward more robust and practical multimodal speaker identification systems.
10. 【2603.24558】LensWalk: Agentic Video Understanding by Planning How You See in Videos
链接:https://arxiv.org/abs/2603.24558
作者:Keliang Li,Yansong Li,Hongze Shen,Mengdi Liu,Hong Chang,Shiguang Shan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Model, presents a profound, profound challenge, challenge for automated, Language Model reasoner
备注: To be published in CVPR 2026
点击查看摘要
Abstract:The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.
11. 【2603.24552】he role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series
链接:https://arxiv.org/abs/2603.24552
作者:Jan Hemmerling,Marcel Schwieder,Philippe Rufin,Leon-Friedrich Thomas,Mirela Tulbure,Patrick Hostert,Stefan Erasmi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:farming systems, farming, sustainable agriculture, Vision Transformer, key element
备注:
点击查看摘要
Abstract:Organic farming is a key element in achieving more sustainable agriculture. For a better understanding of the development and impact of organic farming, comprehensive, spatially explicit information is needed. This study presents an approach for the discrimination of organic and conventional farming systems using intra-annual Sentinel-2 time series. In addition, it examines two factors influencing this discrimination: the joint learning of crop type information in a concurrent task and the role of spatial context. A Vision Transformer model based on the Temporo-Spatial Vision Transformer (TSViT) architecture was used to construct a classification model for the two farming systems. The model was extended for simultaneous learning of the crop type, creating a multitask learning setting. By varying the patch size presented to the model, we tested the influence of spatial context on the classification accuracy of both tasks. We show that discrimination between organic and conventional farming systems using multispectral remote sensing data is feasible. However, classification performance varies substantially across crop types. For several crops, such as winter rye, winter wheat, and winter oat, F1 scores of 0.8 or higher can be achieved. In contrast, other agricultural land use classes, such as permanent grassland, orchards, grapevines, and hops, cannot be reliably distinguished, with F1 scores for the organic management class of 0.4 or lower. Joint learning of farming system and crop type provides only limited additional benefits over single-task learning. In contrast, incorporating wider spatial context improves the performance of both farming system and crop type classification. Overall, we demonstrate that a classification of agricultural farming systems is possible in a diverse agricultural region using multispectral remote sensing data.
12. 【2603.24549】A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English
链接:https://arxiv.org/abs/2603.24549
作者:Dana Serditova,Kevin Tang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
关键词:Automatic Speech Recognition, performance remains uneven, mainstream accents represented, Automatic Speech, current speech recognition
备注: 54 pages, 11 figures
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) systems are widely used in everyday communication, education, healthcare, and industry, yet their performance remains uneven across speakers, particularly when dialectal variation diverges from the mainstream accents represented in training data. This study investigates ASR bias through a sociolinguistic analysis of Newcastle English, a regional variety of North-East England that has been shown to challenge current speech recognition technologies. Using spontaneous speech from the Diachronic Electronic Corpus of Tyneside English (DECTE), we evaluate the output of a state-of-the-art commercial ASR system and conduct a fine-grained analysis of more than 3,000 transcription errors. Errors are classified by linguistic domain and examined in relation to social variables including gender, age, and socioeconomic status. In addition, an acoustic case study of selected vowel features demonstrates how gradient phonetic variation contributes directly to misrecognition. The results show that phonological variation accounts for the majority of errors, with recurrent failures linked to dialect-specific features like vowel quality and glottalisation, as well as local vocabulary and non-standard grammatical forms. Error rates also vary across social groups, with higher error frequencies observed for men and for speakers at the extremes of the age spectrum. These findings indicate that ASR errors are not random but socially patterned and can be explained from a sociolinguistic perspective. Thus, the study demonstrates the importance of incorporating sociolinguistic expertise into the evaluation and development of speech technologies and argues that more equitable ASR systems require explicit attention to dialectal variation and community-based speech data.
Comments:
54 pages, 11 figures
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
ACMclasses:
I.2; I.2.7; I.5; J.4; J.5
Cite as:
arXiv:2603.24549 [cs.CL]
(or
arXiv:2603.24549v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.24549
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
13. 【2603.24541】SEGAR: Selective Enhancement for Generative Augmented Reality
链接:https://arxiv.org/abs/2603.24541
作者:Fanjun Bu,Chenyang Yuan,Hiroshi Yasuda
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:enable temporally coherent, avoiding per-frame rendering, incorporate deliberate visual, predicting future image, future image sequences
备注:
点击查看摘要
Abstract:Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.
14. 【2603.24539】CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition
链接:https://arxiv.org/abs/2603.24539
作者:Florian Stilz,Vinkle Srivastav,Nassir Navab,Nicolas Padoy
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:intraoperative surgical procedure, Intraoperative Surgical Procedures, Video-language foundation models, highly effective, wide range
备注:
点击查看摘要
Abstract:Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at this https URL.
15. 【2603.24533】UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience
链接:https://arxiv.org/abs/2603.24533
作者:Zichuan Lin,Feiyu Liu,Yijun Yang,Jiafei Lyu,Yiming Gao,Yicheng Liu,Zhicong Lu,Yangbin Yu,Mingyu Yang,Junyou Li,Deheng Ye,Jie Jiang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, attracted increasing attention
备注: Code and models are available at [this https URL](https://github.com/ui-voyager/UI-Voyager)
点击查看摘要
Abstract:Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks. To that end, we propose UI-Voyager, a novel two-stage self-evolving mobile GUI agent. In the first stage, we employ Rejection Fine-Tuning (RFT), which enables the continuous co-evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self-Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human-level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self-evolving, and high-performance mobile GUI automation without expensive manual data annotation.
16. 【2603.24528】Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification
链接:https://arxiv.org/abs/2603.24528
作者:Dipam Goswami,Simone Magistri,Gido M. van de Ven,Bartłomiej Twardowski,Andrew D. Bagdanov,Tinne Tuytelaars,Joost van de Weijer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-language models, image, objective of aligning, Vision-language, text
备注: Preprint
点击查看摘要
Abstract:Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.
17. 【2603.24506】oward Physically Consistent Driving Video World Models under Challenging Trajectories
链接:https://arxiv.org/abs/2603.24506
作者:Jiawei Zhou,Zhenxin Zhu,Lingyi Du,Linye Lyu,Lijun Zhou,Zhanqian Wu,Hongcheng Luo,Zhuotao Tian,Bing Wang,Guang Chen,Hangjun Ye,Haiyang Sun,Yu Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving simulation, shown strong potential, driving, driving videos, driving simulation
备注:
点击查看摘要
Abstract:Video generation models have shown strong potential as world models for autonomous driving simulation. However, existing approaches are primarily trained on real-world driving datasets, which mostly contain natural and safe driving scenarios. As a result, current models often fail when conditioned on challenging or counterfactual trajectories-such as imperfect trajectories generated by simulators or planning systems-producing videos with severe physical inconsistencies and artifacts. To address this limitation, we propose PhyGenesis, a world model designed to generate driving videos with high visual fidelity and strong physical consistency. Our framework consists of two key components: (1) a physical condition generator that transforms potentially invalid trajectory inputs into physically plausible conditions, and (2) a physics-enhanced video generator that produces high-fidelity multi-view driving videos under these conditions. To effectively train these components, we construct a large-scale, physics-rich heterogeneous dataset. Specifically, in addition to real-world driving videos, we generate diverse challenging driving scenarios using the CARLA simulator, from which we derive supervision signals that guide the model to learn physically grounded dynamics under extreme conditions. This challenging-trajectory learning strategy enables trajectory correction and promotes physically consistent video generation. Extensive experiments demonstrate that PhyGenesis consistently outperforms state-of-the-art methods, especially on challenging trajectories. Our project page is available at: this https URL.
18. 【2603.24484】Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
链接:https://arxiv.org/abs/2603.24484
作者:Siqi Liu,Xinyang Li,Bochao Zou,Junbao Zhuo,Huimin Ma,Jiansheng Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Theory of Mind, human-like Theory, infer human mental, continue to advance, increasing interest
备注: 20 pages, 7 figures, accepted at CVPR 2026, project page: see [this https URL](https://founce.github.io/VisionToM)
点击查看摘要
Abstract:As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model's attention through different layers of visual features. This guidance reduces the model's reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark-an egocentric, real-world video dataset for ToM with three multiple-choice QA settings-demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents' mental states, pushing machine-human collaboration toward greater alignment.
19. 【2603.24480】Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories
链接:https://arxiv.org/abs/2603.24480
作者:Kawtar Zaher,Olivier Buisson,Alexis Joly
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:large unlabeled collections, Real-world fine-grained visual, minimal supervision, requires discovering, unlabeled collections
备注:
点击查看摘要
Abstract:Real-world fine-grained visual retrieval often requires discovering a rare concept from large unlabeled collections with minimal supervision. This is especially critical in biodiversity monitoring, ecological studies, and long-tailed visual domains, where the target may represent only a tiny fraction of the data, creating highly imbalanced binary problems. Interactive retrieval with relevance feedback offers a practical solution: starting from a small query, the system selects candidates for binary user annotation and iteratively refines a lightweight classifier. While Active Learning (AL) is commonly used to guide selection, conventional AL assumes symmetric class priors and large annotation budgets, limiting effectiveness in imbalanced, low-budget, low-latency settings. We introduce Positive-First Most Ambiguous (PF-MA), a simple yet effective AL criterion that explicitly addresses the class imbalance asymmetry: it prioritizes near-boundary samples while favoring likely positives, enabling rapid discovery of subtle visual categories while maintaining informativeness. Unlike standard methods that oversample negatives, PF-MA consistently returns small batches with a high proportion of relevant samples, improving early retrieval and user satisfaction. To capture retrieval diversity, we also propose a class coverage metric that measures how well selected positives span the visual variability of the target class. Experiments on long-tailed datasets, including fine-grained botanical data, demonstrate that PF-MA consistently outperforms strong baselines in both coverage and classifier performance, across varying class sizes and descriptors. Our results highlight that aligning AL with the asymmetric and user-centric objectives of interactive fine-grained retrieval enables simple yet powerful solutions for retrieving rare and visually subtle categories in realistic human-in-the-loop settings.
20. 【2603.24470】Counting Without Numbers \ Finding Without Words
链接:https://arxiv.org/abs/2603.24470
作者:Badri Narayana Patro
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
关键词:million pets enter, pets enter shelters, million pets, enter shelters, pets enter
备注:
点击查看摘要
Abstract:Every year, 10 million pets enter shelters, separated from their families. Despite desperate searches by both guardians and lost animals, 70% never reunite, not because matches do not exist, but because current systems look only at appearance, while animals recognize each other through sound. We ask, why does computer vision treat vocalizing species as silent visual objects? Drawing on five decades of cognitive science showing that animals perceive quantity approximately and communicate identity acoustically, we present the first multimodal reunification system integrating visual and acoustic biometrics. Our species-adaptive architecture processes vocalizations from 10Hz elephant rumbles to 4kHz puppy whines, paired with probabilistic visual matching that tolerates stress-induced appearance changes. This work demonstrates that AI grounded in biological communication principles can serve vulnerable populations that lack human language.
21. 【2603.24458】OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning
链接:https://arxiv.org/abs/2603.24458
作者:Kaihang Pan,Qi Tian,Jianwei Zhang,Weijie Kong,Jiangfeng Xiong,Yanxin Long,Shixue Zhang,Haiyi Qiu,Tan Wang,Zheqi Lv,Yue Wu,Liefeng Bo,Siliang Tang,Zhao Zhong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, alternatives significantly lag, open-source alternatives significantly, omni-capable video generation, video generation
备注: 32 pages, 22 figures. Project Page: [this https URL](https://omniweaving.github.io)
点击查看摘要
Abstract:While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model will be made publicly available soon. Project Page: this https URL.
22. 【2603.24454】Unleashing Vision-Language Semantics for Deepfake Video Detection
链接:https://arxiv.org/abs/2603.24454
作者:Jiawen Zhu,Yunqi Miao,Xueyi Zhang,Jiankang Deng,Guansong Pang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:CLIP exhibit strong, exhibit strong generalization, strong generalization capabilities, CLIP exhibit, pre-trained Vision-Language Models
备注: 14 pages, 7 figures, accepted by CVPR 2026
点击查看摘要
Abstract:Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue -- Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at this https URL.
23. 【2603.24440】CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
链接:https://arxiv.org/abs/2603.24440
作者:Xiangru Jian,Shravan Nayak,Kevin Qinghong Lin,Aarash Feizi,Kaixin Li,Patrice Bechard,Spandana Gella,Sai Rajeswar
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:hold great promise, complex desktop workflows, automating complex desktop, hold great, great promise
备注: Project Page: [this https URL](https://cua-suite.github.io/)
点击查看摘要
Abstract:Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.
24. 【2603.24434】he Gait Signature of Frailty: Transfer Learning based Deep Gait Models for Scalable Frailty Assessment
链接:https://arxiv.org/abs/2603.24434
作者:Laura McDaniel,Basudha Pal,Crystal Szczesny,Yuxiang Guo,Ryan Roemmich,Peter Abadir,Rama Chellappa
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diminished physiological reserve, aging medicine characterized, vulnerability to stressors, medicine characterized, characterized by diminished
备注:
点击查看摘要
Abstract:Frailty is a condition in aging medicine characterized by diminished physiological reserve and increased vulnerability to stressors. However, frailty assessment remains subjective, heterogeneous, and difficult to scale in clinical practice. Gait is a sensitive marker of biological aging, capturing multisystem decline before overt disability. Yet the application of modern computer vision to gait-based frailty assessment has been limited by small, imbalanced datasets and a lack of clinically representative benchmarks. In this work, we introduce a publicly available silhouette-based frailty gait dataset collected in a clinically realistic setting, spanning the full frailty spectrum and including older adults who use walking aids. Using this dataset, we evaluate how pretrained gait recognition models can be adapted for frailty classification under limited data conditions. We study both convolutional and hybrid attention-based architectures and show that predictive performance depends primarily on how pretrained representations are transferred rather than architectural complexity alone. Across models, selectively freezing low-level gait representations while allowing higher-level features to adapt yields more stable and generalizable performance than either full fine-tuning or rigid freezing. Conservative handling of class imbalance further improves training stability, and combining complementary learning objectives enhances discrimination between clinically adjacent frailty states. Interpretability analyses reveal consistent model attention to lower-limb and pelvic regions, aligning with established biomechanical correlates of frailty. Together, these findings establish gait-based representation learning as a scalable, non-invasive, and interpretable framework for frailty assessment and support the integration of modern biometric modeling approaches into aging research and clinical practice.
25. 【2603.24407】acher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation
链接:https://arxiv.org/abs/2603.24407
作者:Ching-Lam Cheng,Bin Zhu,Shengfeng He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating realistic, human-computer interaction, natural language, language is vital, Generating
备注: 5 pages, accepted by ICASSP2026
点击查看摘要
Abstract:Generating realistic 3D hand motion from natural language is vital for VR, robotics, and human-computer interaction. Existing methods either focus on full-body motion, overlooking detailed hand gestures, or require explicit 3D object meshes, limiting generality. We propose TSHaMo, a model-agnostic teacher-student diffusion framework for text-driven hand motion generation. The student model learns to synthesize motions from text alone, while the teacher leverages auxiliary signals (e.g., MANO parameters) to provide structured guidance during training. A co-training strategy enables the student to benefit from the teacher's intermediate predictions while remaining text-only at inference. Evaluated using two diffusion backbones on GRAB and H2O, TSHaMo consistently improves motion quality and diversity. Ablations confirm its robustness and flexibility in using diverse auxiliary inputs without requiring 3D objects at test time.
26. 【2603.24388】Causal Transfer in Medical Image Analysis
链接:https://arxiv.org/abs/2603.24388
作者:Mohammed M. Abdelsamea,Daniel Tweneboah Anyimadu,Tasneem Selim,Saif Alzubi,Lei Zhang,Ahmed Karam Eldaly,Xujiong Ye
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:models frequently fail, imaging protocols due, deployed across hospitals, frequently fail, fail when deployed
备注:
点击查看摘要
Abstract:Medical imaging models frequently fail when deployed across hospitals, scanners, populations, or imaging protocols due to domain shift, limiting their clinical reliability. While transfer learning and domain adaptation address such shifts statistically, they often rely on spurious correlations that break under changing conditions. On the other hand, causal inference provides a principled way to identify invariant mechanisms that remain stable across environments. This survey introduces and systematises Causal Transfer Learning (CTL) for medical image analysis. This paradigm integrates causal reasoning with cross-domain representation learning to enable robust and generalisable clinical AI. We frame domain shift as a causal problem and analyse how structural causal models, invariant risk minimisation, and counterfactual reasoning can be embedded within transfer learning pipelines. We studied spanning classification, segmentation, reconstruction, anomaly detection, and multimodal imaging, and organised them by task, shift type, and causal assumption. A unified taxonomy is proposed that connects causal frameworks and transfer mechanisms. We further summarise datasets, benchmarks, and empirical gains, highlighting when and why causal transfer outperforms correlation-based domain adaptation. Finally, we discuss how CTL supports fairness, robustness, and trustworthy deployment in multi-institutional and federated settings, and outline open challenges and research directions for clinically reliable medical imaging AI.
27. 【2603.24383】ViHOI: Human-Object Interaction Synthesis with Visual Priors
链接:https://arxiv.org/abs/2603.24383
作者:Songjin Cai,Linjie Zhong,Ling Guo,Changxing Ding
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating realistic, physically plausible, remains a key, realistic and physically, key challenge
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM's high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization.
28. 【2603.24376】GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization
链接:https://arxiv.org/abs/2603.24376
作者:Pengyue Jia,Derong Xu,Yingyi Zhang,Xiaopeng Li,Wenlin Zhang,Yi Wen,Yuanshao Zhu,Xiangyu Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Worldwide image geolocalization, predict precise GPS, precise GPS coordinates, image geolocalization aims, Worldwide image
备注:
点击查看摘要
Abstract:Worldwide image geolocalization aims to predict precise GPS coordinates for images captured anywhere on Earth, which is challenging due to the large visual and geographic diversity. Recent methods mainly follow two paradigms: retrieval-based approaches that match queries against a reference database, and generation-based approaches that directly predict coordinates using Large Vision-Language Models (LVLMs). However, we observe distinct error profiles between them: retrieval excels at fine-grained instance matching, while generation offers robust semantic reasoning. This complementary heterogeneity suggests that no single paradigm is universally superior. To harness this potential, we propose GeoRouter, a dynamic routing framework that adaptively assigns each query to the optimal paradigm. GeoRouter leverages an LVLM backbone to analyze visual content and provide routing decisions. To optimize GeoRouter, we introduce a distance-aware preference objective that converts the distance gap between paradigms into a continuous supervision signal, explicitly reflecting relative performance differences. Furthermore, we construct GeoRouting, the first large-scale dataset tailored for training routing policies with independent paradigm predictions. Extensive experiments on IM2GPS3k and YFCC4k demonstrate that GeoRouter significantly outperforms state-of-the-art baselines.
29. 【2603.24373】PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks
链接:https://arxiv.org/abs/2603.24373
作者:Cheng Cui,Yubo Zhang,Ting Sun,Xueqing Wang,Hongen Liu,Manhui Lin,Yue Zhang,Tingquan Gao,Changda Zhou,Jiaxuan Liu,Zelun Zhang,Jing Zhang,Jun Zhang,Yi Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large-scale vision-language models, large-scale vision-language, text recognition, OCR, precise text localization
备注:
点击查看摘要
Abstract:The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at this https URL.
30. 【2603.24355】Language-Guided Structure-Aware Network for Camouflaged Object Detection
链接:https://arxiv.org/abs/2603.24355
作者:Min Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Camouflaged Object Detection, highly challenging task, Object Detection, aims to segment, terms of color
备注:
点击查看摘要
Abstract:Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with the background in terms of color, texture, and structure, making it a highly challenging task in computer vision. Although existing methods introduce multi-scale fusion and attention mechanisms to alleviate the above issues, they generally lack the guidance of textual semantic priors, which limits the model's ability to focus on camouflaged regions in complex scenes. To address this issue, this paper proposes a Language-Guided Structure-Aware Network (LGSAN). Specifically, based on the visual backbone PVT-v2, we introduce CLIP to generate masks from text prompts and RGB images, thereby guiding the multi-scale features extracted by PVT-v2 to focus on potential target regions. On this foundation, we further design a Fourier Edge Enhancement Module (FEEM), which integrates multi-scale features with high-frequency information in the frequency domain to extract edge enhancement features. Furthermore, we propose a Structure-Aware Attention Module (SAAM) to effectively enhance the model's perception of object structures and boundaries. Finally, we introduce a Coarse-Guided Local Refinement Module (CGLRM) to enhance fine-grained reconstruction and boundary integrity of camouflaged object regions. Extensive experiments demonstrate that our method consistently achieves highly competitive performance across multiple COD datasets, validating its effectiveness and robustness.
31. 【2603.24329】GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
链接:https://arxiv.org/abs/2603.24329
作者:Yunzhe Wang,Runhui Xu,Kexin Zheng,Tianyi Zhang,Jayavibhav Niranjan Kogundi,Soham Hans,Volkan Ustun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal LLMs, LLMs are increasingly, increasingly deployed, deployed as perceptual, perceptual backbones
备注:
点击查看摘要
Abstract:Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.
32. 【2603.24327】Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens
链接:https://arxiv.org/abs/2603.24327
作者:Ciem Cornelissen,Sam Leroux,Pieter Simoens
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:learning visual representations, manual annotations, heterogeneous sensors, Teledyne FLIR ADAS, powerful paradigm
备注:
点击查看摘要
Abstract:Self-supervised learning has emerged as a powerful paradigm for learning visual representations without manual annotations, yet most methods still operate on a single modality and therefore miss the complementary structure available from heterogeneous sensors. We present Le MuMo JEPA, a self-supervised framework that learns unified representations from RGB images and aligned companion modalities. In our driving experiments, the second modality is camera-aligned LiDAR depth; we also evaluate RGB-thermal training and transfer on the Teledyne FLIR ADAS benchmark. Our approach extends LeJEPA to the multi-modal setting by learning fusion tokens that act as a latent bottleneck between modality-specific patch stems inside a shared transformer. Our default model employs a pruned fusion strategy: after an initial cross-modal attention layer, modality-specific tokens are dropped, forcing cross-modal information into the shared fusion-token grid as an efficient latent bottleneck before Sketched Isotropic Gaussian Regularization (SIGReg) is applied to the joint multimodal CLS embedding. On Waymo, Le MuMo JEPA gives the strongest performance-efficiency trade-off on downstream patch probes among the from-scratch multimodal baselines, improving CenterNet detection and dense depth while remaining competitive on segmentation. Under from-scratch training on nuScenes, Le MuMo JEPA remains the strongest model, and it also gives the best FLIR results, especially after Waymo-initialized fine-tuning. It also retains the best overall accuracy-efficiency balance in our study at substantially lower compute, memory, and estimated training time.
33. 【2603.24326】Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
链接:https://arxiv.org/abs/2603.24326
作者:Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Jing Zhang,Jun Zhang,Xing Wei,Yi Liu,Dianhai Yu,Yanjun Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:resolution significantly impacts, significantly impacts performance, image resolution significantly, fine-grained task, Region Focus Module
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at this https URL.
34. 【2603.24322】Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions
链接:https://arxiv.org/abs/2603.24322
作者:Shiqin Wang,Haoyang Chen,Huaizhou Huang,Yinkan He,Dongfang Sun,Xiaoqing Chen,Xingyu Liu,Zheng Wang,Kaiyan Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:adverse weather conditions, significantly impacts unsupervised, impacts unsupervised domain, unsupervised domain adaptation, classes significantly impacts
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model's evolving, high-dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an autonomous class scheduler. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model's training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source-target supervision, the learned class rankings direct the network's focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving) and shows generalization ability in synthetic-to-real semantic segmentation.
35. 【2603.24312】Refining time-space traffic diagrams: A neighborhood-adaptive linear regression method
链接:https://arxiv.org/abs/2603.24312
作者:Zhihong Yao,Yi Yu,Yunxia Wu,Hao Li,Yangsheng Jiang,Zhengbing He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:traffic theory research, resolution directly influencing, traffic diagram serves, engineering applications, crucial tool
备注:
点击查看摘要
Abstract:The time-space (TS) traffic diagram serves as a crucial tool for characterizing the dynamic evolution of traffic flow, with its resolution directly influencing the effectiveness of traffic theory research and engineering applications. However, constrained by monitoring precision and sampling frequency, existing TS traffic diagrams commonly suffer from low resolution. To address this issue, this paper proposes a refinement method for TS traffic diagrams based on neighborhood-adaptive linear regression. Introducing the concept of neighborhood embedding into TS diagram refinement, the method leverages local pattern similarity in TS diagrams, adaptively identifies neighborhoods similar to target cells, and fits the low-to-high resolution mapping within these neighborhoods for refinement. It avoids the over-smoothing tendency of the traditional global linear model, allows the capture of unique traffic wave propagation and congestion evolution characteristics, and outperforms the traditional neighborhood embedding method in terms of local information utilization to achieve target cell refinement. Validation on two real datasets across multiple scales and upscaling factors shows that, compared to benchmark methods, the proposed method achieves improvements of 9.16%, 8.16%, 1.86%, 3.89%, and 5.83% in metrics including MAE, MAPE, CMJS, SSIM, and GMSD, respectively. Furthermore, the proposed method exhibits strong generalization and robustness in cross-day and cross-scenario validations. In summary, requiring only a minimal amount of paired high- and low-resolution training data, the proposed method features a concise formulation, providing a foundation for the low-cost, fine-grained refinement of low-sampling-rate traffic data.
36. 【2603.24296】AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication
链接:https://arxiv.org/abs/2603.24296
作者:Jie Song,Jun Jia,Wei Sun,Wangqiu Zhou,Tao Tan,Guangtao Zhai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal image fusion, enables precise lesion, precise lesion localization, strengthening clinical decision-making, medical imaging research
备注:
点击查看摘要
Abstract:Multimodal image fusion enables precise lesion localization and characterization for accurate diagnosis, thereby strengthening clinical decision-making and driving its growing prominence in medical imaging research. A powerful multimodal image fusion model relies on high-quality, clinically representative multimodal training data and a rigorously engineered model architecture. Therefore, the development of such professional radiomics models represents a collaborative achievement grounded in standardized acquisition, clinical-specific expertise, and algorithmic design proficiency, which necessitates protection of associated intellectual property rights. However, current multimodal image fusion models generate fused outputs without built-in mechanisms to safeguard intellectual property rights, inadvertently exposing proprietary model knowledge and sensitive training data through inference leakage. For example, malicious users can exploit fusion outputs and model distillation or other inference-based reverse engineering techniques to approximate the fusion performance of proprietary models. To address this issue, we propose AMIF, the first Authorizable Medical Image Fusion model with built-in authentication, which integrates authorization access control into the image fusion objective. For unauthorized usage, AMIF embeds explicit and visible copyright identifiers into fusion results. In contrast, high-quality fusion results are accessible upon successful key-based authentication.
37. 【2603.24295】RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation
链接:https://arxiv.org/abs/2603.24295
作者:Kai Zhu,Zhenyu Cui,Zehua Zang,Jiahuan Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:state space, state space compression, state space models, State Space Model, demonstrated efficient video
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Recently, state space models have demonstrated efficient video segmentation through linear-complexity state space compression. However, Video Semantic Segmentation (VSS) requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models' capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model's capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency. The code is available at this https URL.
38. 【2603.24294】VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection
链接:https://arxiv.org/abs/2603.24294
作者:Jumin Lee,Siyeong Lee,Namil Kim,Sung-Eui Yoon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:driving datasets pose, rare classes exhibit, classes exhibit substantial, Long-tail distributions, exhibit substantial intra-class
备注:
点击查看摘要
Abstract:Long-tail distributions in driving datasets pose a fundamental challenge for 3D perception, as rare classes exhibit substantial intra-class diversity yet available samples cover this variation space only sparsely. Existing instance augmentation methods based on copy-paste or asset libraries improve rare-class exposure but are often limited in fine-grained diversity and scene-context placement. We propose VERIA, an image-first multimodal augmentation framework that synthesizes synchronized RGB--LiDAR instances using off-the-shelf foundation models and curates them with sequential semantic and geometric verification. This verification-centric design tends to select instances that better match real LiDAR statistics while spanning a wider range of intra-class variation. Stage-wise yield decomposition provides a log-based diagnostic of pipeline reliability. On nuScenes and Lyft, VERIA improves rare-class 3D object detection in both LiDAR-only and multimodal settings. Our code is available at this https URL.
39. 【2603.24278】opoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
链接:https://arxiv.org/abs/2603.24278
作者:Guan Luo,Xiu Li,Rui Chen,Xuanyu Yi,Jing Lin,Chia-Hao Chen,Jiahang Liu,Song-Hai Zhang,Jianfeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:firm upper bound, reconstruction capability sets, generation quality, generation relies, paradigm for high-fidelity
备注:
点击查看摘要
Abstract:The dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE's reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the representation mismatch between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topological framework. Specifically, we convert arbitrary input meshes into DMC-compliant representations via a remeshing algorithm that preserves sharp edges using an L$\infty$ distance metric. Our decoder outputs meshes in the same DMC format, ensuring that both predicted and target meshes share identical topological structures. This establishes explicit correspondences at the vertex and face level, allowing us to derive explicit mesh-level supervision signals for topology, vertex positions, and face orientations with clear gradients. Our sparse VAE architecture employs this unified framework and is trained with Teacher Forcing and progressive resolution training for stable and efficient convergence. Extensive experiments demonstrate that TopoMesh significantly outperforms existing VAEs in reconstruction fidelity, achieving superior preservation of sharp features and geometric details.
40. 【2603.24270】ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors
链接:https://arxiv.org/abs/2603.24270
作者:Haodong Yu,Yabo Zhang,Donglin Di,Ruyi Zhang,Wangmeng Zuo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Scanning Positional Encoding, http URL limitation, http URL mapping, http URL, http URL overcome
备注:
点击查看摘要
Abstract:While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial this http URL limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional this http URL overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core this http URL mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural this http URL, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.
41. 【2603.24260】Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
链接:https://arxiv.org/abs/2603.24260
作者:Tianyi Liu,Ye Lu,Linfeng Zhang,Chen Cai,Jianjun Gao,Yi Wang,Kim-Hui Yap,Lap-Pui Chau
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:flexible content generation, important paradigm, paradigm for high-quality, high-quality and flexible, flexible content
备注: 10 pages, 6 figures, accepted by CVPR2026
点击查看摘要
Abstract:Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, Diffusion Transformers (DiT) remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the DiT that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model output. This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion-based masked video-to-video (MV2V) generation and editing. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves a noticeable acceleration, including a 2.67$\times$ latency speedup and FLOPs reduction over commonly used foundation models, with negligible degradation in editing quality.
42. 【2603.24257】Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning
链接:https://arxiv.org/abs/2603.24257
作者:Tommaso Galliena,Stefano Rosa,Tommaso Apicella,Pietro Morerio,Alessio Del Bue,Lorenzo Natale
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:yield inconsistent descriptions, construct consistent semantic, hindering the ability, yield inconsistent, inconsistent descriptions
备注: 24 pages, 7 figures, 7 tables (including Supplementary Materials)
点击查看摘要
Abstract:Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision-Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes the current RGB observation, a top-down explored map, and an object-level episodic memory serialized into object-level tokens, ensuring persistent object identity and semantic consistency across extended sequences. To train the model in a self-supervised manner, we collect a dataset in photorealistic 3D environments using a disagreement-based policy and a pseudo-captioning model that enforces consistency across multi-view caption histories. Extensive evaluation on a manually annotated object-level test set, demonstrate improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline models, while enabling scalable performance through a compact scene representation. Code, model weights, and data are available at this https URL
43. 【2603.24245】B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition
链接:https://arxiv.org/abs/2603.24245
作者:Nishit Poddar,Aglind Reka,Diana-Laura Borza,Snehashis Majhi,Michal Balazia,Abhijit Das,Francois Bremond
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:minor posture shifts, high inter-class ambiguity, carry rich social, rich social meaning, current action recognition
备注:
点击查看摘要
Abstract:Micro-actions, fleeting and low-amplitude motions, such as glances, nods, or minor posture shifts, carry rich social meaning but remain difficult for current action recognition models to recognize due to their subtlety, short duration, and high inter-class ambiguity. In this paper, we introduce B-MoE, a Body-part-aware Mixture-of-Experts framework designed to explicitly model the structured nature of human motion. In B-MoE, each expert specializes in a distinct body region (head, body, upper limbs, lower limbs), and is based on the lightweight Macro-Micro Motion Encoder (M3E) that captures long-range contextual structure and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects the most informative regions for each micro-action. B-MoE uses a dual-stream encoder that fuses these region-specific semantic cues with global motion features to jointly capture spatially localized cues and temporally subtle variations that characterize micro-actions. Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-theart gains, with improvements in ambiguous, underrepresented, and low amplitude classes.
44. 【2603.24240】InstanceRSR: Real-World Super-Resolution via Instance-Aware Representation Alignment
链接:https://arxiv.org/abs/2603.24240
作者:Zixin Guo,Kai Zhao,Luyan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:globally consistent reconstructions, achieved remarkable progress, Existing real-world super-resolution, consistent reconstructions, based on generative
备注: 4 pages, 4 figures, 2 tables. Accepted by ICASSP 2026
点击查看摘要
Abstract:Existing real-world super-resolution (RSR) methods based on generative priors have achieved remarkable progress in producing high-quality and globally consistent reconstructions. However, they often struggle to recover fine-grained details of diverse object instances in complex real-world scenes. This limitation primarily arises because commonly adopted denoising losses (e.g., MSE) inherently favor global consistency while neglecting instance-level perception and restoration. To address this issue, we propose InstanceRSR, a novel RSR framework that jointly models semantic information and introduces instance-level feature alignment. Specifically, we employ low-resolution (LR) images as global consistency guidance while jointly modeling image data and semantic segmentation maps to enforce semantic relevance during sampling. Moreover, we design an instance representation learning module to align the diffusion latent space with the instance latent space, enabling instance-aware feature alignment, and further incorporate a scale alignment mechanism to enhance fine-grained perception and detail recovery. Benefiting from these designs, our approach not only generates photorealistic details but also preserves semantic consistency at the instance level. Extensive experiments on multiple real-world benchmarks demonstrate that InstanceRSR significantly outperforms existing methods in both quantitative metrics and visual quality, achieving new state-of-the-art (SOTA) performance.
45. 【2603.24232】Attack Assessment and Augmented Identity Recognition for Human Skeleton Data
链接:https://arxiv.org/abs/2603.24232
作者:Joseph G. Zalameda,Megan A. Witherow,Alexander M. Glandon,Jose Aguilera,Khan M. Iftekharuddin
类目:Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Machine learning models, Machine learning, based skeleton data, Hierarchical Co-occurrence Networks, Augmented Identity Recognition
备注: 8 pages, 9 figures, 3 tables
点击查看摘要
Abstract:Machine learning models trained on small data sets for security applications are especially vulnerable to adversarial attacks. Person identification from LiDAR based skeleton data requires time consuming and expensive data acquisition for each subject identity. Recently, Assessment and Augmented Identity Recognition for Skeletons (AAIRS) has been used to train Hierarchical Co-occurrence Networks for Person Identification (HCN-ID) with small LiDAR based skeleton data sets. However, AAIRS does not evaluate robustness of HCN-ID to adversarial attacks or inoculate the model to defend against such attacks. Popular perturbation-based approaches to generating adversarial attacks are constrained to targeted perturbations added to real training samples, which is not ideal for inoculating models with small training sets. Thus, we propose Attack-AAIRS, a novel addition to the AAIRS framework. Attack-AAIRS leverages a small real data set and a GAN generated synthetic data set to assess and improve model robustness against unseen adversarial attacks. Rather than being constrained to perturbations of limited real training samples, the GAN learns the distribution of adversarial attack samples that exploit weaknesses in HCN-ID. Attack samples drawn from this distribution augment training for inoculation of the HCN-ID to improve robustness. Ten-fold cross validation of Attack-AAIRS yields increased robustness to unseen attacks- including FGSM, PGD, Additive Gaussian Noise, MI-FGSM, and BIM. The HCN-ID Synthetic Data Quality Score for Attack-AAIRS indicates that generated attack samples are of similar quality to the original benign synthetic samples generated by AAIRS. Furthermore, inoculated models show consistent final test accuracy with the original model trained on real data, demonstrating that our method improves robustness to adversarial attacks without reducing test performance on real data.
46. 【2603.24224】RVLM: Recursive Vision-Language Models with Adaptive Depth
链接:https://arxiv.org/abs/2603.24224
作者:Nicanor Mayumu,Zeenath Khan,Melodena Stephens,Patrick Mukala,Farhad Oroumchian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:face two fundamental, Medical AI systems, Medical, systems face, fundamental limitations
备注:
点击查看摘要
Abstract:Medical AI systems face two fundamental limitations. First, conventional vision-language models (VLMs) perform single-pass inference, yielding black-box predictions that cannot be audited or explained in clinical terms. Second, iterative reasoning systems that expose intermediate steps rely on fixed iteration budgets wasting compute on simple cases while providing insufficient depth for complex ones. We address both limitations with a unified framework. RVLM replaces single-pass inference with an iterative generate-execute loop: at each step, the model writes Python code, invokes vision sub-agents, manipulates images, and accumulates evidence. Every diagnostic claim is grounded in executable code, satisfying auditability requirements of clinical AI governance frameworks. RRouter makes iteration depth adaptive: a lightweight controller predicts the optimal budget from task-complexity features, then monitors progress and terminates early when reasoning stalls. We evaluate on BraTS 2023 Meningioma (brain MRI) and MIMIC-CXR (chest X-ray) using Gemini 2.5 Flash without fine-tuning. Across repeated runs, RVLM shows high consistency on salient findings (e.g., mass presence and enhancement) and can detect cross-modal discrepancies between Fluid-Attenuated Inversion Recovery (FLAIR) signal characteristics and segmentation boundaries. On MIMIC-CXR, it generates structured reports and correctly recognises view-specific artefacts. Code: this https URL.
47. 【2603.24209】HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer
链接:https://arxiv.org/abs/2603.24209
作者:Minjun Kim,Minje Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Personalized Federated Learning, existing methods suffer, Federated Learning, shallow prototype alignment, brittle server-side distillation
备注: Accepted at WACV 2026. 8 pages, 7 figures, 3 tables
点击查看摘要
Abstract:Personalized Federated Learning (PFL) aims to deliver effective client-specific models under heterogeneous distributions, yet existing methods suffer from shallow prototype alignment and brittle server-side distillation. We propose HEART-PFL, a dual-sided framework that (i) performs depth-aware Hierarchical Directional Alignment (HDA) using cosine similarity in the early stage and MSE matching in the deep stage to preserve client specificity, and (ii) stabilizes global updates through Adversarial Knowledge Transfer (AKT) with symmetric KL distillation on clean and adversarial proxy data. Using lightweight adapters with only 1.46M trainable parameters, HEART-PFL achieves state-of-the-art personalized accuracy on CIFAR-100, Flowers-102, and Caltech-101 (63.42%, 84.23%, and 95.67%, respectively) under Dirichlet non-IID partitions, and remains robust to out-of-domain proxy data. Ablation studies further confirm that HDA and AKT provide complementary gains in alignment, robustness, and optimization stability, offering insights into how the two components mutually reinforce effective personalization. Overall, these results demonstrate that HEART-PFL simultaneously enhances personalization and global stability, highlighting its potential as a strong and scalable solution for PFL(code available at this https URL).
48. 【2603.24208】Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement
链接:https://arxiv.org/abs/2603.24208
作者:Xin Zhang,Jianyang Xu,Hao Peng,Dongjing Wang,Jingyuan Zheng,Yu Li,Yuyu Yin,Hongbo Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Knowledge distillation transfers, distillation transfers knowledge, efficient inference, Knowledge distillation, Knowledge
备注: 9 pages, 6 figures
点击查看摘要
Abstract:Knowledge distillation transfers knowledge from large teacher models to smaller students for efficient inference. While existing methods primarily focus on distillation strategies, they often overlook the importance of enhancing teacher knowledge quality. In this paper, we propose Text-guided Multi-view Knowledge Distillation (TMKD), which leverages dual-modality teachers, a visual teacher and a text teacher (CLIP), to provide richer supervisory signals. Specifically, we enhance the visual teacher with multi-view inputs incorporating visual priors (edge and high-frequency features), while the text teacher generates semantic weights through prior-aware prompts to guide adaptive feature fusion. Additionally, we introduce vision-language contrastive regularization to strengthen semantic knowledge in the student model. Extensive experiments on five benchmarks demonstrate that TMKD consistently improves knowledge distillation performance by up to 4.49\%, validating the effectiveness of our dual-teacher multi-view enhancement strategy. Code is available at this https URL.
49. 【2603.24198】RefReward-SR: LR-Conditioned Reward Modeling for Preference-Aligned Super-Resolution
链接:https://arxiv.org/abs/2603.24198
作者:Yushuai Song,Weize Quan,Weining Wang,Jiahui Sun,Jing Liu,Meng Li,Pengbin Yu,Zhentao Chen,Wei Shen,Lunxi Yuan,Dong-ming Yan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, improved visual realism, greatly improved visual, frameworks remain misaligned, generative super-resolution
备注:
点击查看摘要
Abstract:Recent advances in generative super-resolution (SR) have greatly improved visual realism, yet existing evaluation and optimization frameworks remain misaligned with human perception. Full-Reference and No-Reference metrics often fail to reflect perceptual preference, either penalizing semantically plausible details due to pixel misalignment or favoring visually sharp but inconsistent artifacts. Moreover, most SR methods rely on ground-truth (GT)-dependent distribution matching, which does not necessarily correspond to human judgments. In this work, we propose RefReward-SR, a low-resolution (LR) reference-aware reward model for preference-aligned SR. Instead of relying on GT supervision or NR evaluation, RefReward-SR assesses high-resolution (HR) reconstructions conditioned on their LR inputs, treating the LR image as a semantic anchor. Leveraging the visual-linguistic priors of a Multimodal Large Language Models (MLLM), it evaluates semantic consistency and plausibility in a reasoning-aware manner. To support this paradigm, we construct RefSR-18K, the first large-scale LR-conditioned preference dataset for SR, providing pairwise rankings based on LR-HR consistency and HR naturalness. We fine-tune the MLLM with Group Relative Policy Optimization (GRPO) using LR-conditioned ranking rewards, and further integrate GRPO into SR model training with RefReward-SR as the core reward signal for preference-aligned generation. Extensive experiments show that our framework achieves substantially better alignment with human judgments, producing reconstructions that preserve semantic consistency while enhancing perceptual plausibility and visual naturalness. Code, models, and datasets will be released upon paper acceptance.
50. 【2603.24181】Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection
链接:https://arxiv.org/abs/2603.24181
作者:Adhemar de Senneville,Xavier Bou,Jérémy Anger,Rafael Grompone,Gabriele Facciolo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Current Large Vision, Large Vision Language, Current Large, Vision Language Models, answering and OCR
备注:
点击查看摘要
Abstract:Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP's architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs' internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.
51. 【2603.24166】Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
链接:https://arxiv.org/abs/2603.24166
作者:Xu Zhang,Zhe Chen,Jing Zhang,Dacheng Tao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:referring object detection, severe label scarcity, face severe label, augmented reality, object detection
备注: CVPR2026
点击查看摘要
Abstract:Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.
52. 【2603.24157】CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare
链接:https://arxiv.org/abs/2603.24157
作者:Akash Ghosh,Tajamul Ashraf,Rishu Kumar Singh,Numan Saeed,Sriparna Saha,Xiuying Chen,Salman Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:transforming human-computer interaction, real-world tasks, pipelines are transforming, transforming human-computer, enabling efficient
备注: CVPR 2026 Findings
点击查看摘要
Abstract:Multimodal agentic pipelines are transforming human-computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision-language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor-critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning-aware predictions during inference. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out-of-distribution dataset.
53. 【2603.24156】A convergent Plug-and-Play Majorization-Minimization algorithm for Poisson inverse problems
链接:https://arxiv.org/abs/2603.24156
作者:Thibaut Modrzyk(CREATIS),Ane Etxebeste(CREATIS),Élie Bretin(ICJ, MMCS),Voichita Maxim(CREATIS)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Poisson inverse problems, Poisson inverse, inverse problems, algorithm for Poisson, Poisson
备注:
点击查看摘要
Abstract:In this paper, we present a novel variational plug-and-play algorithm for Poisson inverse problems. Our approach minimizes an explicit functional which is the sum of a Kullback-Leibler data fidelity term and a regularization term based on a pre-trained neural network. By combining classical likelihood maximization methods with recent advances in gradient-based denoisers, we allow the use of pre-trained Gaussian denoisers without sacrificing convergence guarantees. The algorithm is formulated in the majorization-minimization framework, which guarantees convergence to a stationary point. Numerical experiments confirm state-of-the-art performance in deconvolution and tomography under moderate noise, and demonstrate clear superiority in high-noise conditions, making this method particularly valuable for nuclear medicine applications.
54. 【2603.24146】LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
链接:https://arxiv.org/abs/2603.24146
作者:Jaehun Bang,Jinhyeok Kim,Minji Kim,Seungheon Jeong,Kyungdon Joo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:understanding enables users, environments through natural, natural language, enables users, users to segment
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. For more details, visit our project page this https URL.
55. 【2603.24139】utor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
链接:https://arxiv.org/abs/2603.24139
作者:Zhanhe Lei,Zhongyuan Wang,Jikang Cheng,Baojin Huang,Yuhong Yang,Zhen Han,Chao Liang,Dengpan Ye
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Standard supervised training, Standard supervised, deepfake detection treats, Tutor-Student Reinforcement Learning, uniform importance
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at this https URL.
56. 【2603.24134】Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation
链接:https://arxiv.org/abs/2603.24134
作者:Haoyu Ji,Bowen Chen,Zhihao Yang,Wenze Huang,Yu Gao,Xueting Liu,Weihong Ren,Zhiyong Wang,Honghai Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:untrimmed skeletal motion, skeletal motion sequences, Skeleton-based Temporal Action, classify diverse actions, Skeleton-based Temporal
备注: CVPR Conference
点击查看摘要
Abstract:Skeleton-based Temporal Action Segmentation (STAS) seeks to densely segment and classify diverse actions within long, untrimmed skeletal motion sequences. However, existing STAS methodologies face challenges of limited inter-class discriminability and blurred segmentation boundaries, primarily due to insufficient distinction of spatio-temporal patterns between adjacent actions. To address these limitations, we propose Spectral Scalpel, a frequency-selective filtering framework aimed at suppressing shared frequency components between adjacent distinct actions while amplifying their action-specific frequencies, thereby enhancing inter-action discrepancies and sharpening transition boundaries. Specifically, Spectral Scalpel employs adaptive multi-scale spectral filters as scalpels to edit frequency spectra, coupled with a discrepancy loss between adjacent actions serving as the surgical objective. This design amplifies representational disparities between neighboring actions, effectively mitigating boundary localization ambiguities and inter-class confusion. Furthermore, complementing long-term temporal modeling, we introduce a frequency-aware channel mixer to strengthen channel evolution by aggregating spectra across channels. This work presents a novel paradigm for STAS that extends conventional spatio-temporal modeling by incorporating frequency-domain analysis. Extensive experiments on five public datasets demonstrate that Spectral Scalpel achieves state-of-the-art performance. Code is available at this https URL.
57. 【2603.24131】Reservoir-Based Graph Convolutional Networks
链接:https://arxiv.org/abs/2603.24131
作者:Mayssa Soussia,Gita Ayu Salsabila,Mohamed Ali Mahjoub,Islem Rekik
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Graph Neural Networks, Neural Networks, Message passing, Graph Convolutional Networks, Graph Convolutional Network
备注:
点击查看摘要
Abstract:Message passing is a core mechanism in Graph Neural Networks (GNNs), enabling the iterative update of node embeddings by aggregating information from neighboring nodes. Graph Convolutional Networks (GCNs) exemplify this approach by adapting convolutional operations for graph structures, allowing features from adjacent nodes to be combined effectively. However, GCNs encounter challenges with complex or dynamic data. Capturing long-range dependencies often requires deeper layers, which not only increase computational costs but also lead to over-smoothing, where node embeddings become indistinguishable. To overcome these challenges, reservoir computing has been integrated into GNNs, leveraging iterative message-passing dynamics for stable information propagation without extensive parameter tuning. Despite its promise, existing reservoir-based models lack structured convolutional mechanisms, limiting their ability to accurately aggregate multi-hop neighborhood information. To address these limitations, we propose RGC-Net (Reservoir-based Graph Convolutional Network), which integrates reservoir dynamics with structured graph convolution. Key contributions include: (i) a reimagined convolutional framework with fixed random reservoir weights and a leaky integrator to enhance feature retention; (ii) a robust, adaptable model for graph classification; and (iii) an RGC-Net-powered transformer for graph generation with application to dynamic brain connectivity. Extensive experiments show that RGC-Net achieves state-of-the-art performance in classification and generative tasks, including brain graph evolution, with faster convergence and reduced over-smoothing. Source code is available at this https URL .
58. 【2603.24117】Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization
链接:https://arxiv.org/abs/2603.24117
作者:David Faget(CB),José Luis Lisani,Miguel Colom(CB, CMLA)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Planet-scale photo geolocalization, geographic location depicted, photo geolocalization involves, image purely based, Planet-scale photo
备注:
点击查看摘要
Abstract:Planet-scale photo geolocalization involves the intricate task of estimating the geographic location depicted in an image purely based on its visual features. While deep learning models, particularly convolutional neural networks (CNNs), have significantly advanced this field, understanding the reasoning behind their predictions remains challenging. In this paper, we present Combi-CAM, a novel method that enhances the explainability of CNN-based geolocalization models by combining gradient-weighted class activation maps obtained from several layers of the network architecture, rather than using only information from the deepest layer as is typically done. This approach provides a more detailed understanding of how different image features contribute to the model's decisions, offering deeper insights than the traditional approaches.
59. 【2603.24115】Retinal Layer Segmentation in OCT Images With 2.5D Cross-slice Feature Fusion Module for Glaucoma Assessment
链接:https://arxiv.org/abs/2603.24115
作者:Hyunwoo Kim,Heesuk Kim,Wungrak Choi,Jae-Sang Hyun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:OCT images, accurate glaucoma diagnosis, diagnosis and monitoring, images is essential, segmentation
备注:
点击查看摘要
Abstract:For accurate glaucoma diagnosis and monitoring, reliable retinal layer segmentation in OCT images is essential. However, existing 2D segmentation methods often suffer from slice-to-slice inconsistencies due to the lack of contextual information across adjacent B-scans. 3D segmentation methods are better for capturing slice-to-slice context, but they require expensive computational resources. To address these limitations, we propose a 2.5D segmentation framework that incorporates a novel cross-slice feature fusion (CFF) module into a U-Net-like architecture. The CFF module fuses inter-slice features to effectively capture contextual information, enabling consistent boundary detection across slices and improved robustness in noisy regions. The framework was validated on both a clinical dataset and the publicly available DUKE DME dataset. Compared to other segmentation methods without the CFF module, the proposed method achieved an 8.56% reduction in mean absolute distance and a 13.92% reduction in root mean square error, demonstrating improved segmentation accuracy and robustness. Overall, the proposed 2.5D framework balances contextual awareness and computational efficiency, enabling anatomically reliable retinal layer delineation for automated glaucoma evaluation and potential clinical applications.
60. 【2603.24106】Granular Ball Guided Stable Latent Domain Discovery for Domain-General Crowd Counting
链接:https://arxiv.org/abs/2603.24106
作者:Fan Chen,Shuyin Xia,Yi Wang,Xinbo Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains highly challenging, single labeled source, exhibit severe distribution, counting remains highly, Single-source domain generalization
备注:
点击查看摘要
Abstract:Single-source domain generalization for crowd counting remains highly challenging because a single labeled source domain often contains heterogeneous latent domains, while test data may exhibit severe distribution shifts. A fundamental difficulty lies in stable latent domain discovery: directly performing flat clustering on evolving sample-level latent features is easily affected by feature noise, outliers, and representation drift, leading to unreliable pseudo-domain assignments and weakened domain-structured learning. To address this issue, we propose a granular ball guided stable latent domain discovery framework for domain-general crowd counting. Specifically, the proposed method first organizes samples into compact local granular balls and then clusters granular ball centers as representatives to obtain pseudo-domains, transforming direct sample-level clustering into a hierarchical representative-based clustering process. This design yields more stable and semantically consistent pseudo-domain assignments. Built upon the discovered latent domains, we further develop a two-branch learning framework that enhances transferable semantic representations via semantic codebook re-encoding while modeling domain-specific appearance variations through a style branch, thereby reducing semantic--style entanglement and improving generalization under domain shifts. Extensive experiments on ShanghaiTech A/B, UCF\_QNRF, and NWPU-Crowd under a strict no-adaptation protocol demonstrate that the proposed method consistently outperforms strong baselines, especially under large domain gaps.
61. 【2603.24097】LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation
链接:https://arxiv.org/abs/2603.24097
作者:Haoyu Ji,Xueting Liu,Yu Gao,Wenze Huang,Zhihao Yang,Weihong Ren,Zhiyong Wang,Honghai Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:densely parse untrimmed, parse untrimmed skeletal, untrimmed skeletal sequences, Skeleton-based Temporal Action, frame-level action categories
备注: CVPR Conference
点击查看摘要
Abstract:Skeleton-based Temporal Action Segmentation (STAS) aims to densely parse untrimmed skeletal sequences into frame-level action categories. However, existing methods, while proficient at capturing spatio-temporal kinematics, neglect the underlying physical dynamics that govern human motion. This oversight limits inter-class discriminability between actions with similar kinematics but distinct dynamic intents, and hinders precise boundary localization where dynamic force profiles shift. To address these, we propose the Lagrangian-Dynamic Informed Network (LaDy), a framework integrating principles of Lagrangian dynamics into the segmentation process. Specifically, LaDy first computes generalized coordinates from joint positions and then estimates Lagrangian terms under physical constraints to explicitly synthesize the generalized forces. To further ensure physical coherence, our Energy Consistency Loss enforces the work-energy theorem, aligning kinetic energy change with the work done by the net force. The learned dynamics then drive a Spatio-Temporal Modulation module: Spatially, generalized forces are fused with spatial representations to provide more discriminative semantics. Temporally, salient dynamic signals are constructed for temporal gating, thereby significantly enhancing boundary awareness. Experiments on challenging datasets show that LaDy achieves state-of-the-art performance, validating the integration of physical dynamics for action segmentation. Code is available at this https URL.
62. 【2603.24086】LGTM: Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation
链接:https://arxiv.org/abs/2603.24086
作者:Ryugo Morita,Stanislav Frolov,Brian Bernhard Moser,Ko Watanabe,Riku Takahashi,Andreas Dengel
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:demonstrated high-quality performance, performance in conditional, demonstrated high-quality, high-quality performance, structural cues
备注: Accepted to IJCNN2026
点击查看摘要
Abstract:Diffusion models have demonstrated high-quality performance in conditional text-to-image generation, particularly with structural cues such as edges, layouts, and depth. However, lighting conditions have received limited attention and remain difficult to control within the generative process. Existing methods handle lighting through a two-stage pipeline that relights images after generation, which is inefficient. Moreover, they rely on fine-tuning with large datasets and heavy computation, limiting their adaptability to new models and tasks. To address this, we propose a novel Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation (LGTM), which manipulates the initial latent noise of the diffusion process to guide image generation with text prompts and user-specified light directions. Through a channel-wise analysis of the latent space, we find that selectively manipulating latent channels enables fine-grained lighting control without fine-tuning or modifying the pre-trained model. Extensive experiments show that our method surpasses prompt-based baselines in lighting consistency, while preserving image quality and text alignment. This approach introduces new possibilities for dynamic, user-guided light control. Furthermore, it integrates seamlessly with models like ControlNet, demonstrating adaptability across diverse scenarios.
63. 【2603.24079】When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
链接:https://arxiv.org/abs/2603.24079
作者:Ye Leng,Junjie Chu,Mingjie Li,Chenhao Lin,Chao Shen,Michael Backes,Yun Shen,Yang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:multimodal large language, large language models, multimodal large, large language, Recently
备注: Accepted by CVPR 2026. 15 pages, 11 figures
点击查看摘要
Abstract:Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability may also introduce new and potentially greater safety risks. Taking diffusion models as a reference point, we systematically analyze and compare the safety risks of emerging MLLMs along two dimensions: unsafe content generation and fake image synthesis. Across multiple unsafe generation benchmark datasets, we observe that MLLMs tend to generate more unsafe images than diffusion models. This difference partly arises because diffusion models often fail to interpret abstract prompts, producing corrupted outputs, whereas MLLMs can comprehend these prompts and generate unsafe content. For current advanced fake image detectors, MLLM-generated images are also notably harder to identify. Even when detectors are retrained with MLLMs-specific data, they can still be bypassed by simply providing MLLMs with longer and more descriptive inputs. Our measurements indicate that the emerging safety risks of the cutting-edge generative paradigm, MLLMs, have not been sufficiently recognized, posing new challenges to real-world safety.
64. 【2603.24078】PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation
链接:https://arxiv.org/abs/2603.24078
作者:Yuheng Feng,Wen Zhang,Haodong Duan,Xingxing Zou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:annotated across composition, composition structure, poster understanding, semantic intent, generation prompts spanning
备注: CVPR 2026, Project Page: [this https URL](https://github.com/ArtmeScienceLab/PosterIQ-Benchmark)
点击查看摘要
Abstract:We present PosterIQ, a design-driven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image-annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text-image correspondence, typography/readability and font perception, design quality assessment, and controllable, composition-aware generation with metaphor. We evaluate state-of-the-art MLLMs and diffusion-based generators, finding persistent gaps in visual hierarchy, typographic semantics, saliency control, and intention communication; commercial models lead on high-level reasoning but act as insensitive automatic raters, while generators render text well yet struggle with composition-aware synthesis. Extensive analyses show PosterIQ is both a quantitative benchmark and a diagnostic tool for design reasoning, offering reproducible, task-specific metrics. We aim to catalyze models' creativity and integrate human-centred design principles into generative vision-language systems.
65. 【2603.24059】AD-Reasoning: Multimodal Guideline-Guided Reasoning for Alzheimer's Disease Diagnosis
链接:https://arxiv.org/abs/2603.24059
作者:Qiuhui Chen,Yushan Deng,Xuancheng Yao,Yi Hong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diagnosis requires integrating, requires integrating neuroimaging, models remain opaque, Alzheimer disease, multimodal models remain
备注: ICME 2026
点击查看摘要
Abstract:Alzheimer's disease (AD) diagnosis requires integrating neuroimaging with heterogeneous clinical evidence and reasoning under established criteria, yet most multimodal models remain opaque and weakly guideline-aligned. We present AD-Reasoning, a multimodal framework that couples structural MRI with six clinical modalities and a rule-based verifier to generate structured, NIA-AA-consistent diagnoses. AD-Reasoning combines modality-specific encoders, bidirectional cross-attention fusion, and reinforcement fine-tuning with verifiable rewards that enforce output format, guideline evidence coverage, and reasoning--decision consistency. We also release AD-MultiSense, a 10,378-visit multimodal QA dataset with guideline-validated rationales built from ADNI/AIBL. On AD-MultiSense, AD-Reasoning achieves state-of-the-art diagnostic accuracy and produces structured rationales that improve transparency over recent baselines, while providing transparent rationales.
66. 【2603.24058】Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification
链接:https://arxiv.org/abs/2603.24058
作者:Han Sun,Qin Li,Peixin Wang,Min Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, medical image analysis, Object hallucination, severely compromises, real-world applications
备注: CVPR 2026(Findings)
点击查看摘要
Abstract:Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs' general capability across diverse vision-language tasks.
67. 【2603.24057】Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics
链接:https://arxiv.org/abs/2603.24057
作者:Jipeng Liu,Haichao Shi,Siyu Xing,Rong Yin,Xiao-Yu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:representational disconnect remains, capturing non-semantic artifacts, non-semantic artifacts inherent, Vision-Language Models, CLIP have emerged
备注:
点击查看摘要
Abstract:While Vision-Language Models (VLMs) like CLIP have emerged as a dominant paradigm for generalizable deepfake detection, a representational disconnect remains: their semantic-centric pre-training is ill-suited for capturing non-semantic artifacts inherent to hyper-realistic synthesis. In this work, we identify a failure mode termed Optimization Collapse, where detectors trained with Sharpness-Aware Minimization (SAM) degenerate to random guessing on non-semantic forgeries once the perturbation radius exceeds a narrow threshold. To theoretically formalize this collapse, we propose the Critical Optimization Radius (COR) to quantify the geometric stability of the optimization landscape, and leverage the Gradient Signal-to-Noise Ratio (GSNR) to measure generalization potential. We establish a theorem proving that COR increases monotonically with GSNR, thereby revealing that the geometric instability of SAM optimization originates from degraded intrinsic generalization potential. This result identifies the layer-wise attenuation of GSNR as the root cause of Optimization Collapse in detecting non-semantic forgeries. Although naively reducing perturbation radius yields stable convergence under SAM, it merely treats the symptom without mitigating the intrinsic generalization degradation, necessitating enhanced gradient fidelity. Building on this insight, we propose the Contrastive Regional Injection Transformer (CoRIT), which integrates a computationally efficient Contrastive Gradient Proxy (CGP) with three training-free strategies: Region Refinement Mask to suppress CGP variance, Regional Signal Injection to preserve CGP magnitude, and Hierarchical Representation Integration to attain more generalizable representations. Extensive experiments demonstrate that CoRIT mitigates optimization collapse and achieves state-of-the-art generalization across cross-domain and universal forgery benchmarks.
68. 【2603.24045】LGEST: Dynamic Spatial-Spectral Expert Routing for Hyperspectral Image Classification
链接:https://arxiv.org/abs/2603.24045
作者:Jiawen Wen,Suixuan Qiu,Zihang Luo,Xiaofei Yang,Haotian Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Convolutional Neural Networks, Neural Networks, including Convolutional Neural, achieved remarkable success, Deep learning methods
备注:
点击查看摘要
Abstract:Deep learning methods, including Convolutional Neural Networks, Transformers and Mamba, have achieved remarkable success in hyperspectral image (HSI) classification. Nevertheless, existing methods exhibit inflexible integration of local-global representations, inadequate handling of spectral-spatial scale disparities across heterogeneous bands, and susceptibility to the Hughes phenomenon under high-dimensional sample heterogeneity. To address these challenges, we propose Local-Global Expert Spatial-Spectral Transformer (LGEST), a novel framework that synergistically combines three key innovations. The LGEST first employs a Deep Spatial-Spectral Autoencoder (DSAE) to generate compact yet discriminative embeddings through hierarchical nonlinear compression, preserving 3D neighborhood coherence while mitigating information loss in high-dimensional spaces. Secondly, a Cross-Interactive Mixed Expert Feature Pyramid (CIEM-FPN) leverages cross-attention mechanisms and residual mixture-of-experts layers to dynamically fuse multi-scale features, adaptively weighting spectral discriminability and spatial saliency through learnable gating functions. Finally, a Local-Global Expert System (LGES) processes decomposed features via sparsely activated expert pairs: convolutional sub-experts capture fine-grained textures, while transformer sub-experts model long-range contextual dependencies, with a routing controller dynamically selecting experts based on real-time feature saliency. Extensive experiments on four benchmark datasets demonstrate that LGEST consistently outperforms state-of-the-art methods.
69. 【2603.24043】HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models
链接:https://arxiv.org/abs/2603.24043
作者:Yeqi He,Liang Li,Zhiwen Yang,Xichun Sheng,Zhidong Zhao,Chenggang Yan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable performance, models have demonstrated, demonstrated remarkable, style transfer, style
备注: Accepted in CVPR 2026 Findings
点击查看摘要
Abstract:Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models' robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, these methods often fail to capture complex style reference or retain the identity of user-provided content images, thus falling into the trap of style-content balance. Thus, we propose a training-free style transfer approach via $\textbf{h}$eterogeneous $\textbf{a}$ttention $\textbf{m}$odulation ($\textbf{HAM}$) to protect identity information during image/text-guided style reference transfer, thereby addressing the style-content trade-off challenge. Specifically, we first introduces style noise initialization to initialize latent noise for diffusion. Then, during the diffusion process, it innovatively employs HAM for different attention mechanisms, including Global Attention Regulation (GAR) and Local Attention Transplantation (LAT), which better preserving the details of the content image while capturing complex style references. Our approach is validated through a series of qualitative and quantitative experiments, achieving state-of-the-art performance on multiple quantitative metrics.
70. 【2603.24039】SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons
链接:https://arxiv.org/abs/2603.24039
作者:Haiyang Xu,Ronghuan Wu,Li-Yi Wei,Nanxuan Zhao,Chenxi Liu,Cuong Nguyen,Zhuowen Tu,Zhaowen Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
关键词:original semantic layering, modern design workflows, layering is lost, cornerstone of modern, modern design
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened single-path or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation empowered pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inapplicable to flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task. Project page: this https URL
71. 【2603.24037】A^3: Towards Advertising Aesthetic Assessment
链接:https://arxiv.org/abs/2603.24037
作者:Kaiyuan Ji,Yixuan Gao,Lu Sun,Yushuo Zheng,Zijian Chen,Jianbo Zhang,Xiangyang Zhu,Yuan Tian,Zicheng Zhang,Guangtao Zhai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:commercial conversion rates, current evaluation methods, evaluation methods rely, Advertising Aesthetic Assessment, significantly impact commercial
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A^3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A^3-Law), a dataset (A^3-Dataset), a multimodal large language model (A^3-Align), and a benchmark (A^3-Bench). Central to A^3 is a theory-driven paradigm, A^3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A^3-Law, we construct A^3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A^3-Align, trained under A^3-Law with CoT-guided learning on A^3-Dataset. Extensive experiments on A^3-Bench demonstrate that A^3-Align achieves superior alignment with A^3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: this https URL.
72. 【2603.24036】SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision
链接:https://arxiv.org/abs/2603.24036
作者:Avigail Cohen Rimon,Amir Mann,Mirela Ben Chen,Or Litany
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly attractive representation, Gaussian Splatting, model-based video tracking, enables real-time, photorealistic novel view
备注: Project page: [this https URL](https://avigailco.github.io/SpectralSplats/)
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) enables real-time, photorealistic novel view synthesis, making it a highly attractive representation for model-based video tracking. However, leveraging the differentiability of the 3DGS renderer "in the wild" remains notoriously fragile. A fundamental bottleneck lies in the compact, local support of the Gaussian primitives. Standard photometric objectives implicitly rely on spatial overlap; if severe camera misalignment places the rendered object outside the target's local footprint, gradients strictly vanish, leaving the optimizer stranded. We introduce SpectralSplats, a robust tracking framework that resolves this "vanishing gradient" problem by shifting the optimization objective from the spatial to the frequency domain. By supervising the rendered image via a set of global complex sinusoidal features (Spectral Moments), we construct a global basin of attraction, ensuring that a valid, directional gradient toward the target exists across the entire image domain, even when pixel overlap is completely nonexistent. To harness this global basin without introducing periodic local minima associated with high frequencies, we derive a principled Frequency Annealing schedule from first principles, gracefully transitioning the optimizer from global convexity to precise spatial alignment. We demonstrate that SpectralSplats acts as a seamless, drop-in replacement for spatial losses across diverse deformation parameterizations (from MLPs to sparse control points), successfully recovering complex deformations even from severely misaligned initializations where standard appearance-based tracking catastrophically fails.
73. 【2603.24030】Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
链接:https://arxiv.org/abs/2603.24030
作者:Sa Zhu,Wanqian Zhang,Lin Wang,Xiaohua Chen,Chenxu Cui,Jinchao Zhang,Bo Li
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Temporal Action Detection, Open-Vocabulary Temporal Action, Action Detection, Open-Vocabulary Temporal, aims to classify
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Open-Vocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter action-relevant segments for each phase leveraging phase-wise semantic cues, producing semantically aligned visual representations. Furthermore, we propose the Adaptive Phase-wise Alignment (APA) module to perform phase-level visual-textual matching, and adaptively aggregates alignment results across phases for final prediction. This adaptive phase-wise alignment facilitates the capture of transferable action patterns and significantly enhances generalization to unseen actions. Extensive experiments on two OV-TAD benchmarks demonstrated the superiority of the proposed method.
74. 【2603.24016】COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm
链接:https://arxiv.org/abs/2603.24016
作者:Zekun Qian,Wei Feng,Ruize Han,Junhui Hou
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:real-world scenarios involving, scenarios involving diverse, involving diverse objects, Open-Vocabulary Multi-Object Tracking, Multi-Object Tracking
备注:
点击查看摘要
Abstract:Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.
75. 【2603.24006】UW-VOS: A Large-Scale Dataset for Underwater Video Object Segmentation
链接:https://arxiv.org/abs/2603.24006
作者:Hongshen Zhao,Jingkang Tai,Yuhang Wu,Wenkang Zhang,Xi Lan,Shangyan Wang,Tianyu Zhang,Wankou Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video Object Segmentation, Object Segmentation, suffer significant degradation, significant degradation due, Underwater Video Object
备注:
点击查看摘要
Abstract:Underwater Video Object Segmentation (VOS) is essential for marine exploration, yet open-air methods suffer significant degradation due to color distortion, low contrast, and prevalent camouflage. A primary hurdle is the lack of high-quality training data. To bridge this gap, we introduce $\textbf{UW-VOS}$, the first large-scale underwater VOS benchmark comprising 1,431 video sequences across 409 categories with 309,295 mask annotations, constructed via a semi-automatic data engine with rigorous human verification. We further propose $\textbf{SAM-U}$, a parameter-efficient framework that adapts SAM2 to the underwater domain. By inserting lightweight adapters into the image encoder, SAM-U achieves state-of-the-art performance with only $\sim$2$\%$ trainable parameters. Extensive experiments reveal that existing methods experience an average 13-point $\mathcal{J}\\mathcal{F}$ drop on UW-VOS, while SAM-U effectively bridges this domain gap. Detailed attribute-based analysis further identifies small targets, camouflage, and exit-re-entry as critical bottlenecks, providing a roadmap for future research in robust underwater perception.
76. 【2603.24005】DB SwinT: A Dual-Branch Swin Transformer Network for Road Extraction in Optical Remote Sensing Imagery
链接:https://arxiv.org/abs/2603.24005
作者:Zongyang He,Xiangli Yang,Xian Gao,Zhiguo Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:accurate road extraction, road extraction, traffic monitoring, disaster management, Swin Transformer network
备注:
点击查看摘要
Abstract:With the continuous improvement in the spatial resolution of optical remote sensing imagery, accurate road extraction has become increasingly important for applications such as urban planning, traffic monitoring, and disaster management. However, road extraction in complex urban and rural environments remains challenging, as roads are often occluded by trees, buildings, and other objects, leading to fragmented structures and reduced extraction accuracy. To address this problem, this paper proposes a Dual-Branch Swin Transformer network (DB SwinT) for road extraction. The proposed framework combines the long-range dependency modeling capability of the Swin Transformer with the multi-scale feature fusion strategy of U-Net, and employs a dual-branch encoder to learn complementary local and global representations. Specifically, the local branch focuses on recovering fine structural details in occluded areas, while the global branch captures broader semantic context to preserve the overall continuity of road networks. In addition, an Attentional Feature Fusion (AFF) module is introduced to adaptively fuse features from the two branches, further enhancing the representation of occluded road segments. Experimental results on the Massachusetts and DeepGlobe datasets show that DB SwinT achieves Intersection over Union (IoU) scores of 79.35\% and 74.84\%, respectively, demonstrating its effectiveness for road extraction from optical remote sensing imagery.
77. 【2603.23997】HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images
链接:https://arxiv.org/abs/2603.23997
作者:Yumeng Liu,Xiao-Xiao Long,Marc Habermann,Xuanze Yang,Cheng Lin,Yuan Liu,Yuexin Ma,Wenping Wang,Ligang Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recovering high-fidelity, computer vision, holding significant, consumer-grade RGB cameras, Recovering
备注: project page: [this https URL](https://lym29.github.io/HGGT/)
点击查看摘要
Abstract:Recovering high-fidelity 3D hand geometry from images is a critical task in computer vision, holding significant value for domains such as robotics, animation and VR/AR. Crucially, scalable applications demand both accuracy and deployment flexibility, requiring the ability to leverage massive amounts of unstructured image data from the internet or enable deployment on consumer-grade RGB cameras without complex calibration. However, current methods face a dilemma. While single-view approaches are easy to deploy, they suffer from depth ambiguity and occlusion. Conversely, multi-view systems resolve these uncertainties but typically demand fixed, calibrated setups, limiting their real-world utility. To bridge this gap, we draw inspiration from 3D foundation models that learn explicit geometry directly from visual data. By reformulating hand reconstruction from arbitrary views as a visual-geometry grounded task, we propose a feed-forward architecture that, for the first time in literature, jointly infers 3D hand meshes and camera poses from uncalibrated views. Extensive evaluations show that our approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios. Here is the link of our project page: this https URL.
78. 【2603.23988】CAKE: Real-time Action Detection via Motion Distillation and Background-aware Contrastive Learning
链接:https://arxiv.org/abs/2603.23988
作者:Hieu Hoang,Dung Trung Tran,Hong Nguyen,Nam-Phong Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Online Action Detection, Online Action, Action Detection, high computational cost, primary challenges
备注:
点击查看摘要
Abstract:Online Action Detection (OAD) systems face two primary challenges: high computational cost and insufficient modeling of discriminative temporal dynamics against background motion. Adding optical flow could provides strong motion cues but it incurs significant computational overhead. We propose CAKE, a OAD Flow-based distillation framework to transfer motion knowledge into RGB models. We propose Dynamic Motion Adapter (DMA) to suppress static background noise and emphasize pixel changes, effectively approximating optical flow without explicit computation. The framework also integrates a Floating Contrastive Learning strategy to distinguish informative motion dynamics from temporal background. Various experiments conducted on the TVSeries, THUMOS'14, Kinetics-400 datasets show effectiveness of our model. CAKE achieves a standout mAP compared with SOTA while using the same backbone. Our model operates at over 72 FPS on a single CPU, making it highly suitable for resource-constrained systems.
79. 【2603.23976】SilLang: Improving Gait Recognition with Silhouette Language Encoding
链接:https://arxiv.org/abs/2603.23976
作者:Ruiyi Zhan,Guozhen Peng,Canyu Chen,Jian Lei,Annan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gait silhouettes, Gait, binary gait codes, binary gait silhouettes, widely adopted
备注:
点击查看摘要
Abstract:Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.
80. 【2603.23975】HyDRA: Hybrid Domain-Aware Robust Architecture for Heterogeneous Collaborative Perception
链接:https://arxiv.org/abs/2603.23975
作者:Minwoo Song,Minhee Kang,Heejin Ahn
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:training data distributions, Hybrid Domain-Aware Robust, collaborative perception, data distributions, Domain-Aware Robust Architecture
备注: 8 pages, 6 figures, Submitted to IROS 2026
点击查看摘要
Abstract:In collaborative perception, an agent's performance can be degraded by heterogeneity arising from differences in model architecture or training data distributions. To address this challenge, we propose HyDRA (Hybrid Domain-Aware Robust Architecture), a unified pipeline that integrates intermediate and late fusion within a domain-aware framework. We introduce a lightweight domain classifier that dynamically identifies heterogeneous agents and assigns them to the late-fusion branch. Furthermore, we propose anchor-guided pose graph optimization to mitigate localization errors inherent in late fusion, leveraging reliable detections from intermediate fusion as fixed spatial anchors. Extensive experiments demonstrate that, despite requiring no additional training, HyDRA achieves performance comparable to state-of-the-art heterogeneity-aware CP methods. Importantly, this performance is maintained as the number of collaborating agents increases, enabling zero-cost scaling without retraining.
81. 【2603.23973】SLAT-Phys: Fast Material Property Field Prediction from Structured 3D Latents
链接:https://arxiv.org/abs/2603.23973
作者:Rocktim Jyoti Das,Dinesh Manocha
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
关键词:digital twin generation, physics-based simulation, critical for physics-based, digital twin, material property field
备注: 8 page, 4 figures
点击查看摘要
Abstract:Estimating the material property field of 3D assets is critical for physics-based simulation, robotics, and digital twin generation. Existing vision-based approaches are either too expensive and slow or rely on 3D information. We present SLAT-Phys, an end-to-end method that predicts spatially varying material property fields of 3D assets directly from a single RGB image without explicit 3D reconstruction. Our approach leverages spatially organised latent features from a pretrained 3D asset generation model that encodes rich geometry and semantic prior, and trains a lightweight neural decoder to estimate Young's modulus, density, and Poisson's ratio. The coarse volumetric layout and semantic cues of the latent representation about object geometry and appearance enable accurate material estimation. Our experiments demonstrate that our method provides competitive accuracy in predicting continuous material parameters when compared against prior approaches, while significantly reducing computation time. In particular, SLAT-Phys requires only 9.9 seconds per object on an NVIDIA RTXA5000 GPU and avoids reconstruction and voxelization preprocessing. This results in 120x speedup compared to prior methods and enables faster material property estimation from a single image.
82. 【2603.23961】GRMLR: Knowledge-Enhanced Small-Data Learning for Deep-Sea Cold Seep Stage Inference
链接:https://arxiv.org/abs/2603.23961
作者:Chenxu Zhou,Zelin Liu,Rui Cai,Houlin Gong,Yikang Yu,Jia Zeng,Yanru Pei,Liang Zhang,Weishu Zhao,Xiaofeng Gao
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:high-risk manned submersible, cold seep stage, manned submersible operations, Deep-sea cold seep, seep stage assessment
备注:
点击查看摘要
Abstract:Deep-sea cold seep stage assessment has traditionally relied on costly, high-risk manned submersible operations and visual surveys of macrofauna. Although microbial communities provide a promising and more cost-effective alternative, reliable inference remains challenging because the available deep-sea dataset is extremely small ($n = 13$) relative to the microbial feature dimension ($p = 26$), making purely data-driven models highly prone to overfitting. To address this, we propose a knowledge-enhanced classification framework that incorporates an ecological knowledge graph as a structural prior. By fusing macro-microbe coupling and microbial co-occurrence patterns, the framework internalizes established ecological logic into a \underline{\textbf{G}}raph-\underline{\textbf{R}}egularized \underline{\textbf{M}}ultinomial \underline{\textbf{L}}ogistic \underline{\textbf{R}}egression (GRMLR) model, effectively constraining the feature space through a manifold penalty to ensure biologically consistent classification. Importantly, the framework removes the need for macrofauna observations at inference time: macro-microbe associations are used only to guide training, whereas prediction relies solely on microbial abundance profiles. Experimental results demonstrate that our approach significantly outperforms standard baselines, highlighting its potential as a robust and scalable framework for deep-sea ecological assessment.
83. 【2603.23960】Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection
链接:https://arxiv.org/abs/2603.23960
作者:Jielun Peng,Yabin Wang,Yaqi Li,Long Kong,Xiaopeng Hong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabled hyper-realistic audio-visual, intensifying threats, social trust, rapid progress, progress of generative
备注:
点击查看摘要
Abstract:The rapid progress of generative AI has enabled hyper-realistic audio-visual deepfakes, intensifying threats to personal security and social trust. Most existing deepfake detectors rely either on uni-modal artifacts or audio-visual discrepancies, failing to jointly leverage both sources of information. Moreover, detectors that rely on generator-specific artifacts tend to exhibit degraded generalization when confronted with unseen forgeries. We argue that robust and generalizable detection should be grounded in intrinsic audio-visual coherence within and across modalities. Accordingly, we propose HAVIC, a Holistic Audio-Visual Intrinsic Coherence-based deepfake detector. HAVIC first learns priors of modality-specific structural coherence, inter-modal micro- and macro-coherence by pre-training on authentic videos. Based on the learned priors, HAVIC further performs holistic adaptive aggregation to dynamically fuse audio-visual features for deepfake detection. Additionally, we introduce HiFi-AVDF, a high-fidelity audio-visual deepfake dataset featuring both text-to-video and image-to-video forgeries from state-of-the-art commercial generators. Extensive experiments across several benchmarks demonstrate that HAVIC significantly outperforms existing state-of-the-art methods, achieving improvements of 9.39% AP and 9.37% AUC on the most challenging cross-dataset scenario. Our code and dataset are available at this https URL.
84. 【2603.23957】PointRFT: Explicit Reinforcement Fine-tuning for Point Cloud Few-shot Learning
链接:https://arxiv.org/abs/2603.23957
作者:Yankai Wang,Yiding Sun,Qirui Wang,Pengbo Li,Chaoyi Lu,Dongxu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Understanding spatial dynamics, Relative Policy Optimization, Understanding spatial, Group Relative Policy, Policy Optimization
备注:
点击查看摘要
Abstract:Understanding spatial dynamics and semantics in point cloud is fundamental for comprehensive 3D comprehension. While reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) have recently achieved remarkable breakthroughs in large language models by incentivizing reasoning capabilities through strategic reward design, their potential remains largely unexplored in the 3D perception domain. This naturally raises a pivotal question: Can RL-based methods effectively empower 3D point cloud fine-tuning? In this paper, we propose PointRFT, the first reinforcement fine-tuning paradigm tailored specifically for point cloud representation learning. We select three prevalent 3D foundation models and devise specialized accuracy reward and dispersion reward functions to stabilize training and mitigate distribution shifts. Through comprehensive few-shot classification experiments comparing distinct training paradigms, we demonstrate that PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. Furthermore, when organically integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially unleashed, achieving state-of-the-art performance particularly under data-scarce scenarios.
85. 【2603.23956】SynMVCrowd: A Large Synthetic Benchmark for Multi-view Crowd Counting and Localization
链接:https://arxiv.org/abs/2603.23956
作者:Qi Zhang,Daijie Chen,Yunfei Gong,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing multi-view crowd, multi-view crowd, existing methods impractical, crowd, limited crowd numbers
备注: IJCV 2026
点击查看摘要
Abstract:Existing multi-view crowd counting and localization methods are evaluated under relatively small scenes with limited crowd numbers, camera views, and frames. This makes the evaluation and comparison of existing methods impractical, as small datasets are easily overfit by these methods. To avoid these issues, 3DROM proposes a data augmentation method. Instead, in this paper, we propose a large synthetic benchmark, SynMVCrowd, for more practical evaluation and comparison of multi-view crowd counting and localization tasks. The SynMVCrowd benchmark consists of 50 synthetic scenes with a large number of multi-view frames and camera views and a much larger crowd number (up to 1000), which is more suitable for large-scene multi-view crowd vision tasks. Besides, we propose strong multi-view crowd localization and counting baselines that outperform all comparison methods on the new SynMVCrowd benchmark. Moreover, we prove that better domain transferring multi-view and single-image counting performance could be achieved with the aid of the benchmark on novel new real scenes. As a result, the proposed benchmark could advance the research for multi-view and single-image crowd counting and localization to more practical applications. The codes and datasets are here: this https URL.
86. 【2603.23953】VOLMO: Versatile and Open Large Models for Ophthalmology
链接:https://arxiv.org/abs/2603.23953
作者:Zhenyue Qin,Younjoon Chung,Elijah Lee,Wanyue Feng,Xuguang Ai,Serina Applebaum,Minjie Zou,Yang Liu,Pan Xiao,Mac Singer,Amisha Dave,Aidan Gilson,Tiarnan D. L. Keenan,Emily Y. Chew,Zhiyong Lu,Yih-Chung Tham,Ron Adelman,Luciano V. Del Priore,Qingyu Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
关键词:irreversible vision loss, Vision impairment affects, preventing irreversible vision, affects millions globally, impairment affects millions
备注:
点击查看摘要
Abstract:Vision impairment affects millions globally, and early detection is critical to preventing irreversible vision loss. Ophthalmology workflows require clinicians to integrate medical images, structured clinical data, and free-text notes to determine disease severity and management, which is time-consuming and burdensome. Recent multimodal large language models (MLLMs) show promise, but existing general and medical MLLMs perform poorly in ophthalmology, and few ophthalmology-specific MLLMs are openly available. We present VOLMO (Versatile and Open Large Models for Ophthalmology), a model-agnostic, data-open framework for developing ophthalmology-specific MLLMs. VOLMO includes three stages: ophthalmology knowledge pretraining on 86,965 image-text pairs from 26,569 articles across 82 journals; domain task fine-tuning on 26,929 annotated instances spanning 12 eye conditions for disease screening and severity classification; and multi-step clinical reasoning on 913 patient case reports for assessment, planning, and follow-up care. Using this framework, we trained a compact 2B-parameter MLLM and compared it with strong baselines, including InternVL-2B, LLaVA-Med-7B, MedGemma-4B, MedGemma-27B, and RETFound. We evaluated these models on image description generation, disease screening and staging classification, and assessment-and-management generation, with additional manual review by two healthcare professionals and external validation on three independent cohorts for age-related macular degeneration and diabetic retinopathy. Across settings, VOLMO-2B consistently outperformed baselines, achieving stronger image description performance, an average F1 of 87.4% across 12 eye conditions, and higher scores in external validation.
87. 【2603.23940】High-Fidelity Face Content Recovery via Tamper-Resilient Versatile Watermarking
链接:https://arxiv.org/abs/2603.23940
作者:Peipeng Yu,Jinfeng Xie,Chengfu Ou,Xiaoyu Zhou,Jianwei Fei,Yunshu Dai,Zhihua Xia,Chip Hong Chang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:poses severe threats, proliferation of AIGC-driven, threats to media, AIGC-driven face manipulation, copyright protection
备注:
点击查看摘要
Abstract:The proliferation of AIGC-driven face manipulation and deepfakes poses severe threats to media provenance, integrity, and copyright protection. Prior versatile watermarking systems typically rely on embedding explicit localization payloads, which introduces a fidelity--functionality trade-off: larger localization signals degrade visual quality and often reduce decoding robustness under strong generative edits. Moreover, existing methods rarely support content recovery, limiting their forensic value when original evidence must be reconstructed. To address these challenges, we present VeriFi, a versatile watermarking framework that unifies copyright protection, pixel-level manipulation localization, and high-fidelity face content recovery. VeriFi makes three key contributions: (1) it embeds a compact semantic latent watermark that serves as an content-preserving prior, enabling faithful restoration even after severe manipulations; (2) it achieves fine-grained localization without embedding localization-specific artifacts by correlating image features with decoded provenance signals; and (3) it introduces an AIGC attack simulator that combines latent-space mixing with seamless blending to improve robustness to realistic deepfake pipelines. Extensive experiments on CelebA-HQ and FFHQ show that VeriFi consistently outperforms strong baselines in watermark robustness, localization accuracy, and recovery quality, providing a practical and verifiable defense for deepfake forensics.
88. 【2603.23934】Revealing Multi-View Hallucination in Large Vision-Language Models
链接:https://arxiv.org/abs/2603.23934
作者:Wooje Park,Insu Lee,Soohyun Kim,Jaeyun Jang,Minyoung Noh,Kyuhong Shim,Byonghyo Shim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large vision-language models, image inputs captured, Large vision-language, multi-view image inputs, vision-language models
备注:
点击查看摘要
Abstract:Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.
89. 【2603.23933】ORACLE: Orchestrate NPC Daily Activities using Contrastive Learning with Transformer-CVAE
链接:https://arxiv.org/abs/2603.23933
作者:Seong-Eun Hong,JuYeong Hwang,RyunHa Lee,HyeongYeop Kang
类目:Graphics (cs.GR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:augment user immersion, Non-player characters, integration of Non-player, cognitive engagement, digital environments
备注: 17 pages, 7 figures. Accepted to CVM 2026
点击查看摘要
Abstract:The integration of Non-player characters (NPCs) within digital environments has been increasingly recognized for its potential to augment user immersion and cognitive engagement. The sophisticated orchestration of their daily activities, reflecting the nuances of human daily routines, contributes significantly to the realism of digital environments. Nevertheless, conventional approaches often produce monotonous repetition, falling short of capturing the intricacies of real human activity plans. In response to this, we introduce ORACLE, a novel generative model for the synthesis of realistic indoor daily activity plans, ensuring NPCs' authentic presence in digital habitats. Exploiting the CASAS smart home dataset's 24-hour indoor activity sequences, ORACLE addresses challenges in the dataset, including its imbalanced sequential data, the scarcity of training samples, and the absence of pre-trained models encapsulating human daily activity patterns. ORACLE's training leverages the sequential data processing prowess of Transformers, the generative controllability of Conditional Variational Autoencoders (CVAE), and the discriminative refinement of contrastive learning. Our experimental results validate the superiority of generating NPC activity plans and the efficacy of our design strategies over existing methods.
90. 【2603.23925】DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models
链接:https://arxiv.org/abs/2603.23925
作者:Hongyi Miao,Jun Jia,Xincheng Wang,Qianli Ma,Wei Sun,Wangqiu Zhou,Dandan Zhu,Yewen Cao,Zhi Liu,Guangtao Zhai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, image understanding capabilities, endowed vision-language models, fine-grained image understanding, understanding capabilities
备注:
点击查看摘要
Abstract:Recent advances in visual-language alignment have endowed vision-language models (VLMs) with fine-grained image understanding capabilities. However, this progress also introduces new privacy risks. This paper first proposes a novel privacy threat model named identity-affiliation learning: an attacker fine-tunes a VLM using only a few private photos of a target individual, thereby embedding associations between the target facial identity and their private property and social relationships into the model's internal representations. Once deployed via public APIs, this model enables unauthorized exposure of the target user's private information upon input of their photos. To benchmark VLMs' susceptibility to such identity-affiliation leakage, we introduce the first identity-affiliation dataset comprising seven typical scenarios appearing in private photos. Each scenario is instantiated with multiple identity-centered photo-description pairs. Experimental results demonstrate that mainstream VLMs like LLaVA, Qwen-VL, and MiniGPT-v2, can recognize facial identities and infer identity-affiliation relationships by fine-tuning on small-scale private photographic dataset, and even on synthetically generated datasets. To mitigate this privacy risk, we propose DP2-VL, the first Dataset Protection framework for private photos that leverages Data Poisoning. Though optimizing imperceptible perturbations by pushing the original representations toward an antithetical region, DP2-VL induces a dataset-level shift in the embedding space of VLMs'encoders. This shift separates protected images from clean inference images, causing fine-tuning on the protected set to overfit. Extensive experiments demonstrate that DP2-VL achieves strong generalization across models, robustness to diverse post-processing operations, and consistent effectiveness across varying protection ratios.
91. 【2603.23924】DepthArb: Training-Free Depth-Arbitrated Generation for Occlusion-Robust Image Synthesis
链接:https://arxiv.org/abs/2603.23924
作者:Hongjin Niu,Jiahao Wang,Xirui Hu,Weizhan Zhang,Lan Ma,Yuan Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:frequently exhibit deficiencies, synthesizing accurate occlusion, accurate occlusion relationships, models frequently exhibit, dense overlapping regions
备注:
点击查看摘要
Abstract:Text-to-image diffusion models frequently exhibit deficiencies in synthesizing accurate occlusion relationships of multiple objects, particularly within dense overlapping regions. Existing training-free layout-guided methods predominantly rely on rigid spatial priors that remain agnostic to depth order, often resulting in concept mixing or illogical occlusion. To address these limitations, we propose DepthArb, a training-free framework that resolves occlusion ambiguities by arbitrating attention competition between interacting objects. Specifically, DepthArb employs two core mechanisms: Attention Arbitration Modulation (AAM), which enforces depth-ordered visibility by suppressing background activations in overlapping regions, and Spatial Compactness Control (SCC), which preserves structural integrity by curbing attention divergence. These mechanisms enable robust occlusion generation without model retraining. To systematically evaluate this capability, we propose OcclBench, a comprehensive benchmark designed to evaluate diverse occlusion scenarios. Extensive evaluations demonstrate that DepthArb consistently outperforms state-of-the-art baselines in both occlusion accuracy and visual fidelity. As a plug-and-play method, DepthArb seamlessly enhances the compositional capabilities of diffusion backbones, offering a novel perspective on spatial layering within generative models.
92. 【2603.23919】Uncertainty-Aware Vision-based Risk Object Identification via Conformal Risk Tube Prediction
链接:https://arxiv.org/abs/2603.23919
作者:Kai-Yu Fu,Yi-Ting Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:study object importance-based, object importance-based vision, intelligent driving systems, risk object identification, vision risk object
备注: IEEE International Conference on Robotics and Automation (ICRA) 2026
点击查看摘要
Abstract:We study object importance-based vision risk object identification (Vision-ROI), a key capability for hazard detection in intelligent driving systems. Existing approaches make deterministic decisions and ignore uncertainty, which could lead to safety-critical failures. Specifically, in ambiguous scenarios, fixed decision thresholds may cause premature or delayed risk detection and temporally unstable predictions, especially in complex scenes with multiple interacting risks. Despite these challenges, current methods lack a principled framework to model risk uncertainty jointly across space and time. We propose Conformal Risk Tube Prediction, a unified formulation that captures spatiotemporal risk uncertainty, provides coverage guarantees for true risks, and produces calibrated risk scores with uncertainty estimates. To conduct a systematic evaluation, we present a new dataset and metrics probing diverse scenario configurations with multi-risk coupling effects, which are not supported by existing datasets. We systematically analyze factors affecting uncertainty estimation, including scenario variations, per-risk category behavior, and perception error propagation. Our method delivers substantial improvements over prior approaches, enhancing vision-ROI robustness and downstream performance, such as reducing nuisance braking alerts. For more qualitative results, please visit our project webpage: this https URL
93. 【2603.23916】DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning
链接:https://arxiv.org/abs/2603.23916
作者:Jiajian Huang,Dongliang Zhu,Zitong YU,Hui Ma,Jiayu Zhang,Chunmei Zhu,Xiaochun Cao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:identify deceptive behavior, analyzing audiovisual cues, forensics and security, aims to identify, identify deceptive
备注: 13 pages, 8 figures, 7 tables
点击查看摘要
Abstract:Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth'' television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.
94. 【2603.23914】Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
链接:https://arxiv.org/abs/2603.23914
作者:Fatih Ilhan,Gaowen Liu,Ramana Rao Kompella,Selim Furkan Tekin,Tiansheng Huang,Zachary Yahn,Yichang Xu,Ling Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Vision-Language Models, achieved remarkable success, significant challenge due, time efficiency remains, Large Vision-Language
备注:
点击查看摘要
Abstract:Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.
95. 【2603.23906】GenMask: Adapting DiT for Segmentation via Direct Mask
链接:https://arxiv.org/abs/2603.23906
作者:Yuhuan Yang,Xianwei Zhuang,Yuxuan Cai,Chaofan Ma,Shuai Bai,Jiangchao Yao,Ya Zhang,Junyang Lin,Yanfeng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent approaches, pretrained generative models, leveraged pretrained generative, leveraged pretrained, indirect feature retrieval
备注: Accepted by cvpr 2026
点击查看摘要
Abstract:Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.
96. 【2603.23903】Latent Bias Alignment for High-Fidelity Diffusion Inversion in Real-World Image Reconstruction and Manipulation
链接:https://arxiv.org/abs/2603.23903
作者:Weiming Chen,Qifan Liu,Siyi Liu,Yushun Tang,Yijia Wang,Zhihan Zhu,Zhihai He
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent research, generating high-quality images, high-quality images guided, diffusion inversion, diffusion
备注:
点击查看摘要
Abstract:Recent research has shown that text-to-image diffusion models are capable of generating high-quality images guided by text prompts. But can they be used to generate or approximate real-world images from the seed noise? This is known as the diffusion inversion problem, which serves as a fundamental building block for bridging diffusion models and real-world scenarios. However, existing diffusion inversion methods often suffer from low reconstruction quality or weak robustness. Two major challenges need to be carefully addressed: (1) the misalignment between the inversion and generation trajectories during the diffusion process, and (2) the mismatch between the diffusion inversion process and the VQ autoencoder (VQAE) reconstruction. To address these challenges, we introduce a latent bias vector at each inversion step, which is learned to reduce the misalignment between inversion and generation trajectories. We refer to this strategy as Latent Bias Optimization (LBO). Furthermore, we perform an approximate joint optimization of the diffusion inversion and VQAE reconstruction processes by learning to adjust the image latent representation, which serves as the connecting interface between them. We refer to this technique as Image Latent Boosting (ILB). Extensive experimental results demonstrate that the proposed method significantly improves the image reconstruction quality of the diffusion model, as well as the performance of downstream tasks, including image editing and rare concept generation.
97. 【2603.23902】Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval
链接:https://arxiv.org/abs/2603.23902
作者:Junkai Yang,Qirui Wang,Yaoqing Jin,Shuai Ma,Minghan Xu,Shanmin Pang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Retrieving partially relevant, partially relevant segments, remains difficult due, overlook semantic focus, Retrieving partially
备注: Accepted in ICME 2026
点击查看摘要
Abstract:Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.
98. 【2603.23896】MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation
链接:https://arxiv.org/abs/2603.23896
作者:Gengluo Li,Chengquan Zhang,Yupu Liang,Huawen Shen,Yaping Zhang,Pengyuan Lyu,Weinong Wang,Xingyu Wan,Gangyan Zeng,Han Hu,Can Ma,Yu Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:directly translates textual, translates textual content, text-image machine translation, multilingual scene understanding, text-image machine
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition-Perception-Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.
99. 【2603.23891】FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinking for Large-Scale LoD 3D Gaussian Splatting
链接:https://arxiv.org/abs/2603.23891
作者:Yixian Wang,Haolin Yu,Jiadong Tang,Yu Gao,Xihan Wang,Yufeng Yue,Yi Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Splatting has revolutionized, revolutionized neural rendering, Gaussian Splatting, real-time performance, revolutionized neural
备注:
点击查看摘要
Abstract:3D Gaussian Splatting has revolutionized neural rendering with real-time performance. However, scaling this approach to large scenes using Level-of-Detail methods faces critical challenges: inefficient serial traversal consuming over 60\% of rendering time, and redundant Gaussian-tile pairs that incur unnecessary processing overhead. To address these limitations, we introduce FilterGS, featuring a parallel filtering mechanism with two complementary filters that select Gaussian elements efficiently without tree traversal. Additionally, we propose a novel GTC metric that quantifies the redundancy of Gaussian-tile key-value pairs. Based on this metric, we introduce a scene-adaptive Gaussian shrinking strategy that effectively reduces redundant pairs. Extensive experiments demonstrate that FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets. Project page: this https URL
100. 【2603.23885】owards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
链接:https://arxiv.org/abs/2603.23885
作者:Gengluo Li,Chengquan Zhang,Yupu Liang,Huawen Shen,Yaping Zhang,Pengyuan Lyu,Weinong Wang,Xingyu Wan,Gangyan Zeng,Han Hu,Can Ma,Yu Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, large language models, directly map document, map document images, structured outputs
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.
101. 【2603.23883】BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
链接:https://arxiv.org/abs/2603.23883
作者:Risa Shinoda,Kaede Shiohara,Nakamasa Inoue,Kuniaki Saito,Hiroaki Santo,Fumio Okura
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal data poses, vision and ecology, data poses, poses an emerging, emerging challenge
备注: CVPR 2026 Main
点击查看摘要
Abstract:Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: this https URL
102. 【2603.23874】EnvSocial-Diff: A Diffusion-Based Crowd Simulation Model with Environmental Conditioning and Individual-Group Interaction
链接:https://arxiv.org/abs/2603.23874
作者:Bingxue Zhao,Qi Zhang,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pedestrian trajectories requires, trajectories requires accounting, existing approaches largely, approaches largely emphasize, emphasize social dynamics
备注: ICLR 2026
点击查看摘要
Abstract:Modeling realistic pedestrian trajectories requires accounting for both social interactions and environmental context, yet most existing approaches largely emphasize social dynamics. We propose \textbf{EnvSocial-Diff}: a diffusion-based crowd simulation model informed by social physics and augmented with environmental conditioning and individual--group interaction. Our structured environmental conditioning module explicitly encodes obstacles, objects of interest, and lighting levels, providing interpretable signals that capture scene constraints and attractors. In parallel, the individual--group interaction module goes beyond individual-level modeling by capturing both fine-grained interpersonal relations and group-level conformity through a graph-based design. Experiments on multiple benchmark datasets demonstrate that EnvSocial-Diff outperforms the latest state-of-the-art methods, underscoring the importance of explicit environmental conditioning and multi-level social interaction for realistic crowd simulation. Code is here: this https URL.
103. 【2603.23868】MLE-UVAD: Minimal Latent Entropy Autoencoder for Fully Unsupervised Video Anomaly Detection
链接:https://arxiv.org/abs/2603.23868
作者:Yuang Geng,Junkai Zhou,Kang Yang,Pan He,Zhuoyang Zhou,Jose C. Principe,Joel Harley,Ivan Ruchkin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fully unsupervised video, problem of single-scene, directly for training, training and testing, unsupervised video anomaly
备注: Submitted to ECCV 2026. 18 pages, 8 figures. Includes supplementary material
点击查看摘要
Abstract:In this paper, we address the challenging problem of single-scene, fully unsupervised video anomaly detection (VAD), where raw videos containing both normal and abnormal events are used directly for training and testing without any labels. This differs sharply from prior work that either requires extensive labeling (fully or weakly supervised) or depends on normal-only videos (one-class classification), which are vulnerable to distribution shifts and contamination. We propose an entropy-guided autoencoder that detects anomalies through reconstruction error by reconstructing normal frames well while making anomalies reconstruct poorly. The key idea is to combine the standard reconstruction loss with a novel Minimal Latent Entropy (MLE) loss in the autoencoder. Reconstruction loss alone maps normal and abnormal inputs to distinct latent clusters due to their inherent differences, but also risks reconstructing anomalies too well to detect. Therefore, MLE loss addresses this by minimizing the entropy of latent embeddings, encouraging them to concentrate around high-density regions. Since normal frames dominate the raw video, sparse anomalous embeddings are pulled into the normal cluster, so the decoder emphasizes normal patterns and produces poor reconstructions for anomalies. This dual-loss design produces a clear reconstruction gap that enables effective anomaly detection. Extensive experiments on two widely used benchmarks and a challenging self-collected driving dataset demonstrate that our method achieves robust and superior performance over baselines.
104. 【2603.23867】Can VLMs Reason Robustly? A Neuro-Symbolic Investigation
链接:https://arxiv.org/abs/2603.23867
作者:Weixin Chen,Antonio Vergari,Han Zhao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:wide range, remains unclear, reason robustly, Vision-Language Models, reasoning
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have been applied to a wide range of reasoning tasks, yet it remains unclear whether they can reason robustly under distribution shifts. In this paper, we study covariate shifts in which the perceptual input distribution changes while the underlying prediction rules do not. To investigate this question, we consider visual deductive reasoning tasks, where a model is required to answer a query given an image and logical rules defined over the object concepts in the image. Empirically, we find that VLMs fine-tuned through gradient-based end-to-end training can achieve high in-distribution accuracy but fail to generalize under such shifts, suggesting that fine-tuning does not reliably induce the underlying reasoning function. This motivates a neuro-symbolic perspective that decouples perception from reasoning. However, we further observe that recent neuro-symbolic approaches that rely on black-box components for reasoning can still exhibit inconsistent robustness across tasks. To address this issue, we propose VLC, a neuro-symbolic method that combines VLM-based concept recognition with circuit-based symbolic reasoning. In particular, task rules are compiled into a symbolic program, specifically a circuit, which executes the rules exactly over the object concepts recognized by the VLM. Experiments on three visual deductive reasoning tasks with distinct rule sets show that VLC consistently achieves strong performance under covariate shifts, highlighting its ability to support robust reasoning.
105. 【2603.23864】See, Remember, Explore: A Benchmark and Baselines for Streaming Spatial Reasoning
链接:https://arxiv.org/abs/2603.23864
作者:Yuxi Wei,Wei Huang,Qirui Chen,Lu Hou,Xiaojuan Qi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:crucial deployment-critical requirements, remain offline-evaluating post-hoc, benchmarks remain offline-evaluating, long-horizon streaming inference, embodied agents
备注:
点击查看摘要
Abstract:Spatial understanding is fundamental for embodied agents, yet most spatial VLMs and benchmarks remain offline-evaluating post-hoc QA over pre-recorded inputs and overlooking two crucial deployment-critical requirements: long-horizon streaming inference and active perception when the current view is insufficient. To address this gap, we introduce S3-Bench, a benchmark suite for streaming spatial question answering with active exploration, where queries are temporally grounded to specific timestamps and must be answered using only observations available up to that moment. S3-Bench adopts a dual-domain design, combining a scalable simulator with controllable trajectories and exploration actions, and real-world streaming videos that capture practical sensing artifacts for rigorous generalization evaluation. Overall, it spans 10K+ scenes and 26K+ trajectories, with dedicated training (S3-Train) and evaluation (S3-Eval) splits. We further propose AMF-VLM, which supports streaming spatial reasoning under bounded computing via (i) memory folding, which compresses long-horizon observations into compact structured memory, and (ii) active exploration, which outputs explicit actions (e.g. move/rotate/scan) to acquire missing evidence before answering. Extensive experiments demonstrate that, compared to models using identical training data, our approach yields improvements of 8.8% and 13.3% on the simulated and real splits of S3-Eval, respectively, while maintaining competitive transferability to standard spatial benchmarks.
106. 【2603.23845】3D-LLDM: Label-Guided 3D Latent Diffusion Model for Improving High-Resolution Synthetic MR Imaging in Hepatic Structure Segmentation
链接:https://arxiv.org/abs/2603.23845
作者:Kyeonghun Kim,Jaehyeok Bae,Youngung Han,Joo Young Bae,Seoyoung Ju,Junsu Lim,Gyeongmin Kim,Nam-Joon Kim,Woo Kyoung Jeong,Ken Ying-Kai Liao,Won Jae Lee,Pa Hong,Hyuk-Jae Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:downstream analysis tasks, Deep learning, advancing rapidly, analysis tasks, learning and generative
备注: Accepted to ISBI 2026 (Oral). Camera-ready version
点击查看摘要
Abstract:Deep learning and generative models are advancing rapidly, with synthetic data increasingly being integrated into training pipelines for downstream analysis tasks. However, in medical imaging, their adoption remains constrained by the scarcity of reliable annotated datasets. To address this limitation, we propose 3D-LLDM, a label-guided 3D latent diffusion model that generates high-quality synthetic magnetic resonance (MR) volumes with corresponding anatomical segmentation masks. Our approach uses hepatobiliary phase MR images enhanced with the Gd-EOB-DTPA contrast agent to derive structural masks for the liver, portal vein, hepatic vein, and hepatocellular carcinoma, which then guide volumetric synthesis through a ControlNet-based architecture. Trained on 720 real clinical hepatobiliary phase MR scans from Samsung Medical Center, 3D-LLDM achieves a Fréchet Inception Distance (FID) of 28.31, improving over GANs by 70.9% and over state-of-the-art diffusion baselines by 26.7%. When used for data augmentation, the synthetic volumes improve hepatocellular carcinoma segmentation by up to 11.153% Dice score across five CNN architectures.
107. 【2603.23794】Sparse Autoencoders for Interpretable Medical Image Representation Learning
链接:https://arxiv.org/abs/2603.23794
作者:Philipp Wesp,Robbie Holland,Vasiliki Sideri-Lampretsa,Sergios Gatidis
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Vision foundation models, abstract latent representations, sparse features, foundation models, abstract latent
备注: 11 pages, 4 figures
点击查看摘要
Abstract:Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to investigate Sparse Autoencoders (SAEs) for replacing opaque FM image representations with human-interpretable, sparse features. We train SAEs on embeddings from BiomedParse (biomedical) and DINOv3 (general-purpose) using 909,873 CT and MRI 2D image slices from the TotalSegmentator dataset. We find that learned sparse features: (a) reconstruct original embeddings with high fidelity (R2 up to 0.941) and recover up to 87.8% of downstream performance using only 10 features (99.4% dimensionality reduction), (b) preserve semantic fidelity in image retrieval tasks, (c) correspond to specific concepts that can be expressed in language using large language model (LLM)-based auto-interpretation. (d) bridge clinical language and abstract latent representations in zero-shot language-driven image retrieval. Our work indicates SAEs are a promising pathway towards interpretable, concept-driven medical vision systems. Code repository: this https URL.
108. 【2603.23788】Re-Prompting SAM 3 via Object Retrieval: 3rd of the 5th PVUW MOSE Track
链接:https://arxiv.org/abs/2603.23788
作者:Mingqi Gao,Sijie Li,Jungong Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:technical report explores, complex semi-supervised video, video object segmentation, targets complex semi-supervised, semi-supervised video object
备注:
点击查看摘要
Abstract:This technical report explores the MOSEv2 track of the PVUW 2026 Challenge, which targets complex semi-supervised video object segmentation. Built on SAM~3, we develop an automatic re-prompting framework to improve robustness under target disappearance and reappearance, severe transformation, and strong same-category distractors. Our method first applies the SAM~3 detector to later frames to identify same-category object candidates, and then performs DINOv3-based object-level matching with a transformation-aware target feature pool to retrieve reliable target anchors. These anchors are injected back into the SAM~3 tracker together with the first-frame mask, enabling multi-anchor propagation rather than relying solely on the initial prompt. This simple directly benefits several core challenges of MOSEv2. Our solution achieves a JF of 51.17% on the test set, ranking 3rd in the MOSEv2 track.
109. 【2603.23785】Retinal Disease Classification from Fundus Images using CNN Transfer Learning
链接:https://arxiv.org/abs/2603.23785
作者:Ali Akram
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:visual impairment worldwide, Retinal diseases remain, impairment worldwide, leading preventable, visual impairment
备注: 4 figures
点击查看摘要
Abstract:Retinal diseases remain among the leading preventable causes of visual impairment worldwide. Automated screening based on fundus image analysis has the potential to expand access to early detection, particularly in underserved populations. This paper presents a reproducible deep learning pipeline for binary retinal disease risk classification from publicly available fundus photographs. We implement and compare a baseline convolutional neural network with a transfer learning approach using a pretrained VGG16 backbone and evaluate generalization on held-out data. To address class imbalance, we apply class weighting and report standard classification metrics including accuracy, precision, recall, F1-score, confusion matrices, and ROC-AUC. The VGG16 transfer learning model achieves 90.8% test accuracy with a weighted F1-score of 0.90, substantially outperforming the baseline CNN (83.1% accuracy). Results indicate that transfer learning improves discrimination compared to a baseline CNN, while also revealing remaining challenges in sensitivity to minority disease cases. We discuss practical limitations related to dataset characteristics, class imbalance, and threshold selection, and provide guidance for reproducibility and future improvements for clinically reliable screening
110. 【2603.23766】Semantic Iterative Reconstruction: One-Shot Universal Anomaly Detection
链接:https://arxiv.org/abs/2603.23766
作者:Ning Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unsupervised medical anomaly, Unsupervised medical, Semantic Iterative Reconstruction, severely limited, normal training samples
备注: 8 pages, 2 figures,5 table
点击查看摘要
Abstract:Unsupervised medical anomaly detection is severely limited by the scarcity of normal training samples. Existing methods typically train dedicated models for each dataset or disease, requiring hundreds of normal images per task and lacking cross-modality generalization. We propose Semantic Iterative Reconstruction (SIR), a framework that enables a single universal model to detect anomalies across diverse medical domains using extremely few normal samples. SIR leverages a pretrained teacher encoder to extract multi-scale deep features and employs a compact up-then-down decoder with multi-loop iterative refinement to enforce robust normality priors in deep feature space. The framework adopts a one-shot universal design: a single model is trained by mixing exactly one normal sample from each of nine heterogeneous datasets, enabling effective anomaly detection on all corresponding test sets without task-specific retraining. Extensive experiments on nine medical benchmarks demonstrate that SIR achieves state-of-the-art under all four settings -- one-shot universal, full-shot universal, one-shot specialized, and full-shot specialized -- consistently outperforming previous methods. SIR offers an efficient and scalable solution for multi-domain clinical anomaly detection. Code is available at this https URL.
Comments:
8 pages, 2 figures,5 table
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.23766 [cs.CV]
(or
arXiv:2603.23766v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.23766
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
111. 【2603.23757】Learning Cross-Joint Attention for Generalizable Video-Based Seizure Detection
链接:https://arxiv.org/abs/2603.23757
作者:Omar Zamzam,Takfarinas Medani,Chinmay Chinara,Richard Leahy
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enable real-time monitoring, Automated seizure detection, substantially reduce manual, reduce manual review, manual review time
备注:
点击查看摘要
Abstract:Automated seizure detection from long-term clinical videos can substantially reduce manual review time and enable real-time monitoring. However, existing video-based methods often struggle to generalize to unseen subjects due to background bias and reliance on subject-specific appearance cues. We propose a joint-centric attention model that focuses exclusively on body dynamics to improve cross-subject generalization. For each video segment, body joints are detected and joint-centered clips are extracted, suppressing background context. These joint-centered clips are tokenized using a Video Vision Transformer (ViViT), and cross-joint attention is learned to model spatial and temporal interactions between body parts, capturing coordinated movement patterns characteristic of seizure semiology. Extensive cross-subject experiments show that the proposed method consistently outperforms state-of-the-art CNN-, graph-, and transformer-based approaches on unseen subjects.
112. 【2603.23754】IJmond Industrial Smoke Segmentation Dataset
链接:https://arxiv.org/abs/2603.23754
作者:Yen-Chia Hsu,Despoina Touska
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:industrial smoke segmentation, https URL, smoke segmentation, figshare repository, report describes
备注:
点击查看摘要
Abstract:This report describes a dataset for industrial smoke segmentation, published on a figshare repository (this https URL). The dataset is licensed under CC BY 4.0.
113. 【2603.23742】Detection and Classification of (Pre)Cancerous Cells in Pap Smears: An Ensemble Strategy for the RIVA Cervical Cytology Challenge
链接:https://arxiv.org/abs/2603.23742
作者:Lautaro Kogan,María Victoria Ríos
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:conventional Pap smear, Pap smear images, reducing manual workload, strengthen cervical cancer, cervical cancer screening
备注: Accepted for Poster Presentation at the RIVA Cervical Cytology Challenge, IEEE ISBI 2026. 4 pages, 2 figures
点击查看摘要
Abstract:Automated detection and classification of cervical cells in conventional Pap smear images can strengthen cervical cancer screening at scale by reducing manual workload, improving triage, and increasing consistency across readers. However, it is challenged by severe class imbalance and frequent nuclear overlap. We present our approach to the RIVA Cervical Cytology Challenge (ISBI 2026), which requires multi-class detection of eight Bethesda cell categories under these conditions. Using YOLOv11m as the base architecture, we systematically evaluate three strategies to improve detection performance: loss reweighting, data resampling and transfer learning. We build an ensemble by combining models trained under each strategy, promoting complementary detection behavior and combining them through Weighted Boxes Fusion (WBF). The ensemble achieves a mAP50-95 of 0.201 on the preliminary test set and 0.147 on the final test set, representing a 29% improvement over the best individual model on the final test set and demonstrating the effectiveness of combining complementary imbalance mitigation strategies.
114. 【2603.23730】An Adapter-free Fine-tuning Approach for Tuning 3D Foundation Models
链接:https://arxiv.org/abs/2603.23730
作者:Sneha Paul,Zachary Patterson,Nizar Bouguila
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Point cloud foundation, Point cloud, demonstrate strong generalization, cloud foundation models, downstream tasks remains
备注: Accepted at The Fifth International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI 2026)
点击查看摘要
Abstract:Point cloud foundation models demonstrate strong generalization, yet adapting them to downstream tasks remains challenging in low-data regimes. Full fine-tuning often leads to overfitting and significant drift from pre-trained representations, while existing parameter-efficient fine-tuning (PEFT) methods mitigate this issue by introducing additional trainable components at the cost of increased inference-time latency. We propose Momentum-Consistency Fine-Tuning (MCFT), an adapter-free approach that bridges the gap between full and parameter-efficient fine-tuning. MCFT selectively fine-tunes a portion of the pre-trained encoder while enforcing a momentum-based consistency constraint to preserve task-agnostic representations. Unlike PEFT methods, MCFT introduces no additional representation learning parameters beyond a standard task head, maintaining the original model's parameter count and inference efficiency. We further extend MCFT with two variants: a semi-supervised framework that leverages abundant unlabeled data to enhance few-shot performance, and a pruning-based variant that improves computational efficiency through structured layer removal. Extensive experiments on object recognition and part segmentation benchmarks demonstrate that MCFT consistently outperforms prior methods, achieving a 3.30% gain in 5-shot settings and up to a 6.13% improvement with semi-supervised learning, while remaining well-suited for resource-constrained deployment.
115. 【2603.23729】Bi-CRCL: Bidirectional Conservative-Radical Complementary Learning with Pre-trained Foundation Models for Class-incremental Medical Image Analysis
链接:https://arxiv.org/abs/2603.23729
作者:Xinyao Wu,Zhe Xu,Cheng Chen,Jiawei Ma,Yefeng Zheng,Raymond Kai-yu Tong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scalable clinical deployment, image-guided diagnosis requires, diagnosis requires retaining, newly emerging disease, medical image-guided diagnosis
备注: preprint; under review
点击查看摘要
Abstract:Class-incremental learning (CIL) in medical image-guided diagnosis requires retaining prior diagnostic knowledge while adapting to newly emerging disease categories, which is critical for scalable clinical deployment. This problem is particularly challenging due to heterogeneous data and privacy constraints that prevent memory replay. Although pretrained foundation models (PFMs) have advanced general-domain CIL, their potential in medical imaging remains underexplored, where domain-specific adaptation is essential yet difficult due to anatomical complexity and inter-institutional heterogeneity. To address this gap, we conduct a systematic benchmark of recent PFM-based CIL methods and propose Bidirectional Conservative-Radical Complementary Learning (Bi-CRCL), a dual-learner framework inspired by complementary learning systems. Bi-CRCL integrates a conservative learner that preserves prior knowledge through stability-oriented updates and a radical learner that rapidly adapts to new categories via plasticity-oriented learning. A bidirectional interaction mechanism enables forward transfer and backward consolidation, allowing continual integration of new knowledge while mitigating catastrophic forgetting. During inference, outputs from both learners are adaptively fused for robust predictions. Experiments on five medical imaging datasets demonstrate consistent improvements over state-of-the-art methods under diverse settings, including cross-dataset shifts and varying task configurations.
116. 【2603.23711】Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks
链接:https://arxiv.org/abs/2603.23711
作者:Morui Zhu,Yongqi Zhu,Song Fu,Qing Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unique challenges due, poses unique challenges, sensor poses caused, articulated tractor-trailer geometry, time-varying sensor poses
备注: accepted to CVPR2026
点击查看摘要
Abstract:Autonomous trucking poses unique challenges due to articulated tractor-trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate evaluation, we introduce STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry across diverse environments. Experiments demonstrate that dCAP achieves stable, accurate perception while addressing the limitations of static calibration in autonomous trucking. The dataset, development kit, and source code will be publicly released.
117. 【2603.23694】CoRe: Joint Optimization with Contrastive Learning for Medical Image Registration
链接:https://arxiv.org/abs/2603.23694
作者:Eytan Kats,Christoph Grossbroehmer,Ziad Al-Haj Hemidi,Fenja Falta,Wiebke Heyer,Mattias P. Heinrich
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical image analysis, Medical image registration, Medical image, enabling the alignment, time points
备注: Preprint
点击查看摘要
Abstract:Medical image registration is a fundamental task in medical image analysis, enabling the alignment of images from different modalities or time points. However, intensity inconsistencies and nonlinear tissue deformations pose significant challenges to the robustness of registration methods. Recent approaches leveraging self-supervised representation learning show promise by pre-training feature extractors to generate robust anatomical embeddings, that farther used for the registration. In this work, we propose a novel framework that integrates equivariant contrastive learning directly into the registration model. Our approach leverages the power of contrastive learning to learn robust feature representations that are invariant to tissue deformations. By jointly optimizing the contrastive and registration objectives, we ensure that the learned representations are not only informative but also suitable for the registration task. We evaluate our method on abdominal and thoracic image registration tasks, including both intra-patient and inter-patient scenarios. Experimental results demonstrate that the integration of contrastive learning directly into the registration framework significantly improves performance, surpassing strong baseline methods.
118. 【2603.23686】AdvSplat: Adversarial Attacks on Feed-Forward Gaussian Splatting Models
链接:https://arxiv.org/abs/2603.23686
作者:Yiran Qiao,Yiren Lu,Yunlai Zhou,Rui Yang,Linlin Hou,Yu Yin,Jing Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, paradigm for real-time, increasingly recognized, powerful paradigm, Gaussian
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) is increasingly recognized as a powerful paradigm for real-time, high-fidelity 3D reconstruction. However, its per-scene optimization pipeline limits scalability and generalization, and prevents efficient inference. Recently emerged feed-forward 3DGS models address these limitations by enabling fast reconstruction from a few input views after large-scale pretraining, without scene-specific optimization. Despite their advantages and strong potential for commercial deployment, the use of neural networks as the backbone also amplifies the risk of adversarial manipulation. In this paper, we introduce AdvSplat, the first systematic study of adversarial attacks on feed-forward 3DGS. We first employ white-box attacks to reveal fundamental vulnerabilities of this model family. We then develop two improved, practically relevant, query-efficient black-box algorithms that optimize pixel-space perturbations via a frequency-domain parameterization: one based on gradient estimation and the other gradient-free, without requiring any access to model internals. Extensive experiments across multiple datasets demonstrate that AdvSplat can significantly disrupt reconstruction results by injecting imperceptible perturbations into the input images. Our findings surface an overlooked yet urgent problem in this domain, and we hope to draw the community's attention to this emerging security and robustness challenge.
119. 【2603.23684】MoCHA: Denoising Caption Supervision for Motion-Text Retrieval
链接:https://arxiv.org/abs/2603.23684
作者:Nikolai Warner,Cameron Ethan Taylor,Irfan Essa,Apaar Sadhwani
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:systems learn shared, Text-motion retrieval systems, learn shared embedding, retrieval systems learn, Text-motion retrieval
备注:
点击查看摘要
Abstract:Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.
120. 【2603.23677】Prototype Fusion: A Training-Free Multi-Layer Approach to OOD Detection
链接:https://arxiv.org/abs/2603.23677
作者:Shreen Gul,Mohamed Elmahallawy,Ardhendu Tripathy,Sanjay Madria
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Deep learning models, Deep learning, safety-critical applications, ensure robustness, learning models
备注:
点击查看摘要
Abstract:Deep learning models are increasingly deployed in safety-critical applications, where reliable out-of-distribution (OOD) detection is essential to ensure robustness. Existing methods predominantly rely on the penultimate-layer activations of neural networks, assuming they encapsulate the most informative in-distribution (ID) representations. In this work, we revisit this assumption to show that intermediate layers encode equally rich and discriminative information for OOD detection. Based on this observation, we propose a simple yet effective model-agnostic approach that leverages internal representations across multiple layers. Our scheme aggregates features from successive convolutional blocks, computes class-wise mean embeddings, and applies L_2 normalization to form compact ID prototypes capturing class semantics. During inference, cosine similarity between test features and these prototypes serves as an OOD score--ID samples exhibit strong affinity to at least one prototype, whereas OOD samples remain uniformly distant. Extensive experiments on state-of-the-art OOD benchmarks across diverse architectures demonstrate that our approach delivers robust, architecture-agnostic performance and strong generalization for image classification. Notably, it improves AUROC by up to 4.41% and reduces FPR by 13.58%, highlighting multi-layer feature aggregation as a powerful yet underexplored signal for OOD detection, challenging the dominance of penultimate-layer-based methods. Our code is available at: this https URL.
121. 【2603.23672】Bio-Inspired Event-Based Visual Servoing for Ground Robots
链接:https://arxiv.org/abs/2603.23672
作者:Maral Mordad,Kian Behzad,Debojyoti Biswas,Noah J. Cowan,Milad Siami
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Biological sensory systems, Biological sensory, Dynamic Vision Sensor, inherently adaptive, filtering out constant
备注:
点击查看摘要
Abstract:Biological sensory systems are inherently adaptive, filtering out constant stimuli and prioritizing relative changes, likely enhancing computational and metabolic efficiency. Inspired by active sensing behaviors across a wide range of animals, this paper presents a novel event-based visual servoing framework for ground robots. Utilizing a Dynamic Vision Sensor (DVS), we demonstrate that by applying a fixed spatial kernel to the asynchronous event stream generated from structured logarithmic intensity-change patterns, the resulting net event flux analytically isolates specific kinematic states. We establish a generalized theoretical bound for this event rate estimator and show that linear and quadratic spatial profiles isolate the robot's velocity and position-velocity product, respectively. Leveraging these properties, we employ a multi-pattern stimulus to directly synthesize a nonlinear state-feedback term entirely without traditional state estimation. To overcome the inescapable loss of linear observability at equilibrium inherent in event sensing, we propose a bio-inspired active sensing limit-cycle controller. Experimental validation on a 1/10-scale autonomous ground vehicle confirms the efficacy, extreme low-latency, and computational efficiency of the proposed direct-sensing approach.
122. 【2603.23669】Estimating Individual Tree Height and Species from UAV Imagery
链接:https://arxiv.org/abs/2603.23669
作者:Jannik Endres,Etienne Laliberté,David Rolnick,Arthur Ouaknine
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:major carbon sink, Unoccupied Aerial Vehicles, carbon sink, relies heavily, major carbon
备注: Project page: [this https URL](https://RolnickLab.github.io/DINOvTree)
点击查看摘要
Abstract:Accurate estimation of forest biomass, a major carbon sink, relies heavily on tree-level traits such as height and species. Unoccupied Aerial Vehicles (UAVs) capturing high-resolution imagery from a single RGB camera offer a cost-effective and scalable approach for mapping and measuring individual trees. We introduce BIRCH-Trees, the first benchmark for individual tree height and species estimation from tree-centered UAV images, spanning three datasets: temperate forests, tropical forests, and boreal plantations. We also present DINOvTree, a unified approach using a Vision Foundation Model (VFM) backbone with task-specific heads for simultaneous height and species prediction. Through extensive evaluations on BIRCH-Trees, we compare DINOvTree against commonly used vision methods, including VFMs, as well as biological allometric equations. We find that DINOvTree achieves top overall results with accurate height predictions and competitive classification accuracy while using only 54% to 58% of the parameters of the second-best approach.
123. 【2603.23650】Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge
链接:https://arxiv.org/abs/2603.23650
作者:Masoumeh Chapariniya,Aref Farhadipour,Sarah Ebling,Volker Dellwo,Teodora Vukovic
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:relative salience prediction, blended emotion recognition, BLEMORE Challenge, emotion recognition, blended emotion
备注:
点击查看摘要
Abstract:We present our system for the BLEMORE Challenge at FG 2026 on blended emotion recognition with relative salience prediction. Our approach combines six encoder families through late probability fusion: an S4D-ViTMoE face encoder adapted with soft-label KL training, frozen layer-selective Wav2Vec2 audio features, finetuned body-language encoders (TimeSformer, VideoMAE), and -- for the first time in emotion recognition -- Gemini Embedding 2.0, a large multimodal model whose video embeddings produce competitive presence accuracy (ACCP = 0.320) from only 2 seconds of input. Three key findings emerge from our experiments: selecting prosody-encoding layers (6--12) from frozen Wav2Vec2 outperforms end-to-end finetuning (Score 0.207 vs. 0.161), as the non-verbal nature of BLEMORE audio makes phonetic layers irrelevant; the post-processing salience threshold $\beta$ varies from 0.05 to 0.43 across folds, revealing that personalized expression styles are the primary bottleneck; and task-adapted encoders collectively receive 62\% of ensemble weight over general-purpose baselines. Our 12-encoder system achieves Score = 0.279 (ACCP = 0.391, ACCS = 0.168) on the test set, placing 6th.
124. 【2603.23647】λSplit: Self-Supervised Content-Aware Spectral Unmixing for Fluorescence Microscopy
链接:https://arxiv.org/abs/2603.23647
作者:Federico Carrara,Talley Lambert,Mehdi Seifi,Florian Jug
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:recover individual fluorophore, capture mixed fluorophore, mixed fluorophore emissions, individual fluorophore concentrations, individual fluorophore
备注: 14 pages, 25 pages supplement, 16 figures total, 14 tables total
点击查看摘要
Abstract:In fluorescence microscopy, spectral unmixing aims to recover individual fluorophore concentrations from spectral images that capture mixed fluorophore emissions. Since classical methods operate pixel-wise and rely on least-squares fitting, their performance degrades with increasingly overlapping emission spectra and higher levels of noise, suggesting that a data-driven approach that can learn and utilize a structural prior might lead to improved results. Learning-based approaches for spectral imaging do exist, but they are either not optimized for microscopy data or are developed for very specific cases that are not applicable to fluorescence microscopy settings. To address this, we propose {\lambda}Split, a physics-informed deep generative model that learns a conditional distribution over concentration maps using a hierarchical Variational Autoencoder. A fully differentiable Spectral Mixer enforces consistency with the image formation process, while the learned structural priors enable state-of-the-art unmixing and implicit noise removal. We demonstrate {\lambda}Split on 3 real-world datasets that we synthetically cast into a total of 66 challenging spectral unmixing benchmarks. We compare our results against a total of 10 baseline methods, including classical methods and a range of learning-based methods. Our results consistently show competitive performance and improved robustness in high noise regimes, when spectra overlap considerably, or when the spectral dimensionality is lowered, making {\lambda}Split a new state-of-the-art for spectral unmixing of fluorescent microscopy data. Importantly, {\lambda}Split is compatible with spectral data produced by standard confocal microscopes, enabling immediate adoption without specialized hardware modifications.
125. 【2603.23637】Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting
链接:https://arxiv.org/abs/2603.23637
作者:Peiyu Xu,Xin Sun,Krishna Mullia,Raymond Fei,Iliyan Georgiev,Shuang Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rigid pinhole camera, pinhole camera assumptions, remain slower due, limitations of rasterization, rigid pinhole
备注:
点击查看摘要
Abstract:Ray-tracing-based 3D Gaussian splatting (3DGS) methods overcome the limitations of rasterization -- rigid pinhole camera assumptions, inaccurate shadows, and lack of native reflection or refraction -- but remain slower due to the cost of sorting all intersecting Gaussians along every ray. Moreover, existing ray-tracing methods still rely on rasterization-style approximations such as shadow mapping for relightable scenes, undermining the generality that ray tracing promises. We present a differentiable, sorting-free stochastic formulation for ray-traced 3DGS -- the first framework that uses stochastic ray tracing to both reconstruct and render standard and relightable 3DGS scenes. At its core is an unbiased Monte Carlo estimator for pixel-color gradients that evaluates only a small sampled subset of Gaussians per ray, bypassing the need for sorting. For standard 3DGS, our method matches the reconstruction quality and speed of rasterization-based 3DGS while substantially outperforming sorting-based ray tracing. For relightable 3DGS, the same stochastic estimator drives per-Gaussian shading with fully ray-traced shadow rays, delivering notably higher reconstruction fidelity than prior work.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.23637 [cs.CV]
(or
arXiv:2603.23637v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.23637
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
126. 【2603.23627】Ukrainian Visual Word Sense Disambiguation Benchmark
链接:https://arxiv.org/abs/2603.23627
作者:Yurii Laba,Yaryna Mohytych,Ivanna Rohulia,Halyna Kyryleyza,Hanna Dydyk-Meush,Oles Dobosevych,Rostyslav Hryniv
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Word Sense Disambiguation, Visual Word Sense, Sense Disambiguation, Visual Word, Word Sense
备注:
点击查看摘要
Abstract:This study presents a benchmark for evaluating the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian. The main goal of the Visual-WSD task is to identify, with minimal contextual information, the most appropriate representation of a given ambiguous word from a set of ten images. To construct this benchmark, we followed a methodology similar to that proposed by (CITATION), who previously introduced benchmarks for the Visual-WSD task in English, Italian, and Farsi. This approach allows us to incorporate the Ukrainian benchmark into a broader framework for cross-language model performance comparisons. We collected the benchmark data semi-automatically and refined it with input from domain experts. We then assessed eight multilingual and multimodal large language models using this benchmark. All tested models performed worse than the zero-shot CLIP-based baseline model (CITATION) used by (CITATION) for the English Visual-WSD task. Our analysis revealed a significant performance gap in the Visual-WSD task between Ukrainian and English.
127. 【2603.23617】M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production
链接:https://arxiv.org/abs/2603.23617
作者:Alexandre Symeonidis-Herzig,Jianhe Low,Ozge Mercanoglu Sincan,Richard Bowden
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:language production requires, hand motion generation, Sign language production, Finite Scalar Quantization, motion generation
备注:
点击查看摘要
Abstract:Sign language production requires more than hand motion generation. Non-manual features, including mouthings, eyebrow raises, gaze, and head movements, are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face two barriers to integrating them: the standard body model provides a facial space too low-dimensional to encode these articulations, and when richer representations are adopted, standard discrete tokenization suffers from codebook collapse, leaving most of the expression space unreachable. We propose SMPL-FX, which couples FLAME's rich expression space with the SMPL-X body, and tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary, with an auxiliary translation objective that encourages semantically grounded embeddings. Across three standard benchmarks (How2Sign, CSL-Daily, Phoenix14T) M3T achieves state-of-the-art sign language production quality, and on NMFs-CSL, where signs are distinguishable only by non-manual features, reaches 58.3% accuracy against 49.0% for the strongest comparable pose baseline.
128. 【2603.23607】LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
链接:https://arxiv.org/abs/2603.23607
作者:Royden Wagner,Omer Sahin Tas,Jaime Villa,Felix Hauser,Yinzhe Shen,Marlon Steiner,Dominik Strutz,Carlos Fernandez,Christian Kinzig,Guillermo S. Guitierrez-Cabello,Hendrik Königshof,Fabian Immel,Richard Schwarzkopf,Nils Alexander Rack,Kevin Rösch,Kaiwen Wang,Jan-Hendrik Pauls,Martin Lauer,Igor Gilitschenski,Holger Caesar,Christoph Stiller
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:rare scenarios remains, fundamental challenge, rare scenarios, scenarios remains, remains a fundamental
备注: 21 pages
点击查看摘要
Abstract:In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: this https URL
129. 【2603.23559】CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training
链接:https://arxiv.org/abs/2603.23559
作者:Yuxi Chen,Haoyu Zhai,Chenkai Wang,Rui Yang,Lingming Zhang,Gang Wang,Huan Zhang
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:perceive raw screenshots, general GUI tasks, native vision-language models, general GUI, GUI tasks
备注:
点击查看摘要
Abstract:GUI agents are rapidly shifting from multi-module pipelines to end-to-end, native vision-language models (VLMs) that perceive raw screenshots and directly interact with digital devices. Despite rapid progress on general GUI tasks, CAPTCHA solving remains a major challenge. On the other hand, although specialized CAPTCHA solving pipelines exist, they cannot handle general GUI tasks. To address this gap, we introduce ReCAP: a CAPTCHA-capable native GUI agent that can robustly solve modern, interactive CAPTCHA challenges, while preserving their performance as a general GUI agent. We first develop a dynamic CAPTCHA system spanning seven representative CAPTCHA types, designed to stress primitive and complementary capabilities for CAPTCHA solving (e.g., robust OCR under heavy noise and text stylization, fine-grained visual understanding, and precise control). Then, we develop an automated data collection and curation pipeline that generates large-scale CAPTCHA interaction trajectories paired with reasoning traces. As CAPTCHA solving often requires multi-step interaction and recovery from intermediate mistakes, we further leverage failed trajectories to construct self-correction data, training agents to reflect on errors and correct their actions online. Across held-out test sets, ReCAP improves CAPTCHA-solving success from roughly 30\% to 80\%, while maintaining strong performance on general GUI-agent benchmarks.
130. 【2603.23521】Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages
链接:https://arxiv.org/abs/2603.23521
作者:Shaharukh Khan,Ali Faraz,Abhinav Ravi,Mohd Nauman,Mohd Sarfraz,Akshat Patidar,Raja Kolla,Chandra Khatri,Shubham Agarwal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal research, single-image reasoning, research has predominantly, predominantly focused, focused on single-image
备注: Accepted at "CVPR 2025: Workshop Vision Language Models For All"
点击查看摘要
Abstract:Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset's representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.
131. 【2603.23511】DISCO: Document Intelligence Suite for COmparative Evaluation
链接:https://arxiv.org/abs/2603.23511
作者:Kenza Benkirane,Dan Goldwater,Martin Asenov,Aneiss Ghodsi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:intelligence requires accurate, Document intelligence requires, requires accurate text, accurate text extraction, Document Intelligence Suite
备注: Accepted at the ICLR 2026 Workshop on Multimodal Intelligence (MMIntelligence). 10 pages, 7 figures
点击查看摘要
Abstract:Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for COmparative Evaluation}, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.
132. 【2603.13528】Learning Actionable Manipulation Recovery via Counterfactual Failure Synthesis
链接:https://arxiv.org/abs/2603.13528
作者:Dayou Li,Jiuzhou Lei,Hao Wang,Lulin Liu,Yunhao Yang,Zihan Wang,Bangya Liu,Minghui Zheng,Zhiwen Fan
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:advanced robotic manipulation, recent foundation models, significantly advanced robotic, execution errors, recent foundation
备注:
点击查看摘要
Abstract:While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory-level corrections required for actual recovery. To bridge the gap between failure diagnosis and actionable recovery, we introduce Dream2Fix, a framework that synthesizes photorealistic, counterfactual failure rollouts directly from successful real-world demonstrations. By perturbing actions within a generative world model, Dream2Fix creates paired failure-correction data without relying on simulators. To ensure the generated data is physically viable for robot learning, we implement a structured verification mechanism that strictly filters rollouts for task validity, visual coherence, and kinematic safety. This engine produces a high-fidelity dataset of over 120k paired samples. Using this dataset, we fine-tune a vision-language model to jointly predict failure types and precise recovery trajectories, mapping visual anomalies directly to corrective actions. Extensive real-world robotic experiments show our approach achieves state-of-the-art correction accuracy, improving from 19.7% to 81.3% over prior baselines, and successfully enables zero-shot closed-loop failure recovery in physical deployments.
133. 【2603.24176】Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic
链接:https://arxiv.org/abs/2603.24176
作者:Wanying Qu,Jianxiong Gao,Wei Wang,Yanwei Fu
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
关键词:Capturing dynamic spatiotemporal, large-scale brain mechanisms, understanding large-scale brain, Capturing dynamic, spatiotemporal neural activity
备注: CVPR 2026
点击查看摘要
Abstract:Capturing dynamic spatiotemporal neural activity is essential for understanding large-scale brain mechanisms. Functional magnetic resonance imaging (fMRI) provides high-resolution cortical representations that form a strong basis for characterizing fine-grained brain activity patterns. The high acquisition cost of fMRI limits large-scale applications, therefore making high-quality fMRI reconstruction a crucial task. Electroencephalography (EEG) offers millisecond-level temporal cues that complement fMRI. Leveraging this complementarity, we present an EEG-conditioned framework for reconstructing dynamic fMRI as continuous neural sequences with high spatial fidelity and strong temporal coherence at the cortical-vertex level. To address sampling irregularities common in real fMRI acquisitions, we incorporate a null-space intermediate-frame reconstruction, enabling measurement-consistent completion of arbitrary intermediate frames and improving sequence continuity and practical applicability. Experiments on the CineBrain dataset demonstrate superior voxel-wise reconstruction quality and robust temporal consistency across whole-brain and functionally specific regions. The reconstructed fMRI also preserves essential functional information, supporting downstream visual decoding tasks. This work provides a new pathway for estimating high-resolution fMRI dynamics from EEG and advances multimodal neuroimaging toward more dynamic brain activity modeling.
134. 【2603.24109】Comparative analysis of dual-form networks for live land monitoring using multi-modal satellite image time series
链接:https://arxiv.org/abs/2603.24109
作者:Iris Dumeur(CB),Jérémy Anger(CB),Gabriele Facciolo(CB)
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Image Time Series, Satellite Image Time, Multi-modal Satellite Image, Time Series, Satellite Image
备注:
点击查看摘要
Abstract:Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multimodal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.
135. 【2603.23974】Machine vision with small numbers of detected photons per inference
链接:https://arxiv.org/abs/2603.23974
作者:Shi-Yuan Ma,Jérémie Laydevant,Mandar M. Sohoni,Logan G. Wright,Tianyu Wang,Peter L. McMahon
类目:Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
关键词:including object recognition, including object, scientific instruments, object recognition, central technology
备注: 98 pages, 34 figures
点击查看摘要
Abstract:Machine vision, including object recognition and image reconstruction, is a central technology in many consumer devices and scientific instruments. The design of machine-vision systems has been revolutionized by the adoption of end-to-end optimization, in which the optical front end and the post-processing back end are jointly optimized. However, while machine vision currently works extremely well in moderate-light or bright-light situations -- where a camera may detect thousands of photons per pixel and billions of photons per frame -- it is far more challenging in very low-light situations. We introduce photon-aware neuromorphic sensing (PANS), an approach for end-to-end optimization in highly photon-starved scenarios. The training incorporates knowledge of the low photon budget and the stochastic nature of light detection when the average number of photons per pixel is near or less than 1. We report a proof-of-principle experimental demonstration in which we performed low-light image classification using PANS, achieving 73% (82%) accuracy on FashionMNIST with an average of only 4.9 (17) detected photons in total per inference, and 86% (97%) on MNIST with 8.6 (29) detected photons -- orders of magnitude more photon-efficient than conventional approaches. We also report simulation studies showing how PANS could be applied to other classification, event-detection, and image-reconstruction tasks. By taking into account the statistics of measurement results for non-classical states or alternative sensing hardware, PANS could in principle be adapted to enable high-accuracy results in quantum and other photon-starved setups.

