本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新589篇论文,其中:
- 自然语言处理86篇
- 信息检索16篇
- 计算机视觉109篇
自然语言处理
1. 【2605.23904】SkillOpt: Executive Strategy for Self-Evolving Agent Skills
链接:https://arxiv.org/abs/2605.23904
作者:Yifan Yang,Ziyang Gong,Weiquan Huang,Qihao Yang,Ziwei Zhou,Zisu Huang,Yan Li,Xuemei Gao,Qi Dai,Bei Liu,Kai Qiu,Yuqing Yang,Dongdong Chen,Xue Yang,Chong Luo
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:loosely controlled self-revision, Agent skills today, Claude Code, today are hand-crafted, controlled self-revision
备注: 27 pages, 4 figures, 6 tables
点击查看摘要
Abstract:Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.
2. 【2605.23897】ETCHR: Editing To Clarify and Harness Reasoning
链接:https://arxiv.org/abs/2605.23897
作者:Beichen Zhang,Yuhong Liu,Jinsong Li,Yuhang Zang,Jiaqi Wang,Dahua Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Large Language, purely textual chain, Multimodal Large
备注: Code, model and data are open-sourced at [this https URL](https://github.com/InternLM/ETCHR)
点击查看摘要
Abstract:Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.
3. 【2605.23885】Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions
链接:https://arxiv.org/abs/2605.23885
作者:Anastasiia Sedova,Natalie Schluter,Skyler Seto,Maartje ter Hoeve
类目:Computation and Language (cs.CL)
关键词:building high-performing multilingual, Cross-lingual knowledge transfer, high-performing multilingual language, knowledge transfer, critical for building
备注:
点击查看摘要
Abstract:Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involving scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are largely unavailable for many languages. We propose LINK - a data-level intervention method that improves knowledge transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vocabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabulary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to reach equivalent performance.
4. 【2605.23857】Strong Teacher Not Needed? On Distillation in LLM Pretraining
链接:https://arxiv.org/abs/2605.23857
作者:Taiming Lu,Zhuang Liu
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:distillation generally assumes, Knowledge distillation generally, generally assumes, distillation, stronger teachers yield
备注:
点击查看摘要
Abstract:Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we create strong-to-weak, same-level, and weak-to-strong teacher-student relationships, and study distillation's effectiveness under each. We find that the teacher need not be strong: with proper mixing of the language modeling and knowledge distillation losses, even small and undertrained teachers improve larger students. At the same time, a stronger teacher is not always better: pushing the teacher further, through more parameters or more training tokens, can saturate or even reverse the distillation gains. We further observe that distillation improves generalization (out-of-distribution and downstream performance) more readily than in-domain fitting. Together, these results challenge the common belief that distillation pretraining always requires a strong teacher.
5. 【2605.23826】Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
链接:https://arxiv.org/abs/2605.23826
作者:Michal Shlapentokh-Rothman,Prachi Garg,Yu-Xiong Wang,Derek Hoiem
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:provide verifiable visual, verifiable visual evidence, long-video question answering, provide verifiable, evidence for long-video
备注:
点击查看摘要
Abstract:Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at this https URL .
6. 【2605.23821】Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence
链接:https://arxiv.org/abs/2605.23821
作者:Andres Nava,Matthieu Wyart
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:is-a relation, language representations, propose a distributional, distributional theory, relation between general
备注: 34 pages, 12 figures, including appendices
点击查看摘要
Abstract:We propose a distributional theory of how hypernymy -- the ``is-a'' relation between general and specific concepts -- is encoded geometrically in language representations. Starting from the empirically verified assumption that words closer on the WordNet hypernym graph co-occur more often, we characterize theoretically the spectrum of the resulting embedding Gram matrix of word2vec embeddings. Under mild positivity and decay conditions on the co-occurrence kernel, we prove that the leading eigenvectors first separate broad taxonomic branches and then progressively finer sub-branches, producing a \emph{hierarchical splitting geometry} with a coarse-to-fine spectral organization that mirrors the tree. We confirm these predictions in word2vec embeddings across many sampled WordNet subtrees, and show that the same signature extends strikingly well to Gemma 2B unembeddings. Our results indicate that hierarchical concept geometry in LLMs need not reflect a hierarchy-specific functional mechanism, but emerges from the spectral structure of pairwise word statistics.
7. 【2605.23721】Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering
链接:https://arxiv.org/abs/2605.23721
作者:Mateusz Klimaszewski,Piotr Andruszkiewicz
类目:Computation and Language (cs.CL)
关键词:Classifier-based Quality Filtering, constructing pre-training corpora, Classifier-based Quality, Large Language Models, numerous Large Language
备注: Accepted to ACL 2026
点击查看摘要
Abstract:Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.
8. 【2605.23715】NLG Evaluation: Past, Present, Future
链接:https://arxiv.org/abs/2605.23715
作者:Ehud Reiter
类目:Computation and Language (cs.CL)
关键词:Natural Language Generation, Natural Language, Language Generation, changed dramatically, NLG
备注: Will appear in Proceeedings of RetroEval 2026
点击查看摘要
Abstract:Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.
9. 【2605.23710】A graph-based analysis of semantic types and coercion in contextualized word embeddings
链接:https://arxiv.org/abs/2605.23710
作者:Long Chen,Deniz Ekin Yavas
类目:Computation and Language (cs.CL)
关键词:Neighbor Type Probability, Neighbor Type Entropy, context is central, type, Neighbor Type
备注:
点击查看摘要
Abstract:Semantic type mismatch between a noun and its context is central to coercion phenomena. This paper introduces a graph-based method to examine how lexical and contextual type information is reflected in word embeddings. We select nouns from ten semantic types, annotate corpus instances for type matching (matching vs. coercion vs. other mismatch vs. unrestricted), and construct graphs using BERT and sense-enhanced embeddings. Two metrics -- Neighbor Type Probability (NTP) and Neighbor Type Entropy (NTE) -- are proposed to analyze neighborhood type distributions. Results show that graphs constructed with sense-enhanced embeddings reflect semantic type information better, and matching and mismatch sentences can be distinguished through the proposed metrics.
10. 【2605.23701】Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks
链接:https://arxiv.org/abs/2605.23701
作者:Kan Shao
类目:Computation and Language (cs.CL)
关键词:Prior Dominance Score, Metadata Prior Dominance, benchmark outputs change, study a protocol-level, outputs change
备注: 5 pages, 1 figure, 1 table. Accepted at ICML 2026 Workshop on Hypothesis Testing
点击查看摘要
Abstract:We study a protocol-level test for weak-label benchmarks: whether benchmark outputs change when the provided evidence is intervened on. Metadata-only shortcut checks answer a different question, namely whether outputs are predictable from metadata priors. We therefore combine a metadata statistic, the Metadata Prior Dominance Score (MPDS), with an evidence-intervention statistic, {\Delta}Evi, measuring sensitivity to evidence identity under cross-item shuffling. Synthetic HotpotQA gives a constructed counterexample to metadata-only screening: MPDS is only moderate (0.643), yet {\Delta}Evi is zero. Stronger-reader reruns show why calibration belongs in the test procedure: SNLI shows a calibration reversal, reconstructed HotpotQA occupies a question-dominant warning region, and FEVER is a strongly evidence-sensitive positive control across four transformers. The practical lesson is simple: benchmark audits should report metadata-only screening, evidence intervention, and reader-strength calibration together.
11. 【2605.23694】ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models
链接:https://arxiv.org/abs/2605.23694
作者:Fen Wang,Zekai Shao,Qiman Kang,Chunran Hu,Zhixuan Zhang,Lexu Xie,Chao Liu,Siming Chen
类目:Computation and Language (cs.CL)
关键词:cross-modal retrieval, essential for accessibility, assisting readers, readers in extracting, extracting insights
备注:
点击查看摘要
Abstract:Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics -- Faithfulness, Coverage, Informativeness, and Acuity -- to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.
12. 【2605.23668】OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations
链接:https://arxiv.org/abs/2605.23668
作者:Jiangwang Chen,Bowen Zhang,Zixin Song,Jiazheng Kang,Xiao Yang,Da Zhu,Guanjun Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:conversational systems process, remain fundamentally reactive, systems process millions, multi-turn dialogues daily, large language model
备注:
点击查看摘要
Abstract:Although large language model (LLM) conversational systems process millions of multi-turn dialogues daily, they remain fundamentally reactive: they respond only after the user types a query. A key step toward proactive interaction is next-query prediction, which anticipates the user's subsequent query based solely on the preceding dialogue. Progress on this task is hindered by the lack of dedicated benchmarks and a fundamental efficiency--quality trade-off: naively concatenating full dialogue history incurs linearly growing token consumption, while truncating to the latest turn discards crucial cross-turn context. Our key insight is that accurate prediction does not require re-reading raw history; it suffices to track the user's evolving intent trajectory across topics, unresolved needs, and interest shifts. We propose OnePred, which maintains a recursively updated memory as its sole cross-turn context, bounding the per-turn cost independently of conversation length. We train the model via a two-stage reinforcement learning pipeline that first teaches what to predict, then what to compress, shaping the memory into a prediction-oriented intent chain. To establish a rigorous testbed, we introduce NQP-Bench, spanning three diverse subsets. Experiments demonstrate that OnePred reduces per-turn token consumption by up to 22$\times$ compared to full-history inputs while consistently exceeding all baselines in prediction quality, with larger gains on longer conversations. Our code is publicly available at this https URL.
13. 【2605.23657】OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
链接:https://arxiv.org/abs/2605.23657
作者:Jiahao Ying,Boxian Ai,Wei Tang,Siyuan Liu,Yixin Cao
类目:Computation and Language (cs.CL)
关键词:structured workflow instructions, workflow instructions distilled, increasingly important mechanism, large language models, improving agent performance
备注:
点击查看摘要
Abstract:Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present \textsc{OpenSkillEval}, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textsc{OpenSkillEval} automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: this https URL.
14. 【2605.23651】How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework
链接:https://arxiv.org/abs/2605.23651
作者:Björn Nieth(1 and 4),Marianna Gracheva(2),Michaela Mahlberg(2 and 3),Bjoern Eskofier(1 and 3 and 5 and 6),Emmanuelle Salin(1) ((1) Department Artificial Intelligence in Biomedical Engineering (AIBE) FAU Erlangen-Nürnberg Germany, (2) Department of Digital Humanities and Social Studies (DHSS) FAU Erlangen-Nürnberg Germany, (3) University of Birmingham United Kingdom, (4) Chair of AI-supported Therapy Decisions LMU München Munich Germany, (5) Munich Center for Machine Learning (MCML) Munich Germany, (6) Institute of AI for Health Helmholtz Zentrum München Neuherberg Germany)
类目:Computation and Language (cs.CL)
关键词:Large Language Model, human-like generated texts, focus of Large, Large Language, long time
备注: 8.5 pages (main) + 31 pages appendix, 29 figures, 10 tables. Code and data: [this https URL](https://github.com/BjoernNieth/Register_Aware_LLMs)
点击查看摘要
Abstract:While factual correctness and task-performance have been in focus of Large Language Model (LLM) research for a long time, the fundamental question of how human-like generated texts are on a linguistic level has been underexplored. From a corpus-linguistic perspective, language production is inherently context-dependent, with distinct communicative contexts giving rise to differences in frequencies and co-occurrence patterns of linguistic features. A text failing to adhere to these patterns can be content-wise correct, but still be unfavorable to human readers. In this work, we propose a context-aware evaluation framework in which human-likeness is assessed using a two-sample problem between the linguistic feature distribution of a human reference corpus for a given register and a corresponding LLM-generated corpus. We implement this framework using the Maximum Mean Discrepancy (MMD) and the 67 lexico-grammatical features introduced by Biber, which are commonly applied in corpus linguistics. In our experiments, we compare seven instruction-tuned, open-source models across five English-language datasets spanning distinct registers against a human baseline. While across all tested setups, LLMs deviate from the human baseline, which models are closest to human language depends on the register and is not dictated by model size.
15. 【2605.23618】Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems
链接:https://arxiv.org/abs/2605.23618
作者:Stefano Cirillo,Domenico Desiato,Giuseppe Polese,Giandomenico Solimando
类目:Computation and Language (cs.CL)
关键词:benchmark Google Embeddings, Google Embeddings, explicit task-type conditioning, benchmark Google, context and explicit
备注: 9 pages, 2 figures, 5 tables. Text and evaluation code available at [this https URL](https://github.com/cciro94/GoogleEmbeddings2-benchmark)
点击查看摘要
Abstract:We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR this http URL@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.
16. 【2605.23605】DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
链接:https://arxiv.org/abs/2605.23605
作者:Jean-Marie Lemercier,Tomas Geffner,Karsten Kreis,Morteza Mardani,Arash Vahdat,Ante Jukić
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:masked diffusion language, Diffusion language models, language models intrinsically, models intrinsically fail, Diffusion language
备注:
点击查看摘要
Abstract:Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked diffusion language models with three components: (1) a continuous latent space with semantic capabilities, learned by an auto-encoder fine-tuned from an existing masked diffusion language model; (2) a latent diffusion model learning the prior over the encoder distribution; (3) a consistency model distilling the learned prior into a few-step latent generative model. We show that, even without distillation, our latent-guided diffusion model outperforms the masked diffusion baseline while significantly accelerating inference. Consistency distillation further lowers the computational overhead of continuous diffusion, such that the latent is generated in negligible time compared to discrete decoding.
17. 【2605.23597】Structure-Guided Entity Resolution: Fine-Tuning LLMs for Robust Name Matching in Complex Linguistic Contexts
链接:https://arxiv.org/abs/2605.23597
作者:Shivam Chourasia,Hitesh Kapoor,Nilesh Patil
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:culturally complex environments, heterogeneous records, core challenge, culturally complex, entity resolution
备注: Accepted to ACL 2026. 8 pages, 1 figure, 2 tables
点击查看摘要
Abstract:Matching person names across heterogeneous records is a core challenge in entity resolution, especially within linguistically and culturally complex environments. Variations in naming conventions, inconsistent transliteration across scripts, and frequent data entry errors make it difficult to unify user identities, an essential requirement for Know Your Customer (KYC) compliance. While Large Language Models have shown promise in understanding natural language, they often struggle with the structured ambiguity present in such domain-specific settings. This paper introduces Structure-Guided Entity Resolution (SGER), a novel framework that fine-tunes an LLM through a two-phase curriculum. The model is first trained to parse the grammatical and semantic structure of personal names, then optimized for the downstream task of binary entity matching. We evaluate SGER in the challenging context of Indian identity data, one of the most linguistically diverse and noisy environments globally. SGER achieves 99.02% accuracy and an F1 of 0.994 on a held-out set of 50,000 real-world pairs, outperforming GPT-4o few-shot prompting and single-stage fine-tuning baselines. The system is fully deployed in production at Dream11, the world's largest fantasy sports platform, serving 250M+ users. Our results demonstrate that curriculum-guided training enables robust, high-precision entity resolution in real-world multilingual systems at scale.
18. 【2605.23497】Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering
链接:https://arxiv.org/abs/2605.23497
作者:Max Prior,Andreas Schultz,Matthias Grabmair
类目:Computation and Language (cs.CL)
关键词:Large language models, fixed training cutoffs, static parametric knowledge, Large language, fixed training
备注:
点击查看摘要
Abstract:Large language models are increasingly used for legal research, yet their fixed training cutoffs and reliance on static parametric knowledge are at odds with the evolving nature of statutory law. We study two temporal failure modes: post-cutoff staleness, where models apply superseded rules after legislative amendments, and recency bias, where models prefer newer provisions even when a historical version governs the fact pattern. To this end, we present a benchmark of 312 expert-validated, time-sensitive German statutory QA pairs spanning three categories: Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions. We evaluate five LLMs by OpenAI, Anthropic and DeepSeek under four inference settings: Vanilla, Web-search, and two retrieval-augmented variants that enforce temporal validity via a fact date extraction and version filtering. Using an LLM-as-a-judge validated against human expert ratings, we find severe degradation in the Vanilla post-cutoff setting. Both RAG approaches substantially improve performance across all question types, while web search yields unstable gains and exhibits a marked recency bias on historically anchored tasks. Our results indicate that reliable legal QA requires treating temporal validity as a hard constraint.
19. 【2605.23491】CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test
链接:https://arxiv.org/abs/2605.23491
作者:Zhangyi Hu,Chenhui Liu,Tian Huang,Jindong Li,Yang Yang,Jiemin Wu,Zining Zhong,Menglin Yang,Yutao Yue
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Reinforcement Learning, Verifiable Rewards, Learning with Verifiable, advanced LLM code, Test-Time Scaling
备注: Code is available at: [this https URL](https://github.com/sanae-ai/CosPlay) | Data log is available at: [this https URL](https://huggingface.co/datasets/yomi017/CosPlay)
点击查看摘要
Abstract:Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.
20. 【2605.23454】ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning
链接:https://arxiv.org/abs/2605.23454
作者:Xiaoyuan Li,Keqin Bao,Moxin Li,Yubo Ma,Yichang Zhang,Wenjie Wang,Fuli Feng,Dayiheng Liu
类目:Computation and Language (cs.CL)
关键词:extend reinforcement learning, large language models, Rubric-based rewards offer, reinforcement learning, offer a promising
备注: Under Review
点击查看摘要
Abstract:Rubric-based rewards offer a promising way to extend reinforcement learning (RL) for large language models beyond tasks with automatically verifiable answers. However, scaling rubric-based RL remains challenging: existing approaches often rely on expert-written rubrics and manually constructed question sets, while fixed task-level rubrics may fail to capture the evaluation requirements of individual questions. We propose ARES (Automated Rubric synthEsis for Scalable RL), a framework for automatically constructing rubric-based RL data at scale. Starting from raw pretraining documents, ARES converts source knowledge into self-contained question-answer pairs and co-generates question-specific weighted rubrics, enabling instance-level reward supervision for open-ended responses. To improve diversity and quality, ARES conditions generation on domain labels and persona information, and applies validation filters for question self-containment, answer faithfulness, and rubric validity. Using ARES, we construct 100K rubric-annotated instances across ten domains. Experiments on seven benchmarks show that rubric-based RL trained with ARES, outperforms continual pretraining, supervised fine-tuning, and binary-reward RL, with the largest gains on multi-dimensional open-ended tasks such as healthcare and instruction following.
21. 【2605.23440】SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction
链接:https://arxiv.org/abs/2605.23440
作者:Jiawei He,Mengyu Shi,Chunrong Fang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Semantic Data Augmentation, Data augmentation, Toggle, Structured Semantic Data, data
备注: 12 pages, 3 figure
点击查看摘要
Abstract:Joint Entity and Relation Extraction (JERE) is highly susceptible to weak generalization due to low-quality training data. Data augmentation is a common strategy to enhance model generalization across different domains. However, existing data augmentation methods often overlook text relevance and may disrupt semantic structures and dependencies, making it difficult to generate effective augmented data for improving model generalization. In this paper, we propose Structured Semantic Data Augmentation (SSDAU), a novel method designed to preserve the semantic structure of text during augmentation. SSDAU segments text based on entity labels and employs an encoder to capture semantic features of entities through context awareness. It then performs entity semantic restructuring to generate augmented data. To distinguish semantically similar entities, SSDAU fuses contextualized embeddings with traditional similarity scores. To mitigate potential topic ambiguity and information loss, we apply the BERTTopic model to filter out irrelevant topics, ensuring topic consistency. We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular data augmentation baselines. Experiments demonstrate that SSDAU generates semantically consistent data with superior robustness against ambiguity (8.26\% F1 decrease vs.\ 31.91\% for baselines), significantly outperforming all existing methods across all metrics.
Comments:
12 pages, 3 figure
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2605.23440 [cs.CL]
(or
arXiv:2605.23440v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.23440
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Jiawei He [view email] [v1]
Fri, 22 May 2026 09:52:43 UTC (195 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction, by Jiawei He and 2 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context:
cs.CL
prev
|
next
new
|
recent
| 2026-05
Change to browse by:
cs
cs.AI
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
22. 【2605.23420】Naturalistic measure of social norms alignment
链接:https://arxiv.org/abs/2605.23420
作者:Yevhen Kostiuk,Kenneth Enevoldsen,Peter Bjerregaard Vahlstrup,Márton Kardos,Kristoffer Nielbo
类目:Computation and Language (cs.CL)
关键词:Social norms reflect, Social norms, social norms alignment, acceptable behavior, reflect shared expectations
备注:
点击查看摘要
Abstract:Social norms reflect shared expectations on acceptable behavior. Measuring social norms alignment remains challenging, with existing approaches typically relying on artificial closed-form evaluations such as multiple-choice questionnaires or measuring agreement with predefined statements. In the context of this work, social norms alignment refers to measuring an agreement between solutions with respect to the social problem or dilemma. We propose a framework for measuring social norm alignment in naturalistic, free-form settings through solution matching. The framework enables us to measure alignment between any two dilemma responses e.g., LLMs to a human, LLMs to LLMs, or human to human. We introduce two metrics: stated and explicit agreement accuracy, and construct a dataset of 3k non-trivial social dilemmas in Danish. All dilemmas are assigned reference solutions derived from three panelists, who serve as culturally grounded judges. We evaluate the agreement of several LLMs and human responses in an interaction setup that resembles natural user-model conversations. Our results show that the proposed metrics produce consistent model rankings and reveal variation in agreement across different types of dilemmas, with higher agreement observed for topics such as neighbor conflicts and shared living situations. Overall, our work introduces a dataset and evaluation framework for studying culturally grounded social reasoning in naturalistic open-ended conversations.
23. 【2605.23416】Articulatory strategy as a source of variation in acoustic vowel dynamics
链接:https://arxiv.org/abs/2605.23416
作者:Patrycja Strycharczuk,Justin J. H. Lo,Sam Kirkham
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:Acoustic vowel dynamics, Acoustic vowel, move their articulators, specific and practised, articulatory strategies
备注:
点击查看摘要
Abstract:Acoustic vowel dynamics have some speaker-identifying characteristics, which have been ascribed to individual properties of articulatory strategies: formant transitions have a particular shape because speakers move their articulators, using specific and practised movements. However, there is little existing evidence that different articulatory strategies systematically affect formant dynamics. The present study corroborates the link between the two. Ultrasound tongue imaging data from 36 speakers of Northern-Anglo English are used to identify distinct articulatory strategies for the production of palatal vowel /i/. Tongue shape in /i/ is found to be a significant predictor of formant dynamics in diphthongs with a palatal offglide. The observed relationships can be explained by the characteristics of articulatory movement conditioned by vocal tract shape. Greater articulatory displacement of tongue root and/or dorsum produces greater distortion from the mean tongue shape in palatal vowels, and it also requires higher articulatory velocities, resulting in relatively earlier and steeper formant transitions. The results contribute to the conceptual understanding of individuality in speech, by illuminating the regularising and individual aspects of articulatory compensation.
24. 【2605.23412】EquiSumm : A Gender Bias-Aware Framework for Inclusive Tweet Summarization
链接:https://arxiv.org/abs/2605.23412
作者:Chaitanya Wanjari,Jessica Kamal,Riddhi Jain,Samruddhi Kurhe,Roshni Chakraborty
类目:Computation and Language (cs.CL)
关键词:identify key viewpoints, large-scale opinion sharing, social media platforms, medium for large-scale, manually impossible
备注: Accepted at AI for Social Good Workshop, Pattern Recognition and Machine Intelligence (PReMI 2025), IIT Delhi. 6 pages, 2 figures
点击查看摘要
Abstract:While social media platforms, such as Twitter, provide a medium for large-scale opinion sharing during news events, it is manually impossible for individuals or media agencies to process the vast volume of content to identify key viewpoints. In order to resolve this, several automatic summarization techniques have been proposed to condense large collections of tweets into concise and informative summaries. However, these algorithms do not explicitly consider demographic fairness. Several existing research works have developed automated summarization approaches that can provide a holistic overview of the key aspects and major opinions shared on social media platforms related to a news event. However, these approaches do not explicitly consider different forms of demographic representation, such as gender, which can lead to biased summary representation. In this paper, we propose EquiSumm, which considers the gender aspect of the shared opinion to generate a summary, and our experimental analysis on two major datasets indicates the performance effectiveness with respect to existing research works.
25. 【2605.23384】Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
链接:https://arxiv.org/abs/2605.23384
作者:Sirui Chen,Lei Xu,Yuying Zhao,Yutian Chen,Yu Wang,Beier Zhu,Hanwang Zhang,Shengjie Zhao,Chaochao Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Recent RL methods, methods have substantially, substantially improved, Recent, reasoning
备注:
点击查看摘要
Abstract:Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.
26. 【2605.23382】From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning
链接:https://arxiv.org/abs/2605.23382
作者:Ranxu zhang,zeyang li,Jiacheng Huang,Rui Zhang,Xiaozhou Xu,sun zhe,Yanyong Zhang,Chao Wang
类目:Computation and Language (cs.CL)
关键词:clear success signals, Agentic reinforcement learning, achieved strong progress, success signals, progress in tasks
备注: 34 pages, 7 figures, Under Review
点击查看摘要
Abstract:Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across users. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training-time optimization. At its core is \emph{Personalized Anchor Reward-Decoupled Policy Optimization} (\textbf{PARPO}), which decouples generic task-quality rewards from personalized preference rewards and uses user-specific anchors to stabilize learning under heterogeneous reward scales. We further introduce a two-stage preference-disentangled reward model and \emph{Preference-Aligned Skill Evolution Graph Memory} (\textbf{PSGM}) for personalized supervision and preference-aligned skill retrieval. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines. Code and data are included in the supplementary materials.
27. 【2605.23332】Cultural Adaptation in Large Language Models for Political Discourse
链接:https://arxiv.org/abs/2605.23332
作者:Wajdi Zaghouani
类目:Computation and Language (cs.CL)
关键词:large language models, introducing material risks, large language, language models, comparative research
备注:
点击查看摘要
Abstract:The integration of large language models into political discourse analysis creates new opportunities for comparative research, policy analysis, and civic technology, while introducing material risks for democratic accountability. This paper argues that cultural adaptation is a prerequisite for trustworthy deployment of large language models in political communication across diverse linguistic and institutional contexts. Current systems remain shaped by English dominant data, uneven multilingual coverage, and assumptions grounded in a narrow range of political institutions and discourse conventions, producing systematic errors when applied across cultures. We formalize cultural adaptation across translation, discourse, and ontology levels, identify recurring cultural failure modes in political NLP, and propose an operational evaluation matrix grounded in cultural fidelity, calibration, and democratic safety. Building on political text analysis, sociotechnical auditing, and cross cultural pragmatics, we outline methodological pathways including participatory dataset development, culturally aware transfer learning, and benchmark design that makes cultural adaptation empirically measurable. We conclude by clarifying governance constraints and scope conditions under which culturally adaptive political NLP can support democratic legitimacy.
28. 【2605.23328】Emotion Recognition in Sign Language Conversation
链接:https://arxiv.org/abs/2605.23328
作者:Yusong Wang,Keyu Mao,Takao Obi,Minghao Shao,Kotaro Funakoshi
类目:Computation and Language (cs.CL)
关键词:datasets primarily focus, affective computing, sign language, core component, component of affective
备注:
点击查看摘要
Abstract:Emotion Recognition in Conversation is a core component of affective computing, while current resources of sign language emotion datasets primarily focus on isolated sentences and lack conversational context. Models trained exclusively on these isolated utterances demonstrate degraded performance in real world scenarios because they cannot utilize historical dialogue flow. To address this structural limitation, we introduce the ERC task to sign language video analysis and propose the eJSL Dialog dataset. Constructed using the scripts from the STUDIES corpus, the dataset contains 1,920 video samples organized into 480 unique dialogues. We conduct systematic benchmarking on this dataset using models ranging from isolated visual networks to multimodal conversational architectures. The results reveal a domain gap when applying generic multimodal conversational emotion recognition models to sign language. These findings demonstrate the explicit need for context aware visual extractors specific to sign language and indicate that expanding the scale of conversational datasets to support large scale pre-training is a necessary next step for future research.
29. 【2605.23326】ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication
链接:https://arxiv.org/abs/2605.23326
作者:Wajdi Zaghouani,Md. Rafiul Biswas,Mabrouka Bessghaier,Shimaa Ibrahim,George Mikros
类目:Computation and Language (cs.CL)
关键词:public Facebook posts, climate change collected, Facebook posts, public Facebook, CrowdTangle platform
备注:
点击查看摘要
Abstract:We present ClimateChat-300K, a large-scale dataset of 299,329 public Facebook posts about climate change collected between May 2020 and May 2024 through the CrowdTangle platform. The dataset contains 41 metadata features including post content, engagement metrics, and page attributes, covering material from more than 26,000 global pages. Each post includes rich contextual information such as language, timestamp, page category, and interaction counts, enabling comprehensive analyses of public discourse around climate communication. Using topic modeling and sentiment analysis, we identify ten main themes grouped into five domains: policy, activism, cooperation, science, and conservation. The results reveal that emotional tone, post format, and page identity strongly influence audience engagement, with visually rich and emotionally charged content receiving the highest levels of interaction. The dataset also demonstrates how online discussions evolved in response to major events such as international climate summits and the COVID-19 pandemic period. ClimateChat-300K provides an open resource for reproducible and interdisciplinary research on polarization, misinformation, and the dynamics of digital climate discourse. By releasing this dataset, we aim to support transparent, data-driven research and contribute to a deeper un-derstanding of how public engagement with climate issues develops across time, geography, and institutional contexts.
30. 【2605.23325】AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse
链接:https://arxiv.org/abs/2605.23325
作者:Esra'a Sharqawi,Wajdi Zaghouani
类目:Computation and Language (cs.CL)
关键词:shaping public narratives, hope speech, Arabic hope speech, armed conflicts, providing space
备注:
点击查看摘要
Abstract:Social media has become a crucial arena for shaping public narratives during armed conflicts, providing space for both harmful and constructive communication. While hate speech and misinformation have been widely studied, expressions that promote resilience, solidarity, and optimism remain underexplored, particularly in Arabic contexts. This paper introduces AraHopeCorpus, the first annotated dataset of Arabic hope speech collected from ten thousand YouTube comments related to the war on Gaza between 2023 and 2024. Using a detailed annotation framework, comments were classified into three categories: hope speech, no hope speech, and neutral or unclear discourse. The dataset shows that hopeful language dominates, accounting for more than sixty four percent of all comments. These expressions of hope appear mainly as religious encouragement, collective solidarity, and optimism for endurance and justice. No hope speech, representing about thirteen percent, reflects despair and disillusionment, while the rest of the comments contain neutral or mixed content. Inter-Annotator Agreement reached substantial levels (Cohen's Kappa equals 0.71), though dialectal variation, sarcasm, and implicit meaning posed annotation challenges. A comparative analysis between human annotators and ChatGPT revealed that large language models can support annotation but remain limited in handling dialectal and culturally embedded expressions. AraHopeCorpus will be released for research purposes under an open and non commercial license. It provides a valuable resource for studying constructive digital discourse, enabling further research on hope speech detection, crisis communication, and resilience in Arabic social media.
31. 【2605.23315】Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning
链接:https://arxiv.org/abs/2605.23315
作者:Muhammad Usama,Dong Eui Chang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Platonic Representation Hypothesis, develop increasingly similar, increasingly similar internal, Large language models, similar internal representations
备注:
点击查看摘要
Abstract:Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre-decision representations align (CKA = 0.875) while post-decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at this https URL.
32. 【2605.23278】When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming
链接:https://arxiv.org/abs/2605.23278
作者:Francesco Corielli
类目:Computation and Language (cs.CL); Machine Learning (stat.ML)
关键词:Language models trained, marginal text-only law, marginal text-only, Language, conditional
备注:
点击查看摘要
Abstract:Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.
Subjects:
Computation and Language (cs.CL); Machine Learning (stat.ML)
Cite as:
arXiv:2605.23278 [cs.CL]
(or
arXiv:2605.23278v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.23278
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Francesco Corielli [view email] [v1]
Fri, 22 May 2026 06:34:17 UTC (19 KB)
33. 【2605.23259】Multi-Gate Residuals
链接:https://arxiv.org/abs/2605.23259
作者:Zhizhan Zheng,Feiyun Zhang,Shuchun Liu,Tian Xia,Xi Liu,Dasheng Hu,Hongquan Zhou
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:deep residual layers, inevitably incurs significant, significant communication overhead, unbounded activation growth, incurs significant communication
备注:
点击查看摘要
Abstract:While Attention Residuals has shown some effectiveness in addressing the widespread issue of unbounded activation growth across deep residual layers, it inevitably incurs significant communication overhead. To circumvent this bottleneck, we propose Multi-Gate Residuals (MGR), which stabilizes activation scales without additional communication burden. It utilizes a straightforward scoring and gating mechanism to maintain multi-stream context, coupled with Attention Pooling to extract hidden states from the stream states. Empirical experiments demonstrate that MGR is practical for large-scale training and deployment, offering tangible performance improvements over existing architectures.
34. 【2605.23215】FastKernels: Benchmarking GPU Kernel Generation in Production
链接:https://arxiv.org/abs/2605.23215
作者:Gabriele Oliaro,Yichao Fu,May Jiang,Owen Lu,Junli Wang,Zhihao Jia,Hao Zhang,Samyam Rajbhandari
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:GPU kernel generation, advancing rapidly, generation are advancing, progress is fundamentally, fundamentally constrained
备注:
点击查看摘要
Abstract:LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at this https URL
35. 【2605.23193】CultivAgents: Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening
链接:https://arxiv.org/abs/2605.23193
作者:Yiyang Wang,Moeiini Reilly,Britney Johnson,Kefei Yan,Alex Cabral,Josiah Hester
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
关键词:existing digital tools, overlooks gardeners' skills, provide generic advice, grounded gardening support, existing digital
备注: Preprint, 9 pages. Website: [this https URL](https://hello-diana.github.io/CultivAgents/)
点击查看摘要
Abstract:Gardening is critical to support well-being, cultural continuity, and food autonomy, yet existing digital tools often provide generic advice that overlooks gardeners' skills, local ecologies, seasons, and cultural contexts. We introduce CultivAgents, a relationship-centered multi-agent system for personalized, socio-culturally grounded gardening support. Grounded in ethics of care, CultivAgents coordinates multiple specialized agents: an Experience Agent that adapts guidance to users' skill levels, an Environmental Agent that grounds advice in local and seasonal conditions, and an Ethnobotanical Agent that connects plants to cultural knowledge and histories. We evaluated CultivAgents through a three-phase mixed-methods study with domain experts (n=3), HCI researchers (n=7), and community gardeners (n=5), analyzing expert feedback, pre/post surveys, and participatory design activities. Results suggest that CultivAgents helped gardeners translate interest into situated action: community gardeners reported increased confidence (3.00 to 3.60), motivation (4.00 to 4.40), and trust in acting on AI advice (3.20 to 4.00). Participants valued hyperlocal ecological guidance and complementary agent perspectives, while also identifying limits in cultural specificity, ecological grounding, and agent coordination. The work advances relationship-centered AI, offering design implications for multi-agent systems that support food sovereignty, community resilience, and cultural preservation.
36. 【2605.23190】Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement
链接:https://arxiv.org/abs/2605.23190
作者:Chenwang Wu,Yiu-ming Cheung,Bo Han,Defu Lian
类目:Computation and Language (cs.CL)
关键词:Machine-generated texts, large language models, fully machine-generated texts, produced by large, raised serious concerns
备注:
点击查看摘要
Abstract:Machine-generated texts (MGTs) produced by large language models (LLMs) are increasingly prevalent across various applications, while their potential misuse in fake news propagation and phishing has raised serious concerns, highlighting the need for MGT detection. Existing paragraph-level detection methods commonly treat MGTs as entirely machine-like, overlooking the hidden human-like nature of machine-generated texts: even fully machine-generated texts may contain spans that are highly consistent with human writing. To this end, we first reveal the existence of such hidden human-like spans, and then theoretically analyze their impact on detection. Our analysis shows that these spans increase the sentence complexity for detection, thereby making MGT detection intrinsically harder. Based on this finding, we propose a model-agnostic stacked enhancement framework that improves existing detectors by reducing the influence of hidden human-like spans. Specifically, we model span-level retention decisions as a latent-variable problem and instantiate the optimization with a hard-EM-inspired procedure, where the detector iteratively filters confidently human-like subsequences and refines itself on the remaining text. Extensive experiments across various LLMs and practical scenarios demonstrate that the proposed framework consistently enhances existing detectors. Notably, the framework can also work in a training-free manner, offering flexibility and scalability for practical deployment.
37. 【2605.23180】Self-Improving In-Context Learning
链接:https://arxiv.org/abs/2605.23180
作者:Baturay Saglam,Dionysis Kalogerias
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:fixed few-shot prompt, test time, optimizing the continuous, fixed few-shot, continuous embeddings
备注:
点击查看摘要
Abstract:We propose to improve in-context learning (ICL) by optimizing the continuous embeddings of a fixed few-shot prompt at test time. The key observation is that the log-probabilities a model assigns to its demonstrated outputs$\unicode{x2013}$available from a single forward pass without generating any tokens$\unicode{x2013}$provide a meaningful signal for how well the model has inferred the task from its demonstrations. We formalize this signal as a bounded, self-supervised confidence proxy and maximize it via zeroth-order optimization over the prompt embeddings, yielding a test-time calibration procedure. The approach requires no finetuning, no token generation, no predefined label set, and no external data, making it equally applicable to both classification and free-form generation tasks. Across a comprehensive suite of ICL tasks, the proposed calibration consistently matches or improves upon the base model and outperforms classification-specific baselines on most tasks. The statistically significant correlation between proxy improvement and downstream accuracy gain confirms that the proposed proxy encodes a reliable optimization signal for in-context learning.
38. 【2605.23175】Robust LLM Watermarking with Minimal Semantic Distortion for IP Protection
链接:https://arxiv.org/abs/2605.23175
作者:Kieu Dang,Phung Lai,NhatHai Phan,Yelong Shen,Ruoming Jin
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Proprietary large language, causing financial setbacks, collecting input-output pairs, Proprietary large, large language models
备注:
点击查看摘要
Abstract:Proprietary large language models (LLMs) face risks of intellectual property (IP) violation, as adversaries can replicate an LLM by collecting input-output pairs to train a surrogate model, causing financial setbacks. Watermarks offer a promising defense to verify ownership, but existing methods often struggle with semantic distortion, factual inconsistency, and adversarial attacks. In addition, key-conditioned watermarks for provider-specific detection, especially in cross-provider and multi-user scenarios, remain largely underexplored. To address these challenges, we propose SAFESEAL, a novel key-conditioned watermarking framework that achieves strong detectability with minimal impact on model utility, effectively balancing detectability, utility, and robustness. SAFESEAL preserves named entities while substituting linguistic terms with context-aware synonyms through a key-conditioned Tournament sampling mechanism, maintaining semantic fidelity and factual consistency. For detection, we introduce a key-conditioned contrastive detector that jointly encodes the text and key, enabling provider-specific and robust watermark verification. We derive theoretical bounds on the utility-detectability trade-off and significantly reduce latency through lightweight models, batching, and parallelism. Extensive experiments show that SAFESEAL outperforms baselines in utility, detectability, and robustness, achieving a BERTScore of 0.983, entity similarity of 0.963, a 98.2% detection rate, and the highest human ratings for text quality and content preservation, with latency comparable to the fastest baseline. To promote transparency and community-driven progress, we release the first public watermark leaderboard and an interactive demo.
39. 【2605.23170】Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
链接:https://arxiv.org/abs/2605.23170
作者:Chuyifei Zhang,Hongyu Cui,Xiaowen Huang,Jitao Sang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Position-controlled evaluation, mainstream reasoning benchmarks, standard for retrieval, context length, Position-controlled
备注: 20 pages, 1 figure, 23 tables
点击查看摘要
Abstract:Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four. We propose Context Rot Evaluation (CRE), a controlled framework varying all three factors, and evaluate nine LLMs on GSM8K and ARC-Challenge across two rounds: an initial five-model set and four newer vendor releases. Models can drop sharply when the target task moves from end to middle, and the drop grows worse with context length for vulnerable models. MiMo-v2-Flash drops 88pp at 64K under with_solutions filler (middle accuracy 8%). Newer releases show smaller drops: at 64K, three of four stay within +/-6pp of end-position accuracy; MiMo-V2.5-Pro narrows the MiMo-v2-Flash 88pp drop to 32pp. Under questions_only_v2 filler, middle-position drops persist across all four (range -16pp to -56pp across 8K, 32K, 64K). At 8K, a diagnostic probe adding a target-task copy at the end brings middle accuracy within +/-4pp of end baseline across all nine models, consistent with a positional explanation. In the initial five-model set, 76% of middle-position errors match surrounding filler text versus 22% at the end position, consistent with filler-answer interference as a dominant error mode. These results expose a structural evaluation gap in current reasoning benchmark design and vendor evaluation practice: positional vulnerabilities that grow with context length cannot be measured when task position is not controlled.
40. 【2605.23165】Autonomous Frontier-Based Exploration with VLM Guidance
链接:https://arxiv.org/abs/2605.23165
作者:Aarush Aitha,Avideh Zakhor
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Autonomous robotic exploration, Vision-Language Models, Autonomous robotic, long-standing challenge, unknown and hazardous
备注: 8 pages, 10 figures, CVPR 2026: 2nd Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments
点击查看摘要
Abstract:Autonomous robotic exploration of unknown and hazardous environments, a long-standing challenge, can be significantly improved by leveraging the advanced reasoning of Vision-Language Models (VLMs). We introduce a novel exploration pipeline where a VLM performs high-level strategic decision-making, guiding a conventional low-level robotics control stack. At decision points, the robot generates a multimodal prompt with its current map and visual imagery of potential paths, or frontiers. The VLM analyzes this prompt to select the most promising frontier, replacing simple geometric heuristics with contextual spatial reasoning. This approach, validated in simulation across six indoor environments, improves map coverage by up to 24\% over existing methods. Our pipeline is lightweight, training-free, and easily transferable to any robot with standard sensors and an internet connection.
41. 【2605.23163】Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
链接:https://arxiv.org/abs/2605.23163
作者:Kewei Zhang,Jin Wang,Sensen Gao,Chengyue Wu,Yulong Cao,Songyang Han,Boris Ivanovic,Langechuan Liu,Marco Pavone,Song Han,Daquan Zhou,Enze Xie
类目:Computation and Language (cs.CL)
关键词:efficient inference, precarious balance, balance between high-fidelity, high-fidelity trajectory planning, Scaffold Speculative Decoding
备注:
点击查看摘要
Abstract:End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking $N$ stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to $0.32$m (a $22\%$ improvement). When integrated with SGLang, our framework delivers $12\times$ throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.
42. 【2605.23158】What Does the Server See? Understanding Privacy Leakage from Large Language Models in Split Inference
链接:https://arxiv.org/abs/2605.23158
作者:Mingyuan Fan,Yu Liu,Fuyi Wang,Cen Chen
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, devices remains challenging, resource-constrained devices remains, reduce computational burden, split inference
备注: Accepted to ACM CCS'26
点击查看摘要
Abstract:The deployment of large language models (LLMs) on resource-constrained devices remains challenging, spurring interest in split inference, where models are partitioned between client and server to reduce computational burden and enhance privacy by transmitting only intermediate activations. However, the privacy-preserving capabilities of split inference, particularly in the context of LLMs, have not been exhaustively investigated. To fill this gap, we introduce ActInv, which solves an intermediate activation matching problem to reconstruct the client's input. Extensive evaluations demonstrate that ActInv achieves high-fidelity reconstructions, even in the presence of common perturbation-based defenses such as Gaussian noise injection and activation sparsification. To systematically understand this vulnerability, we develop Perturbation Amplification Factor (PAF), a metric for quantifying a layer's inherent resistance to reconstruction. Our analysis reveals that privacy vulnerability is not uniform across layers, with some layers being highly susceptible to leakage while others offer natural resistance. Furthermore, we demonstrate that defense effectiveness can be significantly improved by calibrating perturbation directions to maximize reconstruction error during backpropagation. Building on these insights, we design PriPert and conduct comprehensive evaluations, covering privacy, utility, and computational overhead, to demonstrate its effectiveness.
43. 【2605.23157】Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs
链接:https://arxiv.org/abs/2605.23157
作者:Casey Ford,Madison Van Doren,Sicheng Jin,Emily Dix
类目:Computation and Language (cs.CL)
关键词:Pixtral Large, Claude Sonnet, mechanistic structure, multimodal large language, overtakes Pixtral Large
备注:
点击查看摘要
Abstract:The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study comparing jailbreak vulnerability in US English (en-US) and Mexican Spanish (es-MX) across four frontier MLLMs: Claude Sonnet 4.5, GPT-5, Pixtral Large, and Qwen Omni. Using a fixed adversarial benchmark of 363 diverse prompt scenarios administered in text-only and multimodal conditions, we collected 52,272 harm ratings and binary attack success judgements from matched panels of nine native-speaker annotators per language group. Our central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose that separation. The practical consequence is that safety rankings are not preserved across languages. Qwen Omni overtakes Pixtral Large as the most vulnerable model among es-MX participants, a rank reversal no scalar correction of English-condition scores could recover, and absolute attack success rates have declined across model generations without closing the gaps between them. These findings demonstrate that safety evaluation frameworks treating language and modality as independent dimensions fundamentally misspecify the attack surface of globally deployed MLLMs, and must be redesigned accordingly.
44. 【2605.23148】When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening
链接:https://arxiv.org/abs/2605.23148
作者:Jianfeng Zhu,Megan Korhummel,Ruoming Jin,Karin G. Coifman
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:outpaces clinician-delivered assessment, care outpaces clinician-delivered, health care outpaces, mental health care, clinician-delivered assessment
备注: 25 pages 7 figures
点击查看摘要
Abstract:As demand for mental health care outpaces clinician-delivered assessment, scalable screening tools are increasingly needed. Large language models (LLMs) may identify psychiatric risk from patient narratives, but their reliability across diagnoses, demographic subgroups, and evidence-use patterns remains uncertain. We introduce a SCID-anchored benchmark of 555 semi-structured experiential interviews paired with diagnostic reference labels for anxiety disorder, major depressive disorder, post-traumatic stress disorder, and any current mental health disorder. Using zero-shot task-specific prompting, we evaluated five state-of-the-art LLMs and examined whether false-negative errors reflected missed psychiatric evidence or differential weighting of symptom, functional-impairment, and protective-context cues. Performance varied across tasks and models, with accuracy ranging from 0.49 to 0.86 and Matthews correlation coefficients from 0.16 to 0.38. GPT-4.1 Mini and GPT-5 Mini showed the most consistent disorder-specific accuracy. Subgroup analyses found higher depression-classification accuracy among male than female participants, no consistent age-related pattern, and modest non-uniform variation across race strata. Evidence-integration analyses showed that false-negative anxiety and PTSD classifications often contained explicit symptom evidence but were accompanied by preserved functioning, coping ability, or social support. Functional-impairment evidence shifted model outputs toward positive classifications, whereas protective-context evidence shifted outputs away. These findings suggest that LLMs may support scalable psychiatric screening, but their tendency to discount symptom evidence in the presence of preserved functioning or protective context requires careful validation before clinical deployment.
45. 【2605.23147】As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs
链接:https://arxiv.org/abs/2605.23147
作者:Eric Xu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:clean linear decomposition, mid layer band, generated tokens, Delta, linear decomposition
备注: 12 pages, 1 figure. Code: [this https URL](https://github.com/xuy/localized-additive-composition)
点击查看摘要
Abstract:Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $\Delta_X$, a pure task effect $\Delta_Y$, and substituting $h_{BB} + \Delta_X + \Delta_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.
Comments:
12 pages, 1 figure. Code: this https URL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2605.23147 [cs.CL]
(or
arXiv:2605.23147v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.23147
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
46. 【2605.23103】A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works
链接:https://arxiv.org/abs/2605.23103
作者:Queenie Luo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB)
关键词:Classical Chinese wenji, fine-tuned BERT classifier, closely confusable preface, Chinese wenji table, Classical Chinese
备注:
点击查看摘要
Abstract:I present Lepton (Letter Prediction), a fine-tuned BERT classifier that predicts whether a title in a Classical Chinese wenji table of contents is a personal letter or a closely confusable preface (particularly the farewell-preface). Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati. I've deployed the model on Hugging Face and has been used at the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji, populating the Ming Letter Platform.
47. 【2605.23093】A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses
链接:https://arxiv.org/abs/2605.23093
作者:Yan Jiang,Sihong Liu,Philip A. Fisher
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:psychology increasingly spans, applied psychology increasingly, open-ended survey responses, Structural Topic Models, methodological traditions
备注:
点击查看摘要
Abstract:Topic modeling in applied psychology increasingly spans two methodological traditions: probabilistic bag-of-words models and newer embedding-based approaches. Yet many evaluations of these methods rely on longer and cleaner benchmark corpora, leaving less guidance for short, open-ended survey responses. This paper compares Structural Topic Models (STM), a probabilistic topic model, and BERTopic, an embedding-based model, for analyzing open-ended survey responses. We evaluated three STM conditions and five BERTopic conditions, varying typographical correction, stemming, embedding choice, and contextual augmentation, a strategy we introduced to provide additional semantic context for very short responses. Results indicate that BERTopic consistently produced higher topic coherence than STM, with contextual augmentation yielding the strongest performance gains. In contrast, higher-dimensional embeddings alone did not improve coherence and were associated with greater data loss. Qualitative evaluation showed that BERTopic generated more interpretable and stable topics, while STM topics were often broader and more mixed. However, STM provides stronger support for inferential covariate analysis, whereas BERTopic covariate comparisons are primarily descriptive. These findings suggest that STM and BERTopic offer complementary strengths. We conclude with practical guidance for selecting and combining topic modeling approaches in applied social science research.
48. 【2605.23078】GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
链接:https://arxiv.org/abs/2605.23078
作者:Jianing Deng,Song Wang,Dongwei Wang,Zijie Liu,Tianlong Chen,Huanrui Yang,Jingtong Hu
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, achieve strong performance, massive expert parameters
备注: ICML 2026
点击查看摘要
Abstract:Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at this https URL .
49. 【2605.23071】he Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management
链接:https://arxiv.org/abs/2605.23071
作者:Binqi Shen,Lier Jin,Hanyu Cai,Lan Hu,Yuting Xin
类目:Computation and Language (cs.CL)
关键词:Large language models, windows introduces substantial, introduces substantial computational, expanding context windows, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and memory compression methods, are typically evaluated using performance and efficiency metrics independently, limiting systematic comparison and deployment-aware decision-making. This paper introduces The Efficiency Frontier, a unified framework for cost-performance optimization in LLM context management. The framework models context strategy selection as a deployment-aware optimization problem that jointly accounts for task performance, token cost, and preprocessing reuse through amortized cost modeling. Unlike existing evaluations that compare methods in isolation, the proposed framework enables decision-oriented analysis of when different context management strategies become preferable under varying operational conditions. Evaluated on 5,000 HotpotQA instances, the framework reveals distinct operational regimes and transition boundaries between retrieval-based and preprocessing-based strategies. Results show that deployment-aware optimization reduces effective token usage by approximately 25% at comparable performance ($F1 \approx 0.78$), while amortized memory compression achieves over 50% lower token cost relative to full-context prompting in higher-performance settings. Overall, the proposed framework provides a principled and practical foundation for evaluating and deploying scalable, efficient, and sustainable LLM systems.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2605.23071 [cs.CL]
(or
arXiv:2605.23071v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.23071
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
50. 【2605.23069】DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge
链接:https://arxiv.org/abs/2605.23069
作者:Yusser Al Ghussin,Daniil Gurgurov,Yasser Hamidullah,Josef van Genabith,Cristina España-Bonet,Simon Ostermann
类目:Computation and Language (cs.CL)
关键词:Large language models, knowledge remains uneven, cultural knowledge remains, Large language, diverse linguistic
备注: Accepted to The 20th International Workshop on Semantic Evaluation at ACL 2026
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used across diverse linguistic and cultural contexts, yet their cultural knowledge remains uneven across regions and languages. We present the DFKI-MLT system for SemEval-2026 Task 7 on cultural awareness, where we apply activation steering to multilingual LLMs using language vectors extracted from parallel FLORES data. Our method performs inference-time adaptation by adding language-specific steering vectors to the residual stream at a selected transformer layer, without any parameter updates. We participated in both the short-answer (SAQ) and multiple-choice (MCQ) tracks; however, only our MCQ submission received an official score. In the official MCQ track, we achieved 86.96% accuracy, ranking 7th out of 17 teams. To better understand system behavior, we conduct post-hoc analyses on the shared-task MCQ and SAQ settings. These analyses show that activation steering yields modest and heterogeneous improvements on cultural reasoning: gains are strongly layer-sensitive, vary substantially across language-region pairs, with some configurations even degrading performance, and interact with prompt formulation, comparing generic and culturally conditioned prompts. Our findings suggest that prompt design and activation steering should be jointly optimized for culturally aware multilingual inference.
51. 【2605.23067】What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
链接:https://arxiv.org/abs/2605.23067
作者:Xinjie He,Zhiyuan Lin,Su Liu,Jialun Wu,Qiyang Xie,Weikai Zhou,Shuai Xiao
类目:Computation and Language (cs.CL)
关键词:training LLM agents, LLM agents, multi-session dialogue, training LLM, external memory banks
备注: 14 pages, 2 figures, 11 tables. Code, checkpoints, and evaluation artifacts available at [this https URL](https://github.com/EvaxHe/rl-memory-curriculum)
点击查看摘要
Abstract:Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi-session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in-domain (LoCoMo), mixed-benchmark (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill - temporal reasoning - despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single-GPU regime: cross-benchmark mixing requires filtering format-specific noise from memory banks to preserve training signal, and binary exact-match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.
52. 【2605.23057】ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
链接:https://arxiv.org/abs/2605.23057
作者:Aman Sunesh,Ali Alshehhi,Hivansh Dhakne
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Performance (cs.PF)
关键词:improving single-GPU large, single-GPU large language, large language model, improving single-GPU, single-GPU large
备注: 10 pages main text, 11 pages including references, 5 figures, 3 tables. Preprint
点击查看摘要
Abstract:ModeSwitch-LLM is a lightweight request-boundary controller for improving single-GPU large language model inference efficiency by routing each request to an appropriate fixed inference mode. Instead of relying on one static serving configuration, the system selects among FP16, quantized modes, speculative decoding, and hybrid modes such as GPTQ plus prefix caching and INT8 plus continuous batching using cheap workload-level features. We evaluate ModeSwitch-LLM on Meta-Llama-3.1-8B-Instruct served on a single NVIDIA A100 GPU. On deployment-style synthetic workloads, the online controller achieves a 2.10x mean latency speedup over FP16 and a 0.48x mean energy ratio, corresponding to 51.7% lower energy per token. On automatic benchmarks used as a quality gate, accuracy remains close to FP16 with a mean delta of +0.17 percentage points. We also evaluate lightweight learned routers, but find that they do not clearly outperform the rule-based controller because they add routing overhead and more often select modes that violate quality, energy, or memory constraints. These results show that simple request-aware routing can recover substantial efficiency from existing inference modes without retraining the model or changing its architecture.
53. 【2605.23055】Decomposing and Measuring Evaluation Awareness
链接:https://arxiv.org/abs/2605.23055
作者:Changling Li,Terry Jingchen Zhang,Jie Zhang,Zhijing Jin,Sahar Abdelnabi,Maksym Andriushchenko
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Frontier language models, evaluated and adjust, Frontier language, evaluation, recognition
备注:
点击查看摘要
Abstract:Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared foundation, conflating properties of the evaluation with properties of the model, and detection with behavioral response. We ground evaluation awareness in social psychology, decomposing it into an environment component (how recognizable the task is) and a model component that separates recognition from propensity to act on it. We operationalize the environment component through eight categorized trigger factors, such as placeholder entities and grading-style output formats, and study recognition and behavior through chain-of-thought monitoring. Across nine frontier models and four benchmarks, recognition rates depend on the specific pairing of model and benchmark rather than on either in isolation. Recognition rarely leads to behavioral change, and when it does, the direction depends on the type of evaluation perceived. Models are also more sensitive to safety than capability evaluations, placing safety benchmark validity at greater risk. To study which factors each model is sensitive to and how they interact, we propose \textbf{EvalAwareBench}, a factor-controlled benchmark of 100 paired safety-capability tasks where each of the eight factors can be independently toggled, varying evaluative signals while holding the underlying request fixed. Through EvalAwareBench, we find that no single factor uniformly affects all models, but stacking factors progressively raises evaluation awareness across all of them. Our framework and EvalAwareBench provide the tools to measure, attribute, and mitigate evaluation awareness, pointing to behavioral consistency under recognition as a promising path forward.
54. 【2605.23054】Model Collapse as Cultural Evolution
链接:https://arxiv.org/abs/2605.23054
作者:Dongxin Guo,Jikun Wu,Siu Ming Yiu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:structures degrade, progressive degradation, characterized statistically, statistically but lacks, lacks a linguistic
备注: Accepted at CoNLL 2026. 18 pages, 3 figures, 2 tables
点击查看摘要
Abstract:Model collapse, the progressive degradation of LLMs trained on their own outputs, has been characterized statistically but lacks a linguistic explanation for which structures degrade, in what order, and why. We show that iterated learning theory from cultural evolution fills this gap. We derive five falsifiable predictions, distinguish those uniquely discriminative for the theory from confirmatory ones, and test them by self-training LLaMA-2-7B and Mistral-7B over 10 generations in English, German, and Turkish. The critical discriminative finding: compositionality follows a non-monotonic trajectory (initially rising, then falling) under unfiltered self-training. This signature persists with maximally regular seed data (ruling out noise removal) and is sustained only by task-grounded filtering, not random filtering, providing the first LLM-scale evidence for the compression-communication tradeoff. All predictions are confirmed with large effect sizes (Hedges' $g 1.6$; $\mathrm{BF}_{10} 100$), and LLM regularization gradients closely match human behavioral data ($R^2 = 0.94$). These results reframe model collapse as a cultural transmission phenomenon and yield concrete principles for self-training pipeline design.
55. 【2605.23052】DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule-Based and RAG Methods
链接:https://arxiv.org/abs/2605.23052
作者:Maryia Zhyrko,Daisy Monika Lal,Erik van Mulligen,Lifeng Han
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:social media timelines, shared task, present DreamerNLplus, task, social media
备注: Accepted by CLPsych2026. CLPsych 2026 will be held at ACL in San Diego July 4th, 2026
点击查看摘要
Abstract:We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared task. Our system addresses three tasks: psychological state modeling, temporal change detection, and sequence-level summarization. For Task 1, we combine LLM-based data augmentation, DeBERTa classification, and Random Forest regression for structured state prediction. For Task 2, we use few-shot prompting with a locally deployed Llama 3.1 model to detect Switch and Escalation events using short-term temporal context. For Task 3.1, we explore both a deterministic rule-based summarization pipeline and a few-shot LLM-based approach, ranking \textbf{2nd} officially. Our RAG-based method achieves strong performance in Task 3.2, ranking \textbf{1st} for Improvement and \textbf{3rd} for Deterioration, demonstrating its ability to capture recurrent psychological change patterns across timelines. Our analysis reveals key challenges, including the mismatch between classification and regression performance, the difficulty of modeling temporal transitions, and the disagreement between semantic and similarity-based evaluation metrics. These findings highlight the complexity of modeling mental health dynamics and motivate future work on unified evaluation frameworks. We share our code and prompts at this https URL
Comments:
Accepted by CLPsych2026. CLPsych 2026 will be held at ACL in San Diego July 4th, 2026
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2605.23052 [cs.CL]
(or
arXiv:2605.23052v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.23052
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
56. 【2605.23043】HawkesLLM: Semantic Uncertainty Propagation in Agentic Text Simulation
链接:https://arxiv.org/abs/2605.23043
作者:Zewei Deng,Tinghan Ye,Liyan Xie
类目:Computation and Language (cs.CL); Machine Learning (stat.ML)
关键词:Agentic text-simulation systems, Agentic text-simulation, text-simulation systems write, text-simulation systems, Agentic
备注: 10 pages, 4 figures, Accepted at the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems
点击查看摘要
Abstract:Agentic text-simulation systems write in sequence, with each item becoming possible context for later steps. That makes uncertainty path-dependent: an early ambiguity can affect later outputs. This paper studies this problem with HawkesLLM, a framework that separates temporal influence modeling from text generation. We represent the cascade as a network whose nodes are text-generating agents. A multivariate Hawkes process models how these nodes activate over time and which earlier node outputs should influence later prompts. A language model then writes each new event from the compact memory selected by this temporal model. We evaluate the framework on a held-out Global Database of Events, Language, and Tone (GDELT) news-cascade case study. The diagnostics track semantic alignment with local held-out references and separate local drift from global drift. In this setting, HawkesLLM improves late-stage semantic alignment under a compact prompt-memory budget.
57. 【2605.23039】Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs
链接:https://arxiv.org/abs/2605.23039
作者:Dongxin Guo,Jikun Wu,Siu Ming Yiu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:learners acquire knowledge, Construction Grammar proposes, Construction Grammar, Grammar proposes statistical, proposes statistical preemption
备注: Accepted at CoNLL 2026. 21 pages (9 main body + appendices and references); 4 figures, 14 tables
点击查看摘要
Abstract:How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption: exposure to a conventional form (e.g., "donated the books to the library") preempts structurally possible but unattested alternatives ("*donated the library the books"). We present a computational study that, for the first time, directly dissociates statistical preemption from the competing entrenchment hypothesis in large language models within a single converging design. Across four experiments spanning 120 English verb-construction pairings (dative, causative, locative), we show that (1) LLM surprisal patterns correlate strongly with human acceptability judgments ($r = 0.79$), validated against three independent behavioral datasets; (2) these patterns are driven by competing-form frequency rather than overall verb frequency, confirmed by non-circular partial correlations; (3) preemption sensitivity scales as a power law with model size; and (4) a controlled fine-tuning intervention causally demonstrates that manipulating competing-form frequencies shifts preemption behavior in the predicted direction, with reverse-direction controls ruling out frequency-sensitivity confounds. These results provide converging evidence that neural language models acquire negative linguistic knowledge through distributional competition, the core mechanism posited by Construction Grammar.
58. 【2605.23036】Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
链接:https://arxiv.org/abs/2605.23036
作者:Yusser Al Ghussin,Daniil Gurgurov,Tanja Baeumel,Josef van Genabith,Patrick Schramowski,Simon Ostermann
类目:Computation and Language (cs.CL)
关键词:enable feature-level mechanistic, Sparse autoencoders, feature-level mechanistic interpretability, trained on English-only, control remains unreliable
备注: Accepted to TrustNLP Workshop at ACL 2026
点击查看摘要
Abstract:Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on English-only data, and steering layers are chosen heuristically. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs. First, we show that training SAEs on multilingual data consistently strengthens cross-lingual representations and yields more reliable, quality-preserving language control across layers and model families. Second, we introduce an \emph{a priori} steering layer-selection rule based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search. We evaluate our approach on LLaMA-3.1-8B and Gemma-2-9B across machine translation and cross-lingual summarization (CrossSumm), using SpBLEU, ROUGE-L, COMET, and LaSE. Our results show that multilingual SAEs combined with intersection-selected layers stabilize the trade-off between language identification accuracy and generation quality, providing a principled, predictive, representation-level account of multilingual SAE steering.
59. 【2605.23035】Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
链接:https://arxiv.org/abs/2605.23035
作者:Dongxin Guo,Jikun Wu,Siu Ming Yiu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
关键词:remains mechanistically unexplained, large language models, Intermediate layers, computational neurolinguistics, mechanistically unexplained
备注: Accepted at CoNLL 2026. 20 pages (9 main + 1 limitations/acknowledgments + 3 references + 7 appendix), 5 figures, 20 tables
点击查看摘要
Abstract:Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap by bridging sparse autoencoders (SAEs) from mechanistic interpretability with neural encoding models, decomposing GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer. A human-validated taxonomy ($\kappa \geq 0.74$) reveals that semantic features alone recover 94% of peak encoding performance ($r=0.285$), substantially exceeding variance-matched baselines ($p0.001$, $d=1.31$). Beyond this aggregate dominance, we test a novel cortical topography prediction: five semantic subcategories derived a priori from three independent neuroscience programs should map onto distinct brain regions. A formal convergence test confirms this alignment (Spearman $\rho=0.72$, $p0.001$; hypergeometric $p=0.007$), demonstrating that SAE-discovered features recapitulate known cortical semantic organization at a granularity inaccessible to prior methods. SAE features further predict human reading times beyond lexical controls ($\Delta\mathrm{logLik}=38.4$, $p0.001$), and an exploratory prediction-error analysis provides preliminary evidence that the brain additionally encodes unexpected semantic content. Results generalize across English, Chinese, and French.
60. 【2605.23032】Brain-LLM Alignment Tracks Training Data, Not Typology
链接:https://arxiv.org/abs/2605.23032
作者:Dongxin Guo,Jikun Wu,Siu Ming Yiu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
关键词:brain language network, language network, network is neuroanatomically, neuroanatomically universal, Petit Prince corpus
备注: Accepted to CoNLL 2026. 9 pages main content + 4 pages references + 6 pages appendix; 4 figures, 13 tables
点击查看摘要
Abstract:Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomically universal across languages. Does alignment also generalize cross-linguistically, and what governs the variation? We test this using fMRI data from 112 participants across English, Chinese, and French (the Le Petit Prince corpus) and seven LLMs spanning English-dominant, Chinese-dominant, and multilingual architectures. Our central finding is that training-language dominance, not an inherent property of English, drives the alignment pattern: a Chinese-dominant model (Baichuan2-7B), architecture-matched to LLaMA-2-7B, reverses the gradient entirely, aligning best with Chinese brains and worst with English. Beyond training dominance, formal typological distance independently covaries with alignment degradation, syntax-associated brain regions (IFG) show $2.3\times$ steeper typological gradients than lexico-semantic regions (PTL), and tokenization fertility accounts for $\sim$60% of a cross-linguistic shift in optimal encoding layer. These results reveal that the apparent "English advantage" in brain-LLM alignment is an artifact of training data composition, while the remaining variation reflects genuine typological structure concentrated in syntactic processing.
61. 【2605.23028】RADAR: Relative Angular Divergence Across Representations
链接:https://arxiv.org/abs/2605.23028
作者:Xavier Cadet,Mateusz Nowak,Peter Chin
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Machine learning methods, learning methods rely, Machine learning, learning methods, methods rely
备注: 27 pages; 8 figures; 10 tables
点击查看摘要
Abstract:Machine learning methods rely on data. However, gathering suitable data can be challenging due to availability constraints, cost, or the need for domain expertise. Expanding datasets with additional sources is a common response to limited data, yet this practice does not always improve downstream performance and can sometimes lead to a loss of performance, known as negative transfer. We propose RADAR, a simple, geometrically grounded metric for estimating cross-domain transferability in foundation models. RADAR analyzes the layer-wise evolution of representations by measuring angular alignments and relative changes in distance along layer-to-layer displacement trajectories, and by comparing empirical distributions of within-domain and cross-domain dynamics. We hypothesize that domain transferability is related to the divergence between these trajectory distributions. We evaluate the metric across multiple modalities, including cross-lingual sentiment classification with text embedding models and cross-domain image classification with foundation vision models. Across several settings, RADAR provides competitive predictive performance relative to existing transferability metrics on several vision and text benchmarks, with particularly strong results when domain transitions are smooth or cleanly separated. Our ablations further suggest that the effectiveness of transferability estimation depends on the geometry of the model's internal representation space, with different modalities favoring different topological formulations.
62. 【2605.23024】he Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems
链接:https://arxiv.org/abs/2605.23024
作者:Dongxin Guo
类目:Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Free Lunch theorems, draft legal documents, produce clinical notes, Turing and Arrow, Free Lunch
备注: PhD thesis, Department of Computer Science, The University of Hong Kong, 2026. 271 pages, 18 figures, 15 tables, 5 algorithms
点击查看摘要
Abstract:Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow to the No Free Lunch theorems, shape what computation can do. This thesis turns such impossibility results from curiosities into design rules. Its flagship result proves an accuracy ceiling set by architecture alone: past a critical reasoning depth, no amount of training moves it, at any adapter rank, sample size, or loss function. Computable before deployment from layer count and embedding width, this Deterministic Horizon is measured between nineteen and thirty-one across twelve transformer architectures, and fine-tuning on optimal-length traces recovers under four percentage points. The mechanism is a capacity invariant of the residual stream, and an information-theoretic conversion yields super-exponential accuracy decay past the horizon. An unconditional circuit-complexity lower bound for modular exponentiation against constant-depth prime-modulus circuits complements this result. The same argument recasts across subfields: preference learning under any misspecified model jumps discontinuously in sample complexity; multi-stage retrieval pipelines require at least as many independent metrics as stages; standard truthful auctions fail for agents with prompt-dependent valuations; and zero-knowledge verification of neural inference pays a measured overhead of one hundred ten to one hundred ninety times per non-linear activation. Together these form a catalogue of sixteen specifications, each pairing a computable boundary, a quantified violation cost, and a constructive design rule: two compositions are proved, one pairing is an honest obstruction, and four remain open. The impossibility-specification methodology is offered for the generative research programme that trustworthy AI may need. Every fundamental limit of AI is also a design rule.
63. 【2605.22993】A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism
链接:https://arxiv.org/abs/2605.22993
作者:Chuanbo Hu,Minglei Yin,Bin Liu,Wenqi Li,Lynn K. Paul,Shuo Wang,Xin Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Social Language Disorder, Characteristic linguistic behaviors, autism spectrum disorder, including echoic repetition, stereotyped media quoting
备注:
点击查看摘要
Abstract:Characteristic linguistic behaviors associated with Social Language Disorder (SLD) in autism spectrum disorder, including echoic repetition, pronoun displacement, and stereotyped media quoting, are largely absent from spontaneous conversation and only emerge under specific conversational conditions. In structured clinical assessments, this latency means that questioning strategy selection is a critical yet underappreciated determinant of how much diagnostic information a conversation yields. Whether large language models (LLMs) can be guided to proactively select questioning strategies that systematically surface these latent traits remains largely unexplored. Here we present TPA (Think, Plan, Ask), a proactive multi-agent dialogue framework applied to the language assessment component of the Autism Diagnostic Observation Schedule Module 4 (ADOS-2), in which a doctor agent explicitly reasons about which traits remain unobserved before selecting a clinically grounded strategy and generating a targeted question. A patient agent grounded in real ADOS-2 clinical data enables reproducible evaluation without real patient participation, validated across three independent experiments confirming adequate fidelity to real patient language. Evaluated on 484 episodes from 35 patients, TPA outperforms six competitive dialogue planning baselines across all primary metrics, achieving 82.1% SLD trait coverage, 16.6% higher than automated replay of real clinical dialogues conducted by trained clinicians (65.5%), with substantially greater per-turn diagnostic efficiency (AUCC: 0.628 vs. 0.458, absolute gain +0.170). These results demonstrate that proactive questioning strategy selection substantially improves the efficiency of automated SLD trait assessment, with direct implications for scalable AI-assisted clinical screening.
64. 【2605.22981】Memorization Dynamics of Fill-in-the-Middle Pretraining
链接:https://arxiv.org/abs/2605.22981
作者:Tobias von Arx,Tanguy Dieudonné
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:equip causal language, causal language models, infilling ability, pretraining objective widely, equip causal
备注: MemFM @ ICML 2026
点击查看摘要
Abstract:Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3.2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts. With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.
65. 【2605.22978】A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text
链接:https://arxiv.org/abs/2605.22978
作者:George Mikros,Fotios Fitsilis
类目:Computation and Language (cs.CL)
关键词:remains poorly served, Universal Dependencies-style parsing, Katharevousa parliamentary questions, Katharevousa Greek remains, importance for legal
备注: 12 pages, 1 figure, 2 tables; companion to the kathnlp open-source release at [this https URL](https://github.com/gmikros/katharevousa-nlp-tooling)
点击查看摘要
Abstract:Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period. The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen automatically validated reference set contains 1{,}697 sentences, split into 1{,}357 training sentences and 340 held-out test sentences. We compare off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, mBERT, XLM-R, and custom Stanza training under the same scoring protocol. Off-the-shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0.4183 LAS. The best structural parser, an XLM-R model, reaches 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS, an absolute LAS gain of 0.0980 over the best external baseline. The feature-based model remains competitive for UPOS and relation labeling, indicating that transparent lexical-context features still matter at this data scale. Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline -- code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports -- is released as an open-access companion to this paper.
66. 【2605.22975】When AI Takes Sides on Questions of Faith: Persistent Asymmetries in AI-Mediated Faith Guidance
链接:https://arxiv.org/abs/2605.22975
作者:Brett Israelsen,Sheryl Carty,Josh Coates,Nancy Fulda,Julie Park,Pete Whiting
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:treat queries, religious conversion symmetrically, conversion symmetrically, large language models, large language
备注: 29 pages, 16 figures
点击查看摘要
Abstract:We ask whether large language models (LLMs) treat queries about religious conversion symmetrically. The answer is no. When asked for advice on hypothetical faith transitions from one religion to another, then asked the reversed question, models exhibited consistent asymmetries, favoring some religions while subtly discouraging conversion to others. On average Catholic, Bahá'í, and Sikh religions were broadly favored (high support for joining, low support for leaving), while Atheists, Agnostics, and Jehovah's Witnesses were primarily disfavored. Patterns varied by model size and model provider, with Grok 4.20 exhibiting the strongest asymmetries. We tested 20 commercial and open-source language models across 182 religion pairings using a human-verified LLM-as-a-judge framework. Each model was probed via interactions with a simulated user asking for advice on a potential faith conversion. Models tended to use more encouraging language for some faith transitions over others; these patterns were systematically repeatable across multiple trials. All LLMs tested exhibited reproducible asymmetry, though the pattern of preferences differed for each. Overall preferences persist across multiple question phrasings and variations in the religious pairing dataset. Taken together, these results suggest that asymmetry is a robust property of model behavior rather than an artifact of how the models' answers were scored. It is important to consider that any imbalances deployed and reproduced en masse can have real-world implications.
Comments:
29 pages, 16 figures
Subjects:
Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:
arXiv:2605.22975 [cs.CL]
(or
arXiv:2605.22975v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.22975
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
67. 【2605.22971】Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs
链接:https://arxiv.org/abs/2605.22971
作者:Ko Watanabe,Shoya Ishimaru
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:organizational productivity losses, Employees often struggle, struggle to identify, leading to organizational, productivity losses
备注:
点击查看摘要
Abstract:Employees often struggle to identify ``who knows what,'' leading to organizational productivity losses. We investigate whether Large Language Models (LLMs) can infer individual domain knowledge directly from long-term Slack logs. Analyzing 27,188 messages from 43 users, we evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Gemini 2.5 Flash achieved the lowest error (MAE 21.13%), while GPT models showed significantly larger discrepancies. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy-preserving deployments and richer, structure-aware representations of human knowledge.
68. 【2605.22963】Graph Alignment Topology as an Inductive Bias for Grounding Detection
链接:https://arxiv.org/abs/2605.22963
作者:Paul Landes,Pranav Herur,Adam Cross,Jimeng Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, produce distributionally plausible, distributionally plausible continuations, source documents
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables generalization, but it does not encode whether responses are grounded with respect to a reference. These issues limit the use of LLMs in domains where strict factual correctness is crucial, such as clinical decision support. Existing hallucination detection approaches improve factuality through retrieval augmentation, self-consistency, or claim verification, but generally do not learn directly over alignment topology. To leverage alignment topology as an inductive bias, we construct aligned bipartite graphs between reference information and LLM outputs and train a graph neural network (GNN) to model alignment structure using message passing. The method achieves state-of-the-art results on four diverse hallucination and question-answering datasets, outperforming all compared methods, including foundational LLMs such as GPT-4o.
69. 【2605.22939】Learnability-Informed Fine-Tuning of Diffusion Language Models
链接:https://arxiv.org/abs/2605.22939
作者:Shubham Parashar,Atharv Chagi,Jacob Helwig,Lakshmi Jotsna,Sushil Vemuri,James Caverlee,Dileep Kalathil,Shuiwang Ji
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:diffusion language models, aim to improve, language models, SFT, tokens
备注:
点击查看摘要
Abstract:We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT-based post-training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25. Our code is publicly available at this https URL.
70. 【2605.22937】RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation
链接:https://arxiv.org/abs/2605.22937
作者:Minseok Jung,Abhas Ricky,Muhammad Rameez Chatni
类目:Computation and Language (cs.CL)
关键词:generation remains underexplored, code generation remains, structured query generation, query code generation, remains underexplored
备注:
点击查看摘要
Abstract:Inference-time scaling can reduce errors in structured query generation, but methods to allocate the compute for query code generation remains underexplored. We study Text2Cypher, where language models generate Cypher queries that execute against property graph databases. Non-executable queries constitute a distinct syntactic failure separate from semantic inaccuracy: a syntax error triggers a system-generated error message from the database. These error messages are typically discarded at inference time rather than leveraged through in-context learning (ICL). We compare two inference methods: Independent Scaling (IS), which performs memoryless resampling, and Reflection-Augmented Scaling (RAS), which conditions each new attempt on prior execution feedback via ICL. Across three Neo4j datasets and five code-specialized language models, RAS reduces the Query Execution Error Rate by 41--50% at n{=}5, outperforming IS at 32--38%. Execution errors are not merely failures to discard but actionable feedback, and structuring inference-time compute around them is a more efficient path to executability than scaling independent samples.
71. 【2605.22923】AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2605.22923
作者:Tom Verhoeff
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Large language models, Large language, explicit knowledge source, lecture notes, answer questions
备注: 19 pages, 3 figures
点击查看摘要
Abstract:Large language models can answer questions about textbooks, lecture notes, and programming exercises more reliably when their answers are grounded in an explicit knowledge source. Retrieval-augmented generation (RAG) is a common approach: relevant fragments of a document are retrieved and inserted into the model context before answering. For mathematical and technical material, the original LaTeX source can be a better starting point than a PDF, because it contains structural information, labels, sectioning commands, macros, and authorial intent that are often lost or distorted in PDF extraction. However, LaTeX source is not automatically AI-friendly. Cross-references must be resolved, custom macros must be interpreted, exercises and examples must be identified, and author-supplied semantic metadata may be needed. This article describes a focused preprocessing approach for turning LaTeX source, together with its compiled auxiliary files and optional author annotations, into Markdown and JSONL chunks suitable for indexing in a vector database.
72. 【2605.22905】EVE-Agent: Evidence-Verifiable Self-Evolving Agents
链接:https://arxiv.org/abs/2605.22905
作者:Yamato Arai,Yuma Ichikawa
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:self-evolving search agents, search agents, Self-evolving, agents, evidence
备注: 23 pages, 2 figures
点击查看摘要
Abstract:Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self-generated curriculum into an opaque and potentially unreliable training signal. We argue that evidence verifiability is a prerequisite for trustworthy self-evolution in search agents: each generated instance should include not only an answer but also a source-grounded span whose contribution to that answer can be measured. We introduce EVE-Agent, an Evidence-Verifiable Self-Evolving Agent that operationalizes this principle through a modification to the proposer--solver framework. The proposer generates a question, an answer, and a verbatim evidence span. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations. EVE-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged. Experiments show that EVE-Agent substantially improves evidence-grounded correctness over prior self-evolving search agents. The resulting curriculum is not merely self-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted.
73. 【2605.22903】Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?
链接:https://arxiv.org/abs/2605.22903
作者:Zixuan Lan,Luzhe Sun,Matthew R. Walter,Jiawei Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:reflect grounded visual, grounded visual understanding, reflect grounded, reflect reliance, implicitly assumed
备注: Accepted to GRAIL-V: Grounded Retrieval and Agentic Intelligence for Vision-Language, CVPR 2026 Workshop. accepted version
点击查看摘要
Abstract:Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.
74. 【2605.22902】ranscoders Trace Visual Grounding and Hallucinations in Vision-Language Models
链接:https://arxiv.org/abs/2605.22902
作者:Dimitrios Damianos,Leon Voukoutis,Georgios Skyrianos,Vassilis Katsouros,Georgios Paraskevopoulos
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:remains poorly understood, text remains poorly, Generative Vision-Language Models, poorly understood, Generative Vision-Language
备注:
点击查看摘要
Abstract:Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood. Existing interpretability work on VLMs uses Sparse Autoencoders (SAEs), which decompose static residual representations and miss the functional updates that drive cross-modal interaction. We adopt a function-centric framework based on Transcoders, sparse approximations of MLP sublayers that act as a causal proxy for layer-wise computation. Applied to Gemma 3-4B-IT, the framework decomposes the model into interpretable computational pathways linking image patches to directions in token generation. Transcoder attributions produce stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, and align better with semantically relevant image regions. A False Visual Grounding counterfactual analysis confirms that the recovered pathways are specific to vision-language this http URL, we perform a structural analysis of hallucinated generations, by extracting graph-based indicators from circuit traces produced by the transcoders. A logistic classifier over these mechanistic graph features predicts hallucinations at AUC $0.68$. These results show that function-centric circuit decomposition yields interpretable and predictive accounts of multimodal computation in VLMs.
75. 【2605.22885】ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization
链接:https://arxiv.org/abs/2605.22885
作者:Riyaz Ahuja,Tate Rowney,Jeremy Avigad,Sean Welleck
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
关键词:Formal mathematics libraries, refactor verified proofs, training data quality, rapidly expanding, creating a growing
备注:
点击查看摘要
Abstract:Formal mathematics libraries are rapidly expanding, creating a growing need to refactor verified proofs for maintainability and to improve training data quality for neural provers. However, scalable proof optimization is hindered by heterogeneous and heuristically specified objectives, scarce data, and high training and inference costs. To overcome these challenges, we introduce ImProver 2, a neurosymbolic framework for automated proof optimization in Lean 4. ImProver 2 combines a data-efficient expert-iteration pipeline with a scaffold that exposes formal structure alongside lightweight informal abstractions. We further introduce a suite of metrics capturing structural proof properties. Using ImProver 2, we train a 7B-parameter model that outperforms orders-of-magnitude larger models within the same model family, and is competitive with mid-tier frontier models across metrics. We additionally demonstrate that our neurosymbolic scaffold significantly improves performance across both small and frontier models. We show that with proper scaffolding and training, small models can effectively restructure research-level proofs over complex and varied metrics, matching substantially larger systems and establishing proof optimization as a scalable, learnable task.
76. 【2605.22880】How Far Will They Go? Red-Teaming Online Influence with Large Language Models
链接:https://arxiv.org/abs/2605.22880
作者:Daniel C. Ruiz,Anna Serbina,Ashwin Rao,Emilio Ferrara,Luca Luceri
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:based agents increasingly, agents increasingly participate, large language model, based agents, online discourse
备注: 30 pages, 8 figures, submitted to COLM 2026
点击查看摘要
Abstract:As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.
77. 【2605.22878】SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
链接:https://arxiv.org/abs/2605.22878
作者:Shuofei Qiao,Yunxiang Wei,Jiazheng Fan,Bin Wu,Busheng Zhang,Mengru Wang,Yuqi Zhu,Ningyu Zhang,Keyan Ding,Qiang Zhang,Huajun Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:deep interdisciplinary integration, organization impedes deep, impedes deep interdisciplinary, unstructured knowledge organization, knowledge organization impedes
备注: Ongoing Work
点击查看摘要
Abstract:The exponential growth of global academic output has confronted researchers and AI agents with an unprecedented ``information explosion,'' where fragmented and unstructured knowledge organization impedes deep interdisciplinary integration. Current academic retrieval tools predominantly rely on superficial keyword matching or vector-space semantic retrieval, which lack the topological reasoning capabilities required to navigate complex logical connections. Agentic deep-research-based frameworks are often prone to logical hallucinations and consuming high inference costs. To bridge this gap, in this report, we introduce SciAtlas, a large-scale, multi-disciplinary, heterogeneous academic resource knowledge graph designed as a panoramic scientific evolution network. By integrating over 43M papers from 26 disciplines, and a total of 157M entities and 3B triplets, SciAtlas provides a structured topological cognitive substrate that dismantles disciplinary barriers and furnishes AI agents with a global perspective. Furthermore, we develop a neuro-symbolic retrieval algorithm featuring tri-path collaborative recall and graph reranking, achieving a seamless transition from simple semantic matching to deterministic association discovery. We also present key application directions of SciAtlas, including literature review, automated research trend synthesis, idea positioning, and academic trajectory exploration, to demonstrate that SciAtlas can serve as an effective ``cognitive map'' to empower the full loop of automated scientific research while significantly reducing reasoning costs. We have released the interfaces for KG retrieval and various downstream tasks in our GitHub repo.
78. 【2605.22873】When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
链接:https://arxiv.org/abs/2605.22873
作者:Wei Xia,Haoqing Wang,Zhi-Hong Deng,Yehui Tang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:enhancing LLM capabilities, fundamental question, strategy for enhancing, application raises, raises a fundamental
备注:
点击查看摘要
Abstract:Chain-of-thought (CoT) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental question: when is explicit reasoning actually beneficial? Empirical evidence reveals a striking paradox: CoT often provides marginal or even negative gains on factual and open-ended tasks while multiplying token consumption. In this work, we show that LLM reasoning is not a static property of tasks or models, but a \emph{dynamic decoding state} that emerges during generation. Through systematic analysis, we find early-stage entropy dynamics provide a reliable signal of this state: tasks benefiting from CoT exhibit consistent entropy reduction, while others display unstable or increasing patterns. This behavior can be interpreted as a phase-transition-like shift from a high-entropy exploratory regime to a low-entropy structured reasoning regime. Based on these insights, we propose \textbf{EDRM} (Entropy Dynamics-based Reasoning Manifold), a lightweight and training-free routing framework that leverages early decoding entropy to adaptively select inference strategies. EDRM embeds entropy trajectories into a compact and interpretable manifold representation, enabling both zero-shot deployment and fine-grained instance-level adaptation. Across 15 benchmarks and 4 LLMs of varying scales and architectures, EDRM consistently outperforms static baselines. At the dataset level, EDRM achieves \textbf{41--55\%} token reduction while improving accuracy with as few as 50 calibration samples. At the instance level, it further improves accuracy by up to \textbf{4.7\%} while maintaining \textbf{27--45\%} token savings. These results suggest that reasoning should be invoked selectively rather than by default, and demonstrate the effectiveness of entropy-driven decoding control for efficient and adaptive LLM inference.
79. 【2605.22870】he Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
链接:https://arxiv.org/abs/2605.22870
作者:Ming Liu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:small language models, preserves most performance, small language, shuffling its steps, steps preserves
备注: 18 pages (8 main + 10 appendix), 3 figures, 5 tables
点击查看摘要
Abstract:Chain-of-thought (CoT) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance. What does CoT contribute if not logical sequencing? In three 1-3B instruction-tuned LMs on GSM8K, we isolate the answer-readout stage via prefix completion and identify a positional shortcut: the model copies whichever number occupies the trailing position before the answer delimiter, regardless of intermediate reasoning. Gold-answer presence accounts for 54-92 pp of accuracy (89-92% of each model's teacher-forcing ceiling); even on incorrect items, the final answer matches the last CoT number 95-96% of the time. The copy channel takes precedence over retained-context completion: replacing the trailing number with a wrong value collapses accuracy to near-zero despite correct intermediates, yet removing it recovers 5-32 pp above that floor--even single-step arithmetic the model can otherwise perform is suppressed when a copyable number is present. Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively. Head-level ablation implicates architecture-specific head sets; the effect replicates on GSM-Symbolic. On non-arithmetic BBH tasks, shuffle retention drops sharply; at 7-8B, content-selective gating emerges. Step-level faithfulness evaluations risk conflating positional answer transport with genuine computation--a failure mode for CoT-based oversight.
80. 【2605.22855】PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations
链接:https://arxiv.org/abs/2605.22855
作者:Yingjie Lei
类目:Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:profitable decision making, Personalized pricing negotiations, guarantee profitable decision, Personalized pricing, challenging testbed
备注: 24 pages, 3 figures, 5 tables. Code is available at [this https URL](https://github.com/ChaosTheProducer/PrefBench)
点击查看摘要
Abstract:Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator-based benchmark for hidden-preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle-customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter-offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM-facing state-summary protocol that constrains agents to return strict JSON actions under a fixed hidden-information boundary. We evaluate zero-shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller-profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement-seeking behavior can coexist with weak profit-sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing-agent behavior under hidden buyer preferences.
81. 【2605.22843】Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model
链接:https://arxiv.org/abs/2605.22843
作者:Tianhao Qiu,Xiaojun Chen
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:executable SQL queries, enabling non-technical users, access relational databases, intelligent data services, converts natural language
备注: 17ages, 5 figures
点击查看摘要
Abstract:Text-to-SQL converts natural language questions into executable SQL queries, enabling non-technical users to access relational databases for analytics and intelligent data services. In real-world scenarios, performance is often constrained by low-resource settings, where high-quality annotated \texttt{question, SQL} pairs are scarce, particularly for domain-specific databases. Additional challenges include opaque schema definitions, abbreviations, and implicit business logic that are not explicitly encoded in the schema. Existing data synthesis and prompting techniques improve coverage but often fail to produce task-specific, semantically grounded examples aligned with database constraints. To address these challenges, we propose a knowledge-aware Text-to-SQL framework that constructs task-specific knowledge base including schema semantics, abbreviations, business logic, and query patterns, and injects them into both training and inference. This framework generates diverse, contextually grounded synthetic training data and enhances inference through targeted knowledge retrieval. Experiments on seven benchmarks, covering both general and domain-specific datasets, demonstrate that our approach substantially improves the performance of open-source and closed-source large language models in Text-to-SQL tasks, especially in low-resource domain-specific settings, enhancing generalization, robustness, and adaptability.
82. 【2605.22834】Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion
链接:https://arxiv.org/abs/2605.22834
作者:Mudit Rastogi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:systems depend critically, Retrieval-Augmented Generation, retrieving relevant context, systems depend, document chunking quality
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems depend critically on document chunking quality for retrieving relevant context. Fixed chunking segments documents into uniform units irrespective of semantics or user intent, producing a precision-recall trade-off unresolvable by tuning chunk size alone. Semantic and agentic methods partially address these limitations but do not integrate user queries at the chunking stage. We present Query-Adaptive Semantic Chunking (QASC), which dynamically constructs chunks by integrating queries into segmentation through three mechanisms: cosine similarity scoring between sentence and query embeddings to identify seed sentences, contextual window expansion around seeds to preserve coherence, and chunk-level score aggregation to ensure holistic relevance. We evaluate QASC on 100 technical documents across 200 queries spanning four types, comparing against fixed chunking at five granularities, recursive splitting, semantic chunking, and agentic chunking. QASC achieves an F1-score of 0.85, a relative improvement of 18-27% over fixed chunking and 8-12% over semantic and agentic alternatives. Ablation studies confirm each component contributes meaningfully. Human evaluation by three annotators (Cohen kappa = 0.82) corroborates that QASC produces more relevant and coherent chunks than existing methods.
83. 【2605.22828】A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development
链接:https://arxiv.org/abs/2605.22828
作者:Mahounan Pericles Adjovi,Victor Olufemi,Roald Eiselen,Prasenjit Mitra
类目:Computation and Language (cs.CL)
关键词:West African languages, approximately 80-100 million, West African, 80-100 million speakers, people in Benin
备注: 8 pages, 7 tables; survey paper; to appear in IEEE SDS 2026
点击查看摘要
Abstract:This survey provides a comprehensive catalog of publicly available text and speech resources for two West African languages: Hausa, an Afroasiatic language with approximately 80-100 million speakers, and Fongbe, a Niger-Congo language spoken by approximately 2 million people in Benin. These languages represent contrasting cases on the resource availability spectrum. We address the question: \textit{What is the current state of publicly available NLP resources for Hausa and Fongbe, and what gaps remain?} Through systematic search of academic repositories, data platforms, and web sources, we catalog parallel corpora, monolingual text collections, speech datasets, pre-trained models, and evaluation benchmarks. For each resource, we document size, domain coverage, format, licensing, and accessibility. Our findings reveal that Hausa benefits from broader text resource diversity across news, encyclopedic, and educational domains. Fongbe, while having more limited text resources, has been the focus of recent academic speech data collection initiatives. Both languages are represented in Masakhane benchmarks for NER and POS tagging. We provide task-specific recommendations and identify priority gaps including domain-diverse Fongbe text and dedicated Hausa speech corpora.
84. 【2605.22826】Evaluating Large Language Models in a Complex Hidden Role Game
链接:https://arxiv.org/abs/2605.22826
作者:Niklas Bauer
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
关键词:Large Language Models, Large Language, potential of Large, game Secret Hitler, uncontrolled environments
备注: Master's thesis, University of Göttingen
点击查看摘要
Abstract:Quantifying the deceptive potential of Large Language Models (LLMs) is critical for AI safety, yet difficult to achieve in uncontrolled environments. This work investigates the reasoning, persuasion, and deceptive capabilities of LLMs within the social deduction game Secret Hitler. I introduce an open-source framework and novel metrics to measure performance: Role Identification Accuracy, Deception Retention Rate, and Game State Impact Rate. By benchmarking models against rule-based algorithms and human games, I identify a gap between conversational ability and strategic depth. The study also analyzes the impact of reasoning-enhancement techniques on win rates and strategic reasoning. Neither Chain-of-Thought prompting nor internal memory bring improvements in performance, with up to 23.2% worse win rates for fascist roles. While rule-based agents align with expert human voting decisions 86.7% of the time, models like Llama 3.1 70B achieve only a 59.7% accuracy. Models playing as Fascists consistently yield negative impact scores and fail to sustain deception, resulting in roughly 40% shorter games compared to humans. These findings suggest that current architectures remain ineffective at complex, multi-turn manipulation. As capabilities advance, detecting when models begin to master these deceptive behaviors is crucial. The developed framework serves as a reproducible testbed for future alignment research.
85. 【2605.20192】Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token
链接:https://arxiv.org/abs/2605.20192
作者:Xintong Wu,Peiting Tsai,Jing Yuan,Michael Yu,Greg Sun,Luyao Zhang
类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Computational Finance (q-fin.CP)
关键词:expanding Metaverse ecosystem, native MANA token, reality platform operating, Metaverse ecosystem, expanding Metaverse
备注:
点击查看摘要
Abstract:Decentraland, a decentralized virtual reality platform operating within the expanding Metaverse ecosystem, utilizes its native MANA token to facilitate virtual asset transactions and governance. This study investigates the integration of Discord community sentiment with multi-modal financial data to enhance cryptocurrency price prediction within virtual world economies. We address: (1) identifying sentiment patterns within Decentraland's Discord community, and (2) evaluating the impact of multi-modal features on token return forecasting. Using a BERT-based large language model for sentiment analysis, we develop two LSTM architectures: a baseline incorporating historical prices and a multi-modal variant integrating sentiment scores, trading volume, and market capitalization. Results indicate predominantly neutral community sentiment with a positive skew. The multi-modal model significantly outperforms the price-only baseline in prediction accuracy. These findings demonstrate the predictive value of community-derived signals for virtual economy forecasting and establish a foundation for future research at the intersection of immersive virtual environments, natural language processing, and cryptocurrency market analysis.
86. 【2605.22841】Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test
链接:https://arxiv.org/abs/2605.22841
作者:Rommin Adl,Peyton Williams
类目:Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); General Economics (econ.GN)
关键词:strongest alliance member, alliance member pressures, Arctic strategic control, pressures a weaker, strategic control
备注: 78 pages, 17 figures, 18 tables. Multi-agent LLM simulation recovering structural utility parameters across 8 frontier models in the Greenland sovereignty crisis. v3: typo pass, fixes phantom action names (REQUEST_MULTILATERAL, INDEPENDENT) and a Blunden date mismatch. v2 added Section V safety findings (legitimacy-laundered escalation, signal decoupling) and Appendix H
点击查看摘要
Abstract:What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland sovereignty crisis as a stress test for LLM geopolitics, centered on the 2019-2026 U.S. push to acquire Greenland from the Kingdom of Denmark. The crisis nests two collective-action problems: Arctic strategic control and whether NATO can enforce alliance norms against the dominant member. We develop three games (asymmetric coercion; a NATO assurance game with a critical-mass tipping point; a triadic extensive-form game with social preferences) and test them with a multi-agent simulation in which eight frontier LLMs play six geopolitical roles (United States, Denmark, Greenland, NATO, Russia, Canada) across 3,604 completed games and 108,120 action observations. Using inverse game theory, we recover each model's structural utility parameters (alpha, beta, gamma, delta, eta) for material self-interest, reciprocity, inequality aversion, norm respect, and commitment consistency. Three findings stand out. First, all eight models become more escalatory under coercion framing (four-action escalation rises from 10.7% to 28.6%). Second, Chinese-origin models show systematically different power-weight profiles from Western-origin models when playing the U.S. role. Third, peaceful US acquisition emerges in only 1.9% of clean games and only 3 of 8 frontier models ever achieve it, most prominently DeepSeek V3.2, which executes a stable five-round playbook through the metropole. Prompts emphasizing jus cogens and self-determination reduce escalation back near baseline in the English-only confirmatory sample; multilingual contrasts are reported as exploratory sensitivity checks. We position this as a structural benchmark for LLM geopolitical behavior, complementing action-frequency benchmarks.
信息检索
1. 【2605.23702】ubiFM: Unified Item, Carousel, and Search Ranking for Streaming Discovery
链接:https://arxiv.org/abs/2605.23702
作者:Alexandre Salle,Chenglei Niu,Suchismit Mahapatra,Xiaoxiao Chen,Suvash Sedhain,Yaqi Wang,Shervin Shahryari,Saurabh Agrawal,Qiang Chen,Michael Tamir
类目:Information Retrieval (cs.IR)
关键词:expose complementary signals, queries reveal intent, watches shape carousel, search queries reveal, tasks expose complementary
备注:
点击查看摘要
Abstract:Personalized discovery systems often train separate models for item ranking, carousel ranking, and search, even though these tasks expose complementary signals from the same viewer journey: watches shape carousel and item ranking, search queries reveal intent even when they do not lead to a catalog match, and watch history helps interpret search as rewatching, continuation, or new discovery. We introduce the user story, a serialized representation that turns a user's cross-surface history - attributes, sessions, watch events with surface and carousel context, and search events - into a single token sequence. By interleaving pretrained language tokens with domain-specific event tokens, user stories let heterogeneous recommendation and search tasks be expressed as prompted next-token prediction over a shared grammar. TubiFM is one instantiation of this approach: a Llama 3.2 1B-based model trained on user stories and prompted to rank items, carousels, or search results without task-specific architectures. In offline evaluation, this single model outperforms specialist baselines across item, carousel, and search ranking. In online A/B tests, TubiFM significantly improves search total viewing time (TVT) by $+3.9\%$ and carousel TVT by $+0.30\%$. Item ranking is statistically neutral on TVT ($+0.14\%$), but matches a mature production stack; across all three tasks, TubiFM serves on L40S GPUs and reduces p99 ranking latency from 500ms to 200ms. These results show that shared user stories can improve discovery while simplifying ranking systems.
2. 【2605.23684】Synthetic Sources?: Auditing Generative Search Engine Citations for Evidence of AI-Generated Sources
链接:https://arxiv.org/abs/2605.23684
作者:Mowafak Allaham,Nicholas Diakopoulos
类目:Information Retrieval (cs.IR); Computers and Society (cs.CY)
关键词:Large Language Models, Generative Search Engines, Language Models, Generative Search, Search Engines
备注: 11 pages + Appendix
点击查看摘要
Abstract:The growing accessibility of Large Language Models via conversational interfaces capable of responding to users' questions by drawing on, synthesizing, and citing information from the web (i.e., Generative Search Engines) has simplified the information-seeking process for users. However, with the proliferation of AI-generated content on the web, it is unclear whether these engines can reliably omit citing synthetic sources (i.e., AI-generated sources). Should these engines be unable to do so, this puts users at risk of harm by treating information from AI-generated sources synthesized in responses of generative search engines as equivalent to information from authoritative or official sources. In a step towards identifying whether AI-generated sources are being cited by these engines, this work presents an audit of four generative search engines (ChatGPT, Copilot, Gemini, Perplexity) using a total of 712 real-world human-generated queries spanning domains of public importance: politics, health, and the environment. Our findings show evidence of AI-generated sources being cited across all four generative search engines (~16% of cited sources) and identifies key source web domains these sources belong to that are frequently cited across these engines and topics. In addition, we observed that generative search engines include a somewhat narrow set of repeatedly cited domains while predominantly surfacing a large number of minimally cited domains in responses to users' queries. These findings contribute to the growing body of work on assessing the risks of generative search engines with the objective of increasing public awareness of their limitations and encouraging appropriate measures to improve information quality and governance of these systems.
3. 【2605.23586】racking a Decade of Research at the University of Nigeria, Nsukka: A Scientometric Analysis (2014-2023)
链接:https://arxiv.org/abs/2605.23586
作者:Muneer Ahmad,Joseph U Igligli
类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)
关键词:employs scientometric methods, university research, study employs scientometric, University, University of Nigeria
备注: 16 pages, 4 figures, Research Article
点击查看摘要
Abstract:This study employs scientometric methods to assess the research output and performance of the University of Nigeria from 2014 to 2023. By analyzing publication trends, citation patterns, and collaboration networks, the research aims to comprehensively evaluate the university's research productivity, impact, and disciplinary focus. These research endeavors are characterized by innovation, interdisciplinary collaboration, and commitment to excellence, making the University of Nigeria a significant hub for cutting-edge research in Nigeria and beyond. The present study has been undertaken to determine the impact of the university's research and publication trends from 2014 to 2023. The study focuses on year-wise research output, citation impact at local and global levels, prominent authors and their total output, top journals, collaborating countries, and the most contributing departments of the University of Nigeria. The university's ten years of publication data indicate that 6,353 papers were published from 2014 to 2023, receiving 86,202 citations with an h-index of 39. In addition to this, the stenographical mapping of data is presented through graphs using the VOSviewer software mapping technique. The findings of this study will contribute to understanding the university's research strengths, weaknesses, and potential areas for improvement. Additionally, the results will inform evidence-based decision-making for enhancing research strategies and policies at the University of Nigeria
4. 【2605.23572】HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval
链接:https://arxiv.org/abs/2605.23572
作者:Vipul Gupta,Shikhar Mohan,Lakshya Kumar,Pranjal Chitale,Nikit Begwani,Amit Singh,Manik Varma
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:balancing retrieval quality, Small Language Models, critical challenge, competitive landscape, Small Language
备注: 9 pages, 3 figures, 10 tables
点击查看摘要
Abstract:In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.
5. 【2605.23556】Is Dimensionality a Barrier for Retrieval Models?
链接:https://arxiv.org/abs/2605.23556
作者:Kiril Bangachev,Guy Bresler,Jonathan Kogan,Yury Polyanskiy
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR); Combinatorics (math.CO)
关键词:prevent modern embedding-based, modern embedding-based retrieval, embedding-based retrieval models, mathsf, embedding-based retrieval
备注:
点击查看摘要
Abstract:Why does the low dimensionality of representations, typically $d\approx 1000$, not prevent modern embedding-based retrieval models from scaling to billions, or even trillions, of data points? To answer this question, we study maximal-margin embeddings in the following retrieval model, classically studied in communication complexity [PS86] and more recently in embedding-based retrieval [WBNL26]. Let $A\in \{0,1\}^{N\times n}$ be a matrix indicating whether each of $N$ queries is relevant to each of $n$ documents. We are interested in the largest margin $m0,$ denoted by $\mathsf{m}^{\mathsf{rd}}(d, A),$ for which there exist unit norm embeddings of the queries and documents $\{U_j\}_{j = 1}^N, \{V_i\}_{i = 1}^n$ with the following property. $\langle U_j, V_i\rangle \ge m$ whenever $A_{ji} = 1$ and $\langle U_j, V_i\rangle \le -m$ otherwise. A large margin is a key proxy for representation quality: it controls both robustness to perturbations and compositional generalization across queries. Our main theorem establishes that the best possible margin without a restriction on the dimension, $\mathsf{m}^{\mathsf{rd}}(+\infty, A),$ can be nearly achieved in dimension $d = O(\mathsf{m}^{\mathsf{rd}}(+\infty, A)^{-2}\log n)$ which improves a theorem of [BDES02]. Together with a matching lower bound in Theorem 1.5, we conclude that when $A\in \{0,1\}^{\binom{n}{k}\times n}$ is the matrix containing all possible $k$-sparse rows once, dimension $d = O(k\log (n/k))$ is necessary and sufficient for the maximal possible margin $\mathsf{m}^{\mathsf{rd}}(+\infty, A) = \Theta(k^{-1/2})$ in this setting. This fully resolves the setup of [WBNL26]. We also give several constructions for large margins when $d = o(k\log (n/k)).$ Finally, we empirically test the InfoNCE and sigmoid losses for producing large margin embeddings and demonstrate a clear advantage of the sigmoid loss.
6. 【2605.23398】PMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization
链接:https://arxiv.org/abs/2605.23398
作者:Lingling Fu,Yongfu Xu
类目:Information Retrieval (cs.IR)
关键词:Direct Preference Optimization, large language model, language model alignment, model alignment due, Direct Preference
备注: 11 pages,6 figures
点击查看摘要
Abstract:Direct Preference Optimization (DPO) has been widely adopted for large language model alignment due to its simple training procedure and lack of an explicit reward model. However, in iterative DPO, when the policy model from the previous iteration is repeatedly used as the reference model for subsequent rounds, noise in preference data and errors in the reference model accumulate over time. This accumulation can lead to late-stage over-optimization, performance fluctuations, and degraded generalization. To address these issues, we propose TPMM-DPO, a trajectory-aware preference-guided model merging method. The method treats the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrates them using learned fusion weights, thereby constructing a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability. Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward scores on both in-domain and out-of-domain evaluations. Further ablation studies and robustness analyses demonstrate that, compared with simple averaging, learnable-weight fusion more effectively alleviates late-stage performance degradation caused by noisy preferences.
Comments:
11 pages,6 figures
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2605.23398 [cs.IR]
(or
arXiv:2605.23398v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2605.23398
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
7. 【2605.23312】owards Generalizable and Efficient Large-Scale Generative Recommenders
链接:https://arxiv.org/abs/2605.23312
作者:Qiuling Xu,Ko-Jen Hsiao,Moumita Bhattacharya
类目:Information Retrieval (cs.IR)
关键词:sequences of events, events and provide, provide a shared, Generative recommendation models, shared backbone
备注: first published under netflix tech blog [this https URL](https://netflixtechblog.medium.com/towards-generalizable-and-efficient-large-scale-generative-recommenders-a7db648aa257)
点击查看摘要
Abstract:Generative recommendation models can model user behavior as sequences of events and provide a shared backbone for multiple recommendation tasks. In production, however, pre-training gains do not automatically translate into downstream application improvements: task headroom, repeated-training cost, serving latency, and item freshness all affect transfer. We describe our experience scaling a generative recommender from 2M to 1B backbone parameters, excluding embedding and decoding layers, in a production-scale title recommendation setting. Across multiple downstream tasks, we observe task-dependent scaling behavior: some tasks approach an empirical ceiling within the observed scale range, while others continue to benefit from additional capacity. This motivates using offset scaling-law fits as a diagnostic for where additional model scale may be more or less useful. We then study production constraints that arise when applying the model in practice. Frequent retraining over trillions of behavior tokens makes training and decoding efficiency important; cached serving can make the immediate next-token target stale; and newly launched titles may need to be scored from semantic metadata before collaborative ID embeddings are reliable. We address these issues with multi-token prediction for serving-latency alignment, sampled softmax and a projected decoding head for efficient repeated training, and semantic item towers with collaborative-embedding masking for cold-start adaptation. In a one-week production-shadow evaluation over 1M users, the 1B-backbone model achieves higher MRR than the 2M-backbone baseline across all reported tasks. Overall, the results support treating model scale as one component of a production transfer problem, alongside task headroom, decoding cost, serving-latency alignment, and item generalization.
Comments:
first published under netflix tech blog this https URL
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2605.23312 [cs.IR]
(or
arXiv:2605.23312v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2605.23312
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
8. 【2605.23310】From Head to Tail: Asymmetric Knowledge Transfer in Long-tail Recommendation with Generative Semantic IDs
链接:https://arxiv.org/abs/2605.23310
作者:Chenyi Yan,Ruocong Tang,Xing Fang,Yang Huang,He Guo,Jing Wang
类目:Information Retrieval (cs.IR)
关键词:severe data imbalance, remains challenging due, e-commerce platforms remains, platforms remains challenging, data imbalance
备注: 5 pages, 1 figure
点击查看摘要
Abstract:Long-tail recommendation in real-world e-commerce platforms remains challenging due to severe data imbalance. Existing methods often struggle to combine content-based multimodal features with collaborative signals. Many of these methods also ignore an important asymmetry in knowledge transfer between head and tail IDs: noisy signals from tail IDs can hurt representation learning for head IDs. This paper presents AKT-Rec, a framework for Asymmetric Knowledge Transfer in long-tail Recommendation that uses LLM-generated semantic IDs. AKT-Rec uses Multimodal LLMs (MLLMs) with supervised fine-tuning to align content representations with collaborative information for both items and users, producing semantic representations. It then discretizes these representations into semantic IDs with a Residual-Quantized VAE (RQ-VAE), which yields semantic clusters of similar entities. AKT-Rec has two main components: (1) Cluster-Guided Adaptive Embedding, which decomposes each ID representation into a cluster-level embedding that captures shared semantics and an individual embedding. Through an asymmetric contrastive objective and an activity-aware gating mechanism, this module directs knowledge transfer from head to tail IDs. (2) Hierarchical Feature Aggregation, which builds parallel feature views and adaptively fuses them to optimize predictions for samples with varying activity levels. Extensive experiments on a large-scale industrial dataset and online A/B testing on the Alibaba Tmall platform demonstrate the effectiveness of AKT-Rec. AKT-Rec improves offline performance by 0.35% in AUC and 1.53% in GAUC, outperforming several competitive baselines. In online A/B testing, AKT-Rec achieves a 2.76% increase in CTR and a 3.47% increase in GMV, validating its utility in real-world production environments.
9. 【2605.23191】Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation
链接:https://arxiv.org/abs/2605.23191
作者:Guoming Li,Shangyu Zhang,Junwei Pan,Wentao Ning,Jin Chen,Gengsheng Xue,Chao Zhou,Shudong Huang,Haijie Gu,Menglin Yang
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR); Numerical Analysis (math.NA)
关键词:recommender systems, central challenge, challenge in recommender, Scaling recommendation models, token mixing
备注: Accepted at the 32st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Research Track), KDD 2026 February Cycle
点击查看摘要
Abstract:Scaling recommendation models is a central challenge in recommender systems. Recently, RankMixer has emerged as an effective solution, operating on a unified token representation and alternating between token mixing and per-token feedforward networks (P-FFNs) to achieve scalable performance. However, RankMixer suffers from \textit{embedding collapse}, where learned representations have low effective rank, limiting expressivity and underutilizing the expanded representation space. Through empirical analysis and theoretical insights, we identify rigid token mixing and P-FFN modules as the primary causes of this phenomenon, jointly inducing a \textbf{damped oscillatory trajectory} in effective-rank evolution across layers. To address it, we propose RankElastor, a novel architecture that produces spectrum-robust representations with provable collapse mitigation. RankElastor introduces two components: (i) \textbf{parameterized full mixing}, which enables expressive token mixing with improved spectral robustness; and (ii) \textbf{GLU-improved P-FFNs}, which stabilize representation spectra through GLU-style FFN modules. Extensive experiments on large-scale industrial datasets demonstrate that RankElastor consistently improves recommendation performance, mitigates embedding collapse, and exhibits robust scaling behavior. Code is available at this GitHub repository: this https URL
10. 【2605.22924】Building a privacy-preserving Federated Recommender system for mobile devices
链接:https://arxiv.org/abs/2605.22924
作者:Aasheesh Singh
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:Serving personalized content, traditionally required pooling, modern privacy expectations, required pooling sensitive, Serving personalized
备注: [this http URL](http://M.Sc) . thesis, Université de Montréal, Department of Computer Science and Operations Research, 2024
点击查看摘要
Abstract:Serving personalized content on mobile devices has traditionally required pooling sensitive user data on centralized servers, a practice increasingly at odds with modern privacy expectations and geographical regulations. We present a two-stage federated recommendation system pipeline for mobile devices, built around a principled separation between non-sensitive user preference data and sensitive mobile context data that never leaves the device. The first stage runs a collaborative filtering model on non-sensitive app-context data in the cloud to generate a shortlist of relevant items. The second stage re-ranks these candidates on-device using sensitive mobile signals, with only model updates/gradients ever leaving the device. We validate the approach on MovieLens, UCI Human Activity Recognition, and a proprietary pilot dataset, and deliver a production-ready implementation as a Kotlin Multiplatform library deployable on Android and iOS.
11. 【2605.22923】AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2605.22923
作者:Tom Verhoeff
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Large language models, Large language, explicit knowledge source, lecture notes, answer questions
备注: 19 pages, 3 figures
点击查看摘要
Abstract:Large language models can answer questions about textbooks, lecture notes, and programming exercises more reliably when their answers are grounded in an explicit knowledge source. Retrieval-augmented generation (RAG) is a common approach: relevant fragments of a document are retrieved and inserted into the model context before answering. For mathematical and technical material, the original LaTeX source can be a better starting point than a PDF, because it contains structural information, labels, sectioning commands, macros, and authorial intent that are often lost or distorted in PDF extraction. However, LaTeX source is not automatically AI-friendly. Cross-references must be resolved, custom macros must be interpreted, exercises and examples must be identified, and author-supplied semantic metadata may be needed. This article describes a focused preprocessing approach for turning LaTeX source, together with its compiled auxiliary files and optional author annotations, into Markdown and JSONL chunks suitable for indexing in a vector database.
12. 【2605.22878】SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
链接:https://arxiv.org/abs/2605.22878
作者:Shuofei Qiao,Yunxiang Wei,Jiazheng Fan,Bin Wu,Busheng Zhang,Mengru Wang,Yuqi Zhu,Ningyu Zhang,Keyan Ding,Qiang Zhang,Huajun Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:deep interdisciplinary integration, organization impedes deep, impedes deep interdisciplinary, unstructured knowledge organization, knowledge organization impedes
备注: Ongoing Work
点击查看摘要
Abstract:The exponential growth of global academic output has confronted researchers and AI agents with an unprecedented ``information explosion,'' where fragmented and unstructured knowledge organization impedes deep interdisciplinary integration. Current academic retrieval tools predominantly rely on superficial keyword matching or vector-space semantic retrieval, which lack the topological reasoning capabilities required to navigate complex logical connections. Agentic deep-research-based frameworks are often prone to logical hallucinations and consuming high inference costs. To bridge this gap, in this report, we introduce SciAtlas, a large-scale, multi-disciplinary, heterogeneous academic resource knowledge graph designed as a panoramic scientific evolution network. By integrating over 43M papers from 26 disciplines, and a total of 157M entities and 3B triplets, SciAtlas provides a structured topological cognitive substrate that dismantles disciplinary barriers and furnishes AI agents with a global perspective. Furthermore, we develop a neuro-symbolic retrieval algorithm featuring tri-path collaborative recall and graph reranking, achieving a seamless transition from simple semantic matching to deterministic association discovery. We also present key application directions of SciAtlas, including literature review, automated research trend synthesis, idea positioning, and academic trajectory exploration, to demonstrate that SciAtlas can serve as an effective ``cognitive map'' to empower the full loop of automated scientific research while significantly reducing reasoning costs. We have released the interfaces for KG retrieval and various downstream tasks in our GitHub repo.
13. 【2605.22843】Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model
链接:https://arxiv.org/abs/2605.22843
作者:Tianhao Qiu,Xiaojun Chen
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:executable SQL queries, enabling non-technical users, access relational databases, intelligent data services, converts natural language
备注: 17ages, 5 figures
点击查看摘要
Abstract:Text-to-SQL converts natural language questions into executable SQL queries, enabling non-technical users to access relational databases for analytics and intelligent data services. In real-world scenarios, performance is often constrained by low-resource settings, where high-quality annotated \texttt{question, SQL} pairs are scarce, particularly for domain-specific databases. Additional challenges include opaque schema definitions, abbreviations, and implicit business logic that are not explicitly encoded in the schema. Existing data synthesis and prompting techniques improve coverage but often fail to produce task-specific, semantically grounded examples aligned with database constraints. To address these challenges, we propose a knowledge-aware Text-to-SQL framework that constructs task-specific knowledge base including schema semantics, abbreviations, business logic, and query patterns, and injects them into both training and inference. This framework generates diverse, contextually grounded synthetic training data and enhances inference through targeted knowledge retrieval. Experiments on seven benchmarks, covering both general and domain-specific datasets, demonstrate that our approach substantially improves the performance of open-source and closed-source large language models in Text-to-SQL tasks, especially in low-resource domain-specific settings, enhancing generalization, robustness, and adaptability.
14. 【2605.22834】Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion
链接:https://arxiv.org/abs/2605.22834
作者:Mudit Rastogi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:systems depend critically, Retrieval-Augmented Generation, retrieving relevant context, systems depend, document chunking quality
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems depend critically on document chunking quality for retrieving relevant context. Fixed chunking segments documents into uniform units irrespective of semantics or user intent, producing a precision-recall trade-off unresolvable by tuning chunk size alone. Semantic and agentic methods partially address these limitations but do not integrate user queries at the chunking stage. We present Query-Adaptive Semantic Chunking (QASC), which dynamically constructs chunks by integrating queries into segmentation through three mechanisms: cosine similarity scoring between sentence and query embeddings to identify seed sentences, contextual window expansion around seeds to preserve coherence, and chunk-level score aggregation to ensure holistic relevance. We evaluate QASC on 100 technical documents across 200 queries spanning four types, comparing against fixed chunking at five granularities, recursive splitting, semantic chunking, and agentic chunking. QASC achieves an F1-score of 0.85, a relative improvement of 18-27% over fixed chunking and 8-12% over semantic and agentic alternatives. Ablation studies confirm each component contributes meaningfully. Human evaluation by three annotators (Cohen kappa = 0.82) corroborates that QASC produces more relevant and coherent chunks than existing methods.
15. 【2605.22833】RAG4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis
链接:https://arxiv.org/abs/2605.22833
作者:Daqian Shi,Pei Han,Jishizhan Chen,Yang Wang,Xiaolei Diao,Xianyou Zheng,Pengfei Cheng
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:high recurrence risk, osteomyelitis presents substantial, presents substantial prognostic, postoperative recovery trajectories, complex postoperative recovery
备注:
点击查看摘要
Abstract:Chronic osteomyelitis presents substantial prognostic challenges due to its high recurrence risk and complex postoperative recovery trajectories. Traditional assessment often relies on manual scoring systems, which limit scalability, efficiency, and consistency in clinical practice. Furthermore, the heterogeneous nature of clinical data poses challenges for current multimodal learning approaches that require aligned inputs and large annotated datasets. In this work, we propose RAG4Outcome, a retrieval-augmented generation (RAG) framework for prognostic prediction in chronic osteomyelitis. Our method integrates multimodal clinical data, including PET-CT imaging reports, structured surgical and diagnostic records, and unstructured follow-up notes, into a unified prediction pipeline. By combining a domain-specific retrieval corpus with expert-guided prompting, the framework enables more interpretable, evidence-grounded, and clinically reliable prognosis. Preliminary results on real-world cases demonstrate promising effectiveness and clinical alignment, highlighting the potential of RAG4Outcome for AI-assisted infection management and postoperative decision support.
16. 【2605.22829】LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding
链接:https://arxiv.org/abs/2605.22829
作者:Yifan Zhu,Yu Mi,Yue Lu,Yanchu Guan,Zhixuan Chu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Large Language Models, enhancing Large Language, Language Models, Large Language, enhancing Large
备注:
点击查看摘要
Abstract:Multimodal Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing multimodal RAG systems predominantly rely on coarse-grained page-level retrieval, which fails to capture fine-grained semantic and layout structures in visually rich documents, thereby compromising retrieval accuracy and leading to redundant context in downstream tasks. To address these issues, we propose Layout-oriented Fine-grained Retrieval-Augmented Generation (LFRAG), a novel framework that advances multimodal RAG from page-level to block-level retrieval. We perform layout segmentation to construct semantically coherent fine-grained retrieval units and design a semantic-layout fusion encoder that integrates local semantics with global context via cross-attention. With block-level late interaction retrieval, LFRAG enables precise query-content alignment and reduces irrelevant content for downstream generation. To enable rigorous evaluation, we construct LFDocQA, a large-scale benchmark with block-level annotations spanning diverse document types, designed to assess both multimodal document retrieval and question answering with greater granularity than existing datasets. Extensive experiments on LFDocQA demonstrate that LFRAG achieves state-of-the-art performance on retrieval tasks, outperforms the best baseline by 7.20% in answer accuracy, and reduces token consumption by 73.07% in generation tasks, confirming LFRAG as an accurate and efficient framework for multimodal RAG over visually rich documents. Our code and datasets will be released soon.
计算机视觉
1. 【2605.23903】Geo-Align: Video Generation Alignment via Metric Geometry Reward
链接:https://arxiv.org/abs/2605.23903
作者:Zizun Li,Haoyu Guo,Runzhe Teng,Chunhua Shen,Tong He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, recent years, generation has achieved, achieved remarkable, remarkable progress
备注:
点击查看摘要
Abstract:Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.
2. 【2605.23902】PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
链接:https://arxiv.org/abs/2605.23902
作者:Yifan Lu,Qi Wu,Jay Zhangjie Wu,Zian Wang,Huan Ling,Sanja Fidler,Xuanchi Ren
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generated latents back, perform generation, maps the generated, times, decoder maps
备注: Project Page: [this https URL](https://research.nvidia.com/labs/sil/projects/pid/)
点击查看摘要
Abstract:Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes $4\times$ and even $8\times$ upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of $512 \times 512$ images into $2048 \times 2048$ pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about $6\times$ faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.
3. 【2605.23897】ETCHR: Editing To Clarify and Harness Reasoning
链接:https://arxiv.org/abs/2605.23897
作者:Beichen Zhang,Yuhong Liu,Jinsong Li,Yuhang Zang,Jiaqi Wang,Dahua Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Large Language, purely textual chain, Multimodal Large
备注: Code, model and data are open-sourced at [this https URL](https://github.com/InternLM/ETCHR)
点击查看摘要
Abstract:Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.
4. 【2605.23895】From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain
链接:https://arxiv.org/abs/2605.23895
作者:Yuval Golbari,Navve Wasserman,Matias Cosarinsky,Roman Beliy,Aude Oliva,Antonio Torralba,Michal Irani,Tamar Rott Shaham
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:challenge in neuroscience, central challenge, identifying regions, concept, Identifying
备注:
点击查看摘要
Abstract:Identifying which brain regions represent a visual concept in the human brain is a central challenge in neuroscience. Existing approaches have localized coarse functional regions (e.g., faces, places) through activation maximization, identifying regions that activate strongly for a target concept relative to other concepts. Yet strong activation alone does not establish that a region represents the concept itself, as responses may instead be driven by correlated visual or semantic cues. We introduce BrainCause, an automated framework that combines generative and brain models to synthesize controlled stimuli and validate neural representations through targeted causal testing. Given a query specifying a concept of interest, our framework constructs targeted stimulus sets comprising concept images, counterfactual edits that remove the target concept while preserving other image content, and images with candidate correlated distractors. It then uses an image-to-fMRI encoding model to predict brain responses and searches for representations that respond specifically to the target concept over correlated alternatives. BrainCause returns validated candidate representations and proposes follow-up fMRI experiments to further test or extend its discoveries. Our approach successfully recovers known functional localizations and identifies new candidate representations across dozens of concepts, validated on both predicted and measured fMRI data. Critically, we show that without causal validation, a large fraction of localizations would be false positives, confirming that activation alone is insufficient evidence of representation.
5. 【2605.23892】Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
链接:https://arxiv.org/abs/2605.23892
作者:Shuhong Zheng,Michael Oechsle,Erik Sandström,Marie-Julie Rakotosaona,Federico Tombari,Igor Gilitschenski
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:enabling joint prediction, architectures for multi-view, enabling joint, prediction of multiple, feed-forward manner
备注: Project Page: [this https URL](https://zsh2000.github.io/good-token-hunting.github.io) , Code: [this https URL](https://github.com/zsh2000/gotohunt)
点击查看摘要
Abstract:Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at this https URL.
6. 【2605.23891】Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework
链接:https://arxiv.org/abs/2605.23891
作者:Xiao Cao,Yansong Qu,Xiangzhen,Chang,Wen Xiao,Jiakui Hu,Heyuan Li,Jialun Liu,Zhiyong Huang,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:textbf, Mask-free video object, requiring harmonious integration, Mask-free video, Mask-free
备注:
点击查看摘要
Abstract:Mask-free video object insertion has emerged as a challenging task, requiring harmonious integration of reference objects into source videos. However, existing methods struggle when references exhibit severe stylistic domain gaps with the source scene. To overcome this, we propose \textit{\textbf{Smart-Insertion-V}}, an end-to-end \textbf{Dual-Stream} framework that concurrently conducts video insertion and image style transfer. Within this framework, the image stream synchronously guides the video generation process, while a \textbf{Closed-loop Feedback} mechanism is further incorporated to ensure robust insertion. Inevitably, integrating these diverse conditioning signals results in feature entanglement and style leakage. To tackle this issue, we design \textbf{Dual-World-View RoPE} to distinguish different signals via spatial-temporal offsets without incurring heavy training overhead. Furthermore, to facilitate spatial grounding and stylistic adaptation, we introduce a \textbf{Decoupled Guidance Module} that leverages a Vision-Language Model for semantic reasoning while preserving original temporal guidance with native text encoder. To bridge data gap for harmonious reference insertion task, we propose a data curation pipeline and will release an \textbf{open-source dataset}. Experiments demonstrate that our method can insert objects into plausible positions while achieving the most harmonious results.
7. 【2605.23889】HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
链接:https://arxiv.org/abs/2605.23889
作者:Chong Cheng,Peilin Tao,Nanjie Yao,Guanzhi Ding,Xianda Chen,Yuansen Du,Xiaoyang Guo,Wei Yin,Weiqiang Ren,Qian Zhang,Zhengqing Chen,Hao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requires estimating camera, reconstruction requires estimating, estimating camera pose, bounded-memory constraints, requires estimating
备注:
点击查看摘要
Abstract:Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an \emph{evidence influence kernel} and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: this https URL
8. 【2605.23888】GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction
链接:https://arxiv.org/abs/2605.23888
作者:Katharina Schmid,Nicolas von Lützow,Jozef Hladký,Angela Dai,Matthias Nießner
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:tightly couples reconstruction, multi-view RGB images, RGB images, multi-view RGB, tightly couples
备注: Project page: [this https URL](https://kasothaphie.github.io/GenRecon/)
点击查看摘要
Abstract:We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we use Trellis.2 as an example -- which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.
9. 【2605.23883】PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
链接:https://arxiv.org/abs/2605.23883
作者:Rim Assouel,Amir Bar,Michal Drozdzal,Adriana Romero-Soriano
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
备注:
点击查看摘要
Abstract:Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.
10. 【2605.23878】LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation
链接:https://arxiv.org/abs/2605.23878
作者:Bo Jiang,Depu Meng,Yihan Hu,Yichen Xie,Tianshuo Xu,Wei Zhan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generators produce visually, produce visually compelling, visually compelling clips, reliable world simulators, video generators produce
备注: Project Page: [this https URL](https://lamo-ai.github.io/)
点击查看摘要
Abstract:Modern video generators produce visually compelling clips but still struggle with physical and motion consistency, limiting their use as reliable world simulators. Existing remedies often rely on external simulators, teacher models, or curated physics-focused data. We explore a complementary self-supervised direction: extracting motion cues from the unlabeled videos already used to train video diffusion models. We propose LaMo, which formulates a latent motion prior over frame-to-frame latent changes conditioned on the current latent and prompt. This prior is exposed through two lightweight readouts: a macro motion drift used during training as a Motion Drift Loss, and a learned micro motion field used during sampling as Motion Prior Guidance. Both components are plug-and-play with existing video diffusion backbones, requiring no architectural or I/O changes. On VideoPhy and VideoPhy2, LaMo improves CogVideoX backbones and outperforms recent physics-aware baselines that use external supervision. On VBench, it preserves overall generation quality while improving motion-related dimensions. These results suggest that unlabeled video contains useful motion supervision for improving physical fidelity in modern video diffusion models.
11. 【2605.23868】Vision Transformers Need Better Token Interaction
链接:https://arxiv.org/abs/2605.23868
作者:Linxiang Su
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Transformers, learn strong image-level, prolonged training, strong image-level representations, learn strong
备注: 7 pages
点击查看摘要
Abstract:Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artifacts alone. Instead, we characterize \emph{semantic diffusion}: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified. Our analysis shows that dense representation quality is not captured by locality alone: shallow features can remain better aligned with foreground regions yet underperform deeper features, and \texttt{[CLS]} features remain complementary for dense prediction. These observations suggest that the goal should not be to remove global context, but to make token interactions more selective. We therefore study sparse attention as a minimal intervention, replacing softmax attention with entmax-1.5 while preserving global token connectivity. On DINOv1 ViT-S/16 trained for 200 epochs on ImageNet-1K, this change preserves ImageNet linear probing accuracy and substantially improves semantic segmentation performance: VOC mIoU increases from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87. These results suggest that selective token mixing is a simple and effective bias for improving dense ViT representations.
12. 【2605.23861】Leveraging Foundation Models for Causal Generative Modeling
链接:https://arxiv.org/abs/2605.23861
作者:Aneesh Komanduri,Xintao Wu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Causal generative modeling, pretrained foundation models, modeling is essential, essential for developing, developing reliable
备注:
点击查看摘要
Abstract:Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.
13. 【2605.23845】Learning a Particle Dynamics Model with Real-world Videos
链接:https://arxiv.org/abs/2605.23845
作者:Chanho Kim,Suhas V. Sumukh,Li Fuxin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Data-driven learning approaches, traditional physics simulators, physics simulators due, physics simulation, differentiable nature
备注: CVPR 2026 Findings
点击查看摘要
Abstract:Data-driven learning approaches for physics simulation, sometimes referred to as world models, have emerged as promising alternatives to traditional physics simulators due to their differentiable nature. Prior work has demonstrated impressive results in predicting the motions of rigid and non-rigid objects in complex scenes involving multiple interacting bodies. However, these models are typically trained in simulated environments because obtaining perfect state information such as complete scene point clouds and point correspondences over time is challenging in real-world settings. This reliance on synthetic data can limit their applicability when the sim-to-real gap is large. In this work, we aim to overcome these limitations by introducing a novel framework for training neural object dynamics models directly from unlabeled real-world videos. Specifically, we propose to learn a particle-based dynamics model compatible with a Gaussian splatting framework, which operates on dense particles derived from Gaussians (i.e., particles with scales and rotations) and predicts their position and rotation changes over time. The model is trained via rendering supervision, enabling learning from real-world videos without requiring particle-level labeled states. Our model operates directly on dense Gaussians without relying on heuristic subsampling anchor points. To enable this study, we also present a real-world dataset consisting of about 500 videos capturing diverse object interactions.
14. 【2605.23840】MuellerPT: Decomposition Driven Pretraining for Dense Learning in Mueller Polarimetry
链接:https://arxiv.org/abs/2605.23840
作者:Adam Tlemsani,Yingdian Li,Maxime Giot,Naim Slim,Christopher J. Peters,Abhijeet Ghosh,Daniel S. Elson
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:physically meaningful contrast, scarce dense annotations, biomedical tissue analysis, Multispectral Animal Polarimetric, Animal Polarimetric Organ
备注: Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)
点击查看摘要
Abstract:Mueller matrix imaging provides rich, physically meaningful contrast for biomedical tissue analysis, but supervised learning is hindered by scarce dense annotations and strong domain shifts across specimens and acquisition settings. We introduce MuellerPT, a physics guided pre-training approach that learns transferable dense representations by predicting Lu-Chipman decomposition maps from per-pixel 4x4 Mueller matrices. To scale pre-training, we collected a new large Multispectral Animal Polarimetric Organ dataset (MAP-Org). The pre-trained encoder is adapted with a segmentation head for grey vs. white matter segmentation in lamb brain. A classification head is used for colorectal cancer vs. non-cancer classification. Both segmentation and classification are evaluated across few-shot learning scenarios. In segmentation, MuellerPT improves label efficiency and cross specimen transfer compared to models without pre-training, achieving an absolute DICE gain of over 20% compared to the baseline trained from scratch when using 5% of the training data. In classification, MuellerPT also enhances label efficiency, improving overall accuracy by 8% compared to the baseline when using 1% of the training data. We demonstrate MuellerPT's robustness to domain shift with a qualitative evaluation of its predicted Lu-Chipman maps on an ex vivo human oesophagus sample. These results suggest that predicting Lu-Chipman decomposition is an effective and practical pretext task for robust biomedical inference from Mueller polarimetry and can pave the way for future work on label efficient Mueller imaging.
15. 【2605.23826】Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
链接:https://arxiv.org/abs/2605.23826
作者:Michal Shlapentokh-Rothman,Prachi Garg,Yu-Xiong Wang,Derek Hoiem
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:provide verifiable visual, verifiable visual evidence, long-video question answering, provide verifiable, evidence for long-video
备注:
点击查看摘要
Abstract:Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at this https URL .
16. 【2605.23819】Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot
链接:https://arxiv.org/abs/2605.23819
作者:Jorge Chang Ortega,Bastien Le Lan,Thomas Serre,Victor Boutin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:human-like visual representations, central question, question in computational, visual representations, generative learning
备注:
点击查看摘要
Abstract:A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.
17. 【2605.23797】Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models
链接:https://arxiv.org/abs/2605.23797
作者:Bo Peng,Jie Lu,Guangquan Zhang,Zhen Fang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:machine learning models, identifying unexpected inputs, Aiming at identifying, OOD detection, OOD
备注: KDD 2026
点击查看摘要
Abstract:Aiming at identifying unexpected inputs from unknown classes, out-of-distribution (OOD) detection has emerged as a pivotal approach to enhancing the reliability of machine learning models. This paper focuses on the burgeoning paradigm of post-hoc OOD detection with pre-trained vision-language models (VLMs), where a popular pipeline is to detect OOD inputs by examining their affinities between ID labels and negative labels, i.e., those semantically different from ID labels. Due to the unavailability of target OOD labels, existing works predominantly rely on heuristic rules to mine negative labels from unlabeled wild corpus data. Despite the empirical success, we argue that the power of VLM-based OOD detection has yet to be fully unleashed since the notorious false negative problem is far from addressed in the literature. With this motivation, we are interested in addressing the challenge of mining true negative labels for OOD scoring. To this end, we develop a theoretical framework for correcting the sampling bias of negatives labels by indirectly approximating the distribution of negative labels. Perhaps surprisingly, we show that the debiased negative mining can be naturally converted into Monte-Carlo sampling based on ID labels and the unlabeled wild corpus data. Extensive experiments empirically manifest that our method establishes a new state-of-the-art in a variety of OOD detection setups. Code is publicly available at \href{this https URL}{\textcolor{red}{here}}.
18. 【2605.23790】Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model
链接:https://arxiv.org/abs/2605.23790
作者:Romaric Mazna,Jean Martinet,Sai Deepesh Pokala
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Saliency, Event-based Saliency, extensively studied, images and videos, event-based
备注:
点击查看摘要
Abstract:Saliency prediction has been extensively studied in RGB images and videos as a computational model of human visual attention. In contrast, predicting saliency from event-based data remains largely unexplored, despite the biological inspiration and favorable sensing properties of event cameras. Two obstacles have held this direction back: the absence of large-scale event saliency datasets, and the lack of a strong baseline. In this paper, we introduce SEST (Swin Event-based Saliency Transformer), a transformer-based model for saliency prediction from event data, bridging the data scarcity barrier through event-native pretraining and synthetic supervision. SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps. To address the scarcity of annotated event-based saliency data, we introduce two new benchmark datasets, N-DHF1K and N-UCF Sports, generated from large-scale RGB saliency benchmarks. Experimental results show that SEST clearly outperforms existing event-based saliency methods and narrows the performance gap with state-of-the-art RGB models. Zero-shot evaluation on a real event camera dataset further demonstrates that our model trained on synthetic data remains transferable on real event streams. To the best of our knowledge, this work is the first to apply deep learning to event-based saliency prediction, opening a new research direction at the intersection of event-based vision and neuromorphic visual attention.
19. 【2605.23777】Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset
链接:https://arxiv.org/abs/2605.23777
作者:FB Pena,D Crabi,Sandro C Izidoro,Érick O Rodrigues,G Bernardes
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:manual procedure performed, performed by gemologists, manual procedure, procedure performed, stone
备注:
点击查看摘要
Abstract:The grading of gemstones is currently a manual procedure performed by gemologists. A popular approach uses reference stones, where those are visually inspected by specialists that decide which one of the available reference stone is the most similar to the inspected stone. This procedure is very subjective as different specialists may end up with different grading choices. This work proposes a complete framework that entails the image acquisition and goes up to the final stone categorization. The proposal is able to automate the entire process apart from including the stone in the created chamber for the image acquisition. It discards the subjective decisions made by specialists. This is the first work to propose a machine learning approach coupled with image processing techniques for emerald grading. The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach. Furthermore, we also create and publish the used dataset that contains 192 images of emerald stones along with their extracted and pre-processed features.
20. 【2605.23775】A Novel Approach for the Counting of Wood Logs Using cGANs and Image Processing Techniques
链接:https://arxiv.org/abs/2605.23775
作者:João VC Mazzochin,Giovani Bernardes Vitor,Gustavo Tiecker,Elioenai MF Diniz,Gilson A Oliveira,Marcelo Trentin,Érick O Rodrigues
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:wood traffic monitoring, precise wood log, Generative Adversarial Networks, wood log counting, Conditional Generative Adversarial
备注:
点击查看摘要
Abstract:This study tackles the challenge of precise wood log counting, where applications of the proposed methodology can span from automated approaches for materials management, surveillance, and safety science to wood traffic monitoring, wood volume estimation, and others. We introduce an approach leveraging Conditional Generative Adversarial Networks (cGANs) for eucalyptus log segmentation in images, incorporating specialized image processing techniques to handle noise and intersections, coupled with the Connected Components Algorithm for efficient counting. To support this research, we created and made publicly available a comprehensive database of 466 images containing approximately 13,048 eucalyptus logs, which served for both training and validation purposes. Our method demonstrated robust performance, achieving an average Accuracy_pixel of 96.4% and Accuracy_logs of 92.3%, with additional measures such as F1 scores ranging from 0.879 to 0.933 and IoU values between 0.784 and 0.875, further validating its effectiveness. The implementation proves to be efficient with an average processing time of 0.713s per image on an NVIDIA T4 GPU, making it suitable for realtime applications. The practical implications of this method are significant for operational forestry, enabling more accurate inventory management, reducing human errors in manual counting, and optimizing resource allocation. Furthermore, the segmentation capabilities of the model provide a foundation for advanced applications such as eucalyptus stack volume estimation, contributing to a more comprehensive and refined analysis of forestry operations. The methodology's success in handling complex scenarios, including intersecting logs and varying environmental conditions, positions it as a valuable tool for practical applications across related industrial sectors.
21. 【2605.23771】PhotoFlow: Agentic 3D Virtual Photography Missions
链接:https://arxiv.org/abs/2605.23771
作者:Jiarui Guo,Haojia Wei,Yiming Zhang,Yifei Liu,Yuning Gong,Hongjie Zhang,Xue Yang,Zhihang Zhong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
关键词:preselected camera pose, enter a prepared, reference image, infer a suitable, language intent
备注:
点击查看摘要
Abstract:Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.
22. 【2605.23747】Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox
链接:https://arxiv.org/abs/2605.23747
作者:Allan Kazakov,Duygu Cakir,Hilal Kurt İrfanoğlu,Yavuz İrfanoğlu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:physical surface properties, requiring physicochemical understanding, physicochemical understanding distinct, Dense Material Segmentation, Apple Dense Material
备注:
点击查看摘要
Abstract:Material segmentation, the pixel-wise classification of physical surface properties, remains a challenging problem in computer vision, requiring physicochemical understanding distinct from object-centric parsing. Despite the introduction of the rigorous Apple Dense Material Segmentation (DMS) dataset, the benchmark has suffered from attrition and stagnation, increasingly overshadowed by geometry-biased foundation models. In this paper, we revive the Apple-DMS benchmark to establish a modern Vision Transformer baseline. We conduct an exhaustive evaluation of SegFormer and Mask2Former architectures, revealing that standard training paradigms fail on amorphous texture fields due to high-variance gradients. To address this, we introduce a stabilized training recipe featuring High-Fidelity Logit Projection, Query Entropy Regularization, and a domain-specific, physics-compliant augmentation pipeline. Our optimized SegFormer-B5 achieves a new State-of-the-Art (SOTA) of 0.4572 mIoU on the original dataset split, significantly surpassing the prior convolutional baseline. Furthermore, we identify a critical "Generalization Paradox": while re-partitioning the dataset into a data-rich 80/10/10 split inflates the metric to 0.5276 mIoU, expert qualitative analysis reveals this induces distributional homogenization, severely degrading real-world, out-of-distribution performance. By releasing our recovered dataset index and robust training framework, we demonstrate that material perception is far from solved and urge the community to leverage the rigorous original split to drive genuine progress in physically grounded artificial intelligence.
23. 【2605.23719】Weierstrass Positional Encoding for Vision Transformers
链接:https://arxiv.org/abs/2605.23719
作者:Zhihang Xin,Rui Wang,Xitong Hu,Xiaojun Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vision Transformers, achieved remarkable success, learnable one-dimensional positional, Transformers have achieved, computer vision
备注:
点击查看摘要
Abstract:Vision Transformers have achieved remarkable success in computer vision, but their common use of learnable one-dimensional positional encodings weakens the inherent two-dimensional spatial structure of images after patch flattening. Existing positional encodings often lack geometric constraints and do not preserve a monotonic relationship between Euclidean spatial distances and sequential index distances, limiting ViTs' ability to exploit spatial proximity priors. Motivated by the usefulness of periodicity in positional encoding, we propose Weierstrass elliptic Positional Encoding (WePE), a mathematically grounded method for encoding two-dimensional coordinates in the complex domain. WePE maps normalized 2D patch coordinates onto the complex plane and constructs compact four-dimensional positional features using the Weierstrass elliptic function and its derivative. The double periodicity provides a principled representation of 2D positions, and its intrinsic lattice structure naturally matches the regular geometry of image patch grids. Its nonlinear geometric properties help model spatial distance relationships more faithfully, while the algebraic addition formula enables relative positional information between arbitrary patch pairs to be derived directly from their absolute encodings. WePE is plug-and-play and resolution-agnostic, allowing seamless integration into existing ViTs. Extensive experiments show that WePE brings consistent performance gains in most settings. With precomputed lookup tables, these improvements introduce no noticeable computational or memory overhead. Additional analyses and ablation studies further validate the effectiveness of the proposed method.
24. 【2605.23699】CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
链接:https://arxiv.org/abs/2605.23699
作者:León Begiristain,Olaf Dünkel,Adam Kortylewski
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:exploit superficial visual, superficial visual correlations, generalizable world models, systems learn underlying, learn underlying causal
备注: 27 pages, 12 figures
点击查看摘要
Abstract:Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors - viewpoint, scene, object category, and object appearance - while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.
25. 【2605.23672】RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video
链接:https://arxiv.org/abs/2605.23672
作者:Chenyu Wu,Wanhua Li,Zhu-Tian Chen,Hanspeter Pfister
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly challenging task, short-term complex deformations, long-term smooth transformations, Reconstructing dynamic, challenging task
备注:
点击查看摘要
Abstract:Reconstructing dynamic 3D scenes from monocular videos is a fundamental yet highly challenging task, as real-world motions often involve both long-term smooth transformations and short-term complex deformations. Existing methods either struggle to maintain temporal consistency or fail to capture high-frequency dynamics due to limited motion modeling capacity. In this work, we present Rigid-aware 4D Gaussian Splatting (RiGS), which simultaneously captures motions across multiple temporal scales. Specifically, RiGS introduces three types of Gaussian primitives: static, rigid, and transient, which represent static backgrounds, long-term low-frequency motions, and short-term high-frequency dynamics, respectively. An object-wise dynamic mask is proposed to aggregate long-range spatiotemporal motion information and guide the decomposition of static and dynamic regions. To jointly model motion across scales, rigid Gaussians are allowed to transition into transient Gaussians based on their temporal duration, and both are optimized under scene flow guidance, providing dense 3D motion supervision. Extensive experiments demonstrate that RiGS achieves state-of-the-art performance on novel view synthesis benchmarks. Code is available at \hyperlink{this https URL}{this https URL}.
26. 【2605.23656】Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models
链接:https://arxiv.org/abs/2605.23656
作者:Maxim Henry,Adrien Deliège,Sébastien Piérard,Marc Van Droogenbroeck
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:substantial computational resources, requires substantial computational, scratch requires substantial, requires substantial, Training
备注: 22 pages, 3 figures, 4 tables, and 34 references
点击查看摘要
Abstract:Training high-capacity vision models from scratch requires substantial computational resources. To improve training efficiency of a wide target model, existing growth methods often assume the availability of narrower models, obscuring the true computational cost of the entire pipeline. We propose an efficient training protocol, RBDC, that builds wide models by coupling in a parameter-free block-diagonal way narrower, independently trained models in a recursive way. This allows a flexible allocation of the training budget available across all the models involved. Evaluated with vision transformers (DeiT) and convolutional networks (ResNet) on ImageNet, our RBDC training protocol shows a much better efficiency than models trained from scratch with the standard protocol, yielding 30% FLOPs reduction at similar test accuracies. It also achieves higher performances at same training FLOPs than training protocols from the model growth literature. Finally, we show that our models can serve as better backbones than their original counterparts for downstream object detection and instance segmentation tasks.
27. 【2605.23655】CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception
链接:https://arxiv.org/abs/2605.23655
作者:Liupeng Li,Haoqian Kang,Zhenyu Lu,Jinpeng Wang,Bin Chen,Ke Chen,Yaowei Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:large language models, multimodal large language, image perception presents, language models, perception presents
备注: Accepted by ICML 2026. 22 pages, 12 figures, 7 tables
点击查看摘要
Abstract:High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Code is released at this https URL.
28. 【2605.23653】ExpOS: Explainable Open-Surgery Skills Assessment Using 3D Hand Reconstruction
链接:https://arxiv.org/abs/2605.23653
作者:Roi Papo,Idan Smoller,Shlomi Laufer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Timely and transparent, current assessment remains, assessment remains dependent, effective surgical training, limiting scalability
备注: 10 pages, 4 figures
点击查看摘要
Abstract:Timely and transparent feedback is essential for effective surgical training, yet current assessment remains dependent on expert observation, limiting scalability and opportunities for autonomous practice. We present ExpOS, an explainable framework for data-driven assessment of open-surgery skills designed to enable automatic, feedback-oriented evaluation. Rather than relying on expert-defined metrics, ExpOS learns discriminative temporal patterns directly from motion data and identifies the segments and behaviors most predictive of skill level. We trained and evaluated the method on 221 videos of medical students performing three open-surgery tasks. Hand poses and tool detections were extracted from each frame to derive kinematic descriptors and global motion statistics. Spatiotemporal hand-tool dynamics were modeled using a temporal convolutional backbone with attention-based pooling to generate frame-level importance maps. These representations were fused with global motion statistics to predict skill level and to provide interpretable feedback. ExpOS provides multi-level explainability by identifying when informative events occur through attention weights and which motion characteristics most influence predictions through global feature analysis. Across tasks, the framework achieved strong correlation with expert ratings, with best performance on fascial closure (r = 0.778, R2 = 0.74). These results demonstrate that combining weakly-supervised temporal importance learning with interpretable motion statistics enables scalable and actionable surgical skill assessment.
29. 【2605.23634】DualMem: Bypassing the Objectness Bottleneck for Calibrated Unknown-Stream Filtering in Open-World Object Detection
链接:https://arxiv.org/abs/2605.23634
作者:Yingjun Xiao(a),Xi Chen(b),Gang Fang(c),Siyuan Chen(b) ((a) School of Artificial Intelligence, Guangzhou University, Guangzhou, China, (b) School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China, (c) Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China)
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:future incremental learning, Open-world object detection, strong OWOD detectors, Open-world object, incremental learning
备注:
点击查看摘要
Abstract:Open-world object detection (OWOD) requires detectors to localize known classes while identifying unknown objects for future incremental learning. We find that the unknown prediction streams of strong OWOD detectors are heavily polluted: on M-OWODB, across PROB, OW-DETR, and HypOW, future-task positive unknowns make up less than 10% of unknown predictions, whereas background false positives account for 46-71%. We show that this is not a missing-information problem, but an information bottleneck at the objectness head. On PROB Task 1, a linear probe on the 256-D decoder query achieves an AUROC of 0.908 for positive-versus-negative unknown discrimination, but the final one-dimensional objectness scalar drops to 0.642. A frozen SigLIP feature, without access to the detector, independently recovers much of this proposal-level separability at the filtering stage (AUROC = 0.871). Motivated by this finding, we propose DualMem, a calibrated post-hoc filter that assumes a small image-disjoint annotated calibration split of held-out future-task objects and performs a non-parametric likelihood ratio test in frozen SigLIP feature space. DualMem uses a k-nearest-neighbor positive memory to protect future-task objects and a negative memory to suppress background-like proposals. Its decision threshold is chosen by Neyman-Pearson calibration, giving users an explicit trade-off between false-unknown suppression and novel recall. Across PROB, OW-DETR, and HypOW on M-OWODB Task 1, DualMem reduces background-type false unknown proposals per image by 44.9%-66.3%, with a mean reduction of 56.6%. On PROB Task 1, it more than doubles the reduction achieved by a natural K-means prototype baseline, while leaving known-class mAP unchanged because known detections bypass the filter.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2605.23634 [cs.CV]
(or
arXiv:2605.23634v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.23634
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
30. 【2605.23629】DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs
链接:https://arxiv.org/abs/2605.23629
作者:Jiazhen Pan,Weixiang Shen,Jun Li,Julian Canisius,Felix Bitzer,Paula Roßmüller,Jiancheng Yang,Virginie Kreutzinger,Daniel Rueckert,Benedikt Wiestler
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fully specified vignette, single prediction, diagnosis, evidence, Medical
备注: 41 pages
点击查看摘要
Abstract:Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.
31. 【2605.23610】EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation
链接:https://arxiv.org/abs/2605.23610
作者:Jente Vandersanden,Matheus Gadelha,Chun-Hao P. Huang,Hyeonho Jeong,Yulia Gryaditskaya
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:video generation requires, generation requires maintaining, Multi-shot video generation, shot-specific text prompts, video generation
备注:
点击查看摘要
Abstract:Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible with pretrained models, restricting self-attention to entity-relevant tokens and reducing computational cost. To support this, we introduce a structured multi-shot script format. We additionally propose a budgeted memory update strategy to maintain a compact, evolving memory. Finally, we equip the entity representation with a noise-injection mechanism that enables fine-grained appearance control, preventing leakage of irrelevant information. Our method improves prompt adherence and efficiency while preserving subject consistency.
32. 【2605.23602】GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes
链接:https://arxiv.org/abs/2605.23602
作者:Beibei Lin,Xiao Cao,Jingyuan Guo,Robby T. Tan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:effectively render high-quality, methods effectively render, Vision Foundation Model, effectively render, views
备注: Accepted by CVPR Findings 2026
点击查看摘要
Abstract:Existing 3DGS methods effectively render high-quality novel views in clear-day scenes. However, they struggle with night scenes, particularly in glow regions, due to the lack of structural features such as textures and edges, which are key cues for splatting-based reconstruction. To address this problem, we leverage a diffusion model and a Vision Foundation Model (VFM) to compensate for missing structural cues. Our method consists of two key novel ideas: semantic feature generation and novel-view semantic learning. First, semantic feature generation produces high-quality semantic features as implicit structural cues for novel views. Specifically, a diffusion model synthesizes novel views with unknown camera poses from training views, while a VFM evaluates their quality. Once high-quality novel views are identified, the VFM extracts robust features to construct the semantic feature bank. Second, novel-view semantic learning enables 3DGS to optimize rendered novel views without requiring ground truth. It achieves this by extracting semantic features from a rendered novel view, searching the feature bank for the most similar features, and minimizing their distance. This process enforces implicit structural constraints, ensuring semantically coherent, artifact-free rendered views. Extensive experiments demonstrate the effectiveness of our GlowGS in generating semantically accurate 3D views, showing significant improvements over existing methods.
33. 【2605.23595】Cost-Effective Model Evaluation with Meta-Learning
链接:https://arxiv.org/abs/2605.23595
作者:Trinh Pham,Viet Huynh,Hongzhi Yin,Quoc Viet Hung Nguyen,Thanh Tam Nguyen
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Performance (cs.PF)
关键词:newly released models, growth of machine, machine learning, learning has produced, produced an ever-expanding
备注:
点击查看摘要
Abstract:The rapid growth of machine learning has produced an ever-expanding ecosystem of models, making it increasingly challenging to verify the reliability of newly released models on unseen, unlabeled data. Conventional evaluation pipelines depend on expensive annotation, repeated fine-tuning, or narrow assumptions that fail to transfer across model families. We present MetaEvaluator, a cost-effective, model-agnostic framework for rapid, label-free assessment of unseen models spanning diverse architectures and modalities. MetaEvaluator leverages meta-learning over a pool of reference models to obtain a transferable initialization, enabling accurate evaluation of new models while amortizing cost across the pool and removing the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework capable of evaluating new models on entirely unlabeled datasets. Extensive experiments show that MetaEvaluator produces stable and accurate performance estimates at substantially reduced cost compared to conventional approaches, making scalable benchmarking of emerging models on unlabeled data practical.
34. 【2605.23580】Calibration-Informative Region Selection for Online LiDAR--Camera Calibration in Agricultural Environments
链接:https://arxiv.org/abs/2605.23580
作者:Rajitha de Silva,Grzegorz Cielniak
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:calibration requires identifying, multi-modal calibration requires, noise or ambiguity, requires identifying, constrain the extrinsic
备注: Accepted to ICRA 2026 Workshop on Agricultural Robotics
点击查看摘要
Abstract:Reliable multi-modal calibration requires identifying which observations truly constrain the extrinsic parameters and which ones mainly add noise or ambiguity. In this paper, we propose a support-map-driven approach to multi-modal calibration that decouples four functional blocks: initial calibration, cross-modal residual extraction, support-map estimation, and support-aware refinement. We instantiate this formulation for online LiDAR--camera calibration using MDPCalib, a target-less LiDAR--camera calibration method based on motion and deep point correspondences, and CMRNext, a dense LiDAR--camera matching model that predicts optical-flow-like image-plane residuals. The key contribution is a dense calibration support map that aggregates cross-modal agreement over aligned observations and highlights where calibration evidence is consistently reliable. Across the Bacchus Long-Term (BLT) dataset and KITTI, we show that calibration evidence is spatially and semantically non-uniform, indicating that some semantic regions provide stronger cues for calibration than others. On KITTI, support-guided refinement improves the calibration performance with better translation accuracy while rotational gains remain limited.
35. 【2605.23559】PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA
链接:https://arxiv.org/abs/2605.23559
作者:Chunze Yang,Qidong Liu,Wenjie Zhao,Yue Tang,Jiusong Ge,Di Zhang,Jiashuai Liu,Lei Wu,Junbo Lu,Ni Zhang,Xian Wu,Zeyu Gao,Chen Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Whole-slide image visual, free-form clinical query, strict inspection budget, Whole-slide image, visual question answering
备注:
点击查看摘要
Abstract:Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical query, a system must first navigate a gigapixel slide under a strict inspection budget to locate sparse, high-resolution evidence. Existing approaches largely fall into two paradigms: i) supervised pathology multimodal large language models (MLLMs) and agents can absorb localization and reasoning into learned modules, but they often couple navigation to task-specific supervision and retraining, limiting their practicality; ii) training-free pathology agents avoid this cost by keeping core models frozen, but often follow a question-first design, constructing the initial candidate set mainly from query-conditioned relevance. This can miss decisive morphology that is not named in the question, and force heavier inference-time scaffolding. To address this challenge, we introduce PathNavigate, a training-free pathology agent built around a scan-search-readout routine. Before question matching, PathNavigate scans the current slide at low magnification with a shared online memory module over frozen pathology features, producing a slide-specific surprise field that marks an abnormal-region pool. It then applies question-conditioned PLIP relevance only within this pool to select high-magnification search targets. Finally, it extracts local high-magnification evidence and answers with a frozen perceptor-adjudicator stack, using the same online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB show that the proposed scan-search-readout design improves answer accuracy and yields more interpretable evidence-selection trajectories with higher this http URL code is available online.
36. 【2605.23555】Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos
链接:https://arxiv.org/abs/2605.23555
作者:Gangjian Zhang,Jian Shu,Sicheng Yu,Wenhao Shen,Yu Feng,Hao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:photorealistic and animatable, monocular videos, paper addresses, addresses the challenge, challenge of reconstructing
备注:
点击查看摘要
Abstract:This paper addresses the challenge of reconstructing photorealistic and animatable 3D human avatars from monocular videos. While existing methods rely on combining per-subject optimization with generic human priors, they often fail to capture fine-grained details when training frames are limited. To mitigate this data scarcity, we propose TrioMan, a systematic tri-module framework for augmented 3D avatar learning. Our approach comprises three synergistic components. The Generator creates diverse unseen samples by imposing Gaussian perturbations on pose and camera. The Refiner improves the quality of generated data through one-step diffusion guided by texture and geometry cues. The Examiner selects subject-consistent samples using a dual-branch attention-based similarity evaluation. Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods.
37. 【2605.23531】PixIE: Prompted Pixel-Space Low-Light Image Enhancement
链接:https://arxiv.org/abs/2605.23531
作者:Ruirui Lin,Guoxi Huang,David Bull,Nantheera Anantrasirichai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Low-light images exhibit, images exhibit severe, Low-light images, exhibit severe noise, contrast loss
备注:
点击查看摘要
Abstract:Low-light images exhibit severe noise, contrast loss, and semantic ambiguity, making enhancement a joint problem of denoising and detail recovery. We propose PixIE, a feed-forward pixel-space LLIE framework semantically-prompted by a vision foundation model. PixIE first performs a cross-scale denoising to suppress noise and preserve structure, then refines details with DINO-Prompted Pixel Blocks (DPPB) that inject intermediate DINOv3 features via patch-conditioned, spatially continuous per-pixel modulation. We introduce a Spatial-Channel Compaction (SCC), which folds features into a compact spatial grid and compresses in the channel dimension, so pixel-attention is computed efficiently with bounded cost across scales. We further propose Multi-Receptive-Field Pixel Embedding (MRPE) to provide neighborhood-aware pixel representations before semantic prompting, improving robustness to signal-dependent noise beyond point-wise embeddings. Experiments on LLIE benchmarks show that PixIE improves the average PSNR by 1.9-15.0% over recent state-of-the-art methods and reduces LPIPS by 8.5-44.4%. Qualitative comparisons further demonstrate that PixIE recovers sharper details and more stable textures, resulting in improved reconstruction fidelity and perceptual quality.
38. 【2605.23523】ComPose: When to Trust Hands for Object Pose Tracking
链接:https://arxiv.org/abs/2605.23523
作者:Jisu Shin,Junoh Lee,JunGyu Lee,Inhwan Bae,Dohyeon Lee,Hokyun Im,Youngwoon Lee,Hae-Gon Jeon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:key component, component for embodied, object, Reconstructing, hand
备注: 22 pages, 10 figures
点击查看摘要
Abstract:Reconstructing the motion of objects from videos is a key component for embodied AI and robot manipulation. While diverse approaches to object pose tracking have been studied, they rely heavily on strong external priors, such as depth data or 3D templates, and remain highly vulnerable to severe occlusions by hand grasps despite the use of explicit masks. In this work, we present ComPose, a 6DoF object tracking framework designed for hand-aware object pose estimation from RGB video. Rather than treating the hand purely as an occluder, our method harmonizes hand motions as a \textit{complementary cue} for object tracking. In detail, we recover a variety of object motions over time by combining object and hand cues from foundation models within a unified tracking pipeline. Here, ComPose adaptively selects informative hand joints, combines object- and hand-derived cues for motion estimation, and refines the resulting object motion using visible geometric evidence and a learned correction. We further enforce the temporal consistency over both rotation and translation, yielding stable 3D object trajectories over time without any external smoothing. Extensive experiments show that our method is accurate, efficient, and robust under severe hand occlusion and geometric ambiguity. In addition, the resulting trajectories can also effectively transfer to downstream robot manipulation by enabling robots to reconstruct human actions from online videos.
39. 【2605.23522】Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models
链接:https://arxiv.org/abs/2605.23522
作者:Jade Zou,Tao Huang,Weijie Kong,Junzhe Li,Yue Wu,Qi Tian,Jiangfeng Xiong,Jianwei Zhang,Liefeng Bo,Zhao Zhong
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Ordinary Differential Equation, Stochastic Differential Equation, Differential Equation, reverse-time Ordinary Differential, improve prompt alignment
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.
40. 【2605.23518】VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
链接:https://arxiv.org/abs/2605.23518
作者:Zhizhou Chen,Shanyan Guan,Zhanxin Gao,En Ci,Yanhao Ge,Wei Li,Zhenyu Zhang,Jian Yang,Ying Tai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modeling high-frequency texture, Directly editing, valuable but underexplored, primarily due, lack of high-quality
备注:
点击查看摘要
Abstract:Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution ($\geq$4096 $\times$ 4096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. Built on VINS-120K, we further develop a high-frequency-aware post-adaptation strategy to extend pretrained non-high-resolution models to the UHR regime. We also present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work improves fine-grained detail synthesis and texture realism in UHR image editing.
41. 【2605.23508】DrawVideo: Generating Long Video from Storyboard Keyframe Sketches
链接:https://arxiv.org/abs/2605.23508
作者:Chuanzhi Xu,Huiqi Liang,Bang Shi,Huiming Zhang,Yifan Xiao,Guangcheng Lin,Haodong Chen,Qiang Qu,Zhicheng Lu,Weidong Cai
类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
关键词:requires high-fidelity synthesis, extended time spans, coherent narrative structure, generation requires high-fidelity, high-fidelity synthesis
备注: 45 pages, 19 figures
点击查看摘要
Abstract:Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.
42. 【2605.23507】MDS-DETR: DETR with Masked Duplicate Suppressor
链接:https://arxiv.org/abs/2605.23507
作者:Chanho Lee,Seunghee Koh,Yunho Jeon,Junmo Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:matching strategy suffers, object detector, DEtection TRansformer, low recall, strategy suffers
备注: code is available at [this https URL](https://github.com/DChoLee/MDS-DETR)
点击查看摘要
Abstract:The DEtection TRansformer (DETR) is a powerful end-to-end object detector, yet its one-to-one matching strategy suffers from slow convergence and low recall. A common approach to address this issue is to use one-to-many label assignment to provide more positive samples. However, existing methods that use one-to-many matching as an auxiliary objective lead to increased training costs, with their auxiliary decoders discarded during inference. To address this limitation, we propose MDS-DETR, which leverages both one-to-one and one-to-many supervision within a single decoder. Specifically, we introduce a Masked Duplicate Suppressor (MDS) that injects asymmetry into self-attention via confidence-based causal masking. MDS filters out the duplicates generated by the one-to-many supervised layer, enables explainable, duplicate-free predictions in a fully end-to-end framework. MDS-DETR outperforms existing one-to-many DETR variants such as MS-DETR, this http URL and Relation-DETR, without relying on any additional queries or auxiliary decoders. Under a 12-epoch training schedule on MS COCO with a ResNet-50 backbone, MDS-DETR achieves a +2.8 mAP improvement over Deformable-DETR with only a 5\% increase in training time, and outperforms the state-of-the-art this http URL by +0.3 mAP while being even 20\% faster in training. Our code and models are available at \href{this https URL}{this https URL}.
43. 【2605.23500】B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation
链接:https://arxiv.org/abs/2605.23500
作者:Mario Markov,Stefan Maria Ailuro,Mohammad Mahdi,Luc Van Gool,Danda Pani Paudel(INSAIT, Sofia University "St. Kliment Ohridski")
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:underpinning pixel-level scene, pixel-level scene understanding, medical image analysis, computer vision, underpinning pixel-level
备注:
点击查看摘要
Abstract:Segmentation is a fundamental task in computer vision, underpinning pixel-level scene understanding and serving as a cornerstone for applications ranging from autonomous perception to medical image analysis. For complex referring segmentation, recent methods pair large vision-language models with segmentation decoders: the former analyzes the image and prompt, while the latter predicts the target mask. Although reinforcement learning improves reasoning-intensive vision-language systems, trainable tools such as segmentation decoders are typically optimized separately with differentiable objectives, and the principled integration of such objectives into reinforcement learning remains underexplored. Thus, we introduce group relative tool optimization (GRTO), a mathematically grounded framework for jointly optimizing a policy with differentiable tool use. GRTO reuses group relative policy optimization (GRPO) rollouts to optimize the auxiliary tool objective, letting decoder gradients complement policy rewards. Further, we derive Bootstrapped-GRTO (B-GRTO), a pre-training method that cheaply bootstraps the tool, leading to faster convergence and superior performance. Across three challenging referring segmentation settings, B-GRTO results in substantial improvements over plain GRPO, matching or surpassing domain-specific state-of-the-art methods. This demonstrates the value of unifying reinforcement learning with differentiable auxiliary objectives for reasoning-intensive segmentation.
44. 【2605.23482】Multimodal Distribution Matching for Vision-Language Dataset Distillation
链接:https://arxiv.org/abs/2605.23482
作者:Jongoh Jeong,Hoyong Kwon,Minseok Kim,Kuk-Jin Yoon
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:preserving downstream performance, Dataset distillation compresses, compresses large training, distillation compresses large, large training sets
备注: Accepted for publication at CVPR 2026. Project Page: [this https URL](https://andyj1.github.io/mdm)
点击查看摘要
Abstract:Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.
45. 【2605.23478】PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
链接:https://arxiv.org/abs/2605.23478
作者:Yu Luo,Xiaogang Zhu,Shan Zeng,Wei Xiang,Thomas Francis Bishop,Zhiyong Wang,Kun Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:global food security, Accurate crop yield, Accurate crop, food security, crucial for sustainable
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly developed for single-crop prediction, they often struggle to generalize across diverse crop types, without addressing the unique crop phenological responses that are dynamically modulated by complex weather patterns. In this paper, we propose PhenoYieldNet, a multi-crop yield prediction framework that learns crop-specific phenology by explicitly modeling their responses with temporal drivers. Specifically, we develop a crop-aware temporal decoder consisting of a Crop Phenology Bank (CPB) and a Crop Phenology Attention (CPA) module. The CPB integrates a set of learnable embeddings, which leverage a query to guide the CPA module to learn the most relevant phenology patterns for the specific crop. And the CPA module explicitly captures multi-scale trend and variation components to construct temporal contexts, enabling the model to dynamically adjust the attention across different phenological stages. To learn robust and generalizable features for multi-crop prediction, the encoder is initialized with a pre-trained foundation model, and further adapted via a self-supervised Temporal Contrastive Adaptation strategy to align with agricultural temporal dynamics. Extensive experiments conducted on multi-crop datasets indicate that our proposed method significantly outperforms state-of-the-art methods, exhibiting strong generalization capabilities across different regions and crops.
46. 【2605.23472】Rethinking Transfer Learning for Industrial Inspection: DINOv3 vs. ImageNet Pretraining Across RGB and X-ray Tasks
链接:https://arxiv.org/abs/2605.23472
作者:Mehdi Gharbage,Céline Teulière,Pierre Bouges,Thierry Chateau
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recently shown strong, shown strong transfer, strong transfer capabilities, recently shown, shown strong
备注: Accepted to the CVPR 2026 Workshop on Vision Foundation Models for Industrial Inspection (VISION'26)
点击查看摘要
Abstract:Vision foundation models pretrained on web-scale data have recently shown strong transfer capabilities on many downstream tasks, but their effectiveness for industrial visual inspection remains unclear. Industrial data differ substantially from web-data and often require fine-grained dense prediction, raising the question of whether modern self-supervised pretraining can improve over the conventional transfer-learning paradigm based on supervised ImageNet initialization. In this work, we compare ConvNeXt backbones pretrained with supervised ImageNet classification or DINOv3 distillation, and relate them to the conventional ResNet-50 baseline. We evaluate semantic segmentation, instance segmentation, and object detection across four downstream datasets spanning RGB surface-defect inspection and X-ray defect detection. We further study both frozen and fully finetuned adaptation regimes. Our results show that DINOv3 offers no clear advantage in frozen transfer, but provides a stronger initialization after full finetuning on RGB tasks, yielding faster convergence and better final performance. Under X-ray modality shift, however, supervised ImageNet pretraining remains more effective in both frozen and finetuned settings. Overall, our findings suggest that modern vision foundation models are promising for supervised RGB industrial inspection, but their transferability is strongly conditioned by downstream adaptation and target modality.
47. 【2605.23458】One-Forcing: Towards Stable One-Step Autoregressive Video Generation
链接:https://arxiv.org/abs/2605.23458
作者:Jiaqi Feng,Justin Cui,Yuanhao Ban,Cho-Jui Hsieh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:substantially improved real-time, improved real-time interactive, Recent advances, real-time interactive video, advances have substantially
备注: Work in Progress. Project Page: [this https URL](https://aurora-edu.github.io/one-forcing/) , Code: [this https URL](https://github.com/Aurora-edu/One-Forcing)
点击查看摘要
Abstract:Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.
48. 【2605.23451】Efficient One-Step Diffusion Restoration Model with Compact Token Compression and Linear Attention
链接:https://arxiv.org/abs/2605.23451
作者:Bingtian Qiao,Yue Shi,Yingjie Zhou,Yong Guo,Guangtao Zhai,Jiezhang Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unknown real-world degradations, Real-world image super-resolution, image super-resolution aims, recover high-quality images, real-world degradations
备注:
点击查看摘要
Abstract:Real-world image super-resolution aims to recover high-quality images from complex and unknown real-world degradations. However, existing generative Real-ISR methods largely inherit the dense latent representations and quadratic-cost global modeling paradigm developed for high-resolution image synthesis, causing computation, memory usage, and inference latency to scale unfavorably with resolution and thus limiting practical deployment. We argue that the key bottleneck lies not in insufficient restoration priors, but in excessive token redundancy and costly token interactions during high-resolution restoration. Motivated by this observation, we revisit Real-ISR from the perspectives of compact latent representation and linear-complexity modeling, and propose SANA-SR, an efficient one-step restoration framework. Specifically, SANA-SR employs a deep compression autoencoder with a 32x compression ratio to drastically reduce latent tokens while preserving restoration-relevant structures and textures. On top of this compact latent space, we introduce a linear-attention DiT with LoRA fine-tuning, enabling efficient high-resolution restoration with linear-complexity token mixing. Extensive experiments on all benchmark datasets demonstrate that SANA-SR achieves highly competitive and often superior quantitative performance against existing methods, while restoring clearer and more realistic textures. Moreover, after pruning, the deployed model runs in 0.019s with 407.95G MACs and 344M parameters, highlighting its strong potential for practical mobile deployment.
49. 【2605.23449】Commutator-Induced Uncertainty in VAEs
链接:https://arxiv.org/abs/2605.23449
作者:Tahereh Dehdarirad,Michael Felsberg,Gabriel Eilertsen,Ziliang Xiong
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
关键词:Variational autoencoders, represent non-commutative structure, non-commutative structure, Lie Group VAE, struggle to represent
备注:
点击查看摘要
Abstract:Variational autoencoders (VAEs) often struggle to represent non-commutative structure in learned latent spaces. Symmetry-aware VAEs commonly address this issue by enforcing commutativity through algebraic regularization, which is appropriate for commutative transformation groups but can suppress meaningful non-commutative structure when it is intrinsic to the data. We argue that non-commutativity should instead be explicitly diagnosed and reflected in reconstruction behavior. We introduce a Lie Group VAE framework that combines geometric and algebraic perspectives on uncertainty while separating discrete generative factors from continuous geometric transformations. In a first phase, the model is trained without structural constraints while algebraic non-commutativity is measured through finite Baker-Campbell-Hausdorff deviations and decoder order sensitivity is measured through reconstruction order-swap tests. These diagnostics reveal a scale mismatch between latent non-commutativity and reconstruction behavior under unconstrained training. In a second phase, we introduce a deformation-stability constraint with a data-driven calibration constant that aligns decoder sensitivity with algebraic non-commutativity. We evaluate the framework on dSprites, 3DShapes, 3DCars, and CelebA against generic and symmetry-aware baselines, including beta-VAE, CLG-VAE, and CFASL. Across synthetic benchmarks, the method improves reconstruction quality and yields decoder-level behavior more consistent with latent non-commutative structure. Qualitative analyses show clearer order-dependent latent compositions and more stable reconstructions. On CelebA, the model yields more faithful reconstructions and factor-specific latent traversals than CFASL, while also exhibiting meaningful order-dependent interactions between learned latent directions.
50. 【2605.23445】DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation
链接:https://arxiv.org/abs/2605.23445
作者:Jie Hu,Zixiang Gao,Yutong He,Kun Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, incurs prohibitive computational, prohibitive computational cost, computational cost due, Diffusion transformers
备注: ICML 2026; 17 pages, 8 figures;
点击查看摘要
Abstract:Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to mitigate this by focusing computation on important regions. However, attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity, which causes existing block sparse attention methods to degrade significantly in quality, especially at high sparsity ratios. In this paper, we revisit block sparse attention and derive a theoretical lower bound on attention recall to characterize the key factors governing its effectiveness. Guided by these insights, we propose DFSAttn, a training-free sparse attention framework that enables dynamic, fine-grained sparsification efficiently. DFSAttn incorporates three core designs: Hilbert curve-based token reordering to achieve fine-grained sparsity while preserving efficient GPU execution, hierarchical block scoring for accurate block importance estimation, and sparse mask caching with adaptive ratios to balance accuracy and efficiency. Experimental results demonstrate that DFSAttn consistently outperforms prior methods under high sparsity, achieving up to 2.1$\times$ end-to-end speedup while maintaining high generation quality. Our code is open-sourced and available at this https URL.
51. 【2605.23428】FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis
链接:https://arxiv.org/abs/2605.23428
作者:Kakia Panagidi,Stathes Hadjieftymiadis
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:wireless sensor multimedia, IoT-based camera networks, modern multimedia systems, sensor multimedia systems, modern multimedia
备注:
点击查看摘要
Abstract:In modern multimedia systems, efficient video processing is critical, especially in resource-constrained environments such as IoT-based camera networks, autonomous platforms, and wireless sensor multimedia systems. A key bottleneck in video compression and understanding is block motion estimation (ME), a process that remains computationally expensive despite the development of fast search techniques. This work introduces an Optimal Stopping Theory (OST) algorithm for block motion estimation based on the assessment of spatiotemporal differences within and across video frames. It also proposes a semantic-aware motion estimation framework that integrates Foundation Models (FMs) with the OST-based decision process. By leveraging pretrained visual models such as Vision Transformers (ViT) and the Segment Anything Model (SAM), the framework extracts semantic attention scores that indicate the importance of motion within specific spatial regions. These scores are fused with traditional distortion-based metrics, such as the Sum of Absolute Differences (SAD), to guide a hybrid stopping criterion that jointly considers motion magnitude and semantic relevance. The resulting adaptive algorithm stops early in redundant regions while continuing the search in areas where motion is semantically significant. Experiments compare the proposed solution with widely used approaches from the literature on benchmark and multimodal video datasets. The proposed method achieves a significant reduction in computation with minimal accuracy loss and improved semantic coverage. The results highlight the benefits of bridging low-level motion analysis with high-level semantic reasoning, offering a promising direction for efficient multimodal video understanding in next-generation smart systems.
52. 【2605.23411】Sample-wise Targeted Adversarial Attacks on Test-time Adaptation
链接:https://arxiv.org/abs/2605.23411
作者:Phuc Duc Nguyen,Quang Duc Nguyen
类目:Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Test-time adaptation, effectively counters distribution, counters distribution shifts, unlabeled test stream, effectively counters
备注: 32 pages, 17 figures
点击查看摘要
Abstract:Test-time adaptation (TTA) effectively counters distribution shifts but exposes models to adversarial manipulation via the unlabeled test stream. Existing class-wise targeted attacks remain impractical for stealthy exploitation in this setting: since TTA operates on batches, forcing a subset of samples toward a target label unintentionally pulls similar benign samples along, resulting in a conspicuously high frequency of the target label that is easy to detect. To capture a more realistic threat, we introduce a sample-wise targeted attack. Unlike prior approaches, the attacker aims to misclassify only inputs carrying an attacker-chosen trigger, while preserving the global label distribution of benign queries to evade detection. To achieve this, we propose a meta-learning-based attack with a novel priority-aware gradient alignment strategy that explicitly prioritizes attack success. The strategy formulates the gradient update as an ellipsoidal trust-region problem, mitigating the misalignment between attack success and distributional stealth, while providing theoretical guarantees for effective optimization of the attack objective in the presence of gradient misalignment. Extensive experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across TTA protocols demonstrate that our method achieves high targeted success rates while maintaining a label distribution that is consistent with the no-attack baseline, making it difficult to detect in unlabeled TTA deployment scenarios. Furthermore, we demonstrate that our attack shows strong robustness against existing defenses.
53. 【2605.23410】What Linear Probes Miss: Multi-View Probing for Weight-Space Learning
链接:https://arxiv.org/abs/2605.23410
作者:Eunwoo Heo,Kyeongkook Seo,Jaejun Yoo
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:open-source model repositories, documentation or metadata, explosive growth, growth of open-source, repositories has created
备注: Accepted at ICML 2026. Code: [this https URL](https://github.com/AI-hew-math/MVProbe) ; Project page: [this https URL](https://ai-hew-math.github.io/MVProbe/)
点击查看摘要
Abstract:The explosive growth of open-source model repositories has created a Model Jungle, where checkpoints are frequently shared without adequate documentation or metadata. While weight-space learning offers a pathway to identify and analyze these models directly from their parameters, processing full-scale weights is computationally prohibitive. Probing-based methods have emerged as a lightweight alternative, extracting permutation-equivariant representations via learnable probe vectors. However, existing probing methods are limited by a single-view design: they capture first-order structures but fail to encode the rich, higher-order correlation patterns inherent in row-column interactions. To bridge this gap, we introduce MVProbe, a multi-perspective probing framework that synthesizes first-order signals with interaction-aware (Gram-based) views. Our approach is theoretically grounded; we analyze the scaling laws of different probing orders to derive a principled standardization and fusion strategy that ensures balanced contributions from all branches. On the Model Jungle benchmark, MVProbe consistently outperforms the state-of-the-art ProbeX across diverse architectures, including discriminative backbones (ResNet, SupViT, MAE, DINO) and large-scale generative LoRA adapters (Stable Diffusion LoRA).
54. 【2605.23409】Online Hand Gesture Recognition Using 3D Convolutional Neural Networks
链接:https://arxiv.org/abs/2605.23409
作者:Yinghao Qin,Tijana Timotijevic
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:human computer interaction, real-time video stream, people perform gestures, dynamic hand gestures, hand gesture recognition
备注: Master's dissertation work written in Autumn 2020
点击查看摘要
Abstract:In human computer interaction, real-time detection and classification of dynamic hand gestures is challenging as: 1) the system must run in a real-time video stream and there is no noticeable lag in response after performing a gesture; 2) there is a large difference in how people perform gestures, making recognition more difficult. In this paper, an online hand gesture recognition system is proposed, which is able to localize gestures in real-time video stream and recognize what these gestures are. To improve the robustness of the system, the sliding window approach is used to refine results from multiple windows. All of the models in my project are trained on Jester database, achieving 98+% accuracy for detector and 90+% accuracy for classifier. For the overall performance of the system, the best group can respond within three seconds and reach 37.5% Levenshtein accuracy on the homemade dataset. The project codes used in this work are publicly available.
55. 【2605.23406】RS2AD-LiDAR: End-to-End Autonomous Driving LiDAR Data Generation from Roadside Sensor Observations
链接:https://arxiv.org/abs/2605.23406
作者:Runyi Huang,Ni Ding,Ruidan Xing,Yuheng Shi,Lei He,Keqiang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving solutions, autonomous driving technology, fine-grained control commands, directly process multimodal, process multimodal sensory
备注:
点击查看摘要
Abstract:End-to-end autonomous driving solutions, which directly process multimodal sensory data and output fine-grained control commands, have gradually become a mainstream direction with the development of autonomous driving technology. However, current methods in this category rely on single-vehicle data collection for model training and optimization, which suffers from high acquisition and annotation costs, scarcity of valuable scenarios, and data silos. To address these challenges, we propose RS2AD-LiDAR, a novel framework for reconstructing and generating vehicle-mounted LiDAR data from roadside sensor observations. Since no public dataset currently provides highly overlapping perception coverage between roadside and vehicle-mounted LiDAR sensors, which is essential for studying roadside-to-vehicle data generation, we constructed a dedicated dataset named R2V-LiDAR which is used solely for evaluation in this work. Specifically, our method transforms roadside LiDAR point clouds into the vehicle-mounted LiDAR coordinate system, and synthesizes high-fidelity vehicle-mounted data via virtual LiDAR modeling and point cloud resampling techniques. To the best of our knowledge, this is the first approach to reconstruct vehicle-mounted LiDAR data from roadside sensor inputs. Extensive experimental comparisons demonstrate the semantic similarity between the generated data and real data. Furthermore, object detection experiments show that incorporating the generated data into real data for model training improves both Bird's Eye View (BEV) and 3D detection accuracy, thereby validating the effectiveness of the proposed method.
56. 【2605.23397】Joint Target-Less Intrinsic and Extrinsic Camera-LiDAR Calibration using Deep Point Correspondences
链接:https://arxiv.org/abs/2605.23397
作者:Simon Bultmann,Daniele Cattaneo,Abhinav Valada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:robust multi-modal perception, perception in robotics, prerequisite for robust, robust multi-modal, multi-modal perception
备注: presented at 2nd German Robotics Conference (GRC)
点击查看摘要
Abstract:Accurate camera-LiDAR calibration is a prerequisite for robust multi-modal perception in robotics. Recent target-less approaches based on deep point correspondences achieve remarkable performance for extrinsic calibration but assume rectified images with known intrinsics. In this work, we overcome this limitation and present the first fully target-less pipeline that jointly estimates camera intrinsics (pinhole model with radial-tangential distortion) and camera-LiDAR extrinsics with deep pixel-point correspondences. Our approach extends deep correspondence-based calibration by (i) automatic intrinsic initialization via structure-from-motion, (ii) generalizing camera-LiDAR matching to raw images with unknown intrinsics including distortion, and (iii) tightly coupling correspondence estimation with joint nonlinear optimization over both intrinsics and extrinsics. We evaluate our method on the KITTI dataset with unseen camera-LiDAR pairs and demonstrate that joint calibration achieves improved extrinsic accuracy while additionally recovering accurate intrinsics.
57. 【2605.23381】VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
链接:https://arxiv.org/abs/2605.23381
作者:Junwen Tan,Jinglin Liang,Hongyuan Chen,Shuangping Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:slow inference speeds, achieved remarkable performance, rectified flow models, inference speeds, rectified flow
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Prior acceleration methods reuse cached features from previous steps, which neglects the growing mismatch between static caches and the evolving input, leading to reduced output fidelity. This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating. Specifically, VDE decomposes the model's velocity into components parallel and orthogonal to the input, exploiting their temporal predictability and directional stability for precise, input-adaptive estimation. To prevent error accumulation, it periodically anchors the model's state via full forward passes. Extensive experiments on image and video generation tasks demonstrate that VDE achieves substantial acceleration with minimal loss in visual quality. Notably, VDE accelerates Flux by 3.22 times and achieves an LPIPS of 0.069 on Qwen-Image, outperforming the best baseline with a 52.2% reduction.
58. 【2605.23355】Decoupling Spatio-Temporal Adapter for Fine-Grained Badminton Action Localization
链接:https://arxiv.org/abs/2605.23355
作者:Tianyu Wang(1),Junjie Wu(1 and 2),Jingquan Gao(1),Shishuo Li(1) ((1) School of Economics and Management, Beihang University, Beijing 100191, China (2) Key Laboratory of Data Intelligence and Management, Beihang University, Ministry of Industry and Information Technology, Beijing 100191, China)
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:remain underexplored due, generic video understanding, Temporal Action Localization, professional badminton videos, remain underexplored
备注: 11 pages, 11figures
点击查看摘要
Abstract:Temporal Action Localization (TAL) has been extensively studied in generic video understanding, while fine-grained sports scenarios, such as professional badminton, remain underexplored due to their complex and subtle spatio-temporal dynamics. In this paper, we focus on fine-grained TAL in professional badminton videos and introduce a new benchmark dataset, Fine-Badminton, which consists of 31 matches with 29 fine-grained stroke categories, covering 2104 rallies and 27597 annotated actions. To effectively capture the intricate motion patterns in such scenarios, we propose a Decoupling Spatio-Temporal Adapter (DSTA), which enables efficient modeling of spatio-temporal features within a parameter-efficient framework. Specifically, DSTA decomposes motion representation into three parallel branches, capturing temporal dynamics as well as vertical and horizontal spatial variations. The design allows the model to better distinguish subtle differences among fine-grained actions. Extensive experiments on both the Fine-Badminton dataset and the ShuttleSet benchmark demonstrate that the proposed method achieves state-of-the-art performance while introducing only a marginal increase in computational and parameter cost. These results validate the effectiveness and efficiency of the proposed approach for fine-grained temporal action localization.
59. 【2605.23345】SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models
链接:https://arxiv.org/abs/2605.23345
作者:Zizhao Tong,Hongfeng Lai,Zeqing Wang,Zhaohu Xing,Kexu Cheng,Haoran Xu,Zhao Pu,Shangwen Zhu,Ruili Feng,Jian Zhao,Yan Zhang,Hao Tang,Yeying Jin,Ling Shao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Interactive world models, resolve high-frequency overlapping, high-frequency overlapping control, Interactive world, disrupting unaffected regions
备注: Project page: [this https URL](https://z2tong.github.io/SCOPE/) . Code is available at [this https URL](https://github.com/z2tong/SCOPE)
点击查看摘要
Abstract:Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.
60. 【2605.23344】CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs
链接:https://arxiv.org/abs/2605.23344
作者:Xiaoyi Huang,Kejia Zhang,Zhiming Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:multimodal reasoning capabilities, language priors dominate, priors dominate insufficient, Large Vision-Language Models, shown strong multimodal
备注:
点击查看摘要
Abstract:Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for "calibration on demand". CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.
61. 【2605.23327】GFSR: Geometric Fidelity and Spatial Refinement for Reliable Lane Detection
链接:https://arxiv.org/abs/2605.23327
作者:Tiancheng Wang,Zhaolu Ding,Richeng Xu,Tianhui Zheng,Hui Liu,Hanyu Xuan,Zhiliang Wu,Guanghui Yue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:driver assistance systems, crucial perception task, advanced driver assistance, Lane detection stands, assistance systems
备注: Submitted to IEEE Transactions on Intelligent Transportation Systems (under review). 12 pages, 6 figures
点击查看摘要
Abstract:Lane detection stands as a crucial perception task in autonomous driving and advanced driver assistance systems. However, existing methods still degrade in complex real scenarios due to two major limitations. First, classification confidence only characterizes the categorical existence of lane candidates and has no strong correlation with geometric quality. If threshold filtering and NMS are conducted merely based on this confidence, the model tends to retain lane priors with high confidence while eliminating those with lower confidence but superior geometric representation. Secondly, existing regression modules weaken correlations among sampling points, hindering fine-grained optimization of distant, high-curvature and complex-topology lanes and causing underfitting. To address these issues, we propose Geometric Fidelity and Spatial Refinement (GFSR), a framework consisting of LaneIoU-guided Confidence Calibration (LCC) and Adaptive Gated Location Refinement (AGLR). Specifically, LCC adopts LaneIoU as soft supervision to explicitly estimate geometric fidelity of lane priors, which is further fused with classification confidence to construct the collaborative reliability index (CRI). This index guides threshold filtering and NMS, effectively retaining lane priors with high classification confidence and favorable geometric quality. Meanwhile, cooperating with regression heads in each refinement stage, AGLR predicts sampling point lateral offsets and adopts a gating mechanism to adaptively regulate correction magnitude, strengthen inter-point correlations and boost model adaptability as well as robustness toward complex lane scenarios. Extensive experiments on CULane and CurveLanes demonstrate that our GFSR achieves state-of-the-art performance on CULane, with F1@50 and F1@75 scores of 81.46% and 65.01%, and reaches 87.35% F1@50 on CurveLanes.
62. 【2605.23324】Enhancing Blood Cells Classification using Hybrid Quantum Neural Networks
链接:https://arxiv.org/abs/2605.23324
作者:Guilherme Cruz,Nouhaila Innan,Alberto Marchisio,Gabriel Falcao,Muhammad Shafique
类目:Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
关键词:challenge conventional deep, conventional deep learning, Blood Cell Images, microscopic blood cells, deep learning models
备注: 11 pages, 13 figures
点击查看摘要
Abstract:Accurate classification of microscopic blood cells is still a critical task in medical image analysis, where subtle variations and limited data can challenge conventional deep learning models. As such, we investigate in this work the potential of Hybrid Quantum-Classical Neural Networks (HQNNs) to enhance feature representation and improve classification performance in this domain. We propose a modular architecture combining a pre-trained ResNet-50 backbone with a low-dimensional latent bottleneck and a variational quantum circuit, enabling a direct comparison between quantum-enhanced and purely classical transformation mechanisms. To isolate the contribution of the quantum component, we evaluate three architectures: a HQNN model, a Classical Matched Model with an additional nonlinear transformation layer of comparable capacity, and a baseline model without an intermediate transformation stage. Experiments conducted on two publicly available blood cell datasets, namely the Blood Cell Images dataset and the PBC dataset, demonstrate that HQNNs consistently achieve superior or more balanced performance across evaluation metrics. In the Blood Cell Images Dataset, the proposed approach improves macro F1-score by up to 3.7% compared to classical baselines, while improving the F1-score from 98.54% to 98.69% in the more challenging 8-class scenario with near-saturated performance. Additional evaluation on IBM quantum hardware shows that the model remains robust under noise, with only a modest performance degradation relative to simulated results. These results indicate that quantum feature transformations can enhance discriminative representations, particularly in challenging classification scenarios, and highlight the practical potential of HQNN models for medical imaging tasks.
63. 【2605.23304】General Hazard Detection
链接:https://arxiv.org/abs/2605.23304
作者:Stephanie Ng,CP Lim,SueJen Looi,Hendrik Zurlinden,David Nguyen,Lei Wei,Saeid Nahavandi,Hailing Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cognitive-level logical reasoning, typically defined, defined through cognitive-level, cognitive-level logical, Hazard
备注: 20 pages, 7 figures and 4 tables
点击查看摘要
Abstract:Hazard, as an abstract concept, is typically defined through cognitive-level logical reasoning rather than concrete examples. In contrast, existing hazard detection systems rely on predefined hazard categories and require intensive collection of labelled examples within detection or classification architectures. This approach faces three fundamental challenges when addressing abstract safety concepts: (1) noisy and sparse training data, (2) dynamically evolving definitions that change across contexts and time, and (3) limited generalisation to unseen or novel scenarios. To address these limitations, we present the CompliVision dataset, the first general-purpose hazard dataset designed for rule-based compliance assessment, along with a baseline framework for hazard evaluation. Our key innovation is decoupling the hazard concept from image-based examples by expressing safety requirements through language-based rules. We ground our approach in authoritative domain regulations and ISO standards to define diverse hazard concepts across multiple domains. The CompliVision dataset comprises 3,006 images spanning traffic, construction, and warehouse environments, with each image annotated for compliance against specific safety rules, accompanied by natural language explanations highlighting the supporting visual evidence. To achieve robust generalisation, we develop an active learning framework to more effectively guide and refine vision-language models in assessing hazard compliance. While state-of-the-art VLMs demonstrate strong capabilities, they struggle with the fine-grained, context-dependent interpretation required for accurate safety assessment. We proposed a general hazard detection framework to address this limitation which combines LLaVA-based visual reasoning with with human-in-the-loop feedback.
64. 【2605.23288】Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition
链接:https://arxiv.org/abs/2605.23288
作者:Yerim So,Jiyeong Kim,Jiwon Yoon,Dongbo Min
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent Open-Vocabulary Action, methods typically aggregate, computing text alignment, typically aggregate visual, aggregate visual features
备注:
点击查看摘要
Abstract:Recent Open-Vocabulary Action Recognition (OVAR) methods typically aggregate visual features into a global representation before computing text alignment, a process that obscures local patch information and fine-grained spatio-temporal cues. We propose Similarity Volume Aggregation (SimVA), a framework that constructs a dense 4D spatio-temporal similarity volume from patch-level visual-text similarities. SimVA constructs a spatio-temporal similarity volume over local video tokens and action classes, and employs class sampling to ensure similarity aggregation scalable to large vocabularies. The similarity volume is refined by spatial aggregation, which contextualizes local similarity patterns to improve intra-frame consistency. Motion-aware modulation further injects inter-frame variation cues, highlighting dynamically changing regions. Mamba-based temporal aggregation then models the evolution of class-conditioned similarity patterns across frames. By maintaining dense visual-text correspondence, SimVA effectively transfers CLIP to video action recognition, achieving competitive performance across zero-shot, few-shot, and base-to-novel benchmarks.
65. 【2605.23287】LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images
链接:https://arxiv.org/abs/2605.23287
作者:Yilong Liu,Wanhua Li,Chen Zhu-Tian,Hanspeter Pfister
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Language Gaussian Splatting, Language Gaussian, Gaussian Splatting, Gaussian primitives enriched, unposed multi-view images
备注: CVPRF 2026
点击查看摘要
Abstract:We present LangFlash, a feed-forward framework for 3D Language Gaussian Splatting that reconstructs 3D scenes parameterized by Gaussian primitives enriched with language-aligned semantic features from sparse unposed multi-view images. Unlike optimization-based 3D methods, LangFlash directly predicts the geometry and semantics in a single forward pass, enabling low-latency 3D reconstruction and language-consistent scene understanding. To support large-scale training, we enriched the RealEstate10k dataset with coherent and dense semantic information for 3D semantic supervision. Furthermore, we propose a sparse semantic encoding scheme that combines a global semantic dictionary with locally varying per-primitive weights, preserving high-level linguistic information, while reducing representation complexity. Experimental results show that LangFlash achieves superior novel view synthesis and semantic consistency compared with previous methods. This study establishes a new paradigm for pose-free, language-grounded 3D scene reconstruction, advancing generalizable 3D vision and multimodal scene understanding. Demo is available at this https URL.
66. 【2605.23281】DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
链接:https://arxiv.org/abs/2605.23281
作者:Jie Zhu,Girish Chandar Ganesan,Xiaoming Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse camera settings, achieved strong progress, Monocular metric depth, metric depth estimation, universal-camera modeling
备注:
点击查看摘要
Abstract:Monocular metric depth estimation has achieved strong progress with large-scale training and universal-camera modeling, yet robust deployment across diverse camera settings, such as perspective, fisheye, and panoramic images, remains challenging. Existing methods typically rely on a single depth estimator, overlooking that different models encode different camera assumptions and perform best under different input domains. In this paper, we show that depth experts exhibit strong sample-wise complementarity: model preference is highly correlated with camera geometry, and multi-model fusion brings the largest gains on difficult samples where individual experts are unreliable. Motivated by these observations, we propose \textbf{\ours}, a vision-language agent for adaptive monocular depth estimation. DepthAgent treats existing depth models as frozen tools and learns to analyze scene and camera cues, invoke suitable experts through multi-turn tool utilization, and select or fuse their predictions for each input. To optimize such discrete decision-making toward dense geometric quality, we design a multi-reward reinforcement fine-tuning scheme that jointly encourages valid tool execution, camera/scene analysis, expert-selection quality, and inference efficiency. Extensive experiments across perspective, fisheye, and panoramic benchmarks show that \ours consistently outperforms individual experts, fixed model fusion, and different selection strategies, with strong improvements on challenging samples, highlighting the critical role of expert selection and fusion. The code and model will be released upon publication.
67. 【2605.23274】U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025
链接:https://arxiv.org/abs/2605.23274
作者:Duc-Nhuan Le,Hoang-Phuc Nguyen,Thanh-Duy Lam,Minh-Nhut Dang,Minh-Hoang Le
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Clip-based Event Search, Event Search Engine, Retrieving events, Unified Clip-based Event, multimodal event retrieval
备注: Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)
点击查看摘要
Abstract:Retrieving events from large-scale video datasets is challenging due to complex temporal, spatial, and multimodal information. This paper presents U-CESE, our solution for the AI Challenge HCMC 2025, a Unified Clip-based Event Search Engine for multimodal event retrieval across diverse video sources. Building on CESE, U-CESE integrates its three modules into a single cohesive framework, ensuring consistent processing and retrieval across query types. A core component is the Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline. To handle large-scale data, we propose DAKE, a lightweight, training-free keyframe extraction method using JPEG file size variations to identify significant scene changes. Finally, we introduce ReCap, a temporally consistent captioning framework inspired by Recurrent Neural Network, generating detailed and context-aware textual descriptions. Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval.
68. 【2605.23271】EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
链接:https://arxiv.org/abs/2605.23271
作者:Songlin Yang,Haobin Zhong,Ruilin Zhang,Xiaotong Zhao,Shuai Li,Kai Zheng,Xuyi Yang,Zhe Wang,Zhenchen Tang,Yang Li,Bohai Gu,Zhengwei Peng,Yidan Huang,Mengzhou Luo,Yihang Bo,Dalu Feng,Yujia Zhang,Juntao Ma,Ruiqi Wang,Lvmin Zhang,Yuwei Guo,Frank Guan,Maneesh Agrawala,Hongbo Fu,Alan Zhao,Anyi Rao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:professional-grade cinematic synthesis, generative video foundation, rapid evolution, evolution of generative, propelled the field
备注:
点击查看摘要
Abstract:The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.
69. 【2605.23270】ChainFlow-VLA: Causal Flow Planning with Vision-Language Models
链接:https://arxiv.org/abs/2605.23270
作者:Xiyang Wang,Xinlin Wang,Tingguang Zhou,Gong Chen,Xingtai Gui,Zhi Xu,Xiaolei Wu,Feiyang Tan,Hangning Zhou,Mu Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:autonomous driving systems, temporal causal reasoning, autonomous driving, driving systems, systems are fundamentally
备注:
点击查看摘要
Abstract:Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction-aware temporal dependencies via causal factorization, but their step-wise decoding leads to error accumulation and suboptimal global structure. In contrast, diffusion models optimize trajectories globally but lack explicit causal constraints, making them unreliable in interactive and safety-critical scenarios. This dichotomy reveals a deeper issue: existing methods treat causal modeling and global optimization as separate paradigms, without a principled way to unify them within a single trajectory distribution. To address this, we propose ChainFlow-VLA, which unifies causal generation and global refinement within a unified probabilistic framework. We formulate planning as a mixture over AR-induced modes and learn Vision-Language Model (VLM)-conditioned residual distributions over these modes. An autoregressive generator (Chain) produces a discrete set of causal trajectory modes, followed by a diffusion-based refiner (Flow) that leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure. This straightforward conditioning seamlessly injects high-level scene understanding into fine-grained trajectory adjustments. Experiments demonstrate that ChainFlow-VLA achieves robust planning in ambiguous and long-tail scenarios, achieving a state-of-the-art score of 94.85 on the NAVSIM v1 leaderboard, matching human-level performance (94.8). Code will be available at this https URL.
70. 【2605.23264】Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution
链接:https://arxiv.org/abs/2605.23264
作者:Hongbo Wang,Huaibo Huang,Pin Wang,Jinhua Hao,Chao Zhou,Ran He
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:compromise faithful restoration, natural image manifold, intrinsic natural image, fundamental spectral misalignment, Image Super-Resolution
备注: Accepted to ICML 2026
点击查看摘要
Abstract:Generative priors in Image Super-Resolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spectral misalignment between isotropic objectives and the intrinsic natural image manifold. While Direct Preference Optimization offers a path to alignment, its reliance on spectrally flat Gaussian noise fails to distinguish authentic high-frequency details from hallucinations. To bridge this geometric gap, we propose ASASR, a theoretically grounded framework that recasts the generative flow into a Sobolev-induced Riemannian geometry by explicitly coloring the noise transition kernel to mirror natural spectral decay. Driving this geometric alignment, we integrate a parametric adversary grounded in the Riesz Representation Theorem, which synthesizes targeted negative samples equivalent to worst-case Sobolev gradients to direct optimization along the tangent space of plausible structural failures. Extensive evaluations demonstrate that ASASR outperforms leading generative baselines, particularly in preserving spectral consistency and structural fidelity, offering a robust solution that effectively mitigates artifacts.
71. 【2605.23257】urning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation
链接:https://arxiv.org/abs/2605.23257
作者:Zixuan Hu,Xuantuo Huang,Yancheng Li,Yichun Hu,Shengyong Xu,Ling-Yu Duan
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:non-stationary environment shifts, environment shifts poses, Navigating under non-stationary, agent deployed, non-stationary environment
备注: Accepted by ICML 2026
点击查看摘要
Abstract:Navigating under non-stationary environment shifts poses a critical challenge for a Vision-and-Language Navigation (VLN) agent deployed in the wild. Yet, existing Test-Time Adaptation (TTA) methods for VLN largely treat online adaptation as transient, isolated updates, leading to catastrophic forgetting and negative transfer. To overcome these issues, we propose Inter-Domain BridgE with Historical Assets (IDEA), a novel TTA framework that transforms adaptation into the accumulation and composition of assets. Specifically, IDEA introduces soft prompts optimized via a Fisher-guided weighting scheme to capture the transferable knowledge. These optimized prompts are then augmented with domain coordinates to form a dynamic asset library. Leveraging this library, IDEA constructs a cross-domain bridge by projecting the target domain onto the convex hull of historical knowledge. These designs form a complementary loop: the evolving library underpins bridge construction, while the bridge provides superior initialization to accelerate asset optimization. Extensive experiments across REVERIE, R2R, and R2R-CE benchmarks demonstrate the consistent superiority of IDEA over existing methods, showcasing its ability to enable training-free adaptation via asset sharing.
72. 【2605.23254】CARE: Class-Adaptive Expert Consensus for Reliable Learning with Long-Tailed Noisy Labels
链接:https://arxiv.org/abs/2605.23254
作者:Mengke Li,Haiquan Ling,Lihao Chen,Yang Lu,Yiqun Zhang,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:data is frequently, frequently hindered, compound challenge, Learning, Learning from real-world
备注: poster in ICML 2026
点击查看摘要
Abstract:Learning from real-world data is frequently hindered by the compound challenge of long-tailed class distributions and noisy annotations. Existing methods partially address these issues but typically ignore the non-uniform impact of label noise across classes, resulting in ineffective correction for tail classes and over-regularization for head classes. To address this issue, we propose Class-Adaptive Rectification with Experts (CARE), a parameter-efficient framework that leverages three complementary supervision sources from vision-language models (VLM): observed noisy labels, VLM text embeddings, and visual features. CARE introduces a class-adaptive expert consensus mechanism that enforces stricter agreement for tail classes and more permissive agreement for head classes based on class frequency. By aggregating high-confidence predictions across these sources, CARE filters unreliable signals and recalibrates class distributions, yielding more reliable rectification under long-tailed distributions. Extensive experiments on both synthetic and real-world benchmarks demonstrate that CARE consistently outperforms state-of-the-art methods, achieving up to 3.0\% performance gains. The source code is available at this https URL.
73. 【2605.23245】SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion
链接:https://arxiv.org/abs/2605.23245
作者:Xinyu Chen,Yuyi Qian,Jiang Lin,Shenyi Wang,Gao Wang,Zhiqiu Zhang,Jizhi Zhang,Mingjie Wang,Qiang Tang,Qian Wang,Song Wu,Zili Yi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:simple content placement, insertion requires ensuring, requires ensuring spatio-temporal, ensuring spatio-temporal coherence, object insertion requires
备注: Accepted by ICME2026
点击查看摘要
Abstract:Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering or resource-intensive retraining, restricting their flexibility and generalization. To bridge this gap, we present \textit{SimInsert}, a training-free paradigm that efficiently decouples the task into intuitive single-frame editing and semantic motion description. By harnessing the robust generative priors of image-to-video diffusion models, SimInsert propagates edits temporally, strictly preserving background invariance while enabling plausible, text-driven interactions between the inserted object and the dynamic environment. Our approach hinges on non-invasive guidance mechanisms that enforce structural consistency, facilitate seamless boundary fusion, and counteract the fidelity drift that typically accumulates during the denoising trajectory. Extensive quantitative experiments validate our efficacy: SimInsert surpasses state-of-the-art methods with an 18.8\% gain in PSNR, 20.1\% in SSIM, and a 44.1\% decrease in LPIPS, offering a streamlined solution for high-fidelity video editing.
74. 【2605.23237】StereoGenBench: A Synthetic Multi-Camera Benchmark for Stereo Generation under Controlled Baseline Regimes
链接:https://arxiv.org/abs/2605.23237
作者:Yangzhi Cui,Feng Qiao,Nathan Jacobs
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:stereo geometry estimation, determine binocular geometry, synthesis require paired, require paired data, condition-controlled view synthesis
备注:
点击查看摘要
Abstract:Stereo image and video generation, stereo geometry estimation, and condition-controlled view synthesis require paired data in which the variables that determine binocular geometry -- camera baseline, intrinsics, scene depth, and camera motion -- are known and controllable. Existing stereo resources provide subsets of these variables, but resources commonly used for stereo generation evaluation do not, to our knowledge, provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source. We introduce StereoGenBench, a synthetic Unreal Engine benchmark designed to make baseline-regime sensitivity and target-camera consistency measurable under matched scene content. Each scene is rendered with a rigid six-camera lateral array, yielding up to 15 calibrated view pairs; adjacent baselines are sampled from inter-pupillary to wide-baseline regimes; focal length is sampled independently; and every view is released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses. The splits include two evaluation families for narrow and wide baseline regimes and a train-only family for broader all-pairs coverage. We release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration for extension with compatible assets. The dataset is available at this https URL
75. 【2605.23231】Beyond Normal References: Discriminative Few-Shot Anomaly Detection
链接:https://arxiv.org/abs/2605.23231
作者:Huan Wang,Jun Shen,Jun Yan,Guansong Pang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:few-shot anomaly detection, practical few-shot anomaly, termed discriminative FSAD, intrinsic deviation, FSAD
备注: 31 pages
点击查看摘要
Abstract:This paper considers a practical few-shot anomaly detection (FSAD) setting, termed discriminative FSAD, where a limited number of both normal and anomalous examples are available as references during inference. Existing FSAD methods rely on normal-only references through normality matching, ignoring the discriminative clues in anomalous references, while directly fitting both references can overfit to the seen anomalies. We introduce IDEAL, an intrinsic deviation learning framework that leverages both reference types to learn intrinsic deviation patterns characterizing generalizable abnormality as deviations from normality. IDEAL decomposes the learning process into two novel components: 1) a Normal Variation Eraser to suppress nuisance normal variations that may lead to noisy deviations from normality, thereby highlighting anomaly-relevant deviation representations; 2) an Intrinsic Deviation Encoder to decompose these denoised deviation representations into intrinsic deviation vectors capturing the most discriminative orthogonal deviation directions. At inference, IDEAL scores query-to-normal deviations preserved after projection onto the learned intrinsic deviation vectors, enabling generalization for both seen and unseen anomalies. Extensive experiments on eight real-world datasets show that IDEAL generalizes effectively to unseen anomalies and consistently outperforms existing state-of-the-art FSAD methods. Code and data will be available at \href{this https URL}{this https URL}.
76. 【2605.23216】CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
链接:https://arxiv.org/abs/2605.23216
作者:Mingfang Zhang,Jingjing Pan,Ashutosh Kumar,Rajat Saini,Mustafa Erdogan,Hsuan-Kung Yang,Caixin Kang,Yifei Huang,Yoichi Sato,Quan Kong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significant challenge, challenge for Vision-Language, surface-level perception, deeper understanding, causal
备注: CVPR 2026
点击查看摘要
Abstract:Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.
77. 【2605.23203】Lipschitz Optimization for Formal Verification of Homographies
链接:https://arxiv.org/abs/2605.23203
作者:Jean-Guillaume Durand,Panagiotis Kouvaros,Maxime Gariel,Alessio Lomuscio
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:regulated industries requires, industries requires formal, formal robustness guarantees, requires formal robustness, regulated industries
备注: 18 pages, 13 figures, 6 tables, to be published at CVPR 2026
点击查看摘要
Abstract:The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to $\ell_p$-norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image-formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN-COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real-world case study on a safety-critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at this https URL .
78. 【2605.23192】Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing
链接:https://arxiv.org/abs/2605.23192
作者:Lin Liu,Zhihan Xiao,Haohang Xu,Rong Cong,Zhibo Zhang,Xiaopeng Zhang,Qi Tian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling diverse object-level, natural language instructions, recently achieved remarkable, achieved remarkable progress, diverse object-level manipulations
备注:
点击查看摘要
Abstract:Video editing has recently achieved remarkable progress with diffusion-based generative models, enabling diverse object-level manipulations from natural language instructions. However, existing methods often struggle under occlusion, viewpoint changes, and fast object motion, where unreliable visual observations lead to inaccurate localization, temporal flickering, and inconsistent edits. In this work, we identify the absence of reliable visual anchors as a fundamental bottleneck in occlusion-robust video editing. To address this issue, we propose an occlusion-aware physics-semantic keyframe selection framework that automatically identifies an optimal anchor frame for downstream editing. Specifically, our method evaluates candidate frames from three complementary perspectives: structural completeness for avoiding truncated observations, cycle-consistent tracking stability for measuring physical reliability, and vision-language-based attribute visibility for ensuring semantic clarity. The selected keyframe is then propagated through bidirectional tracking to generate dense spatiotemporal masks, which are used as auxiliary supervision for a diffusion-based video editing backbone. By transforming occlusion handling from explicit reconstruction into reliable anchor selection, our framework enables precise and temporally consistent editing without requiring manual annotations. Extensive experiments on challenging video editing benchmarks demonstrate the effectiveness and high-quality performance of our method.
79. 【2605.23187】IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction
链接:https://arxiv.org/abs/2605.23187
作者:Lin Qian,Shijie Li,Sihao Lin,Xuan Zhang,Bangya Liu,Yanran Li,Hujun Yin
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Existing object navigation, Existing object, microwave or chair, percent, Existing
备注: preprint
点击查看摘要
Abstract:Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.
80. 【2605.23178】Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes
链接:https://arxiv.org/abs/2605.23178
作者:Wenxuan Peng,Bharath Hariharan,Hadar Averbuch-Elor
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generate semantically diverse, poorly grounded interactions, compositionally accurate multi-person, recent progress, repetitive layouts
备注: Accepted to SIGGRAPH Conference Papers 2026. 22 pages, 12 figures. Project page: [this https URL](https://cornell-vailab.github.io/PeopleComposer/)
点击查看摘要
Abstract:Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose-image representation that brings person-centric structural priors into pretrained diffusion transformers. Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolve during learning. At its core, a cross-modal alignment scheme binds text, pose, and image representations, ensuring consistent grounding across modalities. Furthermore, we design an iterative scene construction scheme, progressively generating complex multi-human interactions while effectively decomposing the overall generation complexity. Extensive experiments demonstrate that our method substantially improves prompt alignment and scene diversity in multi-person image generation.
81. 【2605.23176】DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving
链接:https://arxiv.org/abs/2605.23176
作者:Hao Vo,Khoa Vo,Phu Loc Nguyen,Sieu Tran,Duc Minh Nguyen,Ngo Xuan Cuong,Gladys Gawugah,Sreevenkata Anjani Tishita Godavarthi,Chase Rainwater,Nghi D. Q. Bui,Anh Nguyen,Duy Minh Ho Nguyen,Ngan Le
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:coherent scene representation, integrate multi-view observations, maintain object continuity, Cognitive Scene Construction, requires an agent
备注:
点击查看摘要
Abstract:Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.
82. 【2605.23174】LQ-rPPG: A Label-Quantized Coarse-to-Fine Learning Framework for Remote Physiological Measurement
链接:https://arxiv.org/abs/2605.23174
作者:Jun Seong Lee,Samyeul Noh,Changki Sung,Hyun Myung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:daily health monitoring, Remote photoplethysmography, remote healthcare, enables non-contact measurement, learning-based rPPG methods
备注:
点击查看摘要
Abstract:Remote photoplethysmography (rPPG) enables non-contact measurement of physiological signals from facial videos, offering strong potential for remote healthcare and daily health monitoring. Driven by this potential, various deep learning-based rPPG methods have been proposed to improve rPPG estimation. However, previous deep learning-based rPPG methods have paid little attention to the quality of training labels and their impact on model learning. Contact-based PPG signals used as training labels often contain noise and variability caused by motion artifacts, inconsistent sensor contact, and morphological distortions. Such label inconsistency can lead models to overfit to the label noise and variability and consequently degrade generalization performance. To address this issue, we propose LQ-rPPG, a label-quantized coarse-to-fine learning framework for robust rPPG estimation. LQ-rPPG consists of a label quantization module and a coarse-to-fine rPPG estimation model. The label quantization module transforms continuous PPG signals into multi-bit quantized pseudo labels with reduced noise and variability. The coarse-to-fine estimation model progressively refines rPPG signals under hierarchical supervision guided by the multi-bit pseudo labels. This design alleviates overfitting to label-specific variations and enables the model to learn structured and consistent representations. As a result, LQ-rPPG achieves robust and generalizable rPPG estimation even under challenging conditions. Experiments on multiple benchmark datasets demonstrate that LQ-rPPG achieves strong performance in both intra- and cross-dataset evaluations, while reducing parameters and multiply-accumulate operations by 88% and 29%, respectively, and increasing throughput by 191%. The code is available at this https URL.
83. 【2605.23160】Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping
链接:https://arxiv.org/abs/2605.23160
作者:Nitin Vegesna,Avideh Zakhor
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:present Semantic-Aware Guided, Semantic-Aware Guided Exploration, preserves coverage-oriented behavior, reprioritize frontier selection, Semantic-Aware Guided
备注: 10 pages, 6 figures, 4 tables. To be presented at the 2nd 3D-LLM/VLA Workshop at CVPR 2026 (non-archival workshop)
点击查看摘要
Abstract:We present Semantic-Aware Guided Exploration, SAGE, a system for open-vocabulary exploration in unknown 3D indoor environments that preserves coverage-oriented behavior while allowing semantic cues to reprioritize frontier selection. Building on the FALCON volumetric explorer, SAGE integrates Contrastive Language-Image Pre-training (CLIP) via four key components: object-centric embedding storage, a temporal cache that projects recent observations onto the free-unknown boundary, object frontiers for high-similarity detections, and a unified semantic-geometric planning cost. This cost function bounds semantic reweighting influence, ensuring frontiers are prioritized without sacrificing total coverage. In Matterport3D-based simulations, SAGE outperforms FALCON and a semantic-only ablation in object discovery across map-query pairs. Compared to Finding Things in the Unknown (FTU), SAGE completes exploration 9.0 to 25.9 times faster across the nine shared map-query pairs, achieving a mean speedup of 13.7. Furthermore, SAGE achieves substantially higher volumetric throughput than FTU. Finally, we deploy SAGE in five real-world flights in two environments on a Modal AI Starling 2 quadrotor with onboard sensing and planning, and offboard CLIP inference. Comparing SAGE and FALCON, we find that while FALCON results in faster exploration and shorter mapping trajectories, SAGE outperforms FALCON in terms of object discovery.
84. 【2605.23144】SLIP-RS: Structured-Attribute Language-Image Pre-Training for Remote Sensing Object Detection
链接:https://arxiv.org/abs/2605.23144
作者:Chenxu Wang,Yuxuan Li,Yunheng Li,Xiang Li,Jingyuan Xia,Qibin Hou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Monolithic Label Learning, Existing language-image pre-training, inherent data scarcity, Monolithic Label, domain inherent data
备注:
点击查看摘要
Abstract:Existing language-image pre-training for remote sensing object detection is constrained by Monolithic Label Learning, which relies on exhaustively enumerating open-set categories via black-box data to acquire fine-grained representations, creating a dependency incompatible with the domain's inherent data scarcity. To transcend this bottleneck, we propose SLIP-RS, establishing a Structured-Attribute Decoupling Paradigm that maps the open-ended category space into a finite, physically meaningful attribute space, unlocking fine-grained discriminability via explicit structural logic. This paradigm is realized via two technical pillars: (1) Structured-Attribute Contrastive Learning, which enforces the learning of decoupled intrinsic visual logic via combinatorial attribute augmentation; and (2) Conformal Attribute Reliability Engine, which leverages conformal prediction theory to rigorously distill high-fidelity supervision from noisy sources, yielding RS-Attribute-15M, the largest dataset with over 15 million attribute annotations. Extensive experiments demonstrate that SLIP-RS establishes unprecedented performance in fine-grained detection and cross-domain generalization, validating structured attributes as a vital foundation for remote sensing. Code: this https URL.
85. 【2605.23141】VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images
链接:https://arxiv.org/abs/2605.23141
作者:Zhaonan Li,Kyle R. Chickering,Bangzheng Li,Jacob Dineen,Xiao Ye,Zhikun Xu,Shijie Lu,Yuxi Huang,Ming Shen,Bach Nguyen,Jaya Adithya Pavuluri,Mau Son Nguyen,Sanika Chavan,Ngoc Minh Thu Le,Muhao Chen,Ben Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:manipulate concept-level properties, visual concept learning, concept learning, preserve and manipulate, manipulate concept-level
备注: Accepted to the Workshop on Visual Concepts at CVPR 2026 as a non-archival report
点击查看摘要
Abstract:A useful test of visual concept learning is not just whether a model can recognize a concept in a single image, but whether it can preserve and manipulate concept-level properties under transformation and transfer them to new scenes. We introduce VisAnalog, a controlled suite for this setting on natural images. Each example instantiates $A\!:\!B::C\!:\,?$: images $B$ and a hidden target image $D$ are produced by applying the same deterministic transformation sequence to source images $A$ and $C$. Given $A$, $B$, and $C$, a model must answer a multiple-choice question about $D$. The benchmark contains 617 human-validated questions spanning one- to four-step transformations such as zoom, quadrant swap, rotation, flip, and hue rotation. Across strong proprietary and open-source VLMs, end-to-end accuracy is substantially lower than oracle accuracy when $D$ is directly shown, and degrades sharply as transformation depth increases, while human performance remains near the ceiling. A program-conditioned evaluation further separates failures of relation inference from failures of transformation application, showing that inferring the visual relation from $A \rightarrow B$ is the dominant bottleneck, with additional application errors emerging on harder multi-step cases. The dataset is publicly available at this https URL.
86. 【2605.23118】Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking
链接:https://arxiv.org/abs/2605.23118
作者:Yannick Kirchhoff,Maximilian Rokuss,Daniel Philipp Mertens,David Füller,Benjamin Hamm,Andreas Schreyer,Oliver Ritter,Klaus Maier-Hein
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:oncological response assessment, Tracking tumor lesions, response assessment, serial CT scans, oncological response
备注: Accepted at MICCAI 2026
点击查看摘要
Abstract:Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade-off: end-to-end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration-segmentation pipelines permit user verification yet discard the lesion's prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration-proposed prompt, which the model leverages alongside the baseline lesion appearance to resolve segmentation ambiguities. We present a unified framework combining early spatial prompt fusion with latent temporal difference weighting for longitudinally-informed segmentation. To address data scarcity, we leverage large-scale synthetic pretraining, proving essential for exploiting longitudinal context, improving performance by up to 4.5 Dice points over training from scratch. Our approach secured first place in the MICCAI autoPET IV challenge. We further curate and release PanTrack, a new longitudinal pancreatic cancer benchmark, to assess out-of-distribution generalization. Experiments show that our model outperforms prior work in both fully automatic and the proposed verified tracking setting offering a clinically safe middle ground between automation and control. Code, model and dataset will be released at this https URL
87. 【2605.23116】CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection
链接:https://arxiv.org/abs/2605.23116
作者:Hyeongmuk Lim,Youngbum Hur
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:strong domain dependency, methods typically rely, Video Anomaly Detection, Existing Video Anomaly, Anomaly Detection
备注: Accepted to ICPR 2026
点击查看摘要
Abstract:Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: this https URL
88. 【2605.23113】Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization
链接:https://arxiv.org/abs/2605.23113
作者:Jiayu Xiong,Jing Wang,Qi Zhang,Wanlong Wang,Jun Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Audio-visual deepfake localization, Audio-visual deepfake, deepfake localization demands, temporal evidence, localization demands interval-level
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Audio-visual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schrödinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schrödinger Bridge (SB), IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross-modal consistency; these statistics select cross-modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. IaMSB anticipates single-sided and asynchronous forgeries and, using bottlenecked cross-modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict-IoU boundary precision, raising AP@0.95 by 3%~10%, and yields improved high-precision localization, particularly for single-sided forgeries.
89. 【2605.23070】Flow Mismatching: Unsupervised Anomaly Detection via Velocity Discrepancies in Flow Matching Models
链接:https://arxiv.org/abs/2605.23070
作者:Shengzhe Chen,Mehrdad Moradi,Kamran Paynabar,Hao Yan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:propose Flow Mismatching, avoids reconstruction-based paradigms, Flow Mismatching, unsupervised anomaly detection, anomaly detection method
备注:
点击查看摘要
Abstract:We propose Flow Mismatching, an unsupervised anomaly detection method that deliberately avoids reconstruction-based paradigms. Instead, we treat flow matching as geometric dynamics and leverage a key insight: anomalies occur at places where the learned normal flow disagrees with the geometric path toward a test image. Given a flow matching model trained only on normal images, we probe its learned velocity field along affine paths from Gaussian noise to a target image. Along each path, we compare the model-predicted velocity, which follows normal generative dynamics, with the geometric velocity toward the target, which includes any anomalous content. Anomalies induce strong local disagreement between these velocities. Aggregating the mismatch over different time steps and multiple paths yields pixel-wise heatmaps and image-level scores without test-time optimization, feature memories, or additional calibration. Our analysis shows that the population mismatch decomposes into an irreducible denoising term and a Fisher-divergence term between the test-path and normal-path score functions, which identifies the score-gap component that drives anomaly separation and explains the effectiveness of robust path aggregation. Extensive experiments on MVTec-AD and VisA demonstrate superior performance compared with SOTA reconstruction-based and recent flow matching-based approaches.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.23070 [cs.CV]
(or
arXiv:2605.23070v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.23070
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
90. 【2605.23068】RoboSurg-VQA: A Multimodal Benchmark for Surgical Segmentation-Aware Visual Question Answering
链接:https://arxiv.org/abs/2605.23068
作者:Chengyi Zhang,Zi Ye,Ziyang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reliable visual understanding, minimally invasive surgery, clinicians pose language-like, pose language-like questions, degraded views caused
备注:
点击查看摘要
Abstract:Reliable visual understanding in robot-assisted and minimally invasive surgery (RMIS/MIS) demands more than accurate masks: in clinical practice, clinicians pose language-like questions about procedural context, visibility, artefacts, and the presence of anatomical structures and surgical instruments, often under degraded views caused by occlusion, smoke, bleeding, and specular highlights. We present \textbf{RoboSurg-VQA}, a segmentation-aware visual question answering (VQA) benchmark built by repurposing public surgical segmentation datasets under a shared schema. Each frame is paired with a fixed set of clinically motivated questions spanning procedure context, anatomy (including region), imaging modality/view, surgical artefacts, image quality, and basic visibility and spatial attributes, with closed answer sets to enable consistent evaluation. To scale annotation, we generate candidate answers via constrained prompting with automatic validity and consistency checks, followed by human auditing to improve plausibility and label consistency. We report benchmark statistics, sanity baselines, and common evaluation challenges under challenging surgical conditions. The code will be available on this https URL.
91. 【2605.23065】Dithering Defense: Adversarial Robustness of Vision Foundation Models via Multi-Level Floyd-Steinberg Dithering
链接:https://arxiv.org/abs/2605.23065
作者:Yury Belousov,Brian Pulfer,Vitaliy Kinakh,Slava Voloshynovskiy
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Vision foundation models, Vision foundation, frozen backbones, point of failure, downstream tasks
备注: Paper accepted at the IEEE International Conference on Image Processing (ICIP 2026)
点击查看摘要
Abstract:Vision foundation models are widely used as frozen backbones across many downstream tasks, making them a single point of failure under adversarial attack. We study multi-level Floyd-Steinberg error-diffusion dithering as a lightweight, model-agnostic input transformation that disrupts adversarial perturbations while preserving semantic content. Unlike prior work, which was limited to binary dithering, grayscale CIFAR-10, and a single small model trained from scratch, we evaluate across six tasks (classification, segmentation, depth estimation, retrieval, captioning, visual question answering), two model families (DINOv2, PaliGemma), and three attacks of increasing strength (PGD, MI-FGSM, SIA), as well as an adaptive attacker using a straight-through estimator. Our results show that Floyd-Steinberg dithering at intermediate quantization levels, especially when combined with post-processing blur, exceeds or matches all tested baselines, including diffusion-based denoising, with substantially less degradation on clean inputs.
92. 【2605.23064】Millimeter-wave Imaging for Anthropometric Body Measurement
链接:https://arxiv.org/abs/2605.23064
作者:Miriam Senne,Benjamin D. Killeen,Christoph Baur,Nassir Navab,Azade Farshad
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:manual tape measures, clinically informative biomarkers, including measures, tape measures, hip ratio
备注:
点击查看摘要
Abstract:Body shape and circumferences are clinically informative biomarkers for risk stratification, including measures such as waist to hip ratio, limb and trunk girths, yet conventional tools such as manual tape measures and optical scanners often require undressing and sustained poses. These demands slow workflows, compromise dignity, and exclude many older adults and people with limited mobility. To make measurement fast and contactless, we leverage millimeter-wave (mmWave) radar, which preserves privacy and operates through typical clothing, enabling quick full-body acquisition. In this work, we present a new optimization-based framework to recover 3D human shape and extract a comprehensive set of anthropometric measurements from volumetric mmWave data. Our method introduces a weighted registration pipeline that fits a parametric body model (SMPL) directly to the noisy mmWave point cloud. The core of our contribution is a vertex-weighting strategy that modulates a Chamfer energy function for reliable surface alignment and noise elimination. We further stabilize the fit by incorporating a foot-ground plane constraint and pose priors, optimizing directly for the SMPL parameters. Together, these components enable a fast, privacy preserving workflow that delivers high fidelity body shape and measurements through clothing without cameras or disrobing and with minimal cooperation, supporting frequent risk oriented assessments in clinics and care facilities for patients of all ages and mobility levels.
93. 【2605.23045】he TIME Machine: On The Power of Motion for Efficient Perception
链接:https://arxiv.org/abs/2605.23045
作者:Mantas Skackauskas,Xinyue Hao,Laura Sevilla-Lara
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:video models, Video, recent years, tremendous progress, progress in recent
备注:
点击查看摘要
Abstract:Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with temporal understanding. In this paper we propose a novel approach that uses motion as the central modality for video representation. In particular, given the motion in a video in the form of point-tracks, we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks. This allows us to learn a representation in a self-supervised manner. We show that using motion to represent videos actually addresses both of the core limitations of video technology. First, it allows us to massively reduce the scale of training data, as motion is inherently appearance-independent and hence needs fewer examples to generalize well. Second, motion allows us to bypass the language-dependent training paradigm, learning better fine-grained concepts. The result is an embedding that we call TIME (Temporally Informed Motion Embedding), a representation trained exclusively on synthetic motion data. We test this embedding on a wide set of tasks in a zero-shot manner. We observe that without bells and whistles, performance is on par with state-of-the-art models using up to 4 orders of magnitude less training data. This is a stepping stone towards a new paradigm of video models that are both more temporally aware as well as more scalable.
94. 【2605.23028】RADAR: Relative Angular Divergence Across Representations
链接:https://arxiv.org/abs/2605.23028
作者:Xavier Cadet,Mateusz Nowak,Peter Chin
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Machine learning methods, learning methods rely, Machine learning, learning methods, methods rely
备注: 27 pages; 8 figures; 10 tables
点击查看摘要
Abstract:Machine learning methods rely on data. However, gathering suitable data can be challenging due to availability constraints, cost, or the need for domain expertise. Expanding datasets with additional sources is a common response to limited data, yet this practice does not always improve downstream performance and can sometimes lead to a loss of performance, known as negative transfer. We propose RADAR, a simple, geometrically grounded metric for estimating cross-domain transferability in foundation models. RADAR analyzes the layer-wise evolution of representations by measuring angular alignments and relative changes in distance along layer-to-layer displacement trajectories, and by comparing empirical distributions of within-domain and cross-domain dynamics. We hypothesize that domain transferability is related to the divergence between these trajectory distributions. We evaluate the metric across multiple modalities, including cross-lingual sentiment classification with text embedding models and cross-domain image classification with foundation vision models. Across several settings, RADAR provides competitive predictive performance relative to existing transferability metrics on several vision and text benchmarks, with particularly strong results when domain transitions are smooth or cleanly separated. Our ablations further suggest that the effectiveness of transferability estimation depends on the geometry of the model's internal representation space, with different modalities favoring different topological formulations.
95. 【2605.22997】Scene Reconstruction as Mapping Priors for 3D Detection
链接:https://arxiv.org/abs/2605.22997
作者:Yang Fu,Yuliang Zou,Hao Xiang,Xin Huang,Yijing Bai,Chen Song,Weijing Shi,Govind Thattai,Dragomir Anguelov,Mingxing Tan,Yingwei Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving, critical for motion, motion planning, planning but remains, remains an under-utilized
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks such as 3D object detection. Maps can provide robust structural priors of the static environment, helping resolve ambiguities and correct for sensor data sparsity or noise, especially for distant objects or under adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for efficient, large-scale deployment. In this paper, we propose a scalable solution to systematically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Priors Augmented 3D Detection (MPA3D) framework to effectively integrate mapping priors with different sensor modalities. Extensive experiments on the Waymo Open Dataset demonstrate that our approach achieves new state-of-the-art results, proving the effectiveness of scalable reconstructed scene priors for enhancing 3D detection.
96. 【2605.22996】CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
链接:https://arxiv.org/abs/2605.22996
作者:Adil Meric,Lin Geng Foo,Mert Kiray,Benjamin Busam,Rishabh Dabral,Christian Theobalt
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generates realistic interactive, realistic interactive dynamics, Multi Modal Diffusion, single binary mask, Modal Diffusion Transformer
备注:
点击查看摘要
Abstract:We present CoMoGen, a controllable video generation framework that generates realistic interactive dynamics from a single binary mask sequence conditioned on an input image. CoMoGen introduces a lightweight MaskAdapter that encodes binary mask sequences into a latent residual signal, injected into the Multi Modal Diffusion Transformer (MMDiT) model through a cosine-weighted schedule. Unlike the hierarchical coarse-to-fine design of UNet architectures, MMDiT operates as a sequence of uniform transformer blocks, making it difficult to identify which layers are responsible for the motion generation. Therefore, we propose a novel way to determine "Motion Layers" operating in the attention space of MMDiT. We fine-tune the model by using Low-Rank Adaptation (LoRA) to the Motion Layers, without requiring any architecture change in the MMDiT. This selective adaptation enables our method to focus on motion-critical components, yielding reduced computational cost. Despite its simplicity, CoMoGen enables precise subject motion and plausible interactions with surrounding humans, objects, and scenes. Comprehensive experiments on different datasets show that CoMoGen consistently outperforms prior controllable video generation methods and achieves state-of-the-art performance in motion fidelity and perceptual realism. Project page: this http URL.
97. 【2605.22962】GazeBehavior Annotation Toolkit (GBAT): AI-powered toolkit for automatic annotation of egocentric eye-tracking and video data of child-caregiver interaction
链接:https://arxiv.org/abs/2605.22962
作者:Iba Baig,Kevin Li,Yanbin Xu,Seiji Cattelain,Marie Hallo,Hayato Ono,Sho Tsuji,Ming Bo Cai
类目:Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE); Neurons and Cognition (q-bio.NC)
关键词:child-caregiver interactions enable, interactions enable investigation, child-caregiver interactions, interactions enable, enable investigation
备注: submitted to IEEE International Conference on Development and Learning (ICDL), 2026
点击查看摘要
Abstract:Video recordings of child-caregiver interactions enable investigation of attentional dynamics during naturalistic behavior. Such multimodal recording also allows researchers to examine how attention interacts with action and language use in real time. However, manual annotation of such data is time-consuming. Here, we introduce GazeBehavior Annotation Toolkit, a deep-learning-based toolkit designed to facilitate three key processes in data preprocessing and feature extraction: post-hoc synchronization across multiple videos, semi-automatic annotation of gaze target categories, and categorization of participants' poses and hand actions. This toolkit improves the efficiency and scalability of feature extraction from human egocentric eye-tracking and video data. Such improvement is critical in supporting large-scale and longitudinal investigations of attentional dynamics and naturalistic behavior in human early development.
98. 【2605.22942】Improved Vision-to-Chart Buoy Association with Learned World-to-Image Projection
链接:https://arxiv.org/abs/2605.22942
作者:Borja Carrillo-Perez(Arquimea Research Center)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:DETR-based fusion transformer, fusion transformer baseline, data association challenge, challenge baseline decoder, report presents
备注: 5 pages, 3 figures. Technical report for the MaCVi 2026 Vision-to-Chart Data Association Challenge at the CVPR 2026 Workshop; 2nd place submission. Code: [this https URL](https://github.com/bcarrpe/macvi26-visionmap-querymlp)
点击查看摘要
Abstract:This report presents a lightweight modification to the DETR-based fusion transformer baseline for the MaCVi 2026 Vision-to-Chart data association challenge. The challenge baseline decoder receives per-buoy queries encoding world-space distance and bearing, forcing the transformer to implicitly learn the complex geometric projection from world coordinates to image pixels. Instead, this work trains an additional dedicated MLP, QueryMLP, to explicitly predict the buoy's waterline contact point in the image from chart measurements and IMU orientation data. The predicted pixel coordinates are appended to the baseline decoder query vector, providing a direct spatial prior per buoy and reducing the geometric reasoning burden on the transformer decoder. On the challenge leaderboard, the presented approach achieves an Overall score of 0.7386, with F1 = 0.8055 and mIoU = 0.6718, on the held-out test set, placing second among all submissions.
99. 【2605.22907】VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding
链接:https://arxiv.org/abs/2605.22907
作者:Haichen He,Jiayi Zhou,Sifeng Shang,Yihan Hu,Yuanhan Zhang,Kaiyang Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Real-world long video, massive temporal spans, Real-world long, perform continuous tracking, long video understanding
备注:
点击查看摘要
Abstract:Real-world long video understanding requires models to perform continuous tracking, information integration and memory retention over massive temporal spans within extreme video durations. Mastering this intense cognitive load constitutes the fundamental bottleneck in long video understanding. While existing benchmarks have driven progress by scaling up video duration, their evaluation tasks often require comprehending only short and isolated video segments, falling short of capturing the challenge of ultra-long-context reasoning. To measure this cognitive load, we emphasize continuous certificate length, defined as the video length a human must continuously watch to definitively answer a given question. Driven by this metric, we introduce VideoOdyssey, a benchmark specifically designed for ultra-long-context and omni-modal video understanding. VideoOdyssey is characterized by three key features: 1) Extreme video duration and diversity: spanning 11 domains and 54 subcategories with an average video duration of 109 minutes; 2) Comprehensive evaluation scenarios: offering two subsets to address different research focuses, i.e., VideoOdyssey-V for probing the limits of visual understanding in MLLMs, and VideoOdyssey-AV for evaluating synchronized audio-visual understanding for omni-modal models; 3) Ultra-long and multi-level continuous certificates: extending the average continuous certificate to 16 minutes for VideoOdyssey-V and 12.8 minutes for VideoOdyssey-AV. Crucially, we design 5 granular levels from seconds to hours, providing a comprehensive diagnostic tool to evaluate models across varying context lengths and cognitive loads. Extensive evaluations show that bottlenecks of current MLLMs extend beyond simple retrieval to include struggles with continuous reasoning across varying context lengths, fine-grained perception, and non-verbal omni-modal understanding.
100. 【2605.22904】Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations
链接:https://arxiv.org/abs/2605.22904
作者:Safwen Naimi,Wassim Bouachir,Guillaume-Alexandre Bilodeau,Brian Mishara
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:enable timely intervention, suicide prevention efforts, supporting suicide prevention, monitoring human behavior, Suicide Risk Assessment
备注: 9 pages, 6 figures, 1 table. Accpted for Publication in IJCAI 26
点击查看摘要
Abstract:Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry, and aggregation of heterogeneous behavioral cues over time. In this work, we formalize the task of Suicide Risk Assessment (SRA) in metro stations and introduce the first interpretable framework that addresses this challenge. Unlike approaches that focus on isolated subtasks or attempt to infer intent directly, our formulation assesses suicide risk from accumulated evidence by incorporating person tracking, activity recognition, semantic segmentation of the platform, and trajectory-driven risk heatmap modeling. By formalizing SRA as a distinct task and benchmarking a complete operational pipeline achieving 83.2% ROC-AUC on real surveillance data, this work highlights the complexity of suicide risk assessment and opens new directions for research on interpretable AI systems for social good.
101. 【2605.22903】Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?
链接:https://arxiv.org/abs/2605.22903
作者:Zixuan Lan,Luzhe Sun,Matthew R. Walter,Jiawei Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:reflect grounded visual, grounded visual understanding, reflect grounded, reflect reliance, implicitly assumed
备注: Accepted to GRAIL-V: Grounded Retrieval and Agentic Intelligence for Vision-Language, CVPR 2026 Workshop. accepted version
点击查看摘要
Abstract:Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.
102. 【2605.22890】Extending Deep Event Visual Odometry with Sparse Point-Cloud Export
链接:https://arxiv.org/abs/2605.22890
作者:Alireza Safdari,Sajad Ashraf
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:challenging lighting conditions, lighting conditions due, Deep Event Visual, Event Visual Odometry, high temporal resolution
备注: 9 Pages, 4 figures, 5 tabel
点击查看摘要
Abstract:Event cameras are well suited for visual odometry under high-speed motion and challenging lighting conditions due to their low latency, high temporal resolution, and high dynamic range. Deep Event Visual Odometry (DEVO) demonstrated that monocular event-only odometry can achieve strong performance by combining sparse patch tracking, learned patch selection, recurrent correspondence refinement, and differentiable bundle adjustment. In this project, we extend DEVO with a sparse point-cloud export pipeline. Rather than modifying the core odometry formulation, our approach exposes the internal 3D structure already estimated by DEVO and converts it into an explicit point-cloud representation for visualization and further processing. In addition, we implement a practical workflow for data export, format conversion, and point-cloud cleanup. The resulting system preserves the original visual odometry pipeline while enabling sparse geometric scene output. Experiments on the BOARD SLOW sequence show that the exported sparse cloud is locally consistent with EMVS reconstructions, achieving high precision at a 5 cm threshold, while also highlighting the expected limitations in density, completeness, and sensitivity to accumulated odometry noise.
103. 【2605.22882】GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
链接:https://arxiv.org/abs/2605.22882
作者:Kaichen Zhou,Yuzhen Chen,Fangneng Zhan,Hang Hua,Grace Chen,Xinhai Chang,Ao Qu,Yilun Du,Zhuang Liu,Paul Pu Liang,Mengyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:preserve consistent point-level, consistent point-level motion, generate realistic futures, single instruction, motion over time
备注: Robotic World Model, Video Generative Model
点击查看摘要
Abstract:Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision, distilled from a pretrained geometry foundation model, into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at the project page: this https URL.
104. 【2605.22872】MedExpMem: Adapting Experience Memory for Differential Diagnosis
链接:https://arxiv.org/abs/2605.22872
作者:Qianhan Feng,Zhongzhen Huang,Yakun Zhu,Yannian Gu,Winnie Chiu Wing Chu,Xiaofan Zhang,Qi Dou
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:differentiate confusable conditions, Experienced physicians develop, Experienced physicians, confusable conditions, ability to differentiate
备注: MICCAI 2026 Early Accept. Submission Version
点击查看摘要
Abstract:Experienced physicians develop diagnostic expertise through clinical practice, acquiring not only disease knowledge but also the ability to differentiate confusable conditions. Current medical vision-language models (VLMs) lack this capability -- their parameters encode static knowledge that does not evolve across diagnostic encounters. We propose MedExpMem, an experience memory framework enabling VLM-based diagnostic agents to accumulate differential diagnosis expertise. Unlike retrieval-augmented generation, which retrieves encyclopedic disease descriptions, MedExpMem memorizes discriminative experience derived from the agent's own diagnostic failures and organizes them as pairwise differential notes encoding key discriminators, actionable decision rules and reasoning error patterns. The framework adopts a two-phase construction process mirroring physician learning: initial practice exposes knowledge gaps, and reflective re-diagnosis refines understanding. When encountering new cases, the agent retrieves experience memory to guide differential reasoning. We evaluate MedExpMem on a radiology benchmark spanning 11 subspecialties. Results demonstrate consistent accuracy improvements, maximum 7.0%, across diverse models and scales. Analytical experiments validate experience quality and robustness, demonstrating MedExpMem as a competitive method addresses medical adaptation needs beyond the reach of parameteric learning.
105. 【2605.23323】Efficient Learned Image Compression without Entropy Coding
链接:https://arxiv.org/abs/2605.23323
作者:Hao Cao,Wenqi Guo,Zhijin Qin,Jungong Han
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:learned image compression, Free Learned Image, learned image, Entropy coding, typical learned image
备注: Accepted by ICML 2026
点击查看摘要
Abstract:Entropy coding is widely used in typical learned image compression (LIC) that converts latents into a compact bitstream. However, entropy coding is typically sequential and becomes the coding latency bottleneck. To overcome it, we present Entropy-Coding Free Learned Image Compression (EF-LIC), a multi-rate framework that generates compact representation by removing statistical and correlation redundancy with low coding latency. First, we introduce unconstrained vector quantization and prove that its index distribution approaches the maximum-entropy bound, yielding minimal statistical redundancy. Second, we propose a context-conditioned autoregressive transform that directly reparameterizes the latents to reduce inter-dependency. Theoretical analysis shows that EF-LIC can remove correlation redundancy as effectively as typical LIC with entropy coding, leading to comparable compression performance. Experiments show EF-LIC achieves up to 67.86% bitrate reduction over MS-ILLM on Kodak with LPIPS. Ablation studies further show EF-LIC matches the compression performance of its entropy-coding based variant while achieving over $3\times$ faster encoding and $5\times$ faster decoding.
106. 【2605.23282】Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring
链接:https://arxiv.org/abs/2605.23282
作者:Shaoqing Duan,Haofei Song,Xintian Mao,Qingli Li,Yan Wang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Discontinuous Galerkin Neural, Galerkin Neural Operator, pathological microscopy remains, microscopy remains challenging, remains challenging due
备注: 17 pages, 9 figures. Accepted by ICML 2026
点击查看摘要
Abstract:Defocus deblurring in pathological microscopy remains challenging due to the spatially varying and locally discontinuous nature of optical blur induced by a position-dependent integral imaging process. Existing deep learning methods, constrained by shift-invariance assumptions and limited interpretability, are not well suited to such heterogeneous blur patterns. Neural operators provide a principled alternative by modeling defocus formation directly as an integral operator, offering a new perspective on defocus deblurring. However, most existing neural operator architectures for low-level vision rely on globally parameterized kernels that assume smoothness and stationarity, limiting their ability to model heterogeneous and locally discontinuous blur patterns. To address this limitation, we propose the Discontinuous Galerkin Neural Operator (DGNO), which parameterizes the integral kernel using a discontinuous Galerkin formulation with element-local volume operators and interface numerical fluxes. DGNO provides a principled combination of locality, heterogeneity modeling, and global coherence while preserving the underlying physics of optical image formation. Extensive and insightful experiments demonstrate that DGNO surpasses state-of-the-arts, delivering sharper reconstructions, robust handling of spatially varying blur, and scalable high-resolution performance. The code will be released at this https URL.
Comments:
17 pages, 9 figures. Accepted by ICML 2026
Subjects:
Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2605.23282 [eess.IV]
(or
arXiv:2605.23282v1 [eess.IV] for this version)
https://doi.org/10.48550/arXiv.2605.23282
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Shaoqing Duan [view email] [v1]
Fri, 22 May 2026 06:50:26 UTC (4,679 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring, by Shaoqing Duan and 3 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context:
eess.IV
prev
|
next
new
|
recent
| 2026-05
Change to browse by:
cs
cs.CV
cs.LG
eess
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
107. 【2605.23183】GMENet: Generative Mixture of Experts Network for Multi-Center Glioma Diagnosis with Incomplete Imaging Sequences
链接:https://arxiv.org/abs/2605.23183
作者:Pengfei Song,Fangjin Liu,Wenwen Zeng,Yonghuang Wu,Chengqian Zhao,Feiyu Yin,Xuan Xie,Jinhua Yu
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Contemporary glioma diagnosis, guide clinical decision-making, diagnosis integrates molecular, Contemporary glioma, integrates molecular features
备注: IJCAI Accept
点击查看摘要
Abstract:Contemporary glioma diagnosis integrates molecular features with histopathology to guide clinical decision-making. However, in clinical settings, divergent imaging protocols result in incomplete MRI sequences, leading to two primary challenges: forcing existing frameworks to discard a large portion of clinical data during training and consequently limiting their clinical applicability. To address these limitations, we propose GMENet, a Generative Mixture of Experts Network for multi-center glioma diagnosis with incomplete imaging sequences. Firstly, we design a Cross-attention-based Gated Generation Module that synthesizes missing sequence features from available sequences via cross-attention and dynamic gating mechanisms, incorporating a cycle-consistency loss to preserve semantic integrity. Secondly, we introduce a Dynamically Weighted Experts Fusion Module that performs mixture-of-experts interaction and confidence-aware fusion over original and synthesized dual-sequence features for multi-task prediction. We evaluate GMENet on a multi-center cohort of 1,241 subjects from four in-house datasets and two public repositories. Experiments show that GMENet expands clinically usable training data by 97\%, relative to complete-sequence-only data. Furthermore, it consistently outperforms state-of-the-art methods trained on complete data, demonstrating improved robustness under cross-center distribution shifts.
108. 【2605.23137】STAMBRIDGE: Spectral-Temporal Amplitude-aware Mid-Feature Bridge for EEG Visual Decoding
链接:https://arxiv.org/abs/2605.23137
作者:Jiahe Meng,Weiming Zeng,Yueyang Li,Bo Chai,Hongjie Yan,Zhiguo Zhang,Wai Ting Siok,Nizhuan Wang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:visual decoding remains, highly structured vision, decoding remains challenging, remains challenging due, making direct cross-modal
备注:
点击查看摘要
Abstract:Electroencephalography (EEG) visual decoding remains challenging due to the modality gap between low-SNR neural signals and highly structured vision--language spaces, making direct cross-modal alignment unstable. To address this, we propose STAMBRIDGE, a versatile two-stage framework that sequentially tackles feature conditioning and cross-modal alignment. First, we introduce a Spectral-Temporal Amplitude-aware Modulation (STAM) to extract well-conditioned EEG representations. By replacing hard frequency masking with amplitude-derived soft channel weighting and multi-scale temporal convolutions, STAM explicitly preserves frequency-aware transients while reducing the risk of time-domain ringing artifacts. Building upon these robust neural features, we further introduce a model-agnostic Mid-Feature Semantic Bridge (MFSB) that constructs a regularized intermediate space through directed cross-modal interactions, enabling staged distillation and more stable semantic alignment. Experiments on the THINGS-EEG benchmark show competitive 200-way zero-shot retrieval performance, with 34.50\% Top-1 and 65.95\% Top-5 accuracy. In addition, embeddings learned by STAMBRIDGE produce semantically coherent image reconstructions with a diffusion model, demonstrating robust EEG-to-vision semantic alignment. The code is available at: this https URL.
109. 【2605.23094】Do Synthetic Brain MRIs Reliably Improve Tumour Classification? A StyleGAN2-ADA Class-Plane Augmentation Study on BRISC 2025
链接:https://arxiv.org/abs/2605.23094
作者:José Rafael Noriega Cedeño
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:small medical-image datasets, downstream task performance, improve downstream task, medical-image datasets, task performance
备注: 18 pages, 16 figures
点击查看摘要
Abstract:Generative augmentation is often proposed as a remedy for small medical-image datasets, but synthetic images are only useful when they improve downstream task performance. "Augmentation" here means synthetic supplementation: GAN-generated samples added to the real training pool, not geometric or photometric transforms of existing images. Twelve class-plane StyleGAN2-ADA generators were trained on constrained BRISC 2025 partitions to test whether their output, with or without InceptionV3 feature-space filtering, improves held-out tumour classification across three classifier families: a random forest (RF) on InceptionV3 features, a compact two-headed convolutional neural network (CNN), and MobileViTV2, a mobile hybrid convolutional-transformer. Each was evaluated at 1:1 and 1:2 real-to-synthetic ratios. An independent GPT-5.5 blind test placed gated real-versus-synthetic discrimination at 57.73% (95% CI: 54.48--60.92%) on the model-legible subset -- modestly above chance. The RF classifier did not benefit from the synthetic MRIs. The CNN showed consistent mean gains that did not survive Holm correction. MobileViTV2 showed the clearest benefit: filtered 1:1 augmentation improved tumour classification accuracy by 1.02% absolute (95% CI: 0.54--1.54%; Holm-corrected p = 0.0104). A secondary efficiency analysis found that every augmented CNN condition selected its checkpoint 42--64% earlier than baseline, while compute-matched MobileViTV2 runs reached selection after 50--67% fewer real-data epochs. Overall, augmentation utility was found to be architecture- and ratio-dependent, not guaranteed by visual fidelity alone.

