本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新690篇论文,其中:
- 自然语言处理97篇
- 信息检索22篇
- 计算机视觉125篇
自然语言处理
1. 【2606.27377】DanceOPD: On-Policy Generative Field Distillation
链接:https://arxiv.org/abs/2606.27377
作者:Wei Zhou,Xiongwei Zhu,Zelin Xu,Bo Dong,Lixue Gong,Yongyuan Liang,Meng Chu,Leigang Qu,Lingdong Kong,Wei Liu,Tat-Seng Chua
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Modern image generation, Modern image, unifies diverse capabilities, demands a single, unifies diverse
备注: Technical Report; 39 pages, 13 figures, 9 tables; Project Page at [this https URL](https://danceopd.github.io/)
点击查看摘要
Abstract:Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.
2. 【2606.27347】Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline
链接:https://arxiv.org/abs/2606.27347
作者:Kirill Solovev,Jana Lasser
类目:Computation and Language (cs.CL)
关键词:capture public resources, political elites organise, comparative politics, elites organise, organise into rent-seeking
备注: 34 pages, 17 figures
点击查看摘要
Abstract:Whether political elites organise into rent-seeking coalitions that capture public resources or civic networks that sustain governance is a central question in comparative politics. Yet observing these complex, informal, and adversarial ties at scale has historically required intensive manual coding, while automated text-as-data methods have largely been limited to simple co-occurrence. Recent large language model (LLM) approaches offer a path forward but often rely on proprietary APIs, lack cross-lingual capability, and struggle with scalable entity resolution. We present a modular, fully open-weight pipeline for multilingual joint entity-relation extraction that builds signed, temporal knowledge graphs from massive unstructured news corpora. It combines span-based named-entity recognition (NER) with a three-stage linking cascade mapping mentions to language-independent Wikidata identifiers; a high-throughput, ontology-constrained mixture-of-experts model then uses guided decoding to extract directed, signed relationships grounded in a domain ontology. A full-coverage spot-check against a 3491-relation gold standard shows high textual correctness (68.2% strict to 93.7% lenient). Two large-scale case studies validate the pipeline against the public record. In Austria, it reconstructs a political party's complete lifecycle, dating internal fractures and tracking personnel into successor factions and court convictions. In a Polish corpus, it uncovers the overlapping economic and governance networks of state-enterprise patronage, alongside the structurally balanced, signed conflict network of the polarized Civic Platform (Platforma Obywatelska, PO)--Law and Justice (Prawo i Sprawiedliwość, PiS) duopoly. By bridging raw multilingual text and structured relational data, our framework provides a robust, replicable foundation for cross-national empirical computational social science.
3. 【2606.27330】Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning
链接:https://arxiv.org/abs/2606.27330
作者:Tianyi Men,Zhuoran Jin,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Multimodal web agents, operating repetitive GUI, repetitive GUI tasks, Multimodal web, repetitive GUI
备注: Accepted to ACL 2026 Main
点击查看摘要
Abstract:Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and privacy preserving compared with commercial large models, they suffer from weak planning and limited cross website generalization. To address these limitations, we introduce the planning experience exploration and utilization (PEEU) method, which autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high level training data. To quantitatively analyze the generalization behaviors driving this performance, we propose the task decomposition hierarchical analysis framework (TDHAF) to systematically study compositional generalization across three task granularities: low, middle and high levels. Our analysis reveals that mastering low level atomic skills does not guarantee high level planning competence, while high level task training yields stronger OOD generalization. Experiments on real world benchmarks demonstrate PEEU's superior effectiveness: our 7B model achieves 30.6% accuracy, outperforming the much larger Qwen2.5-VL-32B model. These demonstrate constructing hindsight high level tasks and leveraging experiences is crucial for OOD planning abilities of small MLLMs.
4. 【2606.27316】LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank
链接:https://arxiv.org/abs/2606.27316
作者:Serhii Hamotskyi,Akash Kumar Gautam,Christian Hänig
类目:Computation and Language (cs.CL)
关键词:German Central Bank, Central Bank, German Central, Named Entity Recognition, securities as collateral
备注:
点击查看摘要
Abstract:Verifying the eligibility of securities as collateral is a key responsibility of the German Central Bank. However, manually verifying these assets against legal and financial criteria within lengthy, semi-structured, and often bilingual prospectuses is a resource-intensive task. While previous efforts utilized traditional Named Entity Recognition (NER) for information extraction, these methods can struggle with OCR noise, linguistic variance, and rigid span-based constraints, and the need for manually annotated training data for each relevant annotation type. In this paper, we present the first case study applying Large Language Models (LLMs) to the eligibility examination process, shifting the paradigm toward a generative Information Extraction pipeline. Our approach decomposes the task into extraction, normalization, and interpretation, allowing for greater flexibility in handling noisy text and interleaved German-English content. We further introduce a value-based evaluation methodology using LLM-as-a-judge, which offers a more semantic assessment than location-based metrics. Our results demonstrate that LLM-based systems achieve high precision (up to 91%) in document-level eligibility, exhibiting a conservative operating profile that minimizes false acceptance.
5. 【2606.27314】Beyond Surface Forms: A Comprehensive, Mechanism-Oriented Taxonomy of Indirect Linguistic Encoding for LLM-Based Coded Language Detection
链接:https://arxiv.org/abs/2606.27314
作者:Hamid Reza Firoozfar,Mohammadsadegh Abolhasani,Reza Mousavi,Paul Jen-Hwa Hu
类目:Computation and Language (cs.CL)
关键词:users routinely invent, routinely invent indirect, invent indirect linguistic, indirect linguistic expressions, camouflage sensitive meanings
备注: Submitted for review in ARR for EMNLP 2026
点击查看摘要
Abstract:To avoid moderation and surveillance on social media, some users routinely invent indirect linguistic expressions (ILE) that camouflage sensitive meanings. Such expressions surface as algospeak, euphemisms, and adversarial obfuscation, depending on intent and context, and they involve recurring encoding mechanisms. We propose a comprehensive, mechanism-oriented taxonomy of ILE that abstracts away from communicative goals and instead categorizes the underlying operations through which meaning is encoded and recovered. We evaluate the taxonomy by incorporating it into LLM prompts and comparing it with four existing taxonomies and a no-taxonomy baseline, using 2,000 manually annotated TikTok and Bluesky posts. The proposed taxonomy attains the strongest document- and span-level performance across the three LLMs, achieving an improvement of 4.7% in accuracy and 5.4% in F1 over the best-performing benchmark. The empirical results reveal the importance of a comprehensive, mechanism-oriented taxonomy as a stable scaffold for detecting emerging coded language and a useful input to content moderation. Disclaimer: This paper contains content that may be profane, vulgar, or offensive.
6. 【2606.27306】Multilingual Reasoning Cascades Need More Context
链接:https://arxiv.org/abs/2606.27306
作者:Arnav Mazumder,Dengjia Zhang,Shuyue Stella Li,Yulia Tsvetkov,Niyati Bafna
类目:Computation and Language (cs.CL)
关键词:translate the query, translate the answer, answer back, reasoning translate, Translation cascades
备注:
点击查看摘要
Abstract:Translation cascades for reasoning translate the query from another language to English, reason in English, and translate the answer back to the original language. This is a competitive approach to multilingual reasoning, but structurally lossy, since each stage discards information later stages may need, including cues for cultural grounding, register, and disambiguation. We examine the benefits of a simple and training-free intervention: a context-aware translation cascade, which additionally provides the original question, the English translated question, and the reasoning trace to the context of the final translation module. We evaluate gains across nine multilingual benchmarks including various task types, three backbone models, and 285 high-, mid-, and low-resource languages, and demonstrate strong gains for open-ended generation across models and resource regimes. We show that the original language question carries most of the beneficial context. Our study emphasizes the need to better design information flow in machine translation cascades for mitigating error propagation, and provides a simple and actionable default strategy: preserve the original user question until the end of the pipeline.
7. 【2606.27275】How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation
链接:https://arxiv.org/abs/2606.27275
作者:Maria Levchenko
类目:Computation and Language (cs.CL); Digital Libraries (cs.DL)
关键词:remains poorly understood, Large language models, Large language, poorly understood, digital library workflows
备注: The 22nd Conference on Information and Research Science Connecting to Digital and Library Science
点击查看摘要
Abstract:Large language models (LLMs) are increasingly critical to digital library workflows, yet their ability to process historical language remains poorly understood. Historical difficulty is typically treated as a monolithic barrier, conflating orthographic variation, linguistic distance, and pretraining exposure. In this paper, we propose a diagnostic framework that decomposes this difficulty into four distinct dimensions: tokenization cost, predictive uncertainty (surprisal), semantic robustness, and context sensitivity. We evaluate this framework on three datasets spanning three centuries: (1) a newly curated corpus of 17th-century Italian texts (1610-1689) digitized from original page images; (2) canonical 19th-century Italian "I Promessi Sposi" serving as a high-exposure control; and (3) 18th-century Russian civil print books as a contrastive orthographic stress test. Our results reveal a distinct dissociation between encoding cost and comprehension. While Russian and early modern Italian incur comparable tokenization penalties (25-30% inflation), their predictive difficulty diverges sharply. 17th-century Italian is on average 2.4 times more surprising than its modern equivalent - with academic prose reaching 3.2 times - whereas Russian shows only a modest increase. But predictive uncertainty does not imply representational degradation: embedding similarity remains robust ( 0.85) across all datasets, confirming that models can represent historical meaning even when generation is unstable. Finally, we demonstrate that a minimal temporal context prompt reduces historical surprisal by approximately 60%, offering a simple, model-agnostic mitigation. These findings suggest that while historical text imposes a consistent encoding tax, digital libraries can safely deploy LLMs for semantic retrieval tasks, provided that generative applications are carefully adapted.
Comments:
The 22nd Conference on Information and Research Science Connecting to Digital and Library Science
Subjects:
Computation and Language (cs.CL); Digital Libraries (cs.DL)
MSC classes:
68T50, 68P20
ACMclasses:
I.2.7; H.3.3; H.3.7; I.7.5
Cite as:
arXiv:2606.27275 [cs.CL]
(or
arXiv:2606.27275v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.27275
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Maria Levchenko [view email] [v1]
Thu, 25 Jun 2026 16:52:21 UTC (512 KB)
8. 【2606.27242】he Geometry of Updates: Fisher Alignment at Vocabulary Scale
链接:https://arxiv.org/abs/2606.27242
作者:John Sweeney
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
关键词:shared vocabularies arises, scientific string domains, candidate corpora share, Training-free source selection, genomic sequences
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026), PMLR 306. 64 pages total (main paper plus appendix), 4 figures, 29 tables
点击查看摘要
Abstract:Training-free source selection for LLM families with shared vocabularies arises in scientific string domains such as SMILES, protein, and genomic sequences, where candidate corpora share a tokenizer but differ in prediction targets. This creates an activation-dark regime: representation-similarity metrics can be uninformative without assumptions about label-conditioned error geometry, while classical update-geometry metrics are computationally prohibitive at vocabulary scale. We show that, in a shared-output head setting, representation metrics (e.g., CKA) are non-identifiable for transfer; models can share identical representations yet have orthogonal head updates. The key identity is that head Fisher alignment is exactly a cosine between kernel mean embeddings in the joint activation-error space, exposing activation, error, and coupling factors rather than requiring a materialized Fisher matrix. FisherSketch estimates this cosine directly in a single streaming pass, making K=128,256 head Fisher alignment practical with a 16 KB task signature (m=4096) and a 192 KB per-task streaming state, small enough to store next to a model hash, but encoding transfer-relevant update structure. Beyond source selection, the same signatures and marginals provide a diagnostic instrument for studying whether LLM task similarity is driven by activations, errors, or their coupling; shared-parameter and internal-layer validations, together with Llama-3.1-8B verbalizer-shift experiments, show that FisherSketch remains informative when activation similarity cannot distinguish tasks.
9. 【2606.27237】LMs as Task-Specific Knowledge Bases: An Interpretability Analysis
链接:https://arxiv.org/abs/2606.27237
作者:Amit Elhelo,Amir Globerson,Mor Geva
类目:Computation and Language (cs.CL)
关键词:capture large amounts, Language models, capture large, motivating the view, large amounts
备注:
点击查看摘要
Abstract:Language models (LMs) capture large amounts of factual knowledge applicable to a wide range of tasks, motivating the view of their parameters as a knowledge base. An important property of knowledge bases is that different queries for the same fact return consistent results, drawing on a single source of truth. We investigate whether LMs satisfy this property through behavioral and mechanistic analyses. Our results suggest that they encode knowledge in a task-specific manner. Behaviorally, facts acquired on one task frequently fail to co-emerge on others during training. Parameter localization experiments suggest a mechanistic explanation, revealing distinct parameter subsets underlying different tasks for the same fact. Finally, we show that chain-of-thought reasoning draws part of its effectiveness from engaging task-specific parameters beyond those tied to the evaluation task. Our findings suggest that what the model knows and how it is asked are intertwined in parameter space, undermining the "knowledge base" analogy and carrying implications for the reliability and controllability of factual knowledge in LMs.
10. 【2606.27233】Bridging Talk and Thought: Understanding Dialogue Dynamics Across Collaborative Problem-Solving Contexts
链接:https://arxiv.org/abs/2606.27233
作者:Zhengyuan Liu,Stella Xin Yin,Min-Yen Kan,Nancy F. Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:collaborative problem-solving contexts, problem-solving contexts, present a conceptual, analyzing dialogue, emerging dynamics
备注:
点击查看摘要
Abstract:We present a conceptual framework for analyzing dialogue in collaborative problem-solving contexts, with an emphasis on the emerging dynamics of human-AI and multi-agent collaboration. As intelligent systems become active agents capable of autonomous reasoning and strategic cooperation, understanding the dialogic interaction during collaborative problem solving is increasingly important for optimizing and evaluating such partnerships. Our framework addresses key limitations in current analytical approaches through a hierarchical two-layer coding scheme that integrates cognitive and non-cognitive problem solving with metacognitive regulatory mechanisms. We demonstrate its effectiveness and generalizability across nine datasets spanning multiple domains, and provide insights into how humans and agents coordinate their knowledge, skills, and efforts to solve complex problems, showing in particular that metacognitive regulation can be an essential discriminator of deeper collaboration.
11. 【2606.27229】CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention
链接:https://arxiv.org/abs/2606.27229
作者:Sayak Dutta
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
关键词:order to remember, models must forget, forget in order, Recurrent models, Recurrent
备注: 27 pages, 2 figures, multiple tables. Submitted to arXiv. Primary category: cs.LG; cross-list: cs.CL
点击查看摘要
Abstract:Recurrent models must forget in order to remember, yet the state of the art decides what to erase without consulting what is stored -- the gate sees only the arriving token, not the memory it is about to modify. This memory-blind gating is one of three coupled defects in the leading delta-rule architecture (GDN-2): the value-axis erase mask wastes parameters at the scale of the value projection, and -- as we prove -- mathematically prevents the WY-form triangular chunk solver that makes recurrent training competitive with Transformers. We introduce CARVE (Content-Aware Recurrent with Value Efficiency), which resolves all three problems through one principle: erase only on the key axis. This is provably necessary and sufficient for the WY-form solver to remain valid. Within it, CARVE reuses the recurrent output tensor -- already written to GPU memory -- as a free content signal for the erase gate, and replaces the per-value write-gate projection with a single scalar per head. At initialisation CARVE is bit-identical to GDN-2; any quality difference emerges from what the content gate learns. At 1.3B parameters trained on 100B tokens, CARVE achieves WikiText perplexity 15.72 (minus 0.18 vs. GDN-2, a 4.5-sigma effect), leads every recurrent baseline on nine common-sense reasoning benchmarks, and sets state of the art on every RULER retrieval probe -- at 0.4% throughput overhead, 13% lower peak memory, and 19% fewer parameters. Six formal theorems cover memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, and hybrid optimality.
Comments:
27 pages, 2 figures, multiple tables. Submitted to arXiv. Primary category: cs.LG; cross-list: cs.CL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Cite as:
arXiv:2606.27229 [cs.CL]
(or
arXiv:2606.27229v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.27229
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
12. 【2606.27228】Compositionality and the lexicon in evolutionary semantics
链接:https://arxiv.org/abs/2606.27228
作者:Fausto Carcassi
类目:Computation and Language (cs.CL)
关键词:interpretable lexical parts, recursively composing lexical, sentence meanings arise, fixed signal structures, Formal semantics
备注:
点击查看摘要
Abstract:Formal semantics has shown that sentence meanings arise by recursively composing lexical meanings, yet much of the literature on semantic universals models either lexicons with fixed signal structures or holistic composition without interpretable lexical parts. We introduce a framework that integrates this fundamental insight of formal semantics in evolutionary modeling, by allowing lexical meanings and a composition function to co-evolve under pressures for conceptual simplicity and communicative accuracy. We apply this framework to the evolution of quantificational meaning. Analyzing the Pareto frontier, we find that the most well-known semantic universal, conservativity, emerges as an efficient system-wide abstraction. The account is sensitive to syntactic structure and helps reconcile tensions between empirical evidence on quantifier learnability and prior evolutionary models. More broadly, the results demonstrate that the picture of sentential meaning developed in formal semantics can be productively combined with evolutionary modeling. The framework offers a template for studying universals that involve global compression within a grammatical category, semantic specialization of syntactic arguments, and the co-evolution of lexical and compositional meaning.
13. 【2606.27226】Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement
链接:https://arxiv.org/abs/2606.27226
作者:Sangwoo Cho,Kushal Chawla,Pengshan Cai,Zefang Liu,Chenyang Zhu,Shi-Xiong Zhang,Sambit Sahu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Evaluating LLM outputs, lexical metrics correlate, metrics correlate poorly, bottleneck in NLP, produce opaque scores
备注: Acceepted to the Second Workshop on Compositional Learning at ICML 2026, Seoul, South Korea
点击查看摘要
Abstract:Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-level feedback together with calibrated overall scores. This decomposition makes evaluation easier to inspect, easier to diagnose, and directly usable for prompt improvement. Across SummEval, Topical-Chat, and QAGS, BINEVAL matches or outperforms strong baselines including UniEval and G-Eval, with especially strong results on factual consistency benchmarks such as QAGS. Beyond competitive correlation with human judgments, BINEVAL better matches human score distributions and avoids the ceiling effects common in prior LLM judges, leading to better discrimination between borderline and clearly flawed outputs. We further show that the same question-level feedback supports iterative prompt optimization, improving evaluator prompts on summarization and generation prompts on IFBench under both self-update and cross-model update settings. Overall, BINEVAL provides a task-agnostic, training-free, and interpretable evaluation framework that combines strong empirical performance with practical diagnostic and optimization value.
14. 【2606.27210】Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes
链接:https://arxiv.org/abs/2606.27210
作者:Jeremias Ferrao,Niclas Müller-Hof,Iustin Sîrbu,Traian Rebedea,Yftah Ziser
类目:Computation and Language (cs.CL)
关键词:final label, AIMS, safety, intent, safety classifiers
备注:
点击查看摘要
Abstract:We argue that safety classifiers should model user intent as an explicit signal between the prompt and the final label. To study this, we introduce AIMS, a human-annotated dataset of 1,724 difficult safety prompts, each paired with an intent description and harm label. We use AIMS to evaluate intent-aware training across supervised fine-tuning, preference learning, reasoning distillation, and reinforcement learning. Despite its size, AIMS enables competitive safety classifiers across training regimes: DPO from model-generated intent errors improves over SFT, and intent-conditioned distillation outperforms reasoning-only distillation in most teacher-student pairs. Most notably, directly rewarding intent faithfulness with GRPO yields the strongest average performance across five external safety benchmarks, while our intent-aware models form the inference latency-F1 Pareto frontier. These results show that faithful intent modeling is a compact, high-quality supervision signal for more robust safety classifiers.
15. 【2606.27206】Syntactic Belief Update as the Driver of Garden Path Processing Difficulty
链接:https://arxiv.org/abs/2606.27206
作者:Alan Zhou,Miloš Stanojević,John T. Hale
类目:Computation and Language (cs.CL)
关键词:Garden path sentences, sentence prefix leads, path sentences present, Garden path, sentence processing difficulty
备注:
点击查看摘要
Abstract:Garden path sentences present a processing difficulty for humans -- the sentence prefix leads the listener towards one interpretation, until the listener hears a critical word that shows that the initial interpretation was wrong. Lexical surprisal, a measure that usually predicts sentence processing difficulty quite well, fails to provide good predictions for garden path sentences. We propose an alternative that actively predicts a probability distribution over syntactic trees (its syntactic belief) and updates that distribution after each new word. If a processor is led down a garden path, syntactic beliefs will be wrong and will require a large update at the critical word. The magnitude of the update is measured with a generalized Rényi divergence. Crucially, this metric is dependent on lexical items, but is fully independent of the probability of lexical items. This Syntactic Belief Update provides a better fit to the human reading time data on garden path sentences. This suggests a new research direction examining purely non-lexical alternatives to surprisal for psycholinguistics.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2606.27206 [cs.CL]
(or
arXiv:2606.27206v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.27206
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
16. 【2606.27199】Forecasting With LLMs: Improved Generalization Through Feature Steering
链接:https://arxiv.org/abs/2606.27199
作者:Humzah Merchant,Bradford Levy
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Successful forecasting involves, involves identifying patterns, forecasting involves identifying, Successful forecasting, future observations
备注:
点击查看摘要
Abstract:Successful forecasting involves identifying patterns between historical and future states of the world which generalize to future observations. We apply LLMs to a variety of forecasting tasks and inspect their internal states using sparse autoencoders to understand whether they appear to rely on time-specific pieces of knowledge versus generalizable patterns. Our analyses identify features associated with both time-aware reasoning and look-ahead-biased reasoning. We then apply the LLMs to an entirely different domain and intervene on these features. We find that amplifying time-awareness features substantially reduces look-ahead bias on forecasting prompts while preserving general reasoning performance. In contrast, steering the candidate look-ahead-bias features does not produce an effect. These results suggest that interpretable temporal features can be used to causally shift LLMs toward more historically grounded reasoning.
17. 【2606.27187】HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models
链接:https://arxiv.org/abs/2606.27187
作者:Jiajun Wu,Haoyu Kang,Yining Sun,Jiacheng Hou,Heng Zhang,Danyang Zhang,Zhenjun Zhao,Haochi Zhang,Leixin Sun,Eric Hanchen Jiang,Yushan Li,Ruiyu Li,Mengkai Huang,Yan Gao,Xu Zhang,Guancheng Wan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:automated content moderation, sparking growing interest, recently shown immense, shown immense potential, Large vision-language models
备注:
点击查看摘要
Abstract:Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in existing works: 1) The multi-layered characteristics of harmful videos are overlooked. Existing benchmarks predominantly formulate evaluation as a binary classification task, failing to capture implicit or deep contextual harms. 2) Explanatory rationales are completely absent. Current frameworks measure exclusively whether a model flags a video correctly rather than explaining why, turning evaluation into a black box where models can succeed through superficial shortcuts. To address these problems, we present HarmVideoBench, a multi-layered diagnostic benchmark comprising 1,379 videos paired with 4,137 multiple-choice questions. HarmVideoBench benchmarks three hierarchical dimensions: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning, aiming to evaluate models' deep understanding beyond surface cues with carefully balanced and curated samples. We evaluate 19 leading models on HarmVideoBench to assess their multidimensional understanding of harmful videos. Moreover, we introduce BCR, a benchmark-aligned method that predicts reasoning boundaries and dynamically retrieves context only when needed. Experimental results show that BCR substantially improves the base model's performance in harmful video understanding, raising the macro average from 61.7 percent to a state-of-the-art 84.4 percent.
18. 【2606.27103】he Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans
链接:https://arxiv.org/abs/2606.27103
作者:Bella Fascendini,Kathryn McGregor,Max D. Gupta,Thomas L. Griffiths
类目:Computation and Language (cs.CL)
关键词:reasoning, riddles, LLMs, riddle, Riddle riddles
备注:
点击查看摘要
Abstract:Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern matching from training data or flexible reasoning. Here, we introduce a novel paradigm to test this question: the riddle riddle paradigm. Riddle riddles are word problems written to mimic popular riddles, but altered so their answers only require literal interpretations. Identifying correct answers requires looking past the structure of each question and flexibly apply different reasoning strategies based on the content. If LLMs respond to surface features, such as form, a riddle-like structure should cause models to use an inventive reasoning strategy even when a literal interpretation suffices. Alternatively, if LLMs reason based on content, they should flexibly switch strategies when appropriate. Across two experiments with nine state-of-the-art LLMs and 100 human participants, we show humans and LLMs fail on this paradigm in opposite directions. LLMs were far more accurate on genuine riddles than on riddle riddles (84.9% vs. 50.7%); whereas humans showed the reverse effect (50.5% vs. 80.5%). Error analysis shows that 90.8% of LLM errors on riddle riddles (the condition where they show diminished performance) were due to inappropriate use of inventive reasoning while only 57.6% of human errors on genuine riddles were due to overextending literal reasoning. Thus, while both groups make mistakes, reasoning mistakes are made more often by LLMs than by humans. Overall, LLMs' strong performance on genuine riddles may reflect memory retrieval rather than flexible strategy selection, and without stimuli designed to elicit this contrast, it becomes easy to conflate LLM-generated outputs that look like reasoning with genuine reasoning.
19. 【2606.27069】owards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning
链接:https://arxiv.org/abs/2606.27069
作者:Stanisław Sójka,Felix Steffek,Matthias Grabmair
类目:Computation and Language (cs.CL)
关键词:objective case facts, disentangle objective case, Gated Multi-Task Learning, disentangle objective, objective case
备注: 17 pages (8 pages main text), 5 figures, 9 tables. Accepted to the AI for Law Workshop at the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea
点击查看摘要
Abstract:Legal outcome prediction must disentangle objective case facts from adjudicative context. Merit-based rulings rely on factual evidence while technical disposals may hinge on judicial discretion. We propose a Judge-Aware Gated Multi-Task Learning architecture that explicitly models this distinction. We introduce a fine-grained outcome taxonomy to supervise the encoder, enforcing a structural regularization that disentangles distinct semantic pathways. This granular legal curriculum enables our Gated Fusion mechanism to dynamically modulate reliance on judge identity. We evaluate our approach on 13,937 UK Employment Tribunal decisions. We benchmark our design against supervised fine-tuning (SFT) of a Gemma-4 26B-A4B backbone, in which judge identity and the taxonomy are injected as prompt tokens or autoregressive output targets. The two contextual signals compose only weakly when forced through a single autoregressive channel. In contrast, coupling a LoRA-adapted Gemma-4 encoder with our gated architecture defines a new state of the art on this benchmark while requiring an order of magnitude fewer trainable parameters than the generative SFT baselines, with gains concentrated on the most ambiguous and rarest outcome classes. Beyond accuracy, the architecture is interpretable; learned judge embeddings and calibration profiles localize the cases where adjudicative context drives the prediction. These results indicate that, for identity-conditioned classification of legal outcomes, the choice of conditioning interface dominates scale: differentiable structured composition yields more accurate, more parameter-efficient models than prompt-based composition over a substantially larger backbone.
20. 【2606.27047】NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models
链接:https://arxiv.org/abs/2606.27047
作者:Henry Shaowu Yuchi,Michal Kucer,Benjamin H. Sims,Selma Peterson,Emily Taylor
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, demonstrated strong performance, significant challenge, demonstrated strong
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark comprises approximately 1,240 question-answer pairs spanning three categories: boolean, numeric, and verbal. NuclearQAv2 is constructed using a hybrid pipeline that combines expert-authored questions, existing datasets, and LLM-assisted generation from domain-specific technical corpora. By leveraging structured prompting for both automated question generation and response evaluation, the proposed framework enables scalable benchmark construction and evaluation. We evaluate a diverse set of LLMs using NuclearQAv2 and observe substantial performance differences across task types. While the models generally perform well on factual questions, quantitative reasoning and conceptual understanding remain considerably more challenging. These results highlight the importance of multi-faceted evaluation frameworks and establish NuclearQAv2 as a scalable benchmark for assessing LLM capabilities in technical domains.
21. 【2606.27025】Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization
链接:https://arxiv.org/abs/2606.27025
作者:Zhenhua Xu,Dongsheng Chen,Jian Li,Yitong Lin,Zhebo Wang,Jiafu Wu,Yizhang Jin,Chengjie Wang,Meng Han,Yabiao Wang
类目:Computation and Language (cs.CL)
关键词:Building general-purpose role-playing, Building general-purpose, profile remains challenging, natural-language profile remains, general-purpose role-playing agents
备注:
点击查看摘要
Abstract:Building general-purpose role-playing agents that faithfully portray any character from a natural-language profile remains challenging. The dominant paradigm -- supervised fine-tuning -- encourages behavioral mimicry without deep, human-like internal thought processes, resulting in poor out-of-distribution generalization. Therefore, we propose \textbf{Psy-CoT}, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into three role-specific steps -- \emph{Interaction Perception}, \emph{Psychological Empathy}, and \emph{Logical Construction} -- so that the model \emph{thinks dynamically} from the profile rather than merely mimicking surface patterns. While structured reasoning provides a foundation, it alone is insufficient; reinforcement learning is essential to further align the model with character fidelity. However, we observe that under LLM-based reward models, both generic phrases that hack the reward model and genuinely role-specific phrases receive identical gradient signals -- this hacking accumulates over training, misleading the model into treating both as equally optimal choices. To address this, we propose \textbf{Role-Aware Policy Optimization (RAPO)}, which uses profile--token mutual information to weight gradients asymmetrically -- amplifying role-specific tokens under positive advantage while attenuating them under negative advantage. Experiments on CoSER, CharacterBench, and CharacterEval demonstrate that Psy-CoT outperforms existing role-playing CoT methods, and RAPO consistently surpasses GRPO across multiple model scales.
22. 【2606.27023】Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA
链接:https://arxiv.org/abs/2606.27023
作者:Eren Senoglu,Federico Toschi,Nicolo Brunello,Andrea Sassella,Mark James Carman
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical Visual Question, Visual Question Answering, Multimodal large language, produce overconfident outputs, existing verbalized confidence
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed primarily for text only LLMs, do not account for the multimodal nature of medical image understanding. This work proposes a training based framework that finetunes MLLMs to improve their calibration using a composite loss function combining a Brier style calibration term, an anchor regularizer that prevents confidence collapse toward extreme values, a contrastive image text alignment term, and a KL based model stabilization term. The alignment signal is derived from a $2 \times 2$ factorial perturbation design that crosses image presence with text integrity, probing the reliance of the model on visual modality input versus language priors. Finally, a top K KL divergence regularizer is used to protect the answering ability of the model during finetuning. Across three Medical VQA benchmarks and two architectures (MedGemma 4B IT and Qwen2 VL 7B Instruct), our method reduces calibration error by 60% or more, and improves discrimination by 26% or more, while preserving predictive accuracy. On average across benchmarks, the technique outperforms prompting based, sampling based, and training based approaches, and ablation experiments confirm that each component of the loss function is indeed necessary for improving the calibration. All code for the experiments is publicly available.
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2606.27023 [cs.LG]
(or
arXiv:2606.27023v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2606.27023
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2606.27019】MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment
链接:https://arxiv.org/abs/2606.27019
作者:Sander Land
类目:Computation and Language (cs.CL)
关键词:edit vocabularies, heavy and complex, makes it straightforward, straightforward to edit, comparatively heavy
备注:
点击查看摘要
Abstract:The Unigram tokenizer uses an elegant representation which makes it straightforward to edit vocabularies, but its training is comparatively heavy and complex. We introduce MinGram (Minimalist Unigram), which keeps the token-list representation but simplifies training using a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step. This removes the suffix array, the forward-backward pass, and the iterative prune loop, leaving a procedure that requires little beyond tokenizer inference itself. By making token count the primary objective and using a Unigram score only as a tiebreak, MinGram keeps the compression of pure token-count methods while retaining much of the morphological alignment and downstream quality of probabilistic ones. Across six languages, MinGram compresses better than both BPE and standard Unigram, and a compression-oriented variant matches the strongest token-count compressors while retaining substantially higher morphological alignment. In controlled downstream language-model training, Unigram-family tokenizers, with MinGram among the best, consistently beat BPE in bits-per-byte.
24. 【2606.26987】Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs
链接:https://arxiv.org/abs/2606.26987
作者:Sinie van der Ben,Raphaël Baur,Yannick Metz,Mennatallah El-Assady
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Recent work identified, Claude Sonnet, causally influence behavior, human psychological structure, mirroring human psychological
备注:
点击查看摘要
Abstract:Recent work identified emotion vectors in Claude Sonnet 4.5, which are internal representations that encode emotion concepts, causally influence behavior, and exhibit geometry mirroring human psychological structure. We test the generality of these findings in two open-weight models, Apertus-8B-Instruct-2509 and Gemma-4-E4B-it, extracting emotion contrast vectors across all layers, using two model-generated corpora. We recover valence geometry for both models, with peak PC1--valence correlations of $r = 0.76$ and $r = 0.83$, approaching the $r = 0.81$ reported for this http URL replication, we observe notable differences in how valence representations emerge across model depth. In Gemma-4-E4B-it, valence is strongly encoded in early layers but collapses towards later layers, whereas Apertus-8B-Instruct-2509 exhibits the opposite pattern, with valence representations absent in early layers, but emerging at mid depths. Arousal encoding, in contrast, is sensitive to the extraction corpus: both models show stronger PC2--arousal alignment with Gemma-generated stories ($r$ up to $0.45$) than Apertus-generated ones ($r \leq 0.21$), suggesting arousal-relevant cues are unevenly distributed across generated corpora. We open-source our experiment code and dataset for reproducible investigation of emotion representations across language model architectures.
25. 【2606.26986】ReaORE: Reasoning-Guided Progressive Open Relation Extraction Empowered by Large Reasoning Models
链接:https://arxiv.org/abs/2606.26986
作者:Xin Lin,Liang Zhang,Guoqi Ma,Hongyao Tu,Jinsong Su
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Open Relation Extraction, extract unseen relations, Large Language Models, Open Relation, Relation
备注:
点击查看摘要
Abstract:Open Relation Extraction (OpenRE) requires a model to extract unseen relations between head and tail entities from unstructured text for real-world applications. The core challenge of OpenRE lies in achieving reliable generalization to unseen relation types. Current OpenRE approaches either employ clustering techniques, which cannot generate relation labels and suffer from poor generalization, or rely on direct relation label generation via Large Language Models (LLMs), which lack sufficient discriminative capacity to distinguish easily confused relations. To address these limitations, we propose Reasoning-guided progressive OpenRE (ReaORE), a framework for performing relation extraction through coarse-to-fine relation reasoning. Specifically, ReaORE consists of two key stages: (i) relation filtering, which reasons over multiple aspects to understand relations and instances, yielding an initial relation set, and further supplements and filters relations via embedding-based similarity to ensure the target relation is included; (ii) relation prediction, which aims to predict the target relations from the above set via fine-grained comparative reasoning to better distinguish easily confused relations. Extensive experiments on two widely used OpenRE datasets demonstrate that ReaORE outperforms existing baselines.
26. 【2606.26982】Auditing Framing-Sensitive Behavioral Instability in Large Language Models for Mental Health Interactions
链接:https://arxiv.org/abs/2606.26982
作者:Abla Bedoui,Ashley L. Greene,Mohammed Cherkaoui
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:mental health support, health support tools, Large language models, sensitive conversational applications, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly being integrated into mental health support tools and other psychologically sensitive conversational applications. In such settings, behavioral stability and consistency are important for trustworthy human-AI interaction. However, semantically similar concerns can be presented through different contextual framings, potentially eliciting different model responses. Such framing-sensitive variability may challenge user expectations regarding system behavior and complicate the assessment of AI reliability. While prior studies have primarily examined such effects at the behavioral level, less is known about how framing-related variation is reflected in the internal representations of aligned language models. In this work, we investigate these effects using controlled matched prompts spanning multiple contextual framing conditions across several instruction-tuned model families. Across architectures, framing systematically alters interpretive response tendencies. Layer-wise probing analyses show that behavior-associated information remains decodable throughout transformer depth, with architecture-dependent variation in decoding strength. Moreover, held-out framing probes remained consistently above chance across architectures despite strong lexical baselines. Activation steering experiments further suggest that framing-associated representational directions can partially modulate downstream behavioral outcomes. Finally, these findings indicate that robustness to contextual variation may represent an important consideration when evaluating the consistency and trustworthiness of conversational AI systems deployed in mental-health-oriented interactions.
27. 【2606.26969】Einstein World Models
链接:https://arxiv.org/abs/2606.26969
作者:Munachiso Samuel Nwadike,Zangir Iklassov,Ali Mekky,Zayd M. Kawakibi Zuhri,Kentaro Inui
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:direct experience, Einstein World Models, intelligence require, require the ability, phenomena beyond direct
备注: 12 pages (9 without references), 2 figures, 1 algorithm
点击查看摘要
Abstract:Does intelligence require the ability to reason about phenomena beyond direct experience? It is natural to suspect that some complex thought cannot be captured through language alone. However, of particular concern to this work, is whether visualising counterfactual events can complement language as a mechanism for complex thought. We ask whether LLMs can be trained to utilise such visualisation mechanisms, in a way that benefits their reasoning abilities. Motivated by this question, we propose Einstein World Models. EWMs are a blueprint for LLM-based reasoning systems that place visual-temporal rollouts inside the reasoning trace, allowing them to reason in ways that text alone may not support well. In an EWM, the LLM calls a world-module (not to be confused with a world model), to produce short rollouts of scenes under consideration. The returned rollout is treated not as the answer, but as an inspectable hypothesis that can support later reasoning. Einstein World Models extend the capability of LLMs for tool calling (such as web search or code execution), into the domain of visual thought experiments.
28. 【2606.26968】RedVox: Safety and Fairness Gaps in Speech Models Across Languages
链接:https://arxiv.org/abs/2606.26968
作者:Beatrice Savoldi,Sara Papi,Wafa Aissa,Matteo Negri,Luisa Bentivogli
类目:Computation and Language (cs.CL)
关键词:Speech-capable models, increasingly deployed, deployed in real-world, real-world applications, Speech-capable
备注:
点击查看摘要
Abstract:Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting practices across state-of-the-art speech model releases, finding that only 8% document any multilingual analysis. To address this gap, we introduce RedVox, a multilingual safety and fairness benchmark for audio and speech built on real voices, covering unsafe and unfair stereotypical requests across five languages (English, French, Italian, Spanish, and German). Evaluating eight state-of-the-art models, we find that vulnerabilities persist even under non-adversarial conditions, worsen in non-English languages, and are amplified when the request comes from a spoken input. Finally, by surveying the participants who contributed to RedVox, we document the unique personal and privacy challenges of collecting speech data with human participants, pointing to broader sociotechnical challenges in naturalistic speech safety research.
29. 【2606.26963】rm-Centric Hierarchy Induction from Heterogeneous Corpora
链接:https://arxiv.org/abs/2606.26963
作者:Elena Senger,Yuri Campbell,Jan-Peter Bergmann,Rob van der Goot,Barbara Plank
类目:Computation and Language (cs.CL)
关键词:Organizing knowledge, crucial for tasks, Organizing, exploratory domain mapping, Abstract
备注:
点击查看摘要
Abstract:Organizing knowledge from diverse text sources into interpretable hierarchies is crucial for tasks such as policy analysis, innovation monitoring, and exploratory domain mapping. Existing taxonomy induction methods typically rely on document-level representations that capture entire documents rather than the specific domain concepts relevant for knowledge organization, limiting their ability to generalize across heterogeneous sources. We propose a term-centric framework for inducing hierarchical taxonomies from heterogeneous corpora that scales to massive document collections. Our approach maps documents from diverse sources into a shared representation space using automatic term extraction, enabling robust cross-source alignment. Based on these representations, we construct interpretable hierarchies that integrate domain priors with datadriven clustering. Experiments on a novel English and German multi-source benchmark of over one million documents demonstrate that our method improves cross-source coherence and hierarchy quality over text- and summarybased baselines. A case study on German regional innovation analysis further demonstrates its practical utility for technology landscape mapping.
30. 【2606.26936】Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries
链接:https://arxiv.org/abs/2606.26936
作者:Prarabdh Shukla,Ritik,Suhas Rao,Arpit Agarwal,Arjun Bhagoji
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:elicit actionable responses, average Jane, elicit actionable, actionable responses, non-expert malicious actors
备注:
点击查看摘要
Abstract:With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors ("the average Jane") could elicit actionable responses to malicious requests. In this work, we examine whether this concern is justified. A non-expert malicious actor requires two ingredients for a successful attack: a powerful jailbreak for their target model, acting on an effective malicious query. For the former, we propose a novel attack strategy based on the multi-armed bandit framework. This allows efficient online learning of the optimal jailbreak from a large choice set via noisy exploration on a small number of queries, with subsequent application of the learnt policy on an exploitation set. For the latter, we curate $\mathrm{FrankensteinBench}$, a safety benchmark of $11,279$ malicious queries drawn from manual curation over $7$ existing benchmarks, along with automated enhancement and generation. Each query is categorized as simple or complex by the technical expertise required to craft it. Our findings confirm the concern. Our bandit-based attack achieves success rates as high as $97\%$ on average over $15$ SoTA open-weight LLMs. Moreover, adding complexity to queries raises the attack success rate by up to $26\%$ on average across models -- making it an effective, automatable prompting strategy.
31. 【2606.26923】GAVEL: Grounded Caption Error Verification and Localization
链接:https://arxiv.org/abs/2606.26923
作者:Zixian Gao,Atsushi Hashimoto,Kuniaki Saito
类目:Computation and Language (cs.CL)
关键词:inconsistent outputs, properly aligned, Grounded Caption Error, produce hallucinated, hallucinated or inconsistent
备注: conference
点击查看摘要
Abstract:Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy and localizing its visual evidence. We introduce GAVEL (Grounded Caption Error Verification and Localization), a task that jointly addresses verification, explanation, and localization for image-text pairs. To support systematic evaluation, we also present a corresponding dataset and benchmark. We further train a supervised baseline on the human-annotated training split to assess whether GAVEL provides learnable supervision for these abilities. Experiments show that even strong closed-source models struggle on GAVEL, while the supervised baseline yields consistent improvements across grounding and explanation metrics.
32. 【2606.26901】SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages
链接:https://arxiv.org/abs/2606.26901
作者:Subham Kumar,Prakrithi Shivaprakash,Abhishek Manoharan,Astut Kurariya,Diptadhi Mukherjee,Prabhat Chand,Pratima Murthy,Koustav Rudra,Lekhansh Shukla,Animesh Mukherjee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Automatic Speech Recognition, remains largely unknown, demographically diverse Indian, diverse Indian healthcare, Indian healthcare context
备注:
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare context remains largely unknown. In this study, we first conduct the systematic audit of ASR performance on real-world psychiatric interview data spanning Kannada, Hindi and Indian English, comparing eight state-of-the-art models including IndicWhisper, WhisperLargeV3, Sarvam, GoogleS2T, Gemma3n, OmniLingual, Vaani, and Gemini. Our results reveal substantial variability across models and languages, with some systems performing competitively in Indian English but failing in regional speech. We further fine-tune two of the best performing opensource models, i.e., Gemma3n and OmniLingual, using various methods. With this, we uncover systematic performance gaps tied to speaker role and gender, raising concerns about equitable deployment in clinical settings, which are further mitigated by fairness-aware fine-tuning. To this end, we propose SamaVaani, a unified debiasing technique that simultaneously improves ASR performance and improves fairness across demographic groups.
33. 【2606.26880】Heterogeneous Neural Predictivity from Language Models During Naturalistic Comprehension
链接:https://arxiv.org/abs/2606.26880
作者:Xiao Jia
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:representations provide structured, naturalistic language stimuli, Language-model representations provide, provide structured, high-dimensional annotations
备注:
点击查看摘要
Abstract:Language-model representations provide structured, high-dimensional annotations of naturalistic language stimuli and can serve as informative neural predictors during comprehension. We analyzed locked derived data from Brain Treebank, MEG-MASC, and Podcast ECoG with eight frozen language models, blocked encoding models, and matched temporal, nuisance, and representation-capacity controls. Positive held-out prediction and gains over low-level baselines were widespread in source-level summaries. Across Brain Treebank and Podcast ECoG, 67 of 432 evaluable rows met a controlled predictive-only criterion, and model-side feature ablations changed prediction scores in most evaluable source rows. Brain-derived, timing-linked, acoustic, and implanted-signal controls confirmed component-level sensitivity of the analysis pipeline. These findings show that language-model-derived quantities can annotate neural activity during natural speech and text comprehension. Participant-level matched-control advantages were localized rather than uniform, response-profile and feature-specificity contrasts bounded representational or computational interpretations, and complete co-indexed integrated interpretation will require future jointly indexed coverage. Together, the analyses identify language-model features as useful neural predictors and separate predictive usefulness from claims about shared neural organization or language-processing computations.
34. 【2606.26875】Information-Aware KV Cache Compression for Long Reasoning
链接:https://arxiv.org/abs/2606.26875
作者:Jushi Kai,Zhuiri Xiao,Alexandra Birch,Zhouhan Lin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, language models, size of key-value, capability has advanced, advanced rapidly
备注:
点击查看摘要
Abstract:Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attention weights to estimate token importance. While attention effectively captures contextual relevance, it overlooks complementary information-theoretic signals related to predictive uncertainty and token informativeness. In this paper, we revisit token importance from a forward-looking perspective and introduce \textit{Forward Influence}, a metric that measures how compressed tokens affect future contexts. Our analysis reveals that tokens selected by attention scores mainly influence nearby contexts, whereas tokens associated with high predictive uncertainty exhibit substantially stronger influence on distant future contexts. Based on the observation, we propose \textbf{InfoKV}, an entropy-aware KV cache compression framework that incorporates information-theoretic signals. It combines token-level predictive uncertainty with layer-wise representation evolution and integrates the resulting entropy scores with attention scores during reasoning. Experiments on long-context reasoning benchmarks with Llama-3.1, Llama-3.2, and DeepSeek-R1 demonstrate that InfoKV consistently outperforms existing attention-based KV compression methods in both long prefilling and decoding scenarios.
35. 【2606.26861】Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT
链接:https://arxiv.org/abs/2606.26861
作者:Jinghan Wang,Yanjun Chen,Wei Zhang,Xiaotong Huang,Tianchen Liu,Gaoliang Peng
类目:Computation and Language (cs.CL)
关键词:Deploying large language, Internet of Things, behavior remains unpredictable, devices demands extreme, cross-architecture behavior remains
备注: This work has been submitted to the IEEE Internet of Things Journal for possible publication
点击查看摘要
Abstract:Deploying large language models (LLMs) on Industrial Internet of Things (IIoT) edge devices demands extreme compression, yet existing structured pruning methods collapse at high compression ratios due to one-shot importance estimation, and their cross-architecture behavior remains unpredictable. This article presents a cascaded multi-granularity pruning framework that removes layers, attention heads, and feed-forward channels in coarse-to-fine order, with lightweight low-rank recovery between stages to re-estimate component importance. An information-theoretic analysis motivates this ordering, and the Structural Independence Assumption (SIA) is formalized as a checkable condition predicting whether per-component pruning criteria are reliable for a given architecture: Multi-Head Attention (MHA)+GELU designs satisfy the SIA, whereas Grouped Query Attention (GQA)+SwiGLU designs violate it. On bearing fault diagnosis spanning 88M to 6.25B-parameter models, the framework extends achievable compression to 13.8 times on MHA+GELU architectures with 83.82% accuracy (+3.70 percentage points (pp) over the strongest baseline), while exposing a ~74pp accuracy collapse on GQA+SwiGLU architectures that violate the SIA. Deployed on an industrial slewing bearing fault diagnosis platform with NVIDIA DGX Spark, compressed models reduce inference latency by up to 67.2% and peak memory by 62.5%, demonstrating viability for IIoT edge inference.
36. 【2606.26859】AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems
链接:https://arxiv.org/abs/2606.26859
作者:Changxin Lao,Fei Pan,Guozhuang Ma,Han Li,Huihuang Lin,Jijun Shi,Kangzhi Zhao,Kun Gai,Mo Zhou,Qinqin Zhou,Quan Chen,Ruochen Yang,Shifu Bie,Shuang Yang,Shuo Yang,Wenhao Li,Wentao Xie,Xiao Lv,Xuming Wang,Yijun Wang,Yiming Chen,Yusheng Huang,Zhongyuan Wang,Zibo Zhao,Zijie Zhuang,Baoning Xia,Chao Liu,Chaoyi Ma,Chubo He,Dawei Cong,Feng Jiang,Gang Wang,Guilin Xia,Hanwen Xu,Jiahong Xie,Jiahui Qiao,Jian Liang,Jiangfan Yue,Jing Wang,Jinghan Yang,Jinghui Jia,Kan Qin,Lei Wang,Ming Li,Peilin Song,Pengbo Xu,Qiang Luo,Ruiming Tang,Shiyang Liu,Shuxian Jin,Tao Wang,Tao Zhang,Xiang Gao,Xianghan Li,Yingsong Luo,Yiwen Ning,Yongcheng Liu,Yuan Guo,Zhaojie Liu,Zhenkai Cui
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:transition remains blocked, Recommendation algorithm iteration, attribute online results, structural execution bottleneck, engineer-bound process
备注: Authors are listed alphabetically by their first name
点击查看摘要
Abstract:Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains blocked by a structural execution bottleneck: the idea-to-launch cycle still depends on human engineers to generate hypotheses, modify production code, launch A/B experiments, and attribute online results. Innovation therefore scales linearly with headcount rather than compounding with evidence, compute, and accumulated experimental knowledge. We present AgentX, a production-deployed multi-agent system that fundamentally restructures this production function. AgentX operates as a self-evolving development engine: it autonomously generates, implements, evaluates, and learns from recommendation experiments at a scale and pace that no manual workflow can sustain. The system orchestrates four tightly coupled stages in a closed loop. A Brainstorm Agent synthesizes evidence from historical experiments, system architecture, data analysis, and external research into ranked, executable proposals. A Developing Agent translates each proposal into production-ready code through repository-grounded generation and multi-dimensional reliability verification. An Evaluation Agent conducts safe online rollout with guardrail-vetoed A/B judgment, converting both successes and failures into structured knowledge assets. A Harness Evolution layer (SGPO) then distills execution trajectories into semantic-gradient updates that continuously sharpen the agents themselves -- making the system not merely automated, but self-improving.
Comments:
Authors are listed alphabetically by their first name
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:
arXiv:2606.26859 [cs.AI]
(or
arXiv:2606.26859v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2606.26859
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
37. 【2606.26819】FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following
链接:https://arxiv.org/abs/2606.26819
作者:Zhihang Xie,Marco Gaido,Sara Papi,Matteo Negri,Luisa Bentivogli
类目:Computation and Language (cs.CL)
关键词:shared task, Instruction Following shared, paper describes, describes our submission, IWSLT
备注:
点击查看摘要
Abstract:This paper describes our submission to the IWSLT 2026 Instruction Following shared task. SpeechLLMs are developed for both short-form and long-form speech instruction following under constrained settings. For the short track, strong performance is achieved on MCIF, with a SIFS score of 2.0708. For the long track, three speech segmentation methods are explored, and the HIFS score is introduced to account for unstable long-form generation. Experimental results show that fixed 30-second segmentation provides the most robust long-form performance, achieving the highest HIFS score of 2.0663. Further analysis shows that hallucination mainly manifests as repetitive insertions in generated outputs, substantially affecting ASR and SSUM, while short-form capabilities are largely retained after long-form extension.
38. 【2606.26807】KARLA: Knowledge-base Augmented Retrieval for Language Models
链接:https://arxiv.org/abs/2606.26807
作者:Francois Crespin,Fabian M. Suchanek(IP Paris, LTCI),Nils Holzenberger
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:LLM output, LLM, knowledge base, automatically pull, knowledge
备注:
点击查看摘要
Abstract:We propose a new method that allows an LLM to automatically pull in factual knowledge from a knowledge base during token generation. This means that (1)~factual knowledge in the LLM output can be updated without retraining the LLM, (2)~facts in the LLM output can be traced to the knowledge base for transparency and explainability, and (3)~smaller models can achieve the same factual accuracy as larger models. Our core idea is to train the model to produce special tokens that trigger a query to the knowledge base. Our experiments show that our method improves factual grounding in both short and long-form generation, and allows factual revisions to take effect through KB edits rather than parameter updates.
39. 【2606.26803】From Vajrayana Tara to Bengali Baul: A Computational Study of Lexical Transmission Across Buddhist, Shakta, and Vaishnava Traditions in Bengal
链接:https://arxiv.org/abs/2606.26803
作者:Joy Bose
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:encompassing Buddhist Vajrayana, Shakta Tantra tradition, Shakta Tantra, computational corpus study, Shakta Kali texts
备注: 9 pages, 2 figures, 4 tables. Code and corpus: [this https URL](https://github.com/joyboseroy/bengal-dharma-corpus) Dataset: [this https URL](https://huggingface.co/datasets/joyboseroy/bengal-dharma-corpus)
点击查看摘要
Abstract:We present a computational corpus study of vocabulary relationships across eight tradition layers of Bengali and Sanskrit devotional literature spanning the 8th to 19th centuries, encompassing Buddhist Vajrayana, Shakta Tantra, Vaishnava, and Baul traditions. Using a corpus of 75 texts and TF-IDF character n-gram vectorization with cosine similarity analysis, we address the historically argued but previously unquantified claim that Buddhist Vajrayana vocabulary survived the collapse of the Pala monasteries and was absorbed into the Shakta Tantra tradition of Bengal. The central finding is a specificity result: the Gitagovinda (Vaishnava Sanskrit, 12th century) has zero cosine similarity to Shakta Kali texts, while Bridge Tara texts (Buddhist-Shakta transitional, same century, same language) have cosine similarity 0.54 to Shakta Kali. This 8.5-fold contrast between two Sanskrit traditions from the same century demonstrates that the Buddhist-Shakta vocabulary overlap is not a generic property of Sanskrit devotional literature but is specific to the Buddhist-Shakta transmission chain. Three Brihannilatantra Tara texts show Shakta-to-Buddhist vocabulary ratios of 2.0 to 4.0, constituting measurable evidence of lexical transition within that chain. Ramprasad Sen's 18th-century Bengali Kali songs preserve Buddhist vocabulary residue including 56 occurrences of Tara alongside 103 occurrences of Kali. The Vaishnava Bengali tradition contributes a parallel chain to modern Baul vocabulary (similarity 0.29), slightly weaker than the Buddhist Sahajiya chain via Charyapada (0.31). These results provide the first quantitative multi-tradition corroboration of historically argued Buddhist-Shakta syncretism in Bengal.
40. 【2606.26790】OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
链接:https://arxiv.org/abs/2606.26790
作者:Shuo Yang,Jinyang Wu,Zhengxi Lu,Yuhao Shen,Fan Zhang,Lang Feng,Shuai Zhang,Haoran Luo,Zheng Lian,Zhengqi Wen,Jianhua Tao
类目:Computation and Language (cs.CL)
关键词:Outcome-based reinforcement learning, sparse trajectory-level rewards, Outcome-based reinforcement, stable optimization backbone, textbf
备注:
点击查看摘要
Abstract:Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed. On-policy self-distillation offers dense token-level supervision, yet existing skill-conditioned variants often rely on external skill memories or retrieved privileged context, which are costly to maintain and can be mismatched with the state distribution induced by the current policy in multi-turn interaction. We propose \textbf{OPID} (\textbf{O}n-\textbf{P}olicy Sk\textbf{i}ll \textbf{D}istillation), a framework that extracts skill supervision directly from completed on-policy trajectories. OPID represents trajectory hindsight as hierarchical skills: episode-level skills capture global workflows or failure-avoidance rules, while step-level skills capture local decision knowledge at critical timesteps. A critical-first routing mechanism uses step-level skills when critical decisions are identified and falls back to episode-level skills as default guidance otherwise. The selected skill is injected into the interaction history, allowing the old policy to re-score the same sampled response under both original and skill-augmented contexts. The resulting log-probability shift yields a token-level self-distillation advantage, which is combined with the outcome advantage for policy optimization. OPID thus preserves RL as the primary training objective while introducing dense, distribution-matched hindsight supervision. Experiments on ALFWorld, WebShop and Search-based QA demonstrate that OPID generally improves agent performance, sample efficiency, and robustness over outcome-only RL and existing skill-distillation baselines. Our code is available at this https URL.
41. 【2606.26787】AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing
链接:https://arxiv.org/abs/2606.26787
作者:Chennan Ma,Yanning Zhang,Siqi Hong,Xiuchong Wang,Fei Xiao,Keping Yang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:cumulative Gross Merchandise, Return on Investment, Gross Merchandise, Traditional dynamic pricing, Traditional dynamic
备注: Accepted by KDD 2026 Applied Data Science Track (Oral presentation)
点击查看摘要
Abstract:Traditional dynamic pricing models in large-scale e-commerce suffer from limited interpretability, poor utilization of unstructured information, and misalignment with long-term business objectives such as cumulative Gross Merchandise Value (GMV), Return on Investment (ROI) and milestone achievement. We propose AIGP, a novel framework that leverages a Large Language Model (LLM) prompted with domain knowledge, structured data and textual context to make interpretable, knowledge-aware pricing decisions. For efficient deployment while maintaining high-quality outputs, we employ supervised fine-tuning for knowledge distillation. Central to AIGP is the Long-Term Value Estimator (LTVE), trained via offline reinforcement learning on historical data, which serves as a reward model to score candidate pricing actions and select preference pairs for Direct Preference Optimization (DPO), thereby aligning the pricing policy with long-term business objectives. Extensive offline evaluations and large-scale online A/B tests on Tao Factory demonstrate that AIGP achieves significant improvements: +13.21% in GMV, +7.59% in ROI, and +8.20% in milestone achievement rate over 14 days compared to the production baseline, while simultaneously providing interpretable and transparent pricing rationales.
42. 【2606.26783】Reproducibility Study of "AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models"
链接:https://arxiv.org/abs/2606.26783
作者:Ananth K S,Arya Hariharan
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Fang, AlphaEdit, editing, editing methods, knowledge editing methods
备注: 21 pages, 2 figures
点击查看摘要
Abstract:Fang et al. (2025) introduced a null-space constrained projection, named AlphaEdit, for locate-then-edit knowledge editing methods, theoretically guaranteeing that edits do not disrupt previously preserved knowledge, and reports substantial gains over existing editing methods on LLaMA3, GPT2-XL, and GPT-J. In this work, we present a reproducibility study of AlphaEdit, reproducing its reported results under the original experimental setup and extending the evaluation along three axes: new model architectures, additional downstream benchmarks, and substantially longer sequential editing horizons. We successfully reproduce AlphaEdit's reported metrics across the original models, though we identify a discrepancy in the reported fluency and consistency metric. Extending AlphaEdit to newer model families, we find that its advantage does not generalize uniformly, which we trace to architectural assumptions in the locate-then-edit paradigm that are violated by these newer models. We further stress-test AlphaEdit's central sequential-editing claim by extending the number of edits well beyond those evaluated in the original paper, and find that performance, which is stable at the originally reported scale, degrades as edits reach a much higher count, indicating that the null-space projection's protection against catastrophic forgetting is bounded rather than unconditional. Finally, we extend evaluation of edited models on three extra benchmarks, namely, BoolQ, HellaSwag, and XSTest, and we find that large-scale sequential editing degrades both general downstream task competence and safety-relevant refusal behavior. Our results confirm that AlphaEdit performs as reported within its original scope, while showing that its core theoretical guarantees are sensitive to model architecture and editing scale in ways that have practical implications for its deployment.
43. 【2606.26775】Evaluation Pitfalls and Challenges in Multimedia Event Extraction
链接:https://arxiv.org/abs/2606.26775
作者:Philipp Seeberger,Steffen Freisinger,Tobias Bocklet,Korbinian Riedhammer
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:comprehensive event understanding, Multimedia event extraction, event extraction aims, jointly identify events, text and images
备注: Accepted to ACL 2026
点击查看摘要
Abstract:Multimedia event extraction aims to jointly identify events and their arguments across multiple modalities, such as text and images, to support more comprehensive event understanding. While recent work reports steady and substantial progress, the reliability and comparability of these results critically depend on consistent and rigorous evaluation. In this work, we present the first systematic analysis of evaluation pitfalls in multimedia event extraction and identify three major sources of issues: inconsistent data processing, inconsistent task assumptions, and overly relaxed evaluation settings. We demonstrate, through a series of controlled experiments under a strict evaluation framework, that minor evaluation choices can cause large performance variations and lead to overestimation of a model's ability to ground real-world events across modalities. Our findings highlight the need for comparable evaluation standards and encourage a shift toward more rigorous evaluation in multimedia event extraction.
44. 【2606.26753】ConvMemory v3: A Validity Context Layer for Conversational Memory via Target-Conditioned Relation Verification
链接:https://arxiv.org/abs/2606.26753
作者:Taiheng Pan
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Conversational memory retrieval, memory retrieval optimizes, retrieval optimizes relevance, Conversational memory, retrieved memory
备注: 22 pages, 3 figures
点击查看摘要
Abstract:Conversational memory retrieval optimizes relevance, yet a retrieved memory can be relevant and simultaneously outdated: a later turn updates, corrects, or supersedes it. ConvMemory v3 adds a validity context layer that detects and surfaces this update evidence through target-conditioned relation verification, sitting after the v1/v2 retrieval path. The core mechanism is a dual-evidence gate that conditions a relation judgment on the specific target proposition, scoring a (target, source) pair through the product of a MiniLM slot head and a DeBERTa-v3 slot head and gating it by conservative event/operation evidence. On a synthetic multi-hop validity benchmark the gate reaches 90.12% +/- 1.73 accuracy; through a real-data feedback loop that mines failure patterns but trains on synthetic pairs only, the verifier transfers to Memora role binding with zero target-side labels, reaching 98.8% +/- 0.9 group-all-correct. The deployed layer preserves retrieval by default: a context mode attaches structured validity metadata while keeping the candidate set and rank order fixed, and a query-conditioned demote mode is an explicit opt-in for dense current-state workloads, where it raises current-active H@1 from a never-demote baseline of 45.1% to 95.7% +/- 1.2 while protecting non-superseded memories at 99.4% recall. Six machine-verifiable safety contracts pin the layer's behavior. Multi-hop graph propagation is validated as a mechanism; fully automatic construction of strict prerequisite edges is characterized as a boundary, since strict necessity requires counterfactual world knowledge. This report extends ConvMemory v1 (arXiv:2605.28062) and v2 (arXiv:2606.10842).
45. 【2606.26749】Structure Before Collapse: Transient semantic geometry in next-token prediction
链接:https://arxiv.org/abs/2606.26749
作者:Yize Zhao,Isabel Papadimitriou,Christos Thrampoulidis
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Neural Collapse predicts, balanced one-hot classification, one-hot classification pushes, classification pushes model, Neural Collapse
备注:
点击查看摘要
Abstract:Neural Collapse predicts that balanced one-hot classification pushes model representations to be equally far from each other; a symmetric configuration that depends only on the output label and ignores any semantic similarity in the inputs. This creates a puzzle: next-token prediction language models are trained predominantly (as context length increases) with one-hot labels: the same context is very unlikely to appear twice in training with different labels. However, they clearly learn latent structural features. That is, despite the one-hot training regime, a language model's contextual embeddings represent the fact that the next word in ''Mary broke the ___'' is likely to be filled by tokens in the latent classes of a) medium-sized, b) rigid, c) inanimate nouns. How does gradient descent find such categorical semantic structure when co-occurrence statistics collapse to one-hot sparsity, eliminating any shared next-tokens among different contexts? To investigate this tension we identify three synthetic controlled settings where inputs have latent semantic factors but are mapped to distinct one-hot labels. We find that semantic geometry emerges early in training, and that representations cluster by shared attributes despite receiving no explicit supervision to do so. This structure is transient: with sufficient capacity and time, the model eventually reaches the predicted symmetric state where all representations are equally separated. We study this phase transition through Gram matrix analysis and propose a preliminary modification to the commonly used unconstrained features model to capture the emergent semantic geometry.
46. 【2606.26744】HyperDFlash: MHC-Aligned Block Speculative Decoding with Gated Residual Reduction
链接:https://arxiv.org/abs/2606.26744
作者:Luxi Lin,Shuang Peng,Rui Ma,Junhao Hua,Shuwei Fan,Zhengda Qin,Qiang Wang,Hongjian Sun,Fangmin Chen,Songwei Liu
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:decoding framework tailored, architecture proposed, framework tailored, MHC residual streams, block-parallel speculative decoding
备注:
点击查看摘要
Abstract:We present HyperDFlash, a block-parallel speculative decoding framework tailored to the novel multi-hyper-connection (MHC) architecture proposed by DeepSeek-V4. Despite the strong initial-token drafting performance of the native Multi-Token Prediction (MTP) module in DeepSeek-V4, its draft accuracy degrades sharply at later positions, as error accumulation from unverified intermediate tokens harms acceptance rates. Although the original DFlash method supports efficient one-pass block drafting, it cannot be seamlessly adapted to the MHC paradigm, since the multi-path residual stream of DeepSeek-V4 induces feature misalignment with conventional drafting designs. To resolve this mismatch, we propose two model-aligned optimizations for MHC residual streams. First, we adopt pre-collapse residual states as the exclusive conditioning signal, preserving multi-path structural information and aligning the drafter with the native prediction pathway of the target model. Second, we replace the heavy generic linear compressor with a lightweight gated residual reducer, whose parameters are inherited from the built-in hyper-connection head. This design yields input-aware path aggregation with three orders of magnitude fewer parameters while maintaining architectural alignment. We further enhance training via a targeted KL distillation loss applied to the LM-head, which regularizes predictions against the full target probability distribution and improves draft quality at early training stages. Experiments across math reasoning, code synthesis, and conversational benchmarks show that HyperDFlash consistently outperforms both the native MTP baseline and vanilla DFlash adaptation. It achieves substantial gains in average accepted draft length and decoding speedup, validating the effectiveness of MHC alignment, gated reduction, and targeted distillation for high-performance speculative decoding.
47. 【2606.26698】Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification
链接:https://arxiv.org/abs/2606.26698
作者:Eleni Papadopulos,Firoj Alam,Giovanni Da San Martino
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:fast-paced information era, today fast-paced information, defined as defective, inevitably contribute, information era
备注:
点击查看摘要
Abstract:In today's fast-paced information era, logical fallacies, defined as defective patterns of reasoning, inevitably contribute to the growth of information disorder. However, often fallacies appear in nuanced forms that complicate automated classification. In this study, we investigate whether merging abstract logical structures with context-level linguistic cues proves beneficial for fallacy classification, developing a framework that inductively extracts such patterns from fallacious examples and their explanations using Large Language Models (LLMs). We evaluate the impact of these patterns across different LLMs and experimental zero- and one-shot configurations, showing statistically significant improvements over zero-shot baselines and outperforming competing approaches. Cross-dataset experiments validate generalization, establishing data-driven pattern extraction as an effective method for generating logical representations.
48. 【2606.26686】Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation
链接:https://arxiv.org/abs/2606.26686
作者:Dongbin Na
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:recent guardrail methods, guardrail methods generate, issue a verdict, order to screen, screen a prompt
备注: 9 pages, 6 figures, 3 tables. Project page: [this https URL](https://ndb796.github.io/LeanGuard) ; code and models: [this https URL](https://github.com/ndb796/LeanGuard)
点击查看摘要
Abstract:In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However, CoT also makes the guard heavy and slow, because the model must generate many tokens before it decides. This may not match how guardrails are actually deployed. A guardrail sometimes should not be heavy and slow, and it often runs on-device, for example on an embodied robot. In this paper, we pose a question whether a safety guardrail really needs to reason. To answer this question, we train a lightweight bidirectional encoder and a reasoning guard on the same corpus, and we then remove only the reasoning while we keep everything else fixed. With this controlled same-base comparison, we show that the chain does not improve moderation accuracy. We name the resulting guard LeanGuard. A 395M label-only encoder reaches an average F1 of 82.90 $\pm$ 0.26 over public benchmarks. It matches a reasoning guard that is built on a much larger decoder, while it uses only a single forward pass over an input of at most 512 tokens. This is about a ~100x reduction in inference compute. We further show that this label-only encoder stays robust under training-label noise and retains far more recall at a strict false-positive rate than the reasoning guard, so a heavier reasoning guard is not the more robust choice either. Our finding suggests that the current guardrail benchmarks may not be hard enough to reward reasoning, and that the necessity of CoT for moderation is still not proven. We release all source codes and models including LeanGuard at this https URL.
49. 【2606.26654】SocialPersona: Benchmarking Personalized Profiling and Response with Multimodal Social-Media Context
链接:https://arxiv.org/abs/2606.26654
作者:Qinkai Zhang,Yanyan Zhao,Xin Lu,Yulin Hu,Pengtao Han,Bing Qin
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Personalized language-model assistants, Personalized language-model, memory lens, explicitly stated, model recall preferences
备注:
点击查看摘要
Abstract:Personalized language-model assistants are often evaluated through a memory lens: can a model recall preferences users have explicitly stated in dialogue? More comprehensive personalization demands a harder capability -- inferring what users care about from the multimodal traces they naturally leave behind. We introduce SocialPersona, a benchmark for evaluating whether multimodal large language models (MLLMs) can recover revealed preferences from longitudinal social-media timelines and use them in dialogue. Built from longitudinal timelines of 171 everyday, non-promotional social-media users, SocialPersona contains text, images, timestamps, and 2,597 human-verified preference tags across seven interest domains, separating stable interests from recent interests. It supports two tasks: constructing structured user profiles from multimodal context and generating responses aligned with inferred profiles. Experiments with proprietary and open-weight MLLMs show that models can identify broad interest domains, yet their performance drops on fine-grained and recent interests and degrades further when inferred profiles must be used to personalize dialogue. Together with evidence that text and images provide complementary preference signals, these results indicate that robust cross-modal, long-horizon user modeling remains a key challenge, and that SocialPersona can help measure and advance progress toward assistants that infer and act on revealed preferences.
50. 【2606.26650】CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs
链接:https://arxiv.org/abs/2606.26650
作者:Shigeng Wang,Chao Li,Yangyuxuan Kang,Jiawei Fan,Anbang Yao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Cost-efficient and Accurate, Accurate Ternary Quantization, Accurate Ternary, Ternary Quantization, compressing and accelerating
备注: This work is accepted to ICML 2026 as an oral. The project page: [this https URL](https://github.com/IntelChina-AI/BitTern)
点击查看摘要
Abstract:In this paper, we present CAT-Q, Cost-efficient and Accurate Ternary Quantization, for compressing and accelerating LLMs. Unlike existing state-of-the-art ternary quantization methods that rely on data-intensive and costly quantization-aware training to mitigate severe performance degradation, CAT-Q is a simple yet effective post-training quantization scheme that is readily applicable to LLMs with diverse architectures and model sizes. It has two key components, learnable modulation (LM) and softened ternarization (ST), which are coupled from an optimization perspective. LM leverages a composition of learnable factors to modulate the distribution of pre-trained high-precision weights and the ternary threshold, making them less sensitive to ternarization. ST further introduces a differentiable transition function to guide the ternarization process toward stable convergence. We show that, for pre-trained LLMs with 1.7B to 8B parameters, CAT-Q can efficiently quantize them into ternary models using only 512 calibration samples, while achieving superior performance than the seminal BitNet 1.58-bit v1 and v2 families (with 1.3B to 7B parameters) trained with 100B tokens, yielding about a 100,000X reduction in training tokens. Moreover, we show for the first time that CAT-Q can quantize much larger pre-trained LLMs having 14B to 235B parameters into leading ternary models within just 8 to 60 hours on 8 A100-80GB GPUs. Code is available at this https URL.
51. 【2606.26629】From Weights to Features: SAE-Guided Activation Regularization for LLM Continual Learning
链接:https://arxiv.org/abs/2606.26629
作者:Evan Ning,Wei Xue,Dong Lou,Yike Guo
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Elastic Weight Consolidation, large language models, Weight Consolidation, Elastic Weight, Weight-space regularization methods
备注: 21 pages, 4 figures, 6 tables
点击查看摘要
Abstract:Weight-space regularization methods such as Elastic Weight Consolidation (EWC) are the standard approach to catastrophic forgetting in continual learning. However, those methods tend to underperform when applied to large language models. We argue that such underperformance can be partly explained by the ``polysemantic'' nature of large language models: per-weight importance estimates utilized by EWC-style regularization are too coarse and cannot isolate the knowledge that needs protection. In this paper, we propose regularizing instead in the model's activation space, using pretrained Sparse Autoencoders (SAEs) as a monosemantic feature dictionary. From the perspective of constrained optimization, we derive a new loss function that uses the SAE feature dictionary to explicitly balance stability and plasticity, and show that EWC is a special case in the one-sided weight-space penalty setting. Unlike replay-based methods that store or revisit examples from earlier tasks, our method requires no previous-task data after mask construction: current-task data is used to compute a compact SAE feature mask, and only this mask is retained for later training. Further, since the feature space has significantly lower dimensionality than the parameter space, the proposed method is more memory efficient. On the TRACE and MedCL continual learning benchmarks, the method achieves the strongest result among approaches without introducing task-specific architectural components, also surpassing traditional weight-space regularization methods like EWC. Beyond performance comparisons, we provide empirical evidence for the polysemanticity thesis: task-relevant representations are linearly separable in the SAE feature basis but indistinguishable from chance in the weight basis, and weight-space protection is nearly non-selective at the concept level.
52. 【2606.26618】Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean
链接:https://arxiv.org/abs/2606.26618
作者:Phannet Pov,Sovandara Chhoun,Hyun Woo Park,Wan-Sup Cho,Saksonita Khoeurn
类目:Computation and Language (cs.CL)
关键词:Large pretrained, tokenizer-free TTS model, TTS model, tokenizer-free TTS, Large
备注: 5 pages, 1 figure, 4 tables. IEEE conference format (IEEEtran)
点击查看摘要
Abstract:Large pretrained text-to-speech (TTS) models sound almost human for well-resourced languages, but much worse for languages that are rare in their training data. We study this quality gap for Khmer and Korean using VoxCPM2, a 2.4B-parameter, tokenizer-free TTS model that joins a MiniCPM-4 language-model backbone with a flow-matching diffusion decoder. We build one shared, language-tagged corpus of about 26 hours and adapt VoxCPM2 with a single Low-Rank Adaptation (LoRA) adapter, trained on both languages at once and added to both the language model and the decoder. The adapter is zero-initialized, so training starts exactly at the original (zero-shot) model. In native-speaker listening tests, the Khmer Mean Opinion Score (MOS) rises from 3.85 to 4.23 with the best adapter (rank 64), a highly significant gain (paired Wilcoxon test, p0.001), while training only 0.19 to 3.03 percent of the parameters. The automatic loss and the human ratings, however, disagree on the best rank: validation loss is lowest at rank 128, yet MOS peaks at rank 64. The same adapter brings no gain for Korean, a language the base model already handles well, and at a high rank it even degrades quality. Adaptation therefore helps mainly where the base model is genuinely weak.
53. 【2606.26571】Zero-shot Tweet-Level Stance Detection Enhanced by External Knowledge and Reflective Chain-of-Thought Reasoning
链接:https://arxiv.org/abs/2606.26571
作者:Yiju Huang,Wenxian Wang,Lijun Zhou,Rui Tang,Xiao Lan,Tao Zhang,Haizhou Wang
类目:Computation and Language (cs.CL)
关键词:context sparsity inherent, primary challenges, mitigating the context, confronts two primary, context sparsity
备注:
点击查看摘要
Abstract:Zero-shot tweet-level stance detection confronts two primary challenges: (1) mitigating the context sparsity inherent in short texts, and (2) establishing the relevance between implicit targets and textual content. While existing methods primarily focus on incorporating external knowledge, they neglect the intrinsic semantic cues embedded within key intra-textual entities. Furthermore, current models exhibit limited capability in determining the relevance of unseen targets to the given text, thereby struggling to differentiate between "neutral" and "irrelevant" stance labels. To address these issues, we first construct a four-class, multi-topic Japanese tweet dataset. To our knowledge, this is the first Japanese tweet-level dataset for stance detection. We then propose KIRP, a zero-shot stance detection framework. It integrates external knowledge with entity reorganization for data augmentation and employs prompt chaining for reasoning. Specifically, the framework incorporates knowledge graphs to supplement and reorganize key textual entities, while reflective Chain-of-Thought (CoT) reasoning extracts and validates implicit targets. To better distinguish "neutral" from "irrelevant" labels, we adopt stance-aware contrastive learning to capture discriminative features and design a three-layer iterative prototype network for fine-grained classification. Experimental results on SemEval-2016, WT-WT, and KIRP-D show that KIRP achieves state-of-the-art performance. KIRP obtains F1 scores of 84.05% (three-class) on SemEval-2016, and 84.99% and 79.18% (four-class) on WT-WT and KIRP-D, respectively.
54. 【2606.26566】Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models
链接:https://arxiv.org/abs/2606.26566
作者:Abrar Alotaibi,Moataz Ahmed
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:large language models, largely disconnected tracks, diffusion-based input purification, input purification defenses, systems has matured
备注:
点击查看摘要
Abstract:Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against vision-language models, and diffusion-based input purification defenses. Each has developed its own vocabulary, threat models, and benchmarks, with denoising diffusion models emerging as a shared generative mechanism whose recipes are now actively ported between communities. This survey performs an information-fusion exercise at the meta-research level: we integrate these four tracks into a single conceptual framework with a unified taxonomy, evaluation criteria, and research agenda, focusing on the LLM-side slice. We catalog fifty published papers across four scope areas (text/LLM, image classifier, vision-language model, defense), plus four diffusion-LLM-as-victim entries and ten non-diffusion baselines against which any new attack must be compared. We propose a six-class taxonomy of diffusion roles in adversarial pipelines, augmented by a threat-model axis recording attacker knowledge, query budget, and target accessibility, and apply a five-dimension framework (attack success rate, transferability, query budget, perplexity, defense-evasion) uniformly across modalities. The review adopts a dual attacker-defender perspective: alongside the attack catalog we cover four diffusion-based defenses that form the natural evaluation backdrop for new attacks. Our critical analysis identifies five recurring weaknesses of the current LLM-side literature, and we close with a research agenda of open questions and concrete experimental designs. The companion catalog and spreadsheet are released with the paper. We are explicit that this is a narrative review with quality assessment, not a PRISMA-compliant systematic review, and discuss the implications for replication.
55. 【2606.26560】Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention
链接:https://arxiv.org/abs/2606.26560
作者:Xiao Li,Chengruidong Zhang,Hao Luo,Xi Lin,Zekun Wang,Zihan Qiu,Yunfei Mao,Langshi Chen,Man Yuan,Minmin Sun,Huiqiang Jiang,Siqi Zhang,Rui Men,Wei Hu,Gong Cheng,Bo Zheng,Dayiheng Liu,Jingren Zhou
类目:Computation and Language (cs.CL)
关键词:linear attention improves, attention improves recurrent, Delta-rule linear attention, write, improves recurrent memory
备注:
点击查看摘要
Abstract:Delta-rule linear attention improves recurrent memory updates by correcting what is already stored at the current write address before writing new content. However, the active correction is still anchored to that same write address. As a result, stale information stored at a different address cannot be actively removed before new content is written elsewhere. We propose Erase-then-Delta Attention (EDA), a memory update rule that decouples where to erase from where to write. The key insight is that recurrent memory models should not only correct the current write, but also selectively suppress outdated memory at an independently chosen address. Concretely, our method first applies a targeted erase step along a learned erase direction, and then performs the standard delta-style corrective write along the current write direction. This preserves the corrective behavior of delta-rule updates while expanding their memory-management capacity. Language-model pretraining experiments across dense 2.5B and MoE 25B-A2.8B model families show that EDA performs best in both settings. The gain persists after 80B-token long-context midtraining of the MoE models, where EDA also performs best in long-context evaluations from 4k to 128k contexts. A compact update analysis and memory-state probes suggest why: EDA keeps the delta-rule corrective write intact while allocating an additional cleanup path most strongly when passive decay is weak. These results suggest that recurrent memory models should decide not only what to write, but also what stale information to erase and where.
56. 【2606.26547】Compiler-Driven Approximation Tuning for Hyperdimensional Computing
链接:https://arxiv.org/abs/2606.26547
作者:Xavier Routh,Abdul Rafae Noor,Akash Kothari,Zheyu Li,Mahbod Afarin,Tajana Rosing,Vikram Adve
类目:Programming Languages (cs.PL); Computation and Language (cs.CL); Performance (cs.PF)
关键词:Moore law reaches, accelerate machine learning, Moore law, machine learning workloads, economic limits
备注:
点击查看摘要
Abstract:As Moore's law reaches its physical and economic limits, domain-specific approaches are increasingly employed to accelerate machine learning workloads. Hyperdimensional Computing (HDC) represents one such emerging paradigm, offering an alternative to conventional deep learning techniques. Rooted in cognitive models of computation, HDC is designed bottom-up with hardware efficiency as a first-class objective. HDC workloads map naturally to heterogeneous hardware platforms, including CPUs, GPUs, and FPGAs, as well as emerging in-memory computing technologies such as Resistive RAM (ReRAM) and Phase-Change Memory (PCM). HDC algorithms are intrinsically tolerant to noise and approximation, enabling substantial performance gains with minimal accuracy loss. In this work, we introduce ApproxHDC, a framework for automated identification and application of domain-specific approximations in HDC workloads. ApproxHDC extends the HPVM-HDC compiler infrastructure to enable retargetable compilation across diverse hardware backends, including CPUs, GPUs, and simulated ReRAM and PCM-based accelerators. The space of possible approximations is exponentially large; ApproxHDC employs efficient search and analysis to navigate it and identify high-impact configurations spanning both software and hardware levels.
57. 【2606.26530】\textsc{DiARC}: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models
链接:https://arxiv.org/abs/2606.26530
作者:Yuxuan Yang,Feiyang Li,Yile Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:predicting output grids, Reasoning Corpus, limited grid samples, require summarizing patterns, limited grid
备注:
点击查看摘要
Abstract:The Abstraction and Reasoning Corpus (ARC;~\citealp{chollet2019measure}) contains tasks that require summarizing patterns from limited grid samples and predicting output grids. Recently, many large language model based approaches have attempted to transform it into a text-based reasoning task. However, methods based on open-source models have generally yielded unsatisfactory results, while those relying on closed-source models are too costly. Current efforts mainly focus on data augmentation, constructing ARC-like data for more comprehensive supervised fine-tuning. In this work, we argue that solving ARC-like problems requires not only \textit{positive} sample supervision but also the ability to improve model reasoning by distinguishing \textit{negative} samples. To this end, we draw on the idea of preference alignment and propose \textsc{DiARC}, a method that constructs preference pairs to enable the model to distinguish between them. Specifically, we propose three ways to construct negative samples, including output-level visual transformations, DSL-level rule inversion, and task-specific rule editing. The resulting negative samples provide informative near-miss alternatives while keeping the observed demonstrations unchanged. Experimental results across multiple ARC-like benchmarks show that \textsc{DiARC} consistently improves performance over baseline models. The code is released at this https URL.
58. 【2606.26529】he Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report
链接:https://arxiv.org/abs/2606.26529
作者:Kwan Soo Shin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:told to find, accidents often arise, detects the hazards, model detects, model
备注: 20 pages, 8 figures. Reproducibility deposit: [this https URL](https://doi.org/10.5281/zenodo.20826824)
点击查看摘要
Abstract:AI safety is evaluated by how reliably a model detects the hazards it is told to find, yet accidents often arise from the hazard no one specified. We show that conditioning a language or vision model on a narrow task suppresses its reporting of co-present, safety-critical signals it can otherwise report, a machine analogue of human inattentional blindness arising from a different mechanism. Across radiology and driving text scenarios and chest-radiograph vision tasks, suppression appeared in every model tested, did not diminish with scale, persisted in a reasoning model, and varied more by model family than by size, while the same models reported these signals at substantially higher rates when unconstrained. We name this dissociation the Inattentional Gap and argue that it decouples measured benchmark safety from real-world safety: a system can score near-perfectly on the hazards an evaluation specifies while remaining blind to those that cause harm.
59. 【2606.26522】Assessing Post-Reform Changes in Risk Disclosure Quality with a Multidimensional Text Analysis Approach
链接:https://arxiv.org/abs/2606.26522
作者:Nobuhiro Aikawa,Mitsuo Yoshida
类目:Computation and Language (cs.CL); Digital Libraries (cs.DL)
关键词:time remains challenging, provide crucial information, comprehensively evaluating, remains challenging, disclosures provide crucial
备注: The 4th International Conference on Computational and Data Sciences in Economics and Finance (CDEF 2026)
点击查看摘要
Abstract:While corporate narrative disclosures provide crucial information to capital markets, comprehensively evaluating their qualitative changes over time remains challenging. Narrative text is inherently multidimensional, meaning that an improvement in one textual dimension often occurs alongside changes in others. To capture these underlying dynamics, we propose a longitudinal text analysis approach combining Japanese-language NLP metric extraction with paired testing, shift function analysis, and inter-metric correlation. Our framework extends prior indicator sets by incorporating a cross-section relevance indicator to measure topical alignment between risk disclosures and management strategies. Applying this approach to evaluate Japan's 2019 disclosure reforms, we analyze 19,770 firm-year observations over a 10-year period (FY2015-FY2024). The joint analysis reveals complex shifts in disclosure patterns that are frequently masked by conventional single-indicator methods. Specifically, we find that while disclosure volume increased substantially, it was accompanied by a decline in readability. Furthermore, although the overall information structure improved, specific descriptive quality stagnated, and the degree of adaptation varied across market segments.
60. 【2606.26511】mporal Validity in Retrieval Memory: Eliminating Stale-Fact Errors for AI Agents over Evolving Knowledge
链接:https://arxiv.org/abs/2606.26511
作者:Neeraj Yadav
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
关键词:Retrieval-augmented generation, access to accumulated, RAG, Retrieval-augmented, agents access
备注: 21 pages, 5 tables. Code, prompts, and evaluation datasets included
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) gives agents access to accumulated knowledge, but has no model of time. When a fact changes (e.g., a function is renamed or API restructured), RAG retrieves both the stale and current value with near-identical embedding similarity. The agent then either abstains or serves the superseded fact. We show this is a structural problem: on a calibrated dataset, cosine similarity distinguishes a contradicted fact from a duplicated one with AUROC 0.59 (near chance), as contradictions are often more embedding-similar to the original than rephrased duplicates. We present MemStrata, a retrieval memory maintaining temporal validity. It stores facts like RAG, preserving static recall, but when a fact's value is contradicted, a deterministic (subject, relation, object) supersession rule retires the stale value in a bi-temporal ledger - with no similarity threshold and no LLM call. Across six benchmarks run locally with a 7B model, MemStrata ties RAG on static knowledge and reaches 0.95-1.00 accuracy on evolving knowledge (where RAG reaches 0.20-0.47). The central result is the stale-fact-error rate: when required to answer, RAG serves superseded values 15-40% of the time; MemStrata drives this to ~0%, a failure class RAG cannot avoid. MemStrata achieves this at retrieval latency (~2.1s) versus ~16-18s for LLM-reranking baselines. We release the harness, datasets, and a marker-free evaluation protocol for memory under knowledge evolution.
Comments:
21 pages, 5 tables. Code, prompts, and evaluation datasets included
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
Cite as:
arXiv:2606.26511 [cs.CL]
(or
arXiv:2606.26511v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.26511
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
61. 【2606.26502】Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation
链接:https://arxiv.org/abs/2606.26502
作者:Han-yu Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large reasoning models, longer on harder, Large reasoning, harder problems, LRM
备注:
点击查看摘要
Abstract:Large reasoning models (LRMs) take longer on harder problems, just as humans do. This surface similarity hides an opposite pattern within items. When an LRM gets a problem wrong, it spends more tokens than when it gets the same problem right; humans do the reverse, spending less time on the trials they get wrong. We separate two levels of deliberation: how response time tracks difficulty across items (registration), and, with item identity held fixed, whether an agent spends more on its own failures or successes (allocation). On a public matched human-LRM corpus, humans and all five thinking LRMs reproduce the known cross-item alignment (registration) but diverge within items (allocation): every LRM shows a large wrong-vs-right effect (Cohen's d = 1.47-3.13 on H-ARC) while humans show the opposite sign. The comparison stays inside each agent's own scale; we never put seconds and tokens on one axis. The dissociation holds under item fixed effects, replicates across datasets, and is absent in a non-thinking baseline. We read the human pattern as engagement versus abandonment: people stay on items they expect to solve and give up on the rest. We read the LRM pattern as length driven by uncertainty: chains grow when the model is unsure, which is exactly when it tends to fail. Both policies produce the same cross-item correlation with difficulty, so they look aligned on the measure prior work has used; the divergence shows up only once item identity is fixed. Under resource-rational metareasoning, the split is between two stopping policies that share a difficulty signal but implement opposite control; trace length captures the signal and misses the control.
62. 【2606.26493】Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context
链接:https://arxiv.org/abs/2606.26493
作者:Fitsum Reda,John Kamalu,Roger Waleffe,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro
类目:Computation and Language (cs.CL)
关键词:language models offer, offer a promising, promising alternative, potential for parallel, Diffusion language models
备注: Code and model weights available at [this https URL](https://huggingface.co/collections/nvidia/nemotron-twotower)
点击查看摘要
Abstract:Diffusion language models offer a promising alternative to autoregressive models due to their potential for parallel and iterative generation. However, existing approaches use a single network for both context representation and iterative denoising, forcing one model to serve both roles and limiting its capacity for either role. We propose TwoTower, a block-wise autoregressive diffusion model that decouples these roles into two towers: a frozen AR context tower that causally processes clean tokens, and a trainable diffusion denoiser tower with bidirectional block attention that refines noisy blocks via cross-attention to the context. Built on Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer MoE model, and trained on approximately 2.1T tokens, Nemotron-TwoTower retains 98.7% of the autoregressive baseline's quality while offering 2.42X higher wall-clock generation throughput. We release the code and model weights at this https URL.
63. 【2606.26489】Comparing BERT Sentence-Pair Classification and Few-Shot LLM Prompting for Detecting Threat and Solution Framing in German Climate News
链接:https://arxiv.org/abs/2606.26489
作者:Raven Adam,David Maier,Marie Kogler
类目:Computation and Language (cs.CL)
关键词:shaping public perceptions, coverage emphasizes threats, policy support, media play, play a central
备注: 15 pages
点击查看摘要
Abstract:News media play a central role in shaping public perceptions of climate change, and whether coverage emphasizes threats or solutions has measurable effects on audience engagement and policy support. Automated detection of these framing patterns at the sentence level would allow researchers to analyze large corpora that are infeasible to code manually. We present a systematic comparison of two approaches for classifying sentences from German-language climate news articles as threat-oriented, solution-oriented, both, or neither. The first approach uses few-shot prompting with an open-weights large language model (Llama 4 Maverick), employing chain-of-thought reasoning and structured output with confidence scoring. The second approach fine-tunes a German BERT model (deepset/gbert-large) for sentence-pair classification, where the preceding sentence provides contextual information for the target sentence. Both approaches implement two independent binary classifiers, one for threat framing and one for solution framing. We evaluate both methods on a corpus of 440 Austrian newspaper articles that were manually coded following a detailed coding scheme developed with domain experts. The fine-tuned BERT classifiers achieve an F1 score of 0.83 for both the threat and solution tasks, while the LLM-based classifiers reach an F1 of 0.78. An ablation study confirms that providing the preceding sentence as context improves BERT classification performance substantially compared to single-sentence input. These results contribute to the growing body of work comparing fine-tuned encoder models with prompted generative models for text classification in computational social science.
64. 【2606.26487】Speaking Numbers to LLMs: Multi-Wavelet Number Embeddings for Time Series Forecasting
链接:https://arxiv.org/abs/2606.26487
作者:Defu Cao,Zijie Lei,Muyan Weng,Jiao Sun,Yan Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:heterogeneous textual signals, Large language models, context-aware time series, integrate heterogeneous textual, Large language
备注: Camera Ready version of IJCAI 2026
点击查看摘要
Abstract:Large language models (LLMs) are attractive for context-aware time series forecasting because they can integrate heterogeneous textual signals, yet their discrete, language-oriented tokenization and embedding interfaces are misaligned with continuous numerical values, often harming numerical ordering and forecasting reliability. We propose TempoWave, a plug-and-play temporal wavelet digit interface that maps each scalar observation into digit-wise embeddings constructed from multi-wavelet, multi-scale coefficients. By directly overriding standard token representations, TempoWave seamlessly exposes both fine-grained local fluctuations and macro global structures in a transformer-compatible form, ensuring that precise numerical formatting, distinct digit identity, and robustness to common normalization operations are maintained throughout the LLM pipeline. Experiments across five context-enriched forecasting benchmarks demonstrate that TempoWave consistently improves LLM-based forecasters over standard numeric tokenization and alternative embedding interfaces, achieving a new state-of-the-art. These results highlight the numeric interface as a key bottleneck and suggest that principled multi-resolution embeddings can better couple LLMs' contextual reasoning with precise forecasting. Our code is available at this https URL and our model can be accessed at this https URL.
65. 【2606.26485】Utilizing Cognitive Signals Generated during Human Reading to Enhance Keyphrase Extraction from Microblogs
链接:https://arxiv.org/abs/2606.26485
作者:Xinyi Yan,Yingyi Zhang,Chengzhi Zhang
类目:Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:Microblogging platforms generate, making automatic keyphrase, dispersed user content, automatic keyphrase extraction, platforms generate massive
备注:
点击查看摘要
Abstract:Microblogging platforms generate massive amounts of short, noisy, and dispersed user content, making automatic keyphrase extraction (AKE) an important but challenging task. Prior studies have used eye-tracking signals to improve microblog-based AKE because such signals reflect readers' attention to salient words. However, eye tracking alone is limited by physiological, acquisition, and feature-decoding constraints. To address this issue, we investigate whether electroencephalogram (EEG) signals can complement eye-tracking signals for AKE. Using the ZuCo cognitive language processing corpus, we select 8 EEG features and 17 eye-tracking features and incorporate them into microblog-based AKE models. To reduce possible distortion of cognitive signals by model structures, we inject these features into the input of the soft-attention layer and the query vectors of the self-attention layer. We then evaluate different combinations of cognitive signals across AKE models. The results show that cognitive signals produced during reading consistently improve AKE performance, regardless of feature combinations and model architectures. EEG features bring the largest gains, while combining EEG and eye-tracking features yields performance between the two individual signal types, suggesting partial complementarity but also possible redundancy or noise. These findings indicate that EEG signals provide useful cognitive evidence for microblog-based AKE and that multimodal cognitive signals deserve further investigation.
66. 【2606.26481】Extracting Problem and Method Sentence from Scientific Papers: A Context-enhanced Transformer Using Formulaic Expression Desensitization
链接:https://arxiv.org/abs/2606.26481
作者:Yingyi Zhang,Chengzhi Zhang
类目:Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
关键词:identify essential parts, massive text, scientific papers, identify essential, essential parts
备注:
点击查看摘要
Abstract:Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data and reduce models' reliance on FEs. For the third idea, we propose a context-enhanced transformer that utilizes context to measure the importance of words in target sentences and to reduce noise in the context. Furthermore, this paper conducts experiments using large language model (LLM) based in-context learning (ICL) methods. Quantitative and qualitative experiments demonstrate that our proposed models achieve a higher macro F1 score compared to the baseline models on two scientific paper datasets, with improvements of 3.71% and 2.67%, respectively. The LLM based ICL methods are found to be not suitable for the task of problem and method extraction.
67. 【2606.26479】Adaptive Evaluation of Out-of-Band Defenses Against Prompt Injection in LLM Agents
链接:https://arxiv.org/abs/2606.26479
作者:Praneeth Narisetty,Shiva Nagendra Babu Kore,Uday Kumar Reddy Kattamanchi,Jayaram Kumarapu
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:defending tool-using LLM, refuse malicious instructions, tool-using LLM agents, Recent work, tool-using LLM
备注: 12 pages, 5 figures, 4 tables
点击查看摘要
Abstract:Recent work (2024 to 2026) has converged on a strategy for defending tool-using LLM agents against indirect prompt injection: rather than training the model to refuse malicious instructions, enforce security outside the model with a deterministic policy that mediates the agent's actions. Systems such as CaMeL, FIDES, Progent, RTBAS, and FORGE realize this with capabilities, information-flow labels, and reference monitors, and several report near-elimination of attacks on the AgentDojo benchmark. We make two contributions. First, we organize these out-of-band defenses as instances of classical integrity protection (Biba), reference monitoring, and least privilege, yielding a structured comparison of what they do and do not cover. Second, we warn that every one of them is validated only on static benchmarks (a fixed set of injection attempts), the same methodology that made in-band defenses look strong until adaptive, defense-aware attacks broke twelve of them at over 90% success; we specify the threat model and protocol an adaptive evaluation requires. We then run that protocol as an independent reproduction and extension of Progent's own adaptive-attack analysis, on AgentDojo, with an open-weight agent (Qwen2.5-7B) self-hosted on a single H200, a setting its authors did not test. Averaged over three runs, the defense held: Progent cut mean attack success roughly sixfold (25.8% to 4.2%), and a hand-crafted adaptive attack did not raise it (2.6%). This is one small-scale data point on a weak model with a single black-box attack template; a stronger optimized (white-box GCG) attack remains open. The result is consistent with, but does not establish, the hypothesis that deterministic out-of-band enforcement is a harder target for an adaptive attacker than in-band detection.
68. 【2606.26472】Epiphany-Aware KV Cache Eviction Without the Attention Matrix
链接:https://arxiv.org/abs/2606.26472
作者:Steven Kolawole,Virginia Smith
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:models emit chains, reasoning models emit, deployment bottleneck, long reasoning traces, emit chains
备注: Preprint; in review
点击查看摘要
Abstract:As reasoning models emit chains of thought tens of thousands of tokens long, KV cache increasingly becomes a deployment bottleneck. Existing cache eviction methods rank tokens by attention weight, which is a noisy importance proxy in long reasoning traces, and prohibits the use of fused kernels in production inference by forcing the model to materialize the attention matrix. In this work, we instead score tokens with a metric we term the epiphany score: the change in the model's internal representation, read directly from the forward pass with no attention matrix and negligible extra state. Our resulting cache eviction method, EpiKV, requires no training, classifier, or custom kernel, and can be used directly in FlashAttention inference stacks unchanged -- scaling to a 16x longer feasible context than attention-based scoring. upper-mid layers negatively) and remove a positional trend with a causal rolling z-score. At a 4096-token cache EpiKV reaches 72% on MATH-500, matching the strongest attention-based baseline (ThinKV 71%, H2O 67%); a lag-normalized KV variant reaches 37% on AIME-2024 at 8192 tokens against the best of them (33%), at up to 2.8x the speed.
69. 【2606.26466】Soft Token Alignment for Cross-Lingual Reasoning
链接:https://arxiv.org/abs/2606.26466
作者:Jiayi He,Jungsoo Park,Wei Xu,Alan Ritter
类目:Computation and Language (cs.CL)
关键词:produce inconsistent reasoning, produce inconsistent, semantically equivalent prompts, large language models, equivalent prompts
备注:
点击查看摘要
Abstract:Multilingual large language models often produce inconsistent reasoning and answers for semantically equivalent prompts in different languages. Prior work suggests that intermediate representations can be relatively language-agnostic, but generation becomes increasingly language-specific as models commit to discrete output tokens. This is problematic because language-specific lexical choices can cause semantically equivalent reasoning paths to diverge across languages. These divergences motivate searching for a cross-lingual alignment signal that is less tied to any single vocabulary item or script. We propose SOLAR, an auxiliary objective for supervised fine-tuning that aligns soft-token representations across languages, using English as a pivot. Soft tokens are probability-weighted mixtures over the vocabulary embeddings, yielding continuous representations that can aggregate information from semantically related tokens across languages. We then align each non-English soft-token summary to its English counterpart in the shared embedding space. Across four multilingual reasoning benchmarks, SOLAR improves accuracy by up to +17.7 points over the base model and +3.8 over standard supervised fine-tuning, with the largest gains on low-resource languages. SOLAR also strengthens final-layer cross-lingual similarity and substantially reduces language-cluster separability, suggesting that aligning soft-token representations helps preserve shared semantic structure during multilingual reasoning.
70. 【2606.26452】AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification
链接:https://arxiv.org/abs/2606.26452
作者:Sourav Ghosh,Yash Bhatia,Keshav Goyal,Sahil Singh Bagri,Mohamed Akram Ulla Shariff,Saravana Balaji Shanmugam
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:minimize privacy concerns, devices like smartphones, minimize privacy, privacy concerns, concerns and inference
备注: Accepted at Interspeech 2026
点击查看摘要
Abstract:To minimize privacy concerns and inference latency on edge devices like smartphones, lightweight on-device models remain important for end-user applications. Many of these applications involve natural language classification, but deploying multiple specialized models creates a memory footprint challenge. We investigate: Can a single lightweight architecture solve multiple Speech-Adjacent (SA) classification tasks through reduction to a nuanced text similarity formulation? We propose AnySimLite, a lightweight similarity encoder that combines word-level and character-level channels. Together with a dataset transformation strategy, we evaluate AnySimLite across multiple SA classification tasks and show that it consistently achieves state-of-the-art (SOTA) or SOTA-competitive performance in few-shot settings while maintaining a low memory footprint. Even in the worst case, the performance drop remains below 7% while using $\frac{1}{250}^{\mathrm{th}}$ of the model size of the SOTA qLLaMA_LoRA-7B baseline.
71. 【2606.26449】ProvenAI: Provenance-Native Traces of Evidence in Generated Answers
链接:https://arxiv.org/abs/2606.26449
作者:Mohammad Faizan,Dalal Alharthi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
关键词:Retrieval-augmented systems routinely, routinely present citations, source meaningfully shaped, systems routinely present, present citations alongside
备注:
点击查看摘要
Abstract:Retrieval-augmented systems routinely present citations alongside generated answers, yet a citation does not confirm that the corresponding source meaningfully shaped the output. This paper introduces ProvenAI, a framework that decomposes transparency in multi-hop question answering into three independently measurable layers: answer correctness, citation fidelity against benchmark supporting evidence, and per-document influence under leave-one-resource-out intervention. Targeting the HotpotQA distractor benchmark through a seven-stage pipeline covering data normalisation, retrieval indexing, citation-aware answer generation, attribution auditing, ablation-based influence estimation, batch evaluation, and interactive inspection, ProvenAI evaluates 7,405 validation examples drawn from a canonical corpus of 509,300 passages. The system achieves 53.53% answer accuracy alongside a mean citation-fidelity score of 71.55%, and a worked example surfaces what we call the citation-influence gap: a clean citation audit co-occurring with a profile in which one cited source registers only weak influence while seven uncited sources demonstrably shift the output. We formalise the relationship between the implemented surface proxy and a token-level KL-divergence target through a stated faithfulness condition, ground the framework in causal-mediation analysis and database-provenance theory, and discuss how the three measurement layers compose with cryptographic provenance architectures emerging in autonomous scientific discovery. ProvenAI establishes that meaningful transparency in retrieval-grounded QA requires traceable links across retrieved, cited, and behaviourally influential evidence as three distinct, independently measured layers.
72. 【2606.26437】ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence
链接:https://arxiv.org/abs/2606.26437
作者:Siyi Liu,Aaron Halfaker,Dan Roth,Patrick Xia
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:contradicting evidence coexist, grounding documents, Existing metrics, factuality and faithfulness, answer is supported
备注:
点击查看摘要
Abstract:Existing metrics for factuality and faithfulness evaluate whether an answer is supported or contradicted by its grounding documents, but they fail to capture when both supporting and contradicting evidence coexist. We introduce ConflictScore, a novel metric that quantifies how well a model's response acknowledges conflicting evidence in its grounding documents. Our framework decomposes responses into atomic claims, labels each claim against each grounding document, and then aggregates these labels into two complementary measures: ConflictScore-Count (CS-C), the proportion of claims exhibiting conflicts, and ConflictScore-Ratio (CS-R), the balance between supporting and contradicting evidence. We develop ConflictBench, a benchmark covering diverse forms of conflicts such as ambiguity, contradiction, and divergent opinions, to systematically evaluate our metric. Experiments show that ConflictScore effectively detects overconfident claims across domains and can serve as a corrective feedback mechanism that improves truthfulness on TruthfulQA.
73. 【2606.26429】DualEval: Joint Model-Item Calibration for Unified LLM Evaluation
链接:https://arxiv.org/abs/2606.26429
作者:Aaron J. Li,Hao Huang,Youngmin Park,Yitong Ma,Wei-Lin Chiang,Li Chen,Cho-Jui Hsieh,Bin Yu,Ion Stoica
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Current LLM evaluation, objective correctness labels, Current LLM, open-ended user interactions, LLM evaluation relies
备注:
点击查看摘要
Abstract:Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce DualEval, a latent model-item calibration framework that represents models and evaluation items in a shared space, jointly estimating model ability together with item difficulty and sharpness. We apply DualEval across four domains: coding, math, miscellaneous domain-knowledge tasks, and generic everyday user queries. Our evaluation uses 18 frontier LLMs, static benchmark labels, and reward-model scores validated against held-out human preferences for open-ended model responses. Empirically, our framework produces reliable and balanced model rankings, and its learned item-level profiles support downstream applications such as benchmark compression for sample-efficient evaluation and anomaly detection for contamination or outlier analysis. Overall, DualEval unifies static and arena-style evaluation through joint model-item calibration, producing model rankings and item-level diagnostics that support more sample-efficient, interpretable, and auditable evaluation pipelines.
74. 【2606.26403】ProfileFoundry: A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agent
链接:https://arxiv.org/abs/2606.26403
作者:Sriram Selvam,Anneswa Ghosh
类目:Computation and Language (cs.CL)
关键词:Foundation-model research increasingly, personal histories, longitudinal updates, Real user data, research increasingly
备注:
点击查看摘要
Abstract:Foundation-model research increasingly needs data about people: user state, personal histories, relationships, contact-like fields, documents, and longitudinal updates. Real user data is difficult to share, perturb, audit, or redistribute responsibly, while independently generated fake fields rarely preserve the cross-field and temporal consistency needed for controlled evaluation. We present PROFILEFOUNDRY, a deterministic generator and fixed reference release of 100,000 adult synthetic Person Objects across eight locales. Each object combines a typed current snapshot, household, family, and employer links, snapshot-aligned events, normalized relational views, and generation provenance. The release contains 709,228 events, 40,338 households, 52,491 employers, and 518,564 directed relationship edges. We report evidence in separate categories: selected population-marginal comparisons, per-object invariant checks, release-wide referential and temporal closure, and coincidence/provenance screens. PROFILEFOUNDRY is not a population-fidelity model, a rendered-text corpus, or a formal privacy mechanism. Instead, it is a responsible synthetic source layer for constructing downstream foundation-model evaluations involving memory, privacy, document understanding, record linkage, and agent state while keeping the synthetic person behind each artifact inspectable
75. 【2606.26387】Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs
链接:https://arxiv.org/abs/2606.26387
作者:Xi Xiao,Chen Liu,Chih-Ting Liao,Yunbei Zhang,Qizhen Lan,Yuxiang Wei,Lin Zhao,Janet Wang,Jianyang Gu,Muchao Ye,Tianyang Wang,Hao Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Multimodal large language, extend large language, Multimodal large, large language models, enabling joint reasoning
备注: ECCV 2026
点击查看摘要
Abstract:Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to hallucinations that contradict their visual inputs. Mechanistic studies indicate that this weakness stems from visual laziness: MLLMs encode the correct visual evidence internally, but overly rely on strong language priors during response. Existing alignment methods, such as direct preference optimization, primarily optimize outcome-level rewards based on text. This introduces an optimization bias toward linguistic shortcuts, leading to responses that often contradict the visual evidence. To address this, we propose Visual Information Gain In aLignment (VIGIL), a reinforcement-learning (RL) post-training framework that shifts the focus from numerical reward fitting to causal visual grounding. VIGIL introduces a geometric constraint that explicitly maximizes the mutual information between the visual input and the generated response. We achieve this by penalizing "blind confidence" instances where the model remains improperly certain even when textual-visual attention is masked to create a counterfactual blind state. Extensive experiments show that VIGIL consistently outperforms recent alignment methods across hallucination and reasoning benchmarks without compromising text-only capabilities. Our approach matches the full-data performance of state-of-the-art methods using only 25% of the preference data and even demonstrates emergent spatial grounding capabilities without explicit bounding box supervision.
76. 【2606.26382】Charting the Growth of Social-Physical HRI (spHRI): A Systematic Review Pipeline Augmented by Small Language Models
链接:https://arxiv.org/abs/2606.26382
作者:Mayumi Mohan,Ju-Hung Chen,Alexis E. Block
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
关键词:Social-physical human-robot interaction, human-robot interaction, Social-physical human-robot, human-computer interaction, interaction
备注: 5 pages, 3 figures, 2 tables, Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction
点击查看摘要
Abstract:Social-physical human-robot interaction (spHRI) has grown rapidly across robotics, human-computer interaction, human-robot interaction, and haptics. Yet, fragmented terminology and inconsistent methodologies make systematic synthesis difficult. To support scalable review practices, we evaluated the extent to which small language models (SLMs; 1.5B parameters) can assist with title and abstract screening for a large spHRI systematic review. While no SLMs matched human reviewers' performance, the models operated locally and screened papers orders of magnitude faster. The combined SLM ensemble identified 39 papers reviewers missed, representing 10.29% of the final relevant dataset. These results demonstrate that SLMs can augment, rather than replace, expert reviewers and make large-scale literature reviews accessible and sustainable.
77. 【2606.26366】Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models
链接:https://arxiv.org/abs/2606.26366
作者:Patrick Cooper,Alvaro Velasquez
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:moral dilemmas exhibits, failure modes, moral dilemmas, dilemmas exhibits, exhibits two failure
备注: 24 pages, 8 figures, 16 tables. To appear at ACL 2026 (submitted via ARR)
点击查看摘要
Abstract:Standard chain-of-thought on moral dilemmas exhibits two failure modes: stakeholder collapse (the trace names at most one party with a stake in the outcome) and uncertainty suppression (no explicit unknowns or hedges before committing to an action). We introduce narration-of-thought (NoT), a system prompt that structures chain-of-thought into five sections: protagonist, stakeholders, two-step consequences, uncertainty, then commitment. NoT adds no training, parameters, or fine-tuning. On 100 DailyDilemmas scenarios across four generators from three vendors, NoT cuts stakeholder collapse from up to 31% to under 1% and uncertainty suppression from up to 72% to 1-24% on every model. A matched-budget verbose-CoT control rules out token spend as the active ingredient; NoT retains Cliff's delta advantages of +0.79 to +0.90 on stakeholder count and +0.65 to +0.93 on uncertainty score for three of four generators, and a section ablation attributes each shift to its specific sub-instruction. Textual-gradient descent initialised at NoT improves the scaffold further; a cross-family training judge (different vendor from the generator) dominates an in-family one on every measured axis. Extended to a five-round multi-stakeholder debate protocol, the scaffold converts a 6% standoff into 95% full consensus on a calibration set and 100% combined convergence on a DailyDilemmas replication. The resulting traces externalise the stakeholders, consequences, and uncertainty grounding each commitment, providing an auditable substrate for dependable agentic deployment.
78. 【2606.26360】Phonetic and semantic analyses of spoken corpora of Beijing and Taiwan Mandarin indicate that the neutral tone is a lexical tone
链接:https://arxiv.org/abs/2606.26360
作者:Yuxin Lu,Zhexuan Li,R. Harald Baayen
类目:Computation and Language (cs.CL)
关键词:Mandarin Chinese, neutral tone, tone, Mandarin, Beijing Mandarin
备注:
点击查看摘要
Abstract:The neutral, or floating, tone of Mandarin Chinese is a tone with an enigmatic set of properties. It has been described as a reduced tone, or as a tone that sometimes is lexically fixed but that can also be toneless. In two-syllable words, it is found only on the second syllable, but single-syllable words can also have the neutral tone. We present a corpus-based study of the phonetic realization of the neutral tone in spontaneous conversational speech corpora of Beijing Mandarin and Taiwan Mandarin. We show that the neutral tone has its own tonal target, just as the four lexical tones of Mandarin. We also show that disyllabic words with a neutral tone have pitch contours that have a pitch component that depends on the tone on the first syllable, just as has been observed for two-syllable words with a lexical tone on the second syllable (Chuang et al., 2026). Furthermore, words with a floating tone have word-specific pitch signatures, which have also been documented for single-syllable words (Jin et al., 2026) as well as two-syllable words (Lu et al., 2026b). These word-specific pitch signatures are shown to be predictable to some extent from words' contextualized embeddings, as previously reported for lexical tones (Chuang et al., 2026; Lu et al., 2026b). As there is also considerable variability in the realization of lexical tones, we propose that the neutral tone is, in fact, a lexical tone in both Taiwan Mandarin and Beijing Mandarin. We document both similarities and differences in the realization of the floating tone in these two varieties and provide evidence, using contextualized embeddings, that some of the observed differences may arise from differences in the meanings of the words as used in the two corpora.
79. 【2606.26344】Axon: A Synthesizing Superoptimizer for Tensor Programs
链接:https://arxiv.org/abs/2606.26344
作者:Akash Kothari,Shaowei Zhu,Daniel Kroening,Chungha Sung
类目:Programming Languages (cs.PL); Computation and Language (cs.CL); Performance (cs.PF)
关键词:Writing high performance, requires deep expertise, accelerators requires deep, high performance kernels, Writing high
备注:
点击查看摘要
Abstract:Writing high performance kernels for AI accelerators requires deep expertise in tiling, instruction selection, data layout, and operator fusion placing a significant burden on programmers. In this paper, we focus on tile based AI accelerator programs and present Axon, a synthesizing superoptimizer for tensor programs: it uses program synthesis to automatically generate target instructions from semantics specifications, and explores semantically equivalent program variants to select the best performing kernel empirically. Axon discovers algebraic transformations by propagating operators through computation graphs and uses SMT over unbounded tensors to guarantee that all transformations preserve semantics without requiring hand crafted rewrite rules. It then lowers tensor operations to target ISA instructions, explores tiling configurations constrained by hardware descriptions, and fuses operators and instructions to minimize memory traffic.
80. 【2606.26300】he Verification Horizon: No Silver Bullet for Coding Agent Rewards
链接:https://arxiv.org/abs/2606.26300
作者:Binghai Wang,Chenlong Zhang,Dayiheng Liu,Jiajun Zhang,Jiawei Chen,Mouxiang Chen,Rongyao Fang,Siyuan Zhang,Xuwu Wang,Yuheng Jing,Zeyao Ma,Zeyu Cui
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:classical intuition holds, easier than producing, solution is easier, classical intuition, intuition holds
备注: Authors are listed alphabetically by their first names
点击查看摘要
Abstract:A classical intuition holds that verifying a solution is easier than producing one. For today's coding agents, this intuition is being inverted: as foundation models develop stronger reasoning capabilities and engineering harnesses grow more sophisticated, generating complex candidate solutions is no longer difficult -- reliably verifying them has become the harder problem. Every verifier we can build is only a proxy for human intent, never the intent itself. This makes verification subject to a twofold difficulty: first, intent is underspecified by nature, making it inherently hard to faithfully check whether it has been fulfilled; second, during model training, optimization widens the gap between proxy and intent -- manifesting as reward hacking or signal saturation. To address this, we characterize the quality of verification signals along three dimensions -- scalability, faithfulness, and robustness -- and argue that achieving all three simultaneously is the central challenge. We further study four reward constructions: a test verifier for general coding tasks, a rubric verifier for frontend tasks, the user as verifier for real-world agent tasks, and an automated agent verifier for long-horizon tasks. Across different task types and policy capability levels, we conduct in-depth analysis and experiments on the core challenges of reward design and how to more effectively leverage reward signals. Experiments show that targeted verification design can effectively suppress reward hacking, improve task completion quality, and achieve significant gains across multiple internal and public benchmarks. These experiences collectively point to a core observation: no fixed reward function can remain effective as policy capability continues to grow; and verification must co-evolve with the generator.
81. 【2606.26277】From Clicks to Intent: Cross-Platform Session Embeddings with LLM-Distilled Taxonomy for Financial Services Recommendations
链接:https://arxiv.org/abs/2606.26277
作者:Dianjing Fan,Yao Li,Kyaw Hpone Myint,Dwipam Katariya,Alexandre G.R. Day,Pranab Mohanty,Giri Iyengar
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Sequential user behavior, experiences differ drastically, in-app experiences differ, significant gaps remain, user behavior modeling
备注: Dianjing Fan and Yao Li equally contributed to this work. 7 pages, 1 figure
点击查看摘要
Abstract:Sequential user behavior modeling is widely adopted in industrial recommender systems; however, significant gaps remain in financial services, where pre-login web interactions and authenticated in-app experiences differ drastically. Specifically, pre-login web users typically explore new products, whereas logged-in app users focus on account servicing. Due to the challenge of cross-channel entity resolution (e.g., matching anonymous web sessions to authenticated mobile accounts), web-based intent signals remain underutilized for post-authentication personalization. Existing methods for capturing web-based intent are often ad-hoc and narrow, lacking the flexibility to support both quantitative downstream recommendations and qualitative understanding at scale. In this work, we propose a scalable and dual-purpose intent prediction framework for web-based interactions and demonstrate its applicability for personalization. Our approach transforms raw web clickstreams into two outputs: a self-supervised Transformer encodes multi-modal clickstreams into a compact session embedding, while an LLM-based taxonomy generation and distillation pipeline produces interpretable intent labels. Our system demonstrates that self-supervised clickstream representations combined with LLM-distilled taxonomies can jointly serve quantitative tasks and qualitative understanding in production: on the mobile homepage tile ranking task, the session embedding improves macro Recall@1 by 1.88% and reduces Log Loss by 13.38% over production baselines. On the user conversion prediction task, the embedding outperforms the LLM labels by 4.3% on micro F1, while the distillation layer delivers interpretable labels at ultra-low latency with only a 7% performance drop.
82. 【2606.26196】From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models
链接:https://arxiv.org/abs/2606.26196
作者:Haoxiang Sun,Tao Wang,Li Yuan,Jian Zhao,Jiancheng Lv
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Large Language Models, Multimodal Large Language, recently made remarkable, made remarkable progress, DeepSeek R-series
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's R-series, which have driven a paradigm shift toward perception-centric intelligence. However, there remains a lack of systematic surveys that examine perception from a truly unified vision-language perspective -- one that treats vision and language as an inseparable modality. Existing reviews are often fragmented, focusing separately on either vision or language, and thus rarely capture the cross-modal evolution of perception as an integrated capability. To bridge this gap, we present the first systematic survey of unified vision-language perception in MLLMs. Specifically, we (1) formalize MLLM perception as an intrinsic, unified vision-language capability analogous to human innate perception, (2) introduce a five-stage taxonomy tracing the paradigm evolution of MLLM perception and survey representative methods and milestones at each phase, and (3) identify open challenges and outline promising research directions toward truly general, unified multimodal intelligence. We hope our study will provide both a foundational understanding and an actionable roadmap to foster further innovation on the path toward artificial general intelligence (AGI).
83. 【2606.26144】Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech
链接:https://arxiv.org/abs/2606.26144
作者:Samip Neupane,Sandesh Pokhrel,Sandesh Pyakurel,Basanta Joshi
类目:ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:multilingual information retrieval, accessibility tools, task of determining, meeting transcription, information retrieval
备注: 12 pages, 7 tables
点击查看摘要
Abstract:Speaker diarization, the task of determining "who spoke when" in a multi-speaker recording, is a critical component in applications such as meeting transcription, accessibility tools, and multilingual information retrieval. While end-to-end neural diarization systems have achieved strong performance for English and other high-resource languages, their effectiveness degrades substantially for underrepresented languages where annotated speech data is scarce. This paper investigates speaker diarization for low-resource Nepali-Hindi speech through a multilingual training approach, comparing two modern architectures: EEND with encoder-decoder attractors (EEND-EDA) and EEND with Perceiver-based attractors (DiaPer). Both models are trained on a multilingual corpus combining English speech from LibriSpeech, diverse speaker recordings from VoxCeleb, and separately collected Nepali and Hindi audio, a setup designed to reduce language bias and encourage cross-lingual generalization. We evaluate both models across 2-speaker, 3-speaker, 4-speaker, and mixed-speaker scenarios on LibriSpeech, VoxCeleb, and Nepali-Hindi (NeHi) test sets. DiaPer achieves stronger overall performance than EEND-EDA, particularly in more challenging multi-speaker conditions, obtaining DERs of 3.28%, 2.02%, 4.05%, and 4.76% on NeHi 2-speaker, 3-speaker, 4-speaker, and mixed-speaker settings, respectively, compared to 1.50%, 9.68%, 16.17%, and 11.19% for EEND-EDA. These results demonstrate the viability of Perceiver-based end-to-end neural diarization for low-resource multilingual speech processing.
Comments:
12 pages, 7 tables
Subjects:
Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2606.26144 [cs.SD]
(or
arXiv:2606.26144v1 [cs.SD] for this version)
https://doi.org/10.48550/arXiv.2606.26144
Focus to learn more
arXiv-issued DOI via DataCite</p>
84. 【2606.26130】hinking Like a Scientist? A Structural Study of LLM-Generated Research Methods
链接:https://arxiv.org/abs/2606.26130
作者:Francesca Carlon,Brecht Verbeken,Vincent Ginis,Andres Algaba
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
关键词:Large Language Models, Large Language, prompting remain unclear, minimal prompting remain, guide research methodology
备注: 46 pages, 13 figures, 18 tables
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used to guide research methodology, yet their default methodological tendencies under minimal prompting remain unclear. Here, we prompt GPT-5.1, Gemini 3 Pro, and DeepSeek-V3.2 with an LLM-extracted research question from each of 1,000 recent arXiv computer-science papers and compare the resulting methodology suggestions against a paper-derived experimental inventory. Since we provide only the research question, the differences we measure reflect initial suggestions and not how optimal those suggestions are. We extract structured method features from both sources, map them into a shared taxonomy, and quantify divergence across multiple taxonomy dimensions including model provider, dataset task type, and evaluation metric type. The strongest imbalance appears in provider choice, with Jensen-Shannon divergence about 3-5x larger than any other taxonomy dimension. Other/Academic single-occurrence models are underrepresented by 23-24 percentage points, while reused academic/community models are slightly overrepresented (4-6pp). LLMs also suggest a much narrower range of methods overall: the effective number of model entities contracts from 1,232 to 59-96, and inter-LLM rank correlations (0.55-0.68) generally exceed LLM-to-paper correlations (0.33-0.56), so the distortions are largely shared across models. Popularity baselines, BM25 retrieval calibration, and paper-level similarity tests confirm that the outputs are query-specific responses, but filtered through a narrower set of options. Researchers who rely on LLM suggestions without cross-checking therefore risk narrowing their methodological search space toward a more concentrated default.
85. 【2606.26120】Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM
链接:https://arxiv.org/abs/2606.26120
作者:Tianyi Wu,Xiaoxi Sun,Yanhua Jiao,Yulin Li,Yixin Chen,YunHao Cao,YiQi Hu,Zhuotao Tian
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Diffusion Large Language, Large Language Models, Diffusion Large, bidirectional attention mechanisms, Large Language
备注:
点击查看摘要
Abstract:Diffusion Large Language Models (dLLMs) offer a promising alternative to autoregressive models, excelling in text generation tasks due to their bidirectional attention mechanisms. However, their computational complexity scales on the order of L cubed with the sequence length L. This poses significant challenges for long-sequence and real-time applications, primarily due to the lack of compatibility with key-value caching and the non-autoregressive nature of denoising steps. Existing acceleration methods rely on static caching or parallel decoding strategies, which fail to account for the dynamic behavior of token properties across layers and decoding steps. We propose Dynamic-dLLM, a training-free framework that enhances dLLM inference efficiency through two components: Dynamic Cache Updating (DCU), which adaptively allocates cache-update budgets based on layer-wise token dynamics, and Adaptive Parallel Decoding (APD), which dynamically calibrates decoding thresholds to balance generation quality and efficiency. Extensive experiments on models like LLaDA-8B-Instruct, LLaDA-1.5, and Dream-v0-7B-Instruct across benchmarks such as MMLU, GSM8K, and HumanEval demonstrate that Dynamic-dLLM significantly improves inference speed. It attains an average speedup exceeding 3 times while maintaining performance. Dynamic-dLLM outperforms state-of-the-art acceleration methods and provides a plug-and-play solution for efficient dLLM deployment without compromising performance. The code is available at this https URL.
86. 【2606.26112】From Lexicon to AI: A Structured-Data Pipeline for Specialized Conversational Systems in Low-Resource Languages
链接:https://arxiv.org/abs/2606.26112
作者:Siddhant Hitesh Mantri,Dhara Gorasiya,Malhar Kulkarni,Pushpak Bhattacharya
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:massive training corpora, creating specialized conversational, training corpora, access to massive, massive training
备注: 12 pages, 3 figures
点击查看摘要
Abstract:Low-resource languages face a critical challenge in AI development: creating specialized conversational systems without access to massive training corpora. We present a systematic methodology for transforming structured linguistic resources into specialized AI systems, demonstrating that expert-curated lexical databases can serve as effective foundations for conversational AI development. Our approach converts Hindi WordNet into 1.25 million diverse instruction-response pairs, fine-tunes a 12B-parameter language model using resource-efficient LoRA with 4-bit quantization. Evaluation through a Hindi language learning chatbot demonstrates that structured-knowledge-based systems achieve superior pedagogical effectiveness (91.0 vs. 79.4-83.6 for general-purpose models) while maintaining competitive semantic performance and exceptional consistency. The complete pipeline demonstrates a proof-of-concept methodology using Hindi for developing specialized AI systems for any languages with WordNet resources. This work addresses the critical gap in AI accessibility for low-resource languages, offering a practical alternative to corpus-intensive approaches and potentially enabling specialized AI development for the hundreds of languages with existing WordNet resources.
87. 【2606.26108】Where Larger Models Excel: The Primacy of Constraint-Guided Reasoning
链接:https://arxiv.org/abs/2606.26108
作者:Guan-Yi Lin,Hen-Hsen Huang
类目:Computation and Language (cs.CL)
关键词:gap remain underexplored, language models consistently, consistently outperform smaller, remain underexplored, models consistently outperform
备注: 10 pages, 3 figures,
点击查看摘要
Abstract:Larger language models consistently outperform smaller ones on reasoning benchmarks, yet the reasoning differences underlying this gap remain underexplored. Across benchmarks in mathematics, physics, chemistry, and programming, we observe stable performance gaps: averaged over datasets, Qwen3-32B outperforms Qwen3-8B by 6.43%, while GPT-OSS-120B exceeds GPT-OSS-20B by 7.38%. To study the reasoning differences behind these gains, we develop AdvCluster, an automated framework that identifies questions where the larger model shows a stable advantage, extracts fine-grained advantage descriptions from paired reasoning traces produced by larger and smaller models, and organizes them through semantic clustering with quantitative evaluation and selection guided by a reviewer model. Our analysis yields a systematic taxonomy of larger model reasoning advantages, spanning both common advantages that recur across domains and specialized advantages associated with particular domains. Across these patterns, a recurring theme is Constraint-Guided Reasoning: larger models are better at identifying explicit and implicit constraints, organizing them into structured reasoning, and using them to rule out infeasible paths and verify intermediate steps.
88. 【2606.26107】Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars
链接:https://arxiv.org/abs/2606.26107
作者:Jatin Bhusal,Salma Tamang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:expression remain underexplored, Sign language, Sign language communication, Nepali Sign Language, remain underexplored
备注: 15 pages, 5 figures, 9 tables
点击查看摘要
Abstract:Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a proof-of-concept multimodal framework that demonstrates the feasibility of generating emotion-conditioned Nepali Sign Language avatars from spoken input. As a preliminary investigation, we focus on four common Nepali words ("thank you", "hello", "house", "me") across three emotional states (happy, neutral, sad) to validate our core technical approach. Our lightweight architecture employs a shared acoustic encoder for simultaneous Automatic Speech Recognition and emotion classification, achieving 81.1% ASR accuracy and 79.21% emotion recognition accuracy on a dataset of 600 labeled audio samples from 50 speakers. The system demonstrates 37% parameter efficiency compared to separate model architectures while maintaining a lightweight footprint with only 22.1M parameters suitable for edge deployment. This pilot work establishes the technical foundation for emotion-aware sign language translation in low-resource settings and provides a scalable framework for future expansion to larger vocabularies and more diverse emotional expressions. Our preliminary results indicate the viability of real-time, emotionally expressive sign language communication systems for the hearing-impaired community, with clear pathways for enhancement in subsequent development phases.
89. 【2606.26106】Reducing Conversational Escalation in Large Language Model Dialogue with Nonviolent Communication Constraints
链接:https://arxiv.org/abs/2606.26106
作者:Zhixing Sun,Shenghe Xu,Tao Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, emotionally charged situations, charged situations involving, situations involving interpersonal, involving interpersonal conflict
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used in emotionally charged situations involving interpersonal conflict, frustration, and distress. While prior safety research has focused on preventing explicit harms such as toxic or policy-violating content, less attention has been paid to conversational behaviors that may unintentionally escalate conflict. In this paper, we investigate whether LLMs can be guided toward more de-escalating dialogue behavior through lightweight prompt-level constraints derived from Nonviolent Communication (NVC). We reformulate NVC principles as process-oriented guidelines that discourage blame attribution, emphasize attention to users' emotional experiences, and encourage clarification before advice. Using a dual-agent simulation framework across multiple instruction-tuned models and user resistance levels, we show that NVC-constrained prompting consistently reduces conversational escalation and stabilizes interactions with highly resistant users. These results suggest that simple communication constraints can meaningfully improve the trustworthiness of LLM dialogue in conflict-prone settings.
90. 【2606.26105】Context Recycling for Long-Horizon LLM Inference
链接:https://arxiv.org/abs/2606.26105
作者:Derek Thomas
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, Large language, exhibit strong capabilities, inefficient token usage, long conversational horizons
备注:
点击查看摘要
Abstract:Large language models (LLMs) exhibit strong capabilities in short-context reasoning but degrade in performance over long conversational horizons due to context window limitations and inefficient token usage. We introduce ContextForge, a system for context recycling that maintains task-relevant information across turns by combining structured query generation, external memory retrieval, and controlled synthesis. The system enables efficient reuse of prior computation without relying on full context replay, reducing token overhead while preserving answer quality. We evaluate ContextForge using a 15-turn conversational benchmark that tests multi-turn reasoning, back-references, and domain shifts across structured healthcare queries. Compared to a baseline agent using identical underlying models, ContextForge demonstrates improved consistency and reduced token consumption, while maintaining comparable response accuracy. These results suggest that context recycling provides a practical approach for extending LLM capabilities in long-horizon tasks without requiring larger context windows or model retraining. Code and evaluation artifacts are available at this https URL.
91. 【2606.26104】Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare
链接:https://arxiv.org/abs/2606.26104
作者:Jasmine Brazilek,Harper Dunn
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Animal-welfare advocates produce, animal welfare, millions of people, ten features produce, advocates produce
备注:
点击查看摘要
Abstract:Animal-welfare advocates produce a lot of writing, and increasingly that writing trains the language models that millions of people then ask about animal welfare. Using vocabulary-matched stance-contrast probes on a held-out animal-welfare benchmark, we measure how each of ten linguistic features changes Llama-3.2-1B's preference for pro-animal-welfare reasoning when used as fine-tuning data. Eight of the ten features produce statistically significant shifts. Seven move the model toward stronger pro-animal-welfare reasoning: assertive certainty, explicit moral vocabulary, emotion words, evaluative claims, narrative structure, depicted harm severity, and immediate temporal framing. Two move it the other way: hedged language and concrete sensory description both dilute the pro-animal-welfare stance. First-person perspective has no statistically significant effect. The practical recommendation for anyone writing animal-welfare text that may end up in LLM training corpora: assert a position rather than describe a scene neutrally. The features that shift the model are the ones that make the writer's position explicit; the features that dilute it hold animal-welfare content but withhold stance.
92. 【2606.26103】Investigating LLM's Problem Solving Capability -- a Study on Statics Questions
链接:https://arxiv.org/abs/2606.26103
作者:Tanner Culleton,Hung-Fu Chang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, aspects of society, range of subjects
备注: 9 pages, Engineering and Technology Symposium 2026
点击查看摘要
Abstract:Large Language Models (LLMs) have rapidly influenced many aspects of society, particularly education, due to their demonstrated ability to complete assignments and examinations across a wide range of subjects. Although prior studies have examined the educational impact of LLMs, much of the existing work relies on public or open problem datasets and lacks topic-specific analysis. In engineering education, especially within mechanical engineering, systematic investigations of LLM performance on specific problem types remain limited. Instead of using traditional methods that directly ask textbook questions to an LLM tool, our study adopts a model distillation process to evaluate LLM capabilities in solving statics problems. By distilling ChatGPT, we extracted 25 text-only statics questions and further constructed two additional datasets by adding diagrams and modifying their numerical values. Experimental results show that while LLMs perform well on text-only statics problems, their accuracy decreases when diagrams are introduced and the problems require multi-step reasoning. Further analysis suggests that this performance drop is not primarily caused by limitations in image recognition, but rather by difficulties in multi-step reasoning and in consistently applying extracted visual information across successive solution stages.
93. 【2606.26102】Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training
链接:https://arxiv.org/abs/2606.26102
作者:Jasmine Brazilek,Juliana Seawell
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Standard post-training pipelines, apply supervised fine-tuning, pipelines apply supervised, Animal Harm Benchmark, post-training pipelines apply
备注:
点击查看摘要
Abstract:Standard post-training pipelines apply supervised fine-tuning (SFT) and reinforcement learning (RL) to make language models helpful, but these processes may inadvertently degrade values instilled during pre-training. We investigate whether the domain of post-training data differentially affects the retention of animal compassion values in a Llama 3.1 8B model mid-trained on compassion-oriented synthetic data, using both SFT (helpfulness via Dolly-15k vs. coding via Magicoder-110K) and GRPO (helpfulness via RLHFlow vs. coding via Magicoder), evaluated on the Animal Harm Benchmark (AHB 2.2) and MORU benchmark (Moral Reasoning Under Uncertainty). Helpfulness training significantly degrades animal compassion relative to coding training on AHB (SFT: 35.7% vs. 65.2%; GRPO: 18.7% vs. 32.0%), replicating across two independent helpfulness datasets and two training paradigms. On English MORU items, helpfulness training degrades general moral reasoning by 25.5 percentage points (46.4% vs. 71.9%), a striking gap that rivals the compassion effect in magnitude. However, this effect does not transfer cross-lingually: on the multilingual MORU benchmark, the domain effect disappears (SFT: 52.3% vs. 51.2%). In contrast, the animal compassion effect transfers consistently across languages, with Magicoder's AHB percentage-point gain over the base model 4.5 times larger on non-English items than English items. This divergence suggests that values instilled through mid-training are encoded more deeply and cross-lingually than reasoning improvements from domain-specific post-training. These results suggest that, for labs building on value-laden mid-training, coding-domain post-training may better preserve mid-trained values than helpfulness post-training without harming general reasoning capabilities.
94. 【2606.26101】Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models
链接:https://arxiv.org/abs/2606.26101
作者:Renwei Meng,Bowen Zhang,Jian Wang,Xican Wang,Haoyi Wu,Xuanyan Qiu,Shengan Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Reliable evaluation, separate supported answering, large language models, evaluation of large, large language
备注: 16 pages, 3 figures
点击查看摘要
Abstract:Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a contamination-aware, multi-zone benchmark for measuring the transition from answerable knowledge to abstention-expected unknowns under frozen build-time labels. The benchmark contains 1,200 items across five domains, explicit abstention expectations, contamination-risk metadata, and dual parsing with an official strict parser plus a normalized robustness parser. We evaluate FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models under locked answer-or-abstain prompts, answer-only controls, and prompt-template variants. The benchmark is not solved by generic non-answer behavior: FLAN baselines remain weak on productive abstention, while stronger instruction-tuned models expose a selective but incomplete transition from answering to abstaining. Qwen2.5-3B-Instruct achieves the best overall reliability, but answer-expected zones remain difficult, calibration remains poor, and benign-item refusal persists. Prompt and parser robustness analyses preserve the main ranking and qualitative conclusions. The benchmark therefore provides a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct but interacting dimensions of LLM this http URL dataset is publicly available at this https URL.
95. 【2606.26100】HierBias: Context-Conditioned Hierarchical Media Bias Detection with Multi-Task Type Classification
链接:https://arxiv.org/abs/2606.26100
作者:Kaining Li,Ruichen Yan,Yuxin Dong
类目:Computation and Language (cs.CL)
关键词:annotators naturally exploit, human annotators naturally, balanced information dissemination, ignoring inter-sentence contextual, inter-sentence contextual signals
备注:
点击查看摘要
Abstract:Media bias detection is a critical task for ensuring fair and balanced information dissemination, yet existing sentence-level approaches classify each sentence independently, ignoring inter-sentence contextual signals that human annotators naturally exploit. We present \textbf{HierBias}, a hierarchical context-conditioned media bias detector that formally models document context in bias prediction. We introduce the \emph{context-conditioned bias probability} and prove theoretically that leveraging document context strictly reduces the Bayes error of sentence-level classification when inter-sentence mutual information is non-zero. A multi-task generalization bound further establishes that jointly training binary bias detection and fine-grained bias type classification improves sample efficiency on small annotated corpora. Architecturally, HierBias pairs a sentence-level RoBERTa encoder with a cross-sentence Transformer aggregator and dual output heads for binary detection and four-class type classification. Evaluated on BABE and BASIL, HierBias achieves 0.853 F1 and 0.723 MCC, surpassing the state-of-the-art bias-detector by $+2.6\%$ F1 and $+4.3\%$ MCC (McNemar's test, $p 0.05$). Ablation experiments confirm that each theoretical component contributes independently and consistently.
96. 【2605.00410】Agent Capsules: Quality-Gated Granularity Control for Multi-Agent LLM Pipelines
链接:https://arxiv.org/abs/2605.00410
作者:Aninda Ray
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
关键词:agents typically issues, issues N LLM, LLM calls, Agent Capsules, typically issues
备注: 17 pages, 7 figures. Code: [this https URL](https://github.com/aray-17/agent-capsules)
点击查看摘要
Abstract:A multi-agent pipeline with N agents typically issues N LLM calls per run. Merging agents into fewer calls (compound execution) promises token savings, but naively merged calls silently degrade quality through tool loss and prompt compression. We present Agent Capsules, an adaptive execution runtime that treats multi-agent pipeline execution as an optimization problem with empirical quality constraints. The runtime instruments coordination overhead per group, scores composition opportunity, selects among three compound execution strategies, and gates every mode switch on rolling-mean output quality. A controlled negative result confirms that injecting more context into a merged call worsens compression rather than relieving it, so the framework's escalation ladder (standard, then two-phase, then sequential) recovers quality by moving toward per-agent dispatch rather than by rewriting merged prompts. On LLM-judged quality, the controller matches a hand-tuned oracle on every measured (model, group, mode) cell: routing compound whenever the oracle would, and reverting to fine whenever quality would fail the floor, without per-model configuration. Against a hand-crafted LangGraph implementation of a 14-agent competitive intelligence pipeline, Agent Capsules uses 51% fewer fine-mode input tokens and 42% fewer compound-mode input tokens, at +0.020 and +0.017 quality respectively. Against a DSPy implementation of a 5-agent due diligence pipeline, the framework uses 19% fewer tokens than uncompiled DSPy at quality parity, and 68% fewer tokens than MIPROv2 at +0.052 quality. Even before compound mode fires, the runtime delivers efficiency through automatic policy resolution, cache-aligned prompts, and topology-aware context injection, matching both hand-tuned and compile-time baselines without training data or per-pipeline engineering.
97. 【2511.10657】Patent Representation Learning via Self-supervision
链接:https://arxiv.org/abs/2511.10657
作者:You Zuo(ALMAnaCH),Kim Gerdes(LISN),Eric Villemonte de La Clergerie(ALMAnaCH),Benoît Sagot(ALMAnaCH)
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:contrastive objectives, patent, dropout, study self-supervised patent, Abstract
备注:
点击查看摘要
Abstract:We study self-supervised patent representation learning with contrastive objectives. A standard baseline constructs positives by encoding the same text twice under independent dropout masks, but applying this recipe to long, structured patent documents requires careful calibration. We show that dropout-only training can be substantially strengthened by tuning temperature and dropout rate, yet its best configuration is evaluation-dependent and does not transfer uniformly from title--abstract retrieval to claim-to-disclosure retrieval. We propose mixed dropout--section positives, a patent-specific view construction strategy in which the anchor is the title--abstract view and the positive is sampled either from a dropout re-encoding of the same view or from another section of the same patent, such as claims, summary, background, drawings, or description. This uses patent-internal structure as a training-time signal without IPC labels, citations, or relevance annotations. We evaluate on graded EPO search-report retrieval, DAPFAM, a recently proposed family-level patent retrieval benchmark, and IPC subclass classification. Section-based positives improve over calibrated dropout-only and generic title--abstract augmentation baselines, are competitive with citation-informed patent encoders and a general-purpose embedding model, and perform strongly on the out-of-domain split of DAPFAM. Additional cross-section alignment diagnostics show that section-pair training improves compatibility among abstracts, claims, and descriptions of the same invention. These results indicate that patent sections provide effective self-supervised positive views for learning dense patent representations.
信息检索
1. 【2606.27243】NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems
链接:https://arxiv.org/abs/2606.27243
作者:Shaohua Liu,Liang Fang,Yilong Sun,Shudong Huang,Qingsong Luo,Xiaoyang Chen,Dongqiang Liu,Chuangang Ma,Zhenzhen Chai,Henghuan Wang,Shijie Quan,Changyuan Cui,Zhangbin Zhu,Peng Chen,Wei Xu,Lei Xiao,Haijie Gu,Jie Jiang
类目:Information Retrieval (cs.IR); Software Engineering (cs.SE)
关键词:models are continuously, continuously improved, advertising recommender models, recommender models, architecture evolution
备注: 12 pages, 3 figures
点击查看摘要
Abstract:Industrial advertising recommender models are continuously improved through architecture evolution. Upgrades such as RankMixer, TokenMixer-Large, and MixFormer show that better structures remain a key source of quality and business gains. Yet developing such upgrades in production is expert-intensive and difficult to scale. Existing automation is insufficient: AutoML mainly tunes hyper-parameters, while effective gains often require cross-module changes under strict constraints; generic LLM coding agents optimize for runnable code, but runnable code does not imply a valid recommender architecture. Candidates may pass local tests while causing silent failures that degrade performance. We present NOVA, a level-aware agent harness for verification-aware architecture evolution. NOVA uses an architecture gradient, an SGD-inspired, non-differentiable update signal that aggregates prior modifications, verification diagnostics, metric feedback, and trajectory memory to guide the next modification. A verification cascade checks structure semantics, local executability, offline effectiveness, and online impact; invalid candidates are blocked early, with failure patterns recorded as forbidden directions. L1--L4 task-level control matches automation to task complexity and risk, routing high-risk tasks to Copilot for human oversight. Deployed in an industrial advertising system, NOVA achieves the highest effective pass rate on L2 ScaleUp and L3 Literature-to-Production tasks (54.5% and 60.0%), reduces silent failures compared with coding-agent baselines, and shortens one literature-to-production cycle by over 13x in human-attended time. In online A/B testing, the selected L3 candidate improves GMV on three pCVR objectives by +1.25%, +1.70%, and +2.02%, while reducing pCVR bias by 58.8%, 66.7%, and 37.3%.
Comments:
12 pages, 3 figures
Subjects:
Information Retrieval (cs.IR); Software Engineering (cs.SE)
Cite as:
arXiv:2606.27243 [cs.IR]
(or
arXiv:2606.27243v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2606.27243
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2606.27214】RUST: Item-Calibrated Interval Evidence for Temporal Session-Based Recommendation
链接:https://arxiv.org/abs/2606.27214
作者:Linjiang Guo,Nitin Bisht,Shiqing Wu,Yifan Yin,Guandong Xu
类目:Information Retrieval (cs.IR)
关键词:infer user interest, recommendation to infer, infer user, session-based recommendation, Temporal
备注:
点击查看摘要
Abstract:Temporal signals have been widely used in session-based recommendation to infer user interest. Existing temporal session-based recommenders primarily rely on absolute interval values, implicitly assuming that the same interval carries similar interest signals across items. However, we empirically find that this assumption does not hold: each item has its own interval distribution, so an interval should be interpreted relative to the item it belongs to. Based on this observation, we propose TRUST, a framework that evaluates each observed interval relative to the empirical interval distribution of the corresponding item. Specifically, we propose a score function to guide global neighbor sampling, session graph encoding, and final interest aggregation. Experiments on public datasets show that TRUST consistently improves over representative temporal and non-temporal baselines, and plug-in experiments further show that the proposed scoring function can improve existing temporal session recommenders as a model-agnostic method. Component-wise ablations further show that calibrating the temporal signals within each module, rather than removing the module itself, consistently improves neighbor sampling, session graph encoding, and interest aggregation.
3. 【2606.27058】UniFormer: Efficient and Unified Model-Centric Scaling for Industrial Recommendation
链接:https://arxiv.org/abs/2606.27058
作者:Bo Chen,Jinlong Jiao,Tijian Hu,Ruihao Zhang,Yanzhi Liu,Chenghou Jin,Qinglin Jia,Baixuan He,Hechang Pan,Yiwu Liu,Jian Liang,Chaoyi Ma,Ruiming Tang,Han Li,Kun Gai
类目:Information Retrieval (cs.IR)
关键词:component-centric model scaling, improve model capacity, model capacity, unified model-centric scaling, model-centric scaling framework
备注:
点击查看摘要
Abstract:Recently, substantial progress has been made in industrial recommendation through component-centric model scaling, where individual components such as behavior modeling, feature interaction, or task modeling are independently scaled to improve model capacity. Although recent methods such as HyFormer and OneTrans further explore cross-module co-scaling by jointly modeling behavior and interaction, their designs are still confined to the feature space and lack a unified model-centric scaling framework over the overall modeling space. In this paper, we propose UniFormer, an efficient and unified model-centric scaling framework for industrial recommender systems. To improve efficiency, UniFormer decomposes the overall modeling space into feature and task spaces, which are modeled by stacked Feature-space Interaction Modules and Task-space Interaction Modules, respectively. Moreover, UniFormer introduces semantic-based tokenization scheme to enable user-item decoupling, thereby achieving request-level inference acceleration. To prevent preference collapse, UniFormer employs multi-sequence cross-attention to separately capture heterogeneous behavior patterns, followed by the self-attention to enhance interaction modeling. Besides, dedicated multi-view FFNs are introduced to support flexible and scalable parameter scaling across different modeling components. Extensive online A/B testing in two production scenarios, Kuaishou and Kuaishou Lite, shows that UniFormer consistently improves user engagement and interaction metrics, achieving gains of +0.101%/+0.260% in App Stay Time and +0.729%/+1.113% in Watch Time, respectively.
4. 【2606.27010】riPAH: Imbalance-Aware Tri-Prompt Affinity Hashing for Cross-Modal Medical Retrieval
链接:https://arxiv.org/abs/2606.27010
作者:Jiaming Bian,Songming Li,Yurui Song,Yunfei Chen,Yichao Cao,Jun Long
类目:Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:large-scale case management, big medical data, case management, efficient cross-modal retrieval, era of big
备注: 10 pages, 3 figures, 4 tables
点击查看摘要
Abstract:In the era of big medical data, efficient cross-modal retrieval is pivotal for evidence-based diagnosis and large-scale case management. Cross-modal medical hashing retrieval aims to enable efficient image-text search and support downstream tasks such as case-based reasoning and decision support by learning compact, semantically aligned binary codes. However, current methods suffer from semantic fragmentation due to noisy clinical language, long-tailed labels, and brittle quantization that weakens alignment. We propose TriPAH, a Tri-Prompt Affinity Hashing framework. TriPAH synthesizes ontology-grounded, patient-level prompts conditioned on normalized clinical cues to yield low-noise textual representations for initial alignment. A lightweight prompt-token mixer performs hierarchical, multi-granularity alignment and produces quantization-ready features under an asymmetric multi-task objective coupling multi-positive contrastive alignment, imbalance-aware classification, and progressive quantization regularization. A patient-level consistency module further stabilizes codes across complementary views. Extensive experiments on three public datasets demonstrate that TriPAH significantly outperforms state-of-the-art methods.
5. 【2606.26859】AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems
链接:https://arxiv.org/abs/2606.26859
作者:Changxin Lao,Fei Pan,Guozhuang Ma,Han Li,Huihuang Lin,Jijun Shi,Kangzhi Zhao,Kun Gai,Mo Zhou,Qinqin Zhou,Quan Chen,Ruochen Yang,Shifu Bie,Shuang Yang,Shuo Yang,Wenhao Li,Wentao Xie,Xiao Lv,Xuming Wang,Yijun Wang,Yiming Chen,Yusheng Huang,Zhongyuan Wang,Zibo Zhao,Zijie Zhuang,Baoning Xia,Chao Liu,Chaoyi Ma,Chubo He,Dawei Cong,Feng Jiang,Gang Wang,Guilin Xia,Hanwen Xu,Jiahong Xie,Jiahui Qiao,Jian Liang,Jiangfan Yue,Jing Wang,Jinghan Yang,Jinghui Jia,Kan Qin,Lei Wang,Ming Li,Peilin Song,Pengbo Xu,Qiang Luo,Ruiming Tang,Shiyang Liu,Shuxian Jin,Tao Wang,Tao Zhang,Xiang Gao,Xianghan Li,Yingsong Luo,Yiwen Ning,Yongcheng Liu,Yuan Guo,Zhaojie Liu,Zhenkai Cui
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:transition remains blocked, Recommendation algorithm iteration, attribute online results, structural execution bottleneck, engineer-bound process
备注: Authors are listed alphabetically by their first name
点击查看摘要
Abstract:Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains blocked by a structural execution bottleneck: the idea-to-launch cycle still depends on human engineers to generate hypotheses, modify production code, launch A/B experiments, and attribute online results. Innovation therefore scales linearly with headcount rather than compounding with evidence, compute, and accumulated experimental knowledge. We present AgentX, a production-deployed multi-agent system that fundamentally restructures this production function. AgentX operates as a self-evolving development engine: it autonomously generates, implements, evaluates, and learns from recommendation experiments at a scale and pace that no manual workflow can sustain. The system orchestrates four tightly coupled stages in a closed loop. A Brainstorm Agent synthesizes evidence from historical experiments, system architecture, data analysis, and external research into ranked, executable proposals. A Developing Agent translates each proposal into production-ready code through repository-grounded generation and multi-dimensional reliability verification. An Evaluation Agent conducts safe online rollout with guardrail-vetoed A/B judgment, converting both successes and failures into structured knowledge assets. A Harness Evolution layer (SGPO) then distills execution trajectories into semantic-gradient updates that continuously sharpen the agents themselves -- making the system not merely automated, but self-improving.
Comments:
Authors are listed alphabetically by their first name
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:
arXiv:2606.26859 [cs.AI]
(or
arXiv:2606.26859v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2606.26859
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2606.26845】A Shared IPTC Topic Space for Cross-Source Topic Modelling
链接:https://arxiv.org/abs/2606.26845
作者:Din Iskakov,Sebastian Gonçalves,Marco Idiat,Mendeli Vainstein,Aline Villavicencio,Ronaldo Menezes,Rodrigo Wilkens
类目:Information Retrieval (cs.IR)
关键词:Comparing topic attention, fundamental modelling problem, produce corpus-specific topic, corpus-specific topic spaces, models fitted separately
备注:
点击查看摘要
Abstract:Comparing topic attention across different media is hindered by a fundamental modelling problem: topic models fitted separately to each corpus produce corpus-specific topic spaces that cannot be aligned directly. This paper presents a reproducible framework that places corpora in a single shared topic space defined by a taxonomy. Discovered topics are obtained with guided BERTopic, scored against the ninety-four IPTC Media Topics' taxonomy topics (level-1) through weighted keyword and target centroids, and then collapsed upward to seventeen IPTC parent topics by a maximum-similarity rule. The framework was developed and selected on a controlled New York Times 2011 corpus through a narrowing sequence: a broad model screen, a focused mapping refinement, a strict finalist comparison, a target-construction ablation, and a threshold calibration. In this corpus, the guided family retained substantially stronger mapped coverage than a zero-shot benchmark under stricter assignment thresholds, a parent-enriched target construction improved both coverage and parent consistency, and coverage declined gradually rather than collapsing as the assignment threshold was tightened. The contribution is an externally anchored method for constructing a shared topic space that enables reproducible cross-source topic comparison.
7. 【2606.26803】From Vajrayana Tara to Bengali Baul: A Computational Study of Lexical Transmission Across Buddhist, Shakta, and Vaishnava Traditions in Bengal
链接:https://arxiv.org/abs/2606.26803
作者:Joy Bose
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:encompassing Buddhist Vajrayana, Shakta Tantra tradition, Shakta Tantra, computational corpus study, Shakta Kali texts
备注: 9 pages, 2 figures, 4 tables. Code and corpus: [this https URL](https://github.com/joyboseroy/bengal-dharma-corpus) Dataset: [this https URL](https://huggingface.co/datasets/joyboseroy/bengal-dharma-corpus)
点击查看摘要
Abstract:We present a computational corpus study of vocabulary relationships across eight tradition layers of Bengali and Sanskrit devotional literature spanning the 8th to 19th centuries, encompassing Buddhist Vajrayana, Shakta Tantra, Vaishnava, and Baul traditions. Using a corpus of 75 texts and TF-IDF character n-gram vectorization with cosine similarity analysis, we address the historically argued but previously unquantified claim that Buddhist Vajrayana vocabulary survived the collapse of the Pala monasteries and was absorbed into the Shakta Tantra tradition of Bengal. The central finding is a specificity result: the Gitagovinda (Vaishnava Sanskrit, 12th century) has zero cosine similarity to Shakta Kali texts, while Bridge Tara texts (Buddhist-Shakta transitional, same century, same language) have cosine similarity 0.54 to Shakta Kali. This 8.5-fold contrast between two Sanskrit traditions from the same century demonstrates that the Buddhist-Shakta vocabulary overlap is not a generic property of Sanskrit devotional literature but is specific to the Buddhist-Shakta transmission chain. Three Brihannilatantra Tara texts show Shakta-to-Buddhist vocabulary ratios of 2.0 to 4.0, constituting measurable evidence of lexical transition within that chain. Ramprasad Sen's 18th-century Bengali Kali songs preserve Buddhist vocabulary residue including 56 occurrences of Tara alongside 103 occurrences of Kali. The Vaishnava Bengali tradition contributes a parallel chain to modern Baul vocabulary (similarity 0.29), slightly weaker than the Buddhist Sahajiya chain via Charyapada (0.31). These results provide the first quantitative multi-tradition corroboration of historically argued Buddhist-Shakta syncretism in Bengal.
8. 【2606.26753】ConvMemory v3: A Validity Context Layer for Conversational Memory via Target-Conditioned Relation Verification
链接:https://arxiv.org/abs/2606.26753
作者:Taiheng Pan
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Conversational memory retrieval, memory retrieval optimizes, retrieval optimizes relevance, Conversational memory, retrieved memory
备注: 22 pages, 3 figures
点击查看摘要
Abstract:Conversational memory retrieval optimizes relevance, yet a retrieved memory can be relevant and simultaneously outdated: a later turn updates, corrects, or supersedes it. ConvMemory v3 adds a validity context layer that detects and surfaces this update evidence through target-conditioned relation verification, sitting after the v1/v2 retrieval path. The core mechanism is a dual-evidence gate that conditions a relation judgment on the specific target proposition, scoring a (target, source) pair through the product of a MiniLM slot head and a DeBERTa-v3 slot head and gating it by conservative event/operation evidence. On a synthetic multi-hop validity benchmark the gate reaches 90.12% +/- 1.73 accuracy; through a real-data feedback loop that mines failure patterns but trains on synthetic pairs only, the verifier transfers to Memora role binding with zero target-side labels, reaching 98.8% +/- 0.9 group-all-correct. The deployed layer preserves retrieval by default: a context mode attaches structured validity metadata while keeping the candidate set and rank order fixed, and a query-conditioned demote mode is an explicit opt-in for dense current-state workloads, where it raises current-active H@1 from a never-demote baseline of 45.1% to 95.7% +/- 1.2 while protecting non-superseded memories at 99.4% recall. Six machine-verifiable safety contracts pin the layer's behavior. Multi-hop graph propagation is validated as a mechanism; fully automatic construction of strict prerequisite edges is characterized as a boundary, since strict necessity requires counterfactual world knowledge. This report extends ConvMemory v1 (arXiv:2605.28062) and v2 (arXiv:2606.10842).
9. 【2606.26690】Attributed, But Not Incremental: Cannibalization-Corrected Attribution for Large-Scale Advertising
链接:https://arxiv.org/abs/2606.26690
作者:Donghui Li,Bowen Yuan,Zili Yang,Qinxin Chen,Lijing Song
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:large-scale paid acquisition, outputs are widely, growth advertising systems, channel diagnosis, daily budget allocation
备注: 6 pages, 3 figures. Accepted at ADKDD 2026
点击查看摘要
Abstract:In large-scale paid acquisition and growth advertising systems, production attribution outputs are widely used for daily budget allocation and channel diagnosis. However, paid-attributed conversions such as daily new users (DNU) may systematically overstate true incremental growth when paid channels overlap with organic demand, brand-driven traffic, or other acquisition channels. This attribution-cannibalization mismatch can distort incremental ROI measurement and budget decisions at scale. We propose an experiment-calibrated attribution correction framework that uses incrementality experiments as causal anchors to convert sparse lift measurements into daily correction estimates. To make the corrected signal actionable at production granularity, we further allocate calibrated cannibalization volume across business hierarchies under structural consistency constraints. Offline forward-in-time validation against channel-level incrementality experiment readouts shows that the proposed framework substantially reduces calibration error relative to raw attribution and fine-grained ML baselines. Deployed across multiple global TikTok markets, the system supported budget and traffic strategy adjustments that were followed by an approximately 15-percentage-point reduction in the measured cannibalization rate.
Comments:
6 pages, 3 figures. Accepted at ADKDD 2026
Subjects:
Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:
arXiv:2606.26690 [cs.IR]
(or
arXiv:2606.26690v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2606.26690
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2606.26654】SocialPersona: Benchmarking Personalized Profiling and Response with Multimodal Social-Media Context
链接:https://arxiv.org/abs/2606.26654
作者:Qinkai Zhang,Yanyan Zhao,Xin Lu,Yulin Hu,Pengtao Han,Bing Qin
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Personalized language-model assistants, Personalized language-model, memory lens, explicitly stated, model recall preferences
备注:
点击查看摘要
Abstract:Personalized language-model assistants are often evaluated through a memory lens: can a model recall preferences users have explicitly stated in dialogue? More comprehensive personalization demands a harder capability -- inferring what users care about from the multimodal traces they naturally leave behind. We introduce SocialPersona, a benchmark for evaluating whether multimodal large language models (MLLMs) can recover revealed preferences from longitudinal social-media timelines and use them in dialogue. Built from longitudinal timelines of 171 everyday, non-promotional social-media users, SocialPersona contains text, images, timestamps, and 2,597 human-verified preference tags across seven interest domains, separating stable interests from recent interests. It supports two tasks: constructing structured user profiles from multimodal context and generating responses aligned with inferred profiles. Experiments with proprietary and open-weight MLLMs show that models can identify broad interest domains, yet their performance drops on fine-grained and recent interests and degrades further when inferred profiles must be used to personalize dialogue. Together with evidence that text and images provide complementary preference signals, these results indicate that robust cross-modal, long-horizon user modeling remains a key challenge, and that SocialPersona can help measure and advance progress toward assistants that infer and act on revealed preferences.
11. 【2606.26485】Utilizing Cognitive Signals Generated during Human Reading to Enhance Keyphrase Extraction from Microblogs
链接:https://arxiv.org/abs/2606.26485
作者:Xinyi Yan,Yingyi Zhang,Chengzhi Zhang
类目:Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:Microblogging platforms generate, making automatic keyphrase, dispersed user content, automatic keyphrase extraction, platforms generate massive
备注:
点击查看摘要
Abstract:Microblogging platforms generate massive amounts of short, noisy, and dispersed user content, making automatic keyphrase extraction (AKE) an important but challenging task. Prior studies have used eye-tracking signals to improve microblog-based AKE because such signals reflect readers' attention to salient words. However, eye tracking alone is limited by physiological, acquisition, and feature-decoding constraints. To address this issue, we investigate whether electroencephalogram (EEG) signals can complement eye-tracking signals for AKE. Using the ZuCo cognitive language processing corpus, we select 8 EEG features and 17 eye-tracking features and incorporate them into microblog-based AKE models. To reduce possible distortion of cognitive signals by model structures, we inject these features into the input of the soft-attention layer and the query vectors of the self-attention layer. We then evaluate different combinations of cognitive signals across AKE models. The results show that cognitive signals produced during reading consistently improve AKE performance, regardless of feature combinations and model architectures. EEG features bring the largest gains, while combining EEG and eye-tracking features yields performance between the two individual signal types, suggesting partial complementarity but also possible redundancy or noise. These findings indicate that EEG signals provide useful cognitive evidence for microblog-based AKE and that multimodal cognitive signals deserve further investigation.
12. 【2606.26481】Extracting Problem and Method Sentence from Scientific Papers: A Context-enhanced Transformer Using Formulaic Expression Desensitization
链接:https://arxiv.org/abs/2606.26481
作者:Yingyi Zhang,Chengzhi Zhang
类目:Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
关键词:identify essential parts, massive text, scientific papers, identify essential, essential parts
备注:
点击查看摘要
Abstract:Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data and reduce models' reliance on FEs. For the third idea, we propose a context-enhanced transformer that utilizes context to measure the importance of words in target sentences and to reduce noise in the context. Furthermore, this paper conducts experiments using large language model (LLM) based in-context learning (ICL) methods. Quantitative and qualitative experiments demonstrate that our proposed models achieve a higher macro F1 score compared to the baseline models on two scientific paper datasets, with improvements of 3.71% and 2.67%, respectively. The LLM based ICL methods are found to be not suitable for the task of problem and method extraction.
13. 【2606.26465】3D Spatial Pattern Matching
链接:https://arxiv.org/abs/2606.26465
作者:Nicole R. Schneider,Avik Das,Lukas Arzoumanidis,Abhijeet Ghodgaonkar,Hanan Samet,Youness Dehbi
类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Spatial pattern matching, Spatial pattern, pattern matching, constraints with database, matching
备注:
点击查看摘要
Abstract:Spatial pattern matching is the process of matching query entities and constraints with database entities and relations. It has many applications, including similar region search, housing market search, landmark search, and road network matching. To our knowledge, all existing spatial pattern matching approaches frame the problem in a 2 dimensional space, where entities lie in a cartesian plane and relationships defined between them are contained in 2 dimensions. However, this problem framing has significant limitations when searching for real world entities that have height in addition to position. To address this limitation, we extend spatial pattern matching to 3 dimensions and provide a generalized definition of the problem. We describe a subgraph matching algorithm capable of resolving 3D spatial patterns over distance relations and release two 3D spatial pattern matching datasets, one synthetic and one containing real 3D building data from the city of Hamburg, Germany. We test our subgraph matching algorithm on both datasets and present results as a baseline for future methods to build upon.
14. 【2606.26449】ProvenAI: Provenance-Native Traces of Evidence in Generated Answers
链接:https://arxiv.org/abs/2606.26449
作者:Mohammad Faizan,Dalal Alharthi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
关键词:Retrieval-augmented systems routinely, routinely present citations, source meaningfully shaped, systems routinely present, present citations alongside
备注:
点击查看摘要
Abstract:Retrieval-augmented systems routinely present citations alongside generated answers, yet a citation does not confirm that the corresponding source meaningfully shaped the output. This paper introduces ProvenAI, a framework that decomposes transparency in multi-hop question answering into three independently measurable layers: answer correctness, citation fidelity against benchmark supporting evidence, and per-document influence under leave-one-resource-out intervention. Targeting the HotpotQA distractor benchmark through a seven-stage pipeline covering data normalisation, retrieval indexing, citation-aware answer generation, attribution auditing, ablation-based influence estimation, batch evaluation, and interactive inspection, ProvenAI evaluates 7,405 validation examples drawn from a canonical corpus of 509,300 passages. The system achieves 53.53% answer accuracy alongside a mean citation-fidelity score of 71.55%, and a worked example surfaces what we call the citation-influence gap: a clean citation audit co-occurring with a profile in which one cited source registers only weak influence while seven uncited sources demonstrably shift the output. We formalise the relationship between the implemented surface proxy and a token-level KL-divergence target through a stated faithfulness condition, ground the framework in causal-mediation analysis and database-provenance theory, and discuss how the three measurement layers compose with cryptographic provenance architectures emerging in autonomous scientific discovery. ProvenAI establishes that meaningful transparency in retrieval-grounded QA requires traceable links across retrieved, cited, and behaviourally influential evidence as three distinct, independently measured layers.
15. 【2606.26441】GPUSparse: GPU-Accelerated Learned Sparse Retrieval with Parallel Inverted Indices
链接:https://arxiv.org/abs/2606.26441
作者:Ashutosh Sharma
类目:Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:retrieval quality competitive, Learned sparse retrieval, quality competitive, preserving the interpretability, interpretability and exact-match
备注:
点击查看摘要
Abstract:Learned sparse retrieval models such as SPLADE achieve retrieval quality competitive with dense models while preserving the interpretability and exact-match advantages of sparse representations. However, inference-time scoring still relies on CPU-bound inverted index traversal algorithms (WAND, Block-Max WAND), creating a fundamental bottleneck for real-time serving at scale. We present GPUSparse, a system for GPU-accelerated exact learned sparse retrieval that introduces: (1) a GPU-parallel inverted index with block-aligned, warp-coalesced posting lists; (2) a batched scatter-add scoring algorithm that processes hundreds of queries simultaneously; and (3) fused Triton kernels with an analysis of the tradeoff between work-efficiency and hardware utilization. On MS MARCO passage ranking (8.8M passages) with real SPLADE embeddings, GPUSparse matches CPU exact scoring to three decimals (MRR@10=0.383, equal to Pyserini SPLADE at this precision; Recall@1000=0.999 vs. dense matmul, the residual from floating-point tie-breaking) while providing a 235x speedup over Pyserini CPU at 8.8M documents (1.27ms vs. 298ms per query). Compared to Seismic (the fastest CPU sparse retrieval system), which trades 25% recall for speed (R@1000=0.738 vs. 0.983 exact), GPUSparse achieves exact scoring at 787 QPS throughput (batch 500) on the full 8.8M collection, with 1.3ms per query. Our document-parallel kernel reaches 62.6% of H100 peak HBM bandwidth, revealing a fundamental work-efficiency vs. bandwidth-efficiency tradeoff in GPU sparse retrieval. The reformulation of sparse scoring as scatter-add over an inverted index is shared with SPARe's iterative mode; our contribution is its fused-kernel realization, which we measure to be 23-270x faster than a faithful SPARe iterative reimplementation.
16. 【2606.26439】MaxSim: IO-Aware GPU MaxSim Scoring with Dimension Tiling and Fused Product Quantization
链接:https://arxiv.org/abs/2606.26439
作者:Ashutosh Sharma
类目:Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
关键词:Multi-vector retrieval models, hardware performance unused, existing GPU implementations, GPU implementations leave, fine-grained token-level MaxSim
备注:
点击查看摘要
Abstract:Multi-vector retrieval models such as ColBERT achieve state-of-the-art accuracy through fine-grained token-level MaxSim scoring, yet existing GPU implementations leave most hardware performance unused. We give a roofline analysis of MaxSim on modern GPUs and identify a severe bandwidth gap: naive implementations reach only 5-18% of peak HBM bandwidth because they materialize the Nq x Nd similarity matrix, wasting memory traffic on data that is consumed once and discarded. We present TileMaxSim, a family of IO-aware Triton kernels that close this gap via (1) multi-query SRAM tiling that streams document embeddings through shared memory while accumulating per-query-token maxima in registers, reading each embedding from HBM exactly once; (2) dimension tiling that partitions the embedding dimension into 128-wide chunks, enabling scoring for d 128 embeddings that overflow shared memory; and (3) fused product-quantization scoring via shared-memory lookup tables, cutting HBM I/O by up to ~31x. On NVIDIA H100 GPUs, TileMaxSim reaches 80.2% of peak HBM bandwidth and scores 82M documents/second (71.6M/s on real MS MARCO passages), a 220x speedup over loop-based scoring, 6.5x over fused PyTorch, 6.6-8.5x over this http URL, and 469x the scoring throughput of WARP's CPU engine on the same node. TileMaxSim preserves exact retrieval quality: on MS MARCO and three BEIR benchmarks, rankings match reference MaxSim. As a drop-in replacement in ColBERTv2/PLAID, it cuts scoring latency at 100K candidates from 268 ms to 1.2 ms (98% lower end-to-end latency). We further show constant throughput from 100K to 500K documents, data-parallel multi-GPU sharding, robustness across dimensions 64-768, and FP16/BF16/FP32 support. Concurrent work independently develops an IO-aware fused MaxSim kernel; we differ in dimension tiling for d 128 and fused product-quantization scoring.
17. 【2606.26373】Hybrid privacy-aware semantic search: SVD-truncated document geometry and CKKS-encrypted query reranking under a restricted threat model
链接:https://arxiv.org/abs/2606.26373
作者:Sergey Kurilenko
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Dense embeddings power, embeddings power semantic, reconstruct source text, vector database leaks, power semantic search
备注:
点击查看摘要
Abstract:Dense embeddings power semantic search and retrieval-augmented generation, but embedding-inversion attacks can reconstruct source text from a vector: when a vector database leaks, the documents behind it leak too. The textbook defences are extremes - encrypting the whole search homomorphically is sound but too slow at million-document scale, while privacy noise degrades ranking long before it protects. We study a middle path exploiting the asymmetry between the static collection and the dynamic query. The collection is protected geometrically: each vector is truncated onto a lower-dimensional SVD subspace and rotated by a secret orthogonal transform known only to the owner. The query is protected cryptographically: it is reranked under CKKS homomorphic encryption, so an honest-but-curious server never sees the query or the scores. CKKS parameters come from a small offline benchmark. We prove a tight lower bound on the reconstruction error of any attacker confined to the protected subspace. On one million documents and five encoders the scheme preserves ranking quality (slightly improving it on strong encoders, as a linear denoiser) at sub-second latency, and an off-the-shelf inversion attack on the protected space collapses to the noise floor. We then test stronger adversaries: a known-plaintext attacker recovers the rotation by orthogonal Procrustes from about as many leaked pairs as the retained dimension; the public product-quantization codes preserve most nearest-neighbour structure; and random-projection, calibrated-noise and BEIR baselines show the truncation is an encoder-dependent accuracy cost, not a free denoiser. We state the limits: query confidentiality is cryptographic, but document protection is an empirical obfuscation layer (SVD truncation plus a secret rotation), not a cryptographic primitive, and we delimit the threat model for each claim.
Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:
arXiv:2606.26373 [cs.CR]
(or
arXiv:2606.26373v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2606.26373
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2606.26369】Scoring Is Not Enough: Addressing Gaps in Utility-fairness Trade-offs for Ranking
链接:https://arxiv.org/abs/2606.26369
作者:Shubham Singh,Ian A. Kash,Mesrob I. Ohannessian
类目:Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:represent the relevance, relevance of individual, individual documents, Scoring, modern information retrieval
备注:
点击查看摘要
Abstract:Scoring functions are used to represent the relevance of individual documents. In modern information retrieval or recommendation systems, they are often learned from data and play a pivotal role in ranking sets of documents or items in a way that maximizes utility to a query or user. With the recent interest in algorithmic fairness, the success of scoring has naturally led to methods that learn scores that simultaneously trade off fairness and utility. In this work, we show that in stark contrast with utility-centric objectives, scoring is sub-optimal in achieving all utility-fairness trade-offs. We establish this with a series of counter-examples with a generic fairness formulation. We show that the issue persists whether we have a deterministic scoring function or a randomized one, or whether we measure fairness at the scope of a single query or across multiple queries. On the positive side, we empirically demonstrate that semi-greedy post-processing has the potential to achieve much better trade-offs, often approaching the ideal of exhaustive post-processing in a tractable way.
19. 【2606.26356】Instruction Bleed: Cross-Module Interference in Prompt-Composed Agentic Systems
链接:https://arxiv.org/abs/2606.26356
作者:Ching-Yu Lin,Yifan Liu
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:recurring failure mode, agentic systems report, prompt module silently, module silently shifts, prompt-composed agentic systems
备注: 8 pages, 2 tables. Accepted to the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN), Seoul, South Korea
点击查看摘要
Abstract:Practitioners of prompt-composed agentic systems report a recurring failure mode: editing one prompt module silently shifts the behavior of others despite no shared variable or executable dependency. We formalize this as compositional behavioral leakage (CBL): interference between modules sharing a context window. CBL is enabled by architectural non-isolation: transformer self-attention provides no formal boundary between concatenated modules. We probe CBL on a deployed job-evaluation agent (Claude Sonnet 4.6, 144 trials) through a reusable three-channel protocol that perturbs non-focal modules along volume, content, and form. Only the content channel produces a detectable paired effect (Cohen's d = 0.63, bootstrap 95% CI excluding zero); no recommendation flipped -- a sub-threshold regime invisible to standard QA but compounding across the thousands of decisions a deployed agent makes. CBL is orthogonal to known agent-failure axes (adversarial injection, cognitive degradation, multi-agent fault propagation, privacy leakage). We contribute an operational definition, a reusable protocol, a falsifiable prediction set, and a system-class characterization, establishing cross-module interference measurement as a requirement for prompt-composed agent evaluation.
20. 【2606.26277】From Clicks to Intent: Cross-Platform Session Embeddings with LLM-Distilled Taxonomy for Financial Services Recommendations
链接:https://arxiv.org/abs/2606.26277
作者:Dianjing Fan,Yao Li,Kyaw Hpone Myint,Dwipam Katariya,Alexandre G.R. Day,Pranab Mohanty,Giri Iyengar
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Sequential user behavior, experiences differ drastically, in-app experiences differ, significant gaps remain, user behavior modeling
备注: Dianjing Fan and Yao Li equally contributed to this work. 7 pages, 1 figure
点击查看摘要
Abstract:Sequential user behavior modeling is widely adopted in industrial recommender systems; however, significant gaps remain in financial services, where pre-login web interactions and authenticated in-app experiences differ drastically. Specifically, pre-login web users typically explore new products, whereas logged-in app users focus on account servicing. Due to the challenge of cross-channel entity resolution (e.g., matching anonymous web sessions to authenticated mobile accounts), web-based intent signals remain underutilized for post-authentication personalization. Existing methods for capturing web-based intent are often ad-hoc and narrow, lacking the flexibility to support both quantitative downstream recommendations and qualitative understanding at scale. In this work, we propose a scalable and dual-purpose intent prediction framework for web-based interactions and demonstrate its applicability for personalization. Our approach transforms raw web clickstreams into two outputs: a self-supervised Transformer encodes multi-modal clickstreams into a compact session embedding, while an LLM-based taxonomy generation and distillation pipeline produces interpretable intent labels. Our system demonstrates that self-supervised clickstream representations combined with LLM-distilled taxonomies can jointly serve quantitative tasks and qualitative understanding in production: on the mobile homepage tile ranking task, the session embedding improves macro Recall@1 by 1.88% and reduces Log Loss by 13.38% over production baselines. On the user conversion prediction task, the embedding outperforms the LLM labels by 4.3% on micro F1, while the distillation layer delivers interpretable labels at ultra-low latency with only a 7% performance drop.
21. 【2606.26246】Lacuna: A Research Map for Machine Learning
链接:https://arxiv.org/abs/2606.26246
作者:Martin Weiss,Miles Q. Li,Alejandro H. Artiles,Yacine Mkhinini,Chris Pal,Hugo Larochelle,Nasim Rahaman
类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Lacuna Deep Research, concept elements, machine learning, LLMs to turn, scholarly metadata
备注: 14 pages, 3 figures. Preprint
点击查看摘要
Abstract:Lacuna is a research map for machine learning that uses LLMs to turn papers and scholarly metadata into markdown summaries, concept elements, research directions, and research proposals. Each item keeps links to the primary source records and papers that support it. We release the map with web, markdown, and MCP interfaces. Across LitSearch, Multi-XScience-CS/ML, and ScholarQA-CS-ML, Lacuna outperforms OpenScholar with the strongest gains on LitSearch retrieval (Recall@10 0.538 vs. 0.424 for OpenScholar v3). We also evaluate Lacuna Deep Research, a multi-stage report agent over the map, on 25 ReportBench-ML survey tasks: Lacuna Deep Research reaches 0.052 citation F1, 0.339 citation precision, 99 expert-reference hits, and 7.82/10 RACE report quality, while GPT-Researcher reaches 0.039 F1, 0.290 precision, 72 hits, and 5.24/10 RACE.
22. 【2606.26157】Reducing Redundancy in Whole-Slide Image Patching for Scalable Indexing and Retrieval
链接:https://arxiv.org/abs/2606.26157
作者:Jialiang Geng,Ghazal Alabtah,Saghir Alfasly,Wataru Uegami,H.R.Tizhoosh
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:slide images, rapid growth, growth of digital, created an urgent, WSI indexing
备注:
点击查看摘要
Abstract:The rapid growth of digital pathology has created an urgent need for efficient indexing and retrieval of whole slide images (WSIs). This need is intensified by emerging generative AI workflows, particularly retrieval-augmented generation (RAG), which require dependable similarity search to support high-stakes clinical decision-making. Yet the substantial cost of high-performance storage limits the scalability and accessibility of WSI indexing for many healthcare institutions. Consequently, methods that can reduce storage demands while preserving retrieval accuracy have become a critical research priority. We propose ARReST (Antithetical Redundancy Reduction Strategy), a principled oppositional framework that leverages redundancy across dissimilar tissue classes to markedly decrease the number of patches that must be indexed from each WSI. Instead of eliminating only within-class duplicates, ARReST identifies antithetical patches-those whose representations contribute minimally to cross-class discrimination-and prunes them from the searchable archive. This targeted reduction substantially compresses the index without sacrificing morphological diversity or retrieval fidelity. By minimizing superfluous patch representations, ARReST reduces storage footprint, lowers computational overhead, and accelerates similarity search across large pathology repositories. Extensive experiments on TCGA repository (The Cancer Genome Atlas with 21 organs) demonstrate that ARReST achieves significant index compression while maintaining competitive retrieval performance. The observed storage savings of 3% to 60% (14%$\pm$13%) can be reliably achieved without compromising retrieval performance for many organs. The proposed strategy enables scalable, cost-efficient WSI indexing and is well-suited for next-generation retrieval-driven clinical AI systems.
计算机视觉
1. 【2606.27377】DanceOPD: On-Policy Generative Field Distillation
链接:https://arxiv.org/abs/2606.27377
作者:Wei Zhou,Xiongwei Zhu,Zelin Xu,Bo Dong,Lixue Gong,Yongyuan Liang,Meng Chu,Leigang Qu,Lingdong Kong,Wei Liu,Tat-Seng Chua
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Modern image generation, Modern image, unifies diverse capabilities, demands a single, unifies diverse
备注: Technical Report; 39 pages, 13 figures, 9 tables; Project Page at [this https URL](https://danceopd.github.io/)
点击查看摘要
Abstract:Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.
2. 【2606.27376】Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards
链接:https://arxiv.org/abs/2606.27376
作者:Ritesh Thawkar,Shravan Venkatraman,Omkar Thawakar,Abdelrahman Shaker,Fahad Khan,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
3. 【2606.27374】World Action Models Enable Continual Imitation Learning with Recurrent Generative Replays
链接:https://arxiv.org/abs/2606.27374
作者:Manish Kumar Govind,Dominick Reilly,Smit Patel,Hieu Le,Srijan Das
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:World Action Models, World Action, Action Models, predicting robot actions, generate future visual
备注:
点击查看摘要
Abstract:Going beyond predicting robot actions, World Action Models (WAMs) can also generate future visual observations. We build on this generative capability to propose Recurrent Generative Replay (REGEN), a continual imitation learning framework that synthesizes pseudo-replay trajectories, enabling a robot policy to rehearse previously learned tasks without storing their original human demonstrations. During continual adaptation, REGEN recursively queries the WAM to synthesize pseudo-replay trajectories conditioned only on prior task instructions and current-task observations. Experiments in both simulation and real-world manipulation settings show that REGEN reduces catastrophic forgetting by up to $50\%$ relative to sequential fine-tuning, while approaching the performance of privileged experience replay methods that require access to real replay data. Finally, we analyze the factors limiting generated replay, identifying long-horizon visual degradation and action-observation inconsistency as the primary bottlenecks. Our results establish WAMs as a promising foundation for continual robot learning without stored demonstrations.
4. 【2606.27373】Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models
链接:https://arxiv.org/abs/2606.27373
作者:Shravan Venkatraman,Ritesh Thawkar,Omkar Thawakar,Rao Muhammad Anwer,Hisham Cholakkal,Salman Khan,Fahad Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:improving visual reasoning, self-evolving large multimodal, large multimodal models, purely unsupervised setting, large multimodal
备注: ECCV 2026
点击查看摘要
Abstract:Recently, self-evolving large multimodal models (LMMs) have received attention for improving visual reasoning in a purely unsupervised setting. However, multi-role self-play and self-consistency reward schemes in existing self-evolving LMMs optimize answer agreement without ensuring the decoder attends to visual content, relying instead on statistical language priors to produce self consistent outputs. This leads to a persistent failure mode we term visual under-conditioning, where the decoder relies on language priors rather than the image during generation, manifesting as insufficient attention to visual tokens. As a result, current self-evolving LMMs struggle on vision--language understanding tasks such as image captioning and visual question answering. To address this, we propose VISE (Visual Invariance Self-Evolution), a purely unsupervised self-evolving framework that directly regularizes the model's visual conditioning policy through two complementary invariance-based rewards: a geometric invariance reward that enforces spatial consistency under known transformations, and a semantic invariance reward that penalizes evidence-agnostic generation by requiring the model to recognize the absence of evidence when predicted regions are perturbed. VISE operates within a single model without specialist roles, external reward models, or annotations, and is trained on raw unlabeled images. Experiments on 18 benchmarks demonstrate the efficacy of our approach. Using Qwen3-VL-2B as the base model, VISE achieves gains of $+16.85$ CIDEr on COCO and $+19.66$ CIDEr on TextCaps, reduces object hallucination by $5.0$ Chair-I points, and generalizes across four model families and scales. Our code and models are available at this https URL
5. 【2606.27372】DnA: Denoising Attention for Visual Tasks
链接:https://arxiv.org/abs/2606.27372
作者:Ron Campos,Subhajit Maity,Xin Li,Srijan Das,Aritra Dutta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:activation in multihead, attention-based models, MHA, visual perception tasks, multihead attention
备注:
点击查看摘要
Abstract:The softmax activation in multihead attention (MHA) is the de facto standard for attention-based models in visual perception tasks. However, standard softmax can produce noisy attention patterns that dilute relevant features and degrade its performance. In this paper, we propose Denoising Attention or DnA, in which, first, a positive query identifies which image features belong to the correct class, and a negative query identifies closely associated but irrelevant image features. DnA then projects these interactions into two distinct subspaces with larger principal angles, promoting subspace separation and improved discriminability. Using a ViT-B backbone, our proposed DnA achieves an absolute gain of 0.8% on ImageNet-1K compared to the baseline. We further show improvements across multiple visual understanding tasks, including video understanding with video transformers (1.8%) and video LLMs (0.5%). Our extensive empirical analyses justify the design choices involving two interacting subspaces and the denoising effect of DnA.
6. 【2606.27371】Don't Settle at the Mode! Mitigating Diversity Collapse in Pretrained Flow Models via Feature Self-Guidance
链接:https://arxiv.org/abs/2606.27371
作者:Pradhaan S Bhat,Rishubh Parihar,Abhijnya Bhat,R. Venkatesh Babu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:models generate stunning, generate stunning images, generate stunning, flow models generate, models
备注: Accepted by ECCV 2026. Project page: [this https URL](https://dont-settle-at-the-mode.github.io/)
点击查看摘要
Abstract:State-of-the-art flow models generate stunning images from text or image prompts. However, they suffer from diversity collapse when generating multiple samples under the same conditioning. Existing methods address this issue via either latent guidance, which has limited effectiveness, or sample selection, which relies on external reward models that incur significant inference-time overhead. In this work, we introduce an efficient, training-free self-guidance mechanism to mitigate diversity collapse without requiring additional reward models. Specifically, we disperse the internal features of the flow model during batch generation with feature self-guidance. Further, to keep the features close to the manifold, we introduce a manifold regularization step that projects these dispersed features back onto the data manifold, ensuring diverse generation without sacrificing alignment with the input conditions. Our method integrates seamlessly as a plug-and-play module into pretrained flow models, adding only a marginal inference cost. Experiments demonstrate significant improvements in diversity while preserving fidelity across several conditional flow models, including multi-step and few-step text-to-image, depth-to-image, and reference image generation.
7. 【2606.27364】PhysiFormer: Learning to Simulate Mechanics in World Space
链接:https://arxiv.org/abs/2606.27364
作者:Yiming Chen,Yushi Lan,Andrea Vedaldi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Project page: [this https URL](https://yimingc9.github.io/physiformer)
点击查看摘要
None
8. 【2606.27354】Error-Conditioned Neural Solvers
链接:https://arxiv.org/abs/2606.27354
作者:Haina Jiang,Liam Wang,Peng-Chen Chen,Min Seop Kwak,Seungryong Kim,Brian Bell,Jeong Joon Park
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
关键词:purely statistical task, surrogate models offer, models offer fast, offer fast approximate, fast approximate mappings
备注:
点击查看摘要
Abstract:Neural surrogate models offer fast approximate mappings from PDE parameters to solutions, but they typically treat solving as a purely statistical task: once trained, they struggle to correct their own constraint violations and extrapolate beyond the training distribution. Recent hybrid methods promote physical correctness by targeting the PDE residual via gradient descent or Gauss--Newton steps, but inherit the compute cost and instability of the underlying classical optimizers. We show, theoretically and empirically, that numerically minimizing the PDE residual can be an unreliable proxy for reconstruction accuracy in ill-conditioned systems, explaining why these methods often do not make accurate predictions despite achieving low residuals. We propose error-conditioned Neural Solvers (ENS), built on a different principle: rather than an optimization target, the PDE residual field is passed as a direct input to the network at each iteration, enabling it to read the spatial structure of its own errors and learn an update policy to iteratively correct its predictions. Across four PDE families, ENS attains the highest prediction accuracy in the large majority of settings, with gains reaching $10\times$ on turbulent Kolmogorov flow, while avoiding the expensive compute cost of hybrid methods. ENS's learned correction policy generalizes under distribution shift, including zero-shot parameter changes and cross-equation transfer, where its relative advantage is largest in the ill-conditioned regimes where residual minimization is least reliable. Project website: this https URL.
9. 【2606.27345】RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation
链接:https://arxiv.org/abs/2606.27345
作者:Minghao Yin,Jiahao Lu,Wenbo Hu,Wang Zhao,Shan Ying,Kai Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diffusion transformers position, Modern video diffusion, camera sampling grid, video diffusion transformers, Plucker reciprocal product
备注:
点击查看摘要
Abstract:Modern video diffusion transformers position their tokens through RoPE on the (u,v,t) axes -- a description of the camera's sampling grid that says nothing about the 3D structure of the scene. We observe that the geometric relation between two camera rays is captured by the Plucker reciprocal product, which is bilinear in the two rays -- the same algebraic form as the dot product in Transformer attention. Building on this analogy, we propose RayPE, a positional-encoding extension that injects per-token 6D Plucker coordinates additively into the queries and keys of self-attention, with a query/key flip arrangement under which the symmetric identity configuration coincides exactly with the reciprocal product. The injection is additive, the resulting attention score decomposes into a content term, a geometry term, and two content and geometry cross-terms -- all of which our experiments find individually necessary. To make the encoding stable across video data with heterogeneous camera-translation scales (SfM, deep SLAM, metric), we further decouple ray direction from moment magnitude, gate the encoding by a learned function of the log-magnitude, and apply RMSNorm to align it with the QKNorm-normalized content branch. The full module adds less than 0.1% parameters to a pretrained video DiT, is zero-initialized to start from the pretrained weights, and improves camera controllability, cross-frame 3D consistency, and overall video quality on a four-dataset training mixture.
10. 【2606.27339】SAM2Matting: Generalized Image and Video Matting
链接:https://arxiv.org/abs/2606.27339
作者:Ruiqi Shen,Guangquan Jie,Chang Liu,Henghui Ding
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requires frame-wise understanding, remains challenging due, matting remains challenging, extremely fine-grained details, video matting remains
备注: ECCV 2026. Extended version. Project Page: [this https URL](https://henghuiding.com/SAM2Matting/)
点击查看摘要
Abstract:Despite impressive advances in image matting, video matting remains challenging due to the inherent gap between high-level tracking, which requires frame-wise understanding, and low-level matting, which focuses on extremely fine-grained details. Existing methods attempt this with expensive and narrowly-scoped video matting datasets, which may limit out-of-domain generalization and compromise tracking robustness. We rethink the paradigm with SAM2Matting, a tracker-to-matting framework that advances VOS trackers to high-fidelity video matting. Specifically, it decouples the task by enhancing a foundational tracker (e.g., SAM2, SAM3) with a region-proposal bridge and dedicated matting heads, enabling the uncompromised tracker to handle temporal consistency while the matting components resolve fine-grained details. Notably, despite being trained only on images, SAM2Matting establishes new state-of-the-art performance on video matting, supports diverse prompt types, maintains strong temporal consistency, and demonstrates robust generalization across both human-centric and in-the-wild scenarios.
11. 【2606.27332】RoPEMover: Depth-Aware Object Relocation via Positional Embeddings
链接:https://arxiv.org/abs/2606.27332
作者:Ipek Oztas,Duygu Ceylan,Aybars Bugra Aksoy,Aysegul Dundar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:including handling occlusions, revealing previously unseen, previously unseen regions, geometry-consistent spatial rearrangement, maintaining coherent shadows
备注:
点击查看摘要
Abstract:Moving an object in a single image requires geometry-consistent spatial rearrangement, including handling occlusions, revealing previously unseen regions, and maintaining coherent shadows and reflections. Existing approaches are not well suited to this setting and often fail to preserve such scene-level consistency. We address this problem by introducing a geometry-aware object motion method that operates directly on the positional representations of diffusion transformers. Our key insight is that rotary positional embeddings (RoPE) define a structured spatial field that can be explicitly manipulated to induce controlled motion. We extend 2D RoPE into a depth-aware formulation that encodes 3D spatial structure, enabling consistent object displacement and scene-aware updates. Our model is trained using synthetic data combined with a small set of real images via parameter-efficient fine-tuning. Despite minimal real supervision, it preserves object identity under large spatial displacements, generates plausible content in newly revealed regions, and consistently updates scene-dependent effects such as shadows and illumination. Experimental results on standard object motion benchmarks demonstrate state-of-the-art performance across all evaluation metrics.
12. 【2606.27330】Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning
链接:https://arxiv.org/abs/2606.27330
作者:Tianyi Men,Zhuoran Jin,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Multimodal web agents, operating repetitive GUI, repetitive GUI tasks, Multimodal web, repetitive GUI
备注: Accepted to ACL 2026 Main
点击查看摘要
Abstract:Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and privacy preserving compared with commercial large models, they suffer from weak planning and limited cross website generalization. To address these limitations, we introduce the planning experience exploration and utilization (PEEU) method, which autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high level training data. To quantitatively analyze the generalization behaviors driving this performance, we propose the task decomposition hierarchical analysis framework (TDHAF) to systematically study compositional generalization across three task granularities: low, middle and high levels. Our analysis reveals that mastering low level atomic skills does not guarantee high level planning competence, while high level task training yields stronger OOD generalization. Experiments on real world benchmarks demonstrate PEEU's superior effectiveness: our 7B model achieves 30.6% accuracy, outperforming the much larger Qwen2.5-VL-32B model. These demonstrate constructing hindsight high level tasks and leveraging experiences is crucial for OOD planning abilities of small MLLMs.
13. 【2606.27326】Hallucination in World Models is Predictable and Preventable
链接:https://arxiv.org/abs/2606.27326
作者:Nicklas Hansen,Xiaolong Wang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Modern generative world, realistic action-controllable futures, rollouts remain visually, render increasingly realistic, increasingly realistic action-controllable
备注: Interactive paper, live demo, code, dataset, and models: [this https URL](https://www.nicklashansen.com/mmbench2)
点击查看摘要
Abstract:Modern generative world models render increasingly realistic action-controllable futures, yet they frequently hallucinate: rollouts remain visually fluent while drifting from the ground-truth dynamics. We hypothesize that hallucination concentrates in low-coverage regions of the state-action space, where lightweight data-centric signals can both detect it and guide mitigation. To test this, we introduce MMBench2, a 427-hour, 210-task dataset for visual world modeling with ground-truth actions, rewards, and live simulators, and train a 350M-parameter world model on it. We identify three distinct hallucination modes: perceptual, action-marginalized, and scene-diverging -- each anchored to a different stage of the pipeline, and develop three signals that accurately predict where the model will fail. To close coverage gaps at training time, we develop a coverage-aware sampling technique; to close them online, our hallucination predictors serve as curiosity rewards for targeted data collection, yielding a data-efficient finetuning recipe that adapts the pretrained world model to entirely unseen environments with as few as 50 real environment trajectories. Overall, our findings reveal that hallucination in world models is inherently a data coverage issue, and that the same signals used to detect it can also be used for mitigation. An interactive web version of our paper is available at this https URL
Comments:
Interactive paper, live demo, code, dataset, and models: this https URL
Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Cite as:
arXiv:2606.27326 [cs.LG]
(or
arXiv:2606.27326v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2606.27326
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2606.27325】Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model
链接:https://arxiv.org/abs/2606.27325
作者:Zizhao Yuan,Zhengtu Liang,Taowen Wang,Qiwei Liang,Yichi Wang,Yunheng Wang,Yuetong Fang,Lusong Li,Zecui Zeng,Renjing Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:forecasting future states, Recent advances, modeling complex interactions, show promising progress, promising progress
备注:
点击查看摘要
Abstract:Recent advances in action-conditioned world models show promising progress in modeling complex interactions and forecasting future states under diverse action sequences. While these models are often driven by stronger visual representations and model capacity, action conditioning itself remains underexplored. Most existing approaches compress the entire action sequence into a single representation, which works well for low-DoF control but becomes less reliable in high-DoF scenarios. We observe that high-DoF dexterous actions are inherently heterogeneous, spanning multiple orders of magnitude, where large-scale motions coexist with subtle but important signals. When uniformly aggregated, optimization exhibits an imbalance across action components, which hinders the modeling of fine-grained effects and affects action fidelity. We therefore propose DexAC-WM, which treats action conditioning as a structured process rather than global compression. DexAC preserves dimension-level semantics via action tokenization and aligns action signals with visual dynamics through local refinement and global modulation. To address the limited high-level semantic grounding in existing world models, we further introduce a semantic branch that provides rich object-scene priors, which enables world model to capture dynamic visual details while supporting high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that combining the semantic branch with DexAC significantly improves FID, FVD, and PCK, demonstrating gains in visual-temporal realism and action-following consistency. We further verify that DexAC extends to other backbones, showing the scalability of our structured action-conditioning design. These results suggest that scaling world models to high-DoF control requires both structured action modeling and semantic grounding.
15. 【2606.27317】OctoSense: Self-Supervised Learning for Multimodal Robot Perception
链接:https://arxiv.org/abs/2606.27317
作者:Anthony Bisulco,Jeremy Wang,Kostas Daniilidis,Randall Balestriero,Pratik Chaudhari
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:RTK-corrected global positioning, global positioning system, open-source sensor platform, inertial measurement unit, stereo RGB
备注:
点击查看摘要
Abstract:We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and joint angles for a quadruped robot). The eponymous OctoSense dataset contains 59 hours of time-synchronized driving data across different types of environments at different times of the day, including situations with highly degraded sensors. We demonstrate multi-modal self-supervised learning using such real-world robotics data, where sensors have different representations, frequencies, latencies and noise. Our approach, a "late-fusion" masked autoencoder, (i) uses modality-specific tokenizers to account for different spatiotemporal characteristics of these sensors, and (ii) caches modality-specific tokens at inference time to process new measurements as they come. This architecture (i) is fast (6.68 ms and 112 ms on NVIDIA 5090 and Orin NX respectively, to compute the representation), (ii) performs better than existing image-only foundation models on tasks such as estimation of optical flow, depth, semantic segmentation, and ego-motion (translation, rotation, and steering angle), and (iii) predicts robustly at nighttime or in situations where sensory data is degraded. See our project page for links to the dataset, code, and supplementary videos: this https URL.
16. 【2606.27313】ViQ: Text-Aligned Visual Quantized Representations at Any Resolution
链接:https://arxiv.org/abs/2606.27313
作者:Xumin Yu,Zuyan Liu,Zhenyu Yang,Yuhao Dong,Shengsheng Qian,Jiwen Lu,Han Hu,Yongming Rao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:simpler multimodal modeling, natural pursuit, Visual Quantized Representations, representations, Visual
备注: Accepted to ECCV 2026
点击查看摘要
Abstract:A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same way as text inevitably introduces severe information loss. Existing work struggles to balance low-level details and high-level semantics in discrete representations: reconstruction-oriented representations often lack semantic information, whereas semantically stronger features typically suffer from severe loss of detail. We present ViQ, a Visual Quantized Representations framework, which is designed to balance semantics and details in discrete representations while supporting inputs at native resolutions, thereby enabling it to serve as a unified and general discrete representation for arbitrary visual inputs. Our approach structures quantization learning into two stages: text-aligned pre-training and feature discretization. With text-aligned pre-training, we enhance the visual encoder semantic-rich supervision from the pretrained language model and enable it to process native-resolution visual inputs. During discretization, we propose a proximal representation learning strategy to progressively compact the feature space, along with a position-aware head-wise quantization mechanism that enables flexible processing of arbitrary resolutions. Extensive experiments on multimodal tasks demonstrate that ViQ achieves competitive performance compared to state-of-the-art multimodal vision encoders with continuous and high-dimensional visual features, while maintaining high precision in low-level reconstruction. We also show that multimodal training with visual quantized representations largely improves efficiency, yielding up to 20\%-70\% acceleration with different base LLMs and training recipes.
17. 【2606.27307】See Sniff: Learning Visuo-Olfactory Representations
链接:https://arxiv.org/abs/2606.27307
作者:Seongyu Kim,Seungwoo Lee,Hyeonggon Ryu,Joon Son Chung,Arda Senocak
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:models integrate vision, olfaction remains largely, remains largely unexplored, largely unexplored due, paired visuo-olfactory data
备注: ECCV 2026. Project Page: [this https URL](https://mm.kaist.ac.kr/projects/SeeandSniff/)
点击查看摘要
Abstract:While modern multimodal models integrate vision with language, audio, or touch, olfaction remains largely unexplored due to the lack of paired visuo-olfactory data. We introduce SmellNet-V, a scalable visuo-olfactory dataset built on the insight that odor identity is largely invariant to visual transformations within a semantic category. This allows us to synthetically pair smell-only samples with semantically aligned in-the-wild web images, converting a unimodal olfactory dataset into a cross-modal benchmark without costly co-collection. Building on this dataset, we propose See Sniff, a self-supervised framework that learns joint visuo-olfactory representations via dense local alignment and naturally produces smell saliency maps for spatial grounding of odor sources. We further introduce pixel-level smell localization task and a benchmark for evaluation. Our method surpasses smell-only baselines by 7% in smell classification from smell alone and generalizes to cross-modal retrieval and smell localization, establishing visuo-olfactory learning as a new direction in multimodal perception.
18. 【2606.27305】Sculpting NeRF Geometry: Human-Preference Fine-Tuning of a 3D-Aware Face GAN
链接:https://arxiv.org/abs/2606.27305
作者:Archer Moore,Mingming Gong,Liam Hodgkinson
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:explicit surface representations, existing pipelines optimise, pipelines optimise explicit, optimise explicit surface, converting radiance fields
备注:
点击查看摘要
Abstract:Reinforcement learning from human feedback (RLHF) for 3D generation is now established across a number of works, but most existing pipelines optimise explicit surface representations, often by converting radiance fields into meshes and training heavily on surface-supervised data. We instead fine-tune a pretrained 3D-aware generative model directly from a learned reward over radiance-field density ($\sigma$) values, with no externally supplied mesh or shape prior. The reward model requires no pretraining, trains easily on a small set of preference samples, and yields robust improvement in 3D geometry. Working on an unconditional 3D-aware face GAN (EG3D), our reward reads the continuous 3D density field of the neural radiance field (NeRF) directly and supplies a geometry-only learning signal, requiring neither text conditioning, mesh extraction, nor multi-view rendering. A density-consistency constraint keeps the 2D appearance qualitatively similar while the geometry is reshaped, at a measurable but bounded distributional cost (FID-50k rises from 4.09 to 6.66): the fine-tuned generator, trained from the preferences of a single annotator as a proof of concept, produces face geometries preferred by users in 74.4% of pairwise comparisons.
19. 【2606.27280】Exact and Deterministic Patch Descriptor Retrieval via Hierarchical Normalization
链接:https://arxiv.org/abs/2606.27280
作者:Koichi Sato
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:descriptor retrieval method, exact nearest neighbour, exhaustive full-vector search, patch descriptor retrieval, K-dim major component
备注: 9 pages, 4 figures
点击查看摘要
Abstract:We present a patch descriptor retrieval method that returns the exact nearest neighbour -- provably identical to exhaustive full-vector search -- while evaluating only a small fraction of the database, and does so deterministically: the same (database, query) pair always produces the same result, independent of run order, thread count, or hardware. This contrasts with approximate nearest-neighbour (ANN) approaches such as HNSW and IVF-PQ, which trade exactness for speed and may return different results across runs. The enabling mechanism is Hierarchical Normalization (HN): a normalisation scheme that splits the pre-normalisation feature vector into a K-dim major component (norm sqrt(1-alpha)) and a (128-K)-dim minor component (norm sqrt(alpha)). Since the minor inner product is bounded by alpha (Cauchy-Schwarz on the prescribed norms), the major similarity plus alpha is an admissible upper bound on the full similarity: the search scans the K-dim major component for all entries, then applies full 128-dim evaluation only to entries that cannot be pruned -- a provably exact branch-and-bound scan. We train HN-modified HardNet on the notredame split of the UBC patch dataset and evaluate on trevi and halfdome. With a cache-optimised Structure-of-Arrays layout and K=8, alpha=1/32, the search achieves 13.7x (trevi) / 12.7x (halfdome) speed-up over brute-force 128-dim search, with only 0.4% of entries requiring full evaluation. At K=16, alpha=1/8, FPR@95 rises from 0.0062 to 0.0064 on trevi at 7.2x speed-up, with 98.8% of entries bypassing full evaluation.
20. 【2606.27277】EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting
链接:https://arxiv.org/abs/2606.27277
作者:Junwei Luo,Shuai Yuan,Zhenya Yang,Yansheng Li,Zhe Liu,Hengshuang Zhao
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Earth surface dynamics, predict future Earth, future Earth surface, Earth surface, Earth Observation
备注: 28 pages, 5 figures, 11 tables
点击查看摘要
Abstract:Earth Observation (EO) forecasting aims to predict future Earth surface dynamics from satellite observations under changing meteorological conditions. In this paper, we view this task as a partially observed, weather-driven world modeling problem, in which weather acts as a conditioning signal, while forecasting remains uncertain due to sparse observations and unobserved land-surface states. However, existing methods do not fully capture this setting: deterministic models collapse uncertainty into a single future prediction, while diffusion-based methods typically treat weather variables as undifferentiated conditioning signals, and existing benchmarks focus mainly on reconstruction accuracy rather than whether forecasts respond correctly to changed weather this http URL introduce EO-WM, a video diffusion transformer for multispectral EO forecasting. EO-WM incorporates a physically informed conditioning framework that represents meteorological forcing through a climatological baseline, weather anomalies, and cumulative physical stress signals. Specifically, it separates baseline and anomaly through distinct conditioning pathways, and accumulates anomalous forcing over time to capture sustained heat and drought stress. To evaluate weather-response behavior beyond standard metrics, we introduce two diagnostic benchmarks: an Extreme Summer Benchmark for severity-aware prediction of vegetation degradation under extreme weather, and a Seasonal Matched-Pair Benchmark for testing response fidelity under changed weather forcing. Experiments show that EO-WM reduces the error in predicted Normalized Difference Vegetation Index (NDVI) decline amplitude by a relative 5.63% and improves directional hit rate by a relative 7.80%, while remaining competitive on standard pixel-level metrics. The benchmarks and model will be made open-source at this https URL.
21. 【2606.27264】CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs
链接:https://arxiv.org/abs/2606.27264
作者:Hashmat Shadab Malik,Anees Ur Rehman Hashmi,Numan Saeed,Muzammal Naseer,Salman Khan,Christoph Lippert
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shown strong promise, Reasoning, shown strong, strong promise, Clinically Organized Reasoning
备注:
点击查看摘要
Abstract:Reasoning in multimodal large language models (MLLMs) has shown strong promise in medical imaging. However, this reasoning is usually free-form text judged only by its final answer, making it hard to interpret and verify, especially in 3D radiology, where a diagnosis should be traceable to evidence in the scan. Existing chest CT question-answering datasets compound this by reducing expert radiology reports to answer-only pairs, dropping the reasoning that links findings to conclusions and omitting the patient history clinicians rely on. As a result, reasoning-capable 3D chest CT MLLMs remain out of reach, as neither the structured supervision needed to train them nor the protocol needed to verify their reasoning yet exists. We introduce CORTEX (Clinically Organized Reasoning and sTructured EXplanation), a structured reasoning benchmark for 3D chest CT. For each question, CORTEX restores the missing reasoning as a four-stage diagnostic trace mirroring a radiologist's workflow: task understanding, visual observation, diagnostic reasoning, and answer synthesis. We generate these traces using frontier large language models with broad medical and general-domain knowledge, then filter and verify them with a stage-level evaluation protocol combining automated rubric scoring with expert radiologist review. Crucially, both the reasoning structure and evaluation rubrics are designed in close collaboration with clinicians. Built on CT-RATE, a large, publicly available chest CT dataset without reasoning annotations, CORTEX comprises 76,177 validated reasoning traces across open-ended VQA, closed-ended VQA, and report generation, providing both the structured supervision and the stage-level evaluation protocol needed to build and evaluate trustworthy reasoning models for 3D chest CT. Our dataset and evaluation code will be made publicly available upon acceptance.
22. 【2606.27234】From Celebrities to Anyone: Characterizing AI Nudification Content, Technology, and Community Dynamics on 4chan
链接:https://arxiv.org/abs/2606.27234
作者:Chi Cui,Yixin Wu,Yang Zhang
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:sexually explicit imagery, create synthetic non-consensual, synthetic non-consensual sexually, non-consensual sexually explicit, explicit imagery
备注: 22 pages, 13 figures, 2 tables
点击查看摘要
Abstract:AI nudification uses generative models to create synthetic non-consensual sexually explicit imagery (SNEACI) of real individuals. Prior work has examined dedicated nudification platforms and model repositories, finding that most targets are female celebrities. However, the anonymous content community, where SNEACI is actively requested, generated, and exchanged, remains unexplored. In this work, we present a large-scale study of AI nudification in the wild, identifying 24,105 SNEACI items. We find a significant shift in target demographics: non-celebrity individuals now account for 55.8\% of targets, compared to only 4.7\% in prior studies, indicating that AI nudification has expanded from targeting public figures to increasingly harming individuals within users' own social circles. Meanwhile, open-source models dominate production, with Stable Diffusion family generating 42.7\% of images and Wan generating 66.5\% of videos, all driven by thousands of shared fine-tuned models and accessible tutorials. Yet the ecosystem runs on a small cohort of active producers, with the most prolific producing 780 items, drives community engagement, shapes target demographics, and disseminates technical knowledge that lowers barriers for new producers. Our work provides an empirical understanding of how AI nudification operates in the wild, revealing the mechanisms that sustain this ecosystem and highlighting the urgent need for interventions in platform governance, technical safeguards, and affected individual protection.
23. 【2606.27223】SatSplatDiff: Geometry-preserving generative refinement for high-fidelity satellite Gaussian Splatting
链接:https://arxiv.org/abs/2606.27223
作者:Jiyong Kim,Shuang Song,Ronjgun Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:representing radiometrically diverse, diverse satellite scenes, radiometrically diverse satellite, Gaussian Splatting, flexibility and efficiency
备注: 23 pages, 15 figures
点击查看摘要
Abstract:Gaussian Splatting has been recently explored for satellite 3D reconstruction, demonstrating flexibility and efficiency in representing radiometrically diverse satellite scenes. However, the limited top viewpoint of satellite imagery results in insufficient supervision on building facades, leaving surface holes and degraded visual fidelity. Generative refinement, which leverages pretrained generative priors to iteratively refine and update the rendered images used as supervision targets, has recently been investigated to improve the visual fidelity of Gaussian-rendered images. However, since these models refine each view independently, the resulting images can generate hallucinations and break photo-consistency, leading to geometric degradation. To address these limitations, we propose SatSplatDiff, which aims to minimize geometric degradation prevalent in generative refinement. Building on photogrammetric DSM initialization and 2DGS-based shadow casting established in our prior work SatSplat, we first introduce monocular depth supervision and multi-scale geometric refinement to establish a geometrically accurate and well-regularized surface representation. We then apply shadow-guided generative refinement, where geometrically calculated shadow maps guide the Gaussians to maintain consistency with the underlying geometry, improving visual fidelity while reducing geometric degradation. Extensive evaluations on the IARPA2016 and DFC2019 datasets demonstrate state-of-the-art performance, reducing geometric MAE by up to 18% and improving visual fidelity (FID-CLIP) by 28-45% over existing baselines. Our method delivers up to 5x resolution enhancement with minimal hallucination and sensor-consistent appearance, demonstrating seamless cross-tile consistency and strong scalability for large-scale reconstruction. Source code is available at this https URL
24. 【2606.27192】LISA: Likelihood Score Alignment for Visual-condition Controllable Generation
链接:https://arxiv.org/abs/2606.27192
作者:Yanghao Wang,Hongxu Chen,Jiazhen Liu,Zhenqi He,Rui Liu,Zhen Wang,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual-condition controllable generation, shown remarkable success, side network, frozen pretrained main, prevalent dual-branch paradigm
备注:
点击查看摘要
Abstract:The prevalent dual-branch paradigm, i.e., training a side network to encode visual conditions and fusing its intermediate-layer features to a frozen pretrained main network, has shown remarkable success in visual-condition controllable generation. Despite its widespread adoption, the role of the side branch and its training efficiency remain underexplored. In this paper, we first revisit this mainstream paradigm through the lens of score-based generative modeling: 1) The main network preserves visual perceptual quality by providing a prior unconditional score. 2) The side network steers conditional control by implicitly contributing a likelihood score. Guided by this perspective, we propose LIkelihood Score Alignment (LISA), an effective regularization method that explicitly aligns the intermediate feature of the side network with an approximated likelihood score. Specifically, we first hook features from a designated layer of the side network and project them into the score latent space by a lightweight decoder. Then, we construct an approximated likelihood score target and calculate the distance between the decoder's output and this target as an additional regularization loss. Finally, we jointly optimize the side network and decoder with both standard diffusion loss and our regularization loss. Experiments across various image/video tasks, architectures, and diffusion/flow models demonstrated that LISA can not only consistently accelerate the training convergence and improve final synthetic results, but also encourage the side network's features to be more disentangled for conditional modeling with negligible additional training cost and zero extra inference cost.
25. 【2606.27187】HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models
链接:https://arxiv.org/abs/2606.27187
作者:Jiajun Wu,Haoyu Kang,Yining Sun,Jiacheng Hou,Heng Zhang,Danyang Zhang,Zhenjun Zhao,Haochi Zhang,Leixin Sun,Eric Hanchen Jiang,Yushan Li,Ruiyu Li,Mengkai Huang,Yan Gao,Xu Zhang,Guancheng Wan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:automated content moderation, sparking growing interest, recently shown immense, shown immense potential, Large vision-language models
备注:
点击查看摘要
Abstract:Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in existing works: 1) The multi-layered characteristics of harmful videos are overlooked. Existing benchmarks predominantly formulate evaluation as a binary classification task, failing to capture implicit or deep contextual harms. 2) Explanatory rationales are completely absent. Current frameworks measure exclusively whether a model flags a video correctly rather than explaining why, turning evaluation into a black box where models can succeed through superficial shortcuts. To address these problems, we present HarmVideoBench, a multi-layered diagnostic benchmark comprising 1,379 videos paired with 4,137 multiple-choice questions. HarmVideoBench benchmarks three hierarchical dimensions: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning, aiming to evaluate models' deep understanding beyond surface cues with carefully balanced and curated samples. We evaluate 19 leading models on HarmVideoBench to assess their multidimensional understanding of harmful videos. Moreover, we introduce BCR, a benchmark-aligned method that predicts reasoning boundaries and dynamically retrieves context only when needed. Experimental results show that BCR substantially improves the base model's performance in harmful video understanding, raising the macro average from 61.7 percent to a state-of-the-art 84.4 percent.
26. 【2606.27147】Safe Autoregressive Image Generation with Iterative Self-Improving Codebooks
链接:https://arxiv.org/abs/2606.27147
作者:Yunqi Xue,Zhijiang Li,Philip Torr,Jindong Gu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Unlike diffusion-based models, sequentially predicting discretized, Unlike diffusion-based, predicting discretized visual, discretized visual tokens
备注: 10 pages including references, 8 figures, accepted for publication at the 43rd International Conference on Machine Learning (ICML 2026)
点击查看摘要
Abstract:Unlike diffusion-based models that operate in continuous latent spaces, autoregressive unified multimodal models produce images by sequentially predicting discretized visual tokens. These tokens are derived from a codebook that maps embeddings to quantized visual patterns. The language-like architecture enables unified multimodal models to effectively capture text conditional information for generation, making them promising for text-to-image tasks. This also raises an interesting question: how safe are the images generated in such an autoregressive way? In this work, we propose iterative self-improving codebooks for safe autoregressive generation. We leverage the understanding and judgment capabilities of the unified multimodal model itself to identify unsafe generated images without human annotation. Subsequently, the inherent representations in the codebook are fixed to eliminate harmful mappings. Our method comprises two steps: first, we use the unified model to identify unsafe generations and construct corresponding harmful and safe image-text pairs. These pairs are used to construct the Harmful Space and guide updates to the codebook, thereby eliminating harmful outputs. Second, we perform adaptive fine-tuning on the codebook within the harmless space using safe image-text pairs to ensure the quality of generated images. These two steps are repeated until no further improvement is observed, producing a safety-enhanced model codebook. Without additional external feedback, the safety of models is improved iteratively.
27. 【2606.27128】FlameVQA: A Physically-Grounded UAV Wildfire VQA Benchmark with Radiometric Thermal Supervision
链接:https://arxiv.org/abs/2606.27128
作者:Mobin Habibpour,John Spodnik,Niloufar Alipour Talemi,Fatemeh Afghah
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:complex aerial scenes, limit RGB-only interpretation, UAVs requires reliable, scale variation, aerial scenes
备注:
点击查看摘要
Abstract:Wildfire monitoring from UAVs requires reliable reasoning over complex aerial scenes, where smoke, scale variation, and occlusions often limit RGB-only interpretation. We introduce FlameVQA, a multiple-choice visual question answering benchmark for UAV-based wildfire intelligence built on FLAME 3, leveraging paired RGB imagery and radiometric thermal TIFFs for temperature-grounded, safety-critical reasoning. FlameVQA includes 34 multiple-choice questions per image spanning six operational capability groups, covering tasks such as detection, localization, distribution/coverage estimation, cross-modal reasoning, and flight planning. To ensure label reliability, we combine MLLM-assisted annotation with deterministic thermal rules and cross-question consistency checks, followed by human auditing. We also evaluate representative MLLMs on FlameVQA to provide baselines for future work. Results show strong performance when explicit cross-modal cues are available, but notable failures on presence detection under heavy smoke and on coverage estimation. These findings suggest that current MLLMs require domain-specific adaptation to better support disaster and wildfire monitoring. The dataset and benchmark code are open-source at this http URL
28. 【2606.27123】Proposal-Conditioned Latent Diffusion for Closed-Loop Traffic Scenario Generation
链接:https://arxiv.org/abs/2606.27123
作者:Shubham Vaijanath Phoolari,Aleyna Kara,Christoph Lauer,Steven Peters
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Closed-loop traffic simulation, traffic simulation remains, simulation remains challenging, Closed-loop traffic, generate interactive multi-agent
备注: Accepted for publication at the IEEE International Conference on Intelligent Transportation Systems (ITSC), 2026
点击查看摘要
Abstract:Closed-loop traffic simulation remains challenging because it must generate interactive multi-agent behaviors that are scene-consistent and controllable throughout rollout. Prior diffusion-based approaches achieve strong realism, but their computational cost can hinder deployment in time-constrained replanning loops for autonomous vehicle planning and simulation. We present a diffusion-based scenario generation framework conditioned on instance-centric scene context and multimodal proposal priors, with optional test-time guidance for shaping safety-critical behaviors. A compact action-latent representation and proposal-based initialization improve sampling efficiency and reduce per-step runtime without retraining. Experiments on the Waymo Open Motion Dataset demonstrate a favorable balance among realism, safety, and controllability across diverse interactive scenarios, while showing that test-time guidance enables systematic trade-offs among competing objectives.
29. 【2606.27089】MP: Tree-structured Mixed-policy Pruning for Large-scale Image Generation and Editing
链接:https://arxiv.org/abs/2606.27089
作者:Peizhen Zhang,Yang Li,Xunsong Li,Songtao Liu,Zewen Liu,Qiangqiang Hu,Guotong Guo,Jupeng Ding,Yifu Sun,coopersli,Jian Zhang,Zhao Zhong,Liefeng Bo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-fidelity image synthesis, meet high-fidelity image, Modern image generation, model rapidly grows, Modern image
备注: 10 pages, 3 figures, 3 tables, tech report
点击查看摘要
Abstract:Modern image generation model rapidly grows their sizes to meet high-fidelity image synthesis. However, they gradually become unaffordable for their enormous parameter consumption and computation budget that lead to massive resources requirement and gpu memory footprint. In this paper, we propose TMP, the first Tree-structured Mixed-policy Pruning framework that generalizes prevalent image tasks (T2I and TI2I) and architectures (Mixture-of-Experts (MoE) and Diffusion transformer (DiT)). It could be applied to the step-distilled models and contribute as the last stage. We perform experiments upon current open-sourced SOTA HunyuanImage-3.0 instruct and a popular efficient model Z-Image turbo. The proposed pruning framework manages to compress HunyuanImage 3.0 from 80B to 20B parameters at 75% reduction ratio, sacrificing limited generation quality. We also optimize to enable the inference of the pruned 20B version of HunyuanImage 3.0 on a single 24GB 4090 GPU by engineering skills. The inference script and model weight have been integrated into the existing HunyuanImage3.0 open-source github and huggingface repository. Besides, we prove the efficacy of TMP by compressing Z-Image turbo from 6B to 4B (33% reduction) with negligible degradation.
30. 【2606.27088】SubdivAR: Autoregressive Next-Scale Prediction for Neural Mesh Subdivision
链接:https://arxiv.org/abs/2606.27088
作者:Huipeng Guo,Zikai Song,Hang Long,Jielei Zhang,Wenbing Li,Junkai Lin,Tianhao Zhao,Jinshen Zhang,Tianle Guo,Wei Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:digital asset creation, converting coarse, asset creation, fundamental operation, operation for converting
备注:
点击查看摘要
Abstract:Mesh subdivision is a fundamental operation for converting coarse, editable meshes into high-resolution surfaces, with broad applications in digital asset creation. Classical rule-based schemes rely on fixed local refinement rules and often produce over-smoothed surfaces. Recent neural subdivision methods improve detail synthesis, but remain constrained by local modeling and exhibit limited generalizability. We present SubdivAR, a neural mesh subdivision framework based on our proposed Mesh Autoregressive Representation (MAR). MAR arranges meshes at different subdivision levels into an ordered scale sequence, reformulating subdivision as autoregressive next-scale prediction. To support this formulation, we introduce a Hybrid Topology-Aware Transformer that combines global semantic attention with topology-constrained local feature aggregation. SubdivAR adopts a next-scale coordinate prediction paradigm, regressing vertex offsets at each refinement stage to preserve subdivision topology while recovering fine-grained geometric details. To enable reliable learning, we construct FII-40K, a curated dataset of nearly 40,000 high-quality meshes with multi-level subdivision supervision. Experiments show that SubdivAR outperforms state-of-the-art baselines, reducing Hausdorff Distance and Chamfer Distance by 18.8% and 14.2%, respectively, and demonstrates strong robustness on complex open-surface geometries.
31. 【2606.27084】Pseudo-Text-Conditioned 3D Grounding DINO for Organ Localization in Abdominal CT
链接:https://arxiv.org/abs/2606.27084
作者:Siqi Chen,Han Gong,Keyi Hou,Jingxuan Yang,Sheethal Bhat,Andreas Maier
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:downstream trauma analysis, provide spatial priors, Reliable organ localization, Reliable organ, trauma analysis
备注: 24 pages, 17 figures
点击查看摘要
Abstract:Reliable organ localization in abdominal CT can provide spatial priors for downstream trauma analysis. We propose CT-3GDINO, a lightweight 3D detector that adapts a Grounding-DINO-style query-based architecture to fixed organ localization using frozen pseudo-text class tokens instead of a real text encoder. The model combines a Swin3D visual backbone, bidirectional feature enhancement, pseudo-text-guided query selection, and a cross-modality decoder to predict normalized 3D boxes for liver, spleen, left kidney, right kidney, and bowel. We train and evaluate on 193 matched RSNA/RATIC CT volumes with segmentation-derived boxes. The best multi-scale model, trained from scratch, achieves 0.5830 overall top-1 class-wise mAP over 3D IoU thresholds from 0.1 to 0.7, outperforming fixed- and trainable-backbone classification-pretrained variants with 0.5570 and 0.4657 mAP. Performance is strong for coarse localization, with 0.9649 AP at IoU 0.1, but remains limited for strict box alignment, with 0.1552 AP at IoU 0.7. These results establish CT-3GDINO as an open-source baseline for pseudo-text-conditioned 3D organ localization and motivate future work on localization-aware pretraining, richer multimodal conditioning, and injury-focused detection.
32. 【2606.27071】PanoImager: Geometry-Guided Novel View Synthesis and Reconstruction from Sparse Panoramic Views
链接:https://arxiv.org/abs/2606.27071
作者:Zhisong Xu,Takeshi Oishi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sensing offers wide, panoramas remains challenging, weak-parallax motion, Panoramic sensing offers, offers wide
备注: IROS 2026
点击查看摘要
Abstract:Panoramic sensing offers wide field-of-view coverage, yet 3D reconstruction from sparse panoramas remains challenging under rotation-dominant, weak-parallax motion. In such regimes, SfM/SLAM initialization is often ill-conditioned and unreliable. We present PanoImager, an SfM-free framework that combines feed-forward pose/depth priors, geometry-conditioned diffusion view completion, and depth-guided 3DGS optimization. Given only a few panoramic images, PanoImager decomposes them into local perspective views, synthesizes auxiliary observations to enrich sparse evidence, and stabilizes Gaussian optimization for improved cross-view consistency. Experiments on multiple benchmarks show improved stability under extreme sparsity, suggesting PanoImager as an offline/background component for map refinement when SfM/SLAM fails to initialize.
33. 【2606.27023】Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA
链接:https://arxiv.org/abs/2606.27023
作者:Eren Senoglu,Federico Toschi,Nicolo Brunello,Andrea Sassella,Mark James Carman
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical Visual Question, Visual Question Answering, Multimodal large language, produce overconfident outputs, existing verbalized confidence
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed primarily for text only LLMs, do not account for the multimodal nature of medical image understanding. This work proposes a training based framework that finetunes MLLMs to improve their calibration using a composite loss function combining a Brier style calibration term, an anchor regularizer that prevents confidence collapse toward extreme values, a contrastive image text alignment term, and a KL based model stabilization term. The alignment signal is derived from a $2 \times 2$ factorial perturbation design that crosses image presence with text integrity, probing the reliance of the model on visual modality input versus language priors. Finally, a top K KL divergence regularizer is used to protect the answering ability of the model during finetuning. Across three Medical VQA benchmarks and two architectures (MedGemma 4B IT and Qwen2 VL 7B Instruct), our method reduces calibration error by 60% or more, and improves discrimination by 26% or more, while preserving predictive accuracy. On average across benchmarks, the technique outperforms prompting based, sampling based, and training based approaches, and ablation experiments confirm that each component of the loss function is indeed necessary for improving the calibration. All code for the experiments is publicly available.
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2606.27023 [cs.LG]
(or
arXiv:2606.27023v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2606.27023
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
34. 【2606.27018】On-board Remote-Sensing Foundation Models for Unsupervised Change Detection of Disaster Events
链接:https://arxiv.org/abs/2606.27018
作者:S. Ramírez-Gallego
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Remote Sensing Foundation, Sensing Foundation Models, Remote Sensing, Earth Observation, autonomously trigger high-resolution
备注:
点击查看摘要
Abstract:Remote Sensing Foundation Models (RSFMs) have emerged as a powerful alternative to supervised models for Earth Observation, allowing satellites to autonomously trigger high-resolution captures or adjust tasking parameters upon detecting an anomaly, thereby maximizing the utility of the mission's limited power and computational resources. RSFMs are versatile, unified encoders that optimize onboard storage for multiple orbital applications while ensuring high-fidelity feature extraction. In particular, unsupervised change detection with RSFMs offers a well-informed and transformative path for disaster monitoring without expensive labels. In this paper, we present a novel unsupervised detection method based on ResNet (RSFM) + FPN which identifies a wide spectrum of anomalies by detecting subtle semantic shifts in the latent space between successive orbital passes. By relying on an untrained FPN architecture and its intrinsic priors, the system achieves efficient image-level generation and higher resolution mapping with minimal effort (training-free) compared to previous proposals (patch-based, trained). And by replacing tailored models with RSFMs, we can achieve comparable results through an approach that eliminates the need for bespoke training and extensive development effort and adds customization, while ensuring high-performance generalization across diverse terrains and sensors.
35. 【2606.26994】Event-Aware Instructed Assistant for Referring Video Segmentation
链接:https://arxiv.org/abs/2606.26994
作者:Jinyu Liu,Henghui Ding,Shuting He,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Existing referring video, single event consisting, multiple images, video, multiple distinct events
备注: IEEE Transactions on Image Processing
点击查看摘要
Abstract:Existing referring video segmentation methods often treat a video as a single event consisting of multiple images, overlooking the fact that a video typically contains multiple distinct events. Under such a mechanism, the model needs to directly understand all the complex content in the video and text, which can easily lead to confusion and hallucinations. To address this issue, we propose to decompose a video to a set of simple events by learnable Event Query, and understand complex video content in an event-by-event, easy-to-understand manner. This is based on the observation that natural language expressions often divide a video into distinct, text-related segments, each representing a separate event within a compound event. We introduce EVIS, an Event-Aware Video Instructed Segmentation Assistant, which utilizes text-guided Event Queries to partition a video into simple events, extracting event-aware visual-text features to achieve a hierarchical understanding of the video. Additionally, we propose Object-Pixel-Hybrid Learning, which enables the MLLMs to track targets in long-term videos by integrating fine-grained pixel features with prior object queries. Extensive experimental results on 5 public benchmarks demonstrate EVIS's strong performance in addressing the referring video segmentation task.
36. 【2606.26984】Unison: Benchmarking Unified Multimodal Models via Synergistic Understanding and Generation
链接:https://arxiv.org/abs/2606.26984
作者:Jinyu Liu,Xincheng Shuai,Henghui Ding,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable strides, remarkable strides, Unified multimodal models, multimodal models capable, achieved remarkable
备注: ICML 2026
点击查看摘要
Abstract:Unified multimodal models capable of both understanding and generation have achieved remarkable strides. However, despite their unified designs, existing evaluations typically assess understanding and generation capabilities in isolation, overlooking the synergy between comprehension and generation. To bridge this gap, we introduce Unison, a comprehensive benchmark comprising 2,169 high-quality unified task samples, designed to evaluate joint understanding and generation in unified multimodal models. Unison offers three key strengths: 1) Comprehensive Dimensions: Unison encompasses internal consistency, understanding-guided generation, generation-guided understanding, and mutual enhancement to enable holistic evaluation. 2) Diagnostic Evaluation: it provides both unified and decoupled tracks for understanding and generation, allowing fine-grained attribution of failure modes and quantitative analysis of the gains from unified modeling. 3) Human Alignment: we also introduce Unison-Judge, an evaluation model well aligned with human judgments to ensure reliable assessment. Based on systematic evaluations of state-of-the-art models on Unison, we uncover critical limitations in current unified multimodal systems and highlight promising directions for future research. Codes, Unison and Unison-Judge are publicly available at this https URL.
37. 【2606.26973】Geometric Gradient Rectification for Safe Open-Set Semi-Supervised Learning
链接:https://arxiv.org/abs/2606.26973
作者:Jiahe Chen,Qian Shao,Qiyuan Chen,Jiaying He,Jintai Chen,Jian Wu,Hongxia Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:leverage unlabeled data, semi-supervised learning aims, outliers while maintaining, in-distribution classes, aims to leverage
备注: ECCV 2026
点击查看摘要
Abstract:Open-set semi-supervised learning aims to leverage unlabeled data that may contain out-of-distribution outliers while maintaining performance on in-distribution classes. Existing methods mainly follow two paradigms: filtering suspicious samples or incorporating unlabeled objectives with soft weighting. We argue that both face a common trade-off: aggressive filtering can discard informative but hard ID samples, whereas utilization can introduce auxiliary gradients that conflict with supervised learning when pseudo labels are wrong. We therefore shift the focus from sample selection to gradient-level control. We propose \textit{Geometric Gradient Rectification} (GGR), a plug-in framework that uses the supervised gradient as an anchor and projects conflicting auxiliary gradients onto an admissible region in gradient space. This makes the applied auxiliary update first-order non-opposing within the rectified coordinate block while preserving orthogonal components that may still carry useful representation signals. We further extend GGR with subspace-aware rectification to stabilize the anchor under noisy mini-batch gradients. Experiments on CIFAR and ImageNet benchmarks show that GGR improves representative OSSL baselines in most settings and yields gains in both closed-set generalization and open-set robustness. Code will be available at this https URL.
38. 【2606.26970】Computer Vision for MOBA Analytics: A Dataset and Baseline for Visibility Analysis in Dota 2
链接:https://arxiv.org/abs/2606.26970
作者:Ricardo da Rocha Carvalho,Eloísa Oliveira,Luiz Bernardo Martins Kummer,Emerson Cabrera Paraiso,Rayson Laroca
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Online Battle Arena, Multiplayer Online Battle, Battle Arena, Multiplayer Online, Online Battle
备注: Accepted for presentation at the 2026 Simpósio Brasileiro de Jogos e Entretenimento Digital (SBGames)
点击查看摘要
Abstract:Introduction: Most Multiplayer Online Battle Arena (MOBA) analytics studies rely on structured data, which does not directly capture what each team could actually see during a match. Objective: This work introduces Dota2-Vis, a video-based dataset, and a baseline pipeline for visibility analysis in professional Dota 2 matches. Methodology: The dataset comprises all 144 matches from The International 2025, recorded from both team perspectives, totaling 288 Full HD videos, together with 2,477 manually annotated minimap images. We evaluate multiple variants of a modern object detector for player-icon detection and use the best-performing model to estimate opponent-visible player presence over time. Results: YOLO11l (large) achieved the best overall performance, reliably identifying player icons even in dense and visually cluttered minimap scenes. The resulting visibility curves reveal player, hero, role, and team-level patterns that complement conventional MOBA analytics, highlighting behavioral differences that are difficult to obtain from structured data alone. The dataset and code are publicly available at this https URL.
39. 【2606.26969】Einstein World Models
链接:https://arxiv.org/abs/2606.26969
作者:Munachiso Samuel Nwadike,Zangir Iklassov,Ali Mekky,Zayd M. Kawakibi Zuhri,Kentaro Inui
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:direct experience, Einstein World Models, intelligence require, require the ability, phenomena beyond direct
备注: 12 pages (9 without references), 2 figures, 1 algorithm
点击查看摘要
Abstract:Does intelligence require the ability to reason about phenomena beyond direct experience? It is natural to suspect that some complex thought cannot be captured through language alone. However, of particular concern to this work, is whether visualising counterfactual events can complement language as a mechanism for complex thought. We ask whether LLMs can be trained to utilise such visualisation mechanisms, in a way that benefits their reasoning abilities. Motivated by this question, we propose Einstein World Models. EWMs are a blueprint for LLM-based reasoning systems that place visual-temporal rollouts inside the reasoning trace, allowing them to reason in ways that text alone may not support well. In an EWM, the LLM calls a world-module (not to be confused with a world model), to produce short rollouts of scenes under consideration. The returned rollout is treated not as the answer, but as an inspectable hypothesis that can support later reasoning. Einstein World Models extend the capability of LLMs for tool calling (such as web search or code execution), into the domain of visual thought experiments.
40. 【2606.26964】Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds
链接:https://arxiv.org/abs/2606.26964
作者:Jiaming Bian,Bingliang Li,Yuehao Wu,Pichao Wang,Zhi Wang,Hailan Ma,Huadong Mo,Zhenhong Sun
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:models increasingly operate, world models increasingly, Story World Benchmark, models increasingly, increasingly operate
备注: 25 pages, 17 figures
点击查看摘要
Abstract:As embodied AI and world models increasingly operate in dynamic 3D environments, visual perception must move beyond passively interpreting given observations toward actively deciding what to observe. We study this problem through camera planning in dynamic 3D story worlds, where the camera must not only generate smooth motion, but also decide what visual evidence should be acquired before it moves. We formulate this capability as Narrative-Grounded World Visual Attention, where the camera acts as an embodied observer that determines what to observe, how to compose the observation, and how to shift attention over time under narrative intent and physical 3D constraints. To realize this capability, we propose Look-Before-Move, a camera planning framework that separates observation specification from motion execution. It first builds a Semantic Observation Contract to convert directorial intent into executable visual constraints, then performs Monte Carlo Viewpoint Search to find narrative-compliant and geometrically feasible viewpoints, and finally applies Semantic Trajectory Grounding to connect selected viewpoints into continuous, collision-aware, and temporally coherent camera motion. We further construct a dynamic 3D Story World Benchmark based on StoryBlender, covering 50 stories, 457 scenes, and 1585 shots with animated characters, semantic scene configurations, and executable 3D environments. Experiments show that our framework improves subject perception, intent consistency, and trajectory quality over representative baselines, demonstrating the importance of organizing visual attention before generating camera motion.
41. 【2606.26947】Scaling Multi-Reference Image Generation with Dynamic Reward Optimization
链接:https://arxiv.org/abs/2606.26947
作者:Wenwang Huang,Yusen Fu,Junjie Wang,Mengfei Huang,Yulin Li,Gan Liu,Jing Cai,Yancheng He,Zhuotao Tian
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:complex MRIG tasks, multi-reference image generation, complex MRIG scenarios, complex MRIG, personalized image generation
备注: Accepted by ECCV2026
点击查看摘要
Abstract:While personalized image generation has achieved remarkable progress, multi-reference image generation (MRIG) remains a challenging task. Most existing benchmarks fail to adequately evaluate complex MRIG scenarios, hindering further progress in this area. To better assess model performance on complex MRIG tasks, we introduce OmniRef-Bench, a benchmark that covers complex combinations of reference image types and a large number of reference images. Evaluations on OmniRef-Bench show that mainstream open-source models struggle in complex MRIG scenarios, and their performance deteriorates significantly as the number of mixed-type reference images increases. To address this issue, we propose DyRef, a two-stage training framework. In the first stage, supervised fine-tuning equips the model with the basic capability to handle complex MRIG tasks. In the second stage, we introduce Difficulty-aware Advantage Reweighting (DAR) and Discriminative Reward Scaling (DRS). DAR dynamically adjusts the optimization objective to improve performance when handling a large number of mixed-type reference images. DRS enlarges intra-group reward differences for more effective policy optimization. Experiments demonstrate that DyRef significantly improves the performance of open-source models on OmniRef-Bench and single-image editing benchmarks, demonstrating the effectiveness and generalization capability of our approach.
42. 【2606.26942】raMP-LLaMA: Generative Interpretability with Decoupled Instruction Tuning for Facial Expression Quality Assessment
链接:https://arxiv.org/abs/2606.26942
作者:Shuchao Duan,Alan Whone,Hossein Rahmani,Jun Liu,Majid Mirmehdi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing facial expression, expression quality assessment, facial expression quality, methods typically produce, Existing facial
备注:
点击查看摘要
Abstract:Existing facial expression quality assessment (FEQA) methods typically produce only a severity score, without explicitly communicating the observable facial motion evidence that supports the prediction. This limits interpretability and makes it difficult to inspect the basis of model outputs in Parkinson's disease assessment. To address this gap, we propose TraMP-LLaMA, a unified multimodal framework that jointly predicts severity scores and generates structured textual reports from facial motion cues. The framework integrates RGB appearance and landmark trajectory cues, and adopts a decoupled instruction-tuning strategy to reduce task interference between severity prediction and language generation. To support this task, we further extend the PFED5 dataset with expert-guided textual motion descriptions and construct PFED5-plus. Experiments on PFED5-plus show that TraMP-LLaMA outperforms competitive video-language baselines in report generation and achieves the best severity prediction performance among the compared methods under joint multi-expression training, improving Spearman's rank correlation by at least 4.39 percent over all competing methods. The text annotations and code are available at this https URL.
43. 【2606.26938】Focusing on What Matters: Saliency-Harnessing Accurate Routing for Diffusion MoE
链接:https://arxiv.org/abs/2606.26938
作者:Haoyou Deng,Keyu Yan,Chaojie Mao,Xiang Wang,Yu Liu,Changxin Gao,Nong Sang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:architectures have emerged, powerful paradigm, paradigm for scaling, scaling diffusion models, salient tokens
备注: ECCV 2026
点击查看摘要
Abstract:Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling diffusion models in visual generation. Recent advancements have focused on adaptively allocating computational resources across diverse tokens to improve efficiency and performance. However, we identify a routing assignment problem in existing diffusion MoE frameworks: the router fails to accurately allocate more computational resources to salient tokens. Our analysis attributes this failure to the router's reliance on noise-corrupted latent features throughout the denoising process. Such stochastic noise obscures the critical structural and textural information, thereby preventing the router from effectively distinguishing salient tokens. To address this, we propose SharpMoE, a post-training framework with a saliency-harnessing accurate routing mechanism, which utilizes clean latent features as a noise-free guidance signal for routing. By bypassing the noise-distorted inputs, SharpMoE provides the router with clear saliency guidance, enabling the identification of salient tokens even in high-noise stages. Furthermore, we introduce a trajectory routing loss to constrain the compute allocation throughout the multi-step denoising trajectory, ensuring precise resource allocation along the generation rollout. Extensive experiments demonstrate that SharpMoE serves as a versatile, plug-and-play solution that further enhances the pretrained, converged MoE models, achieving state-of-the-art performance in visual generation.
44. 【2606.26930】PortraitGen: Exemplar-Driven GRPO with Dual-Reward Guidance for Photorealistic Portrait Generation
链接:https://arxiv.org/abs/2606.26930
作者:Xiaomin Li,Qian Liang,Yinan Li,Ying Zhang,Chen Li,Jing Lyu,Huchuan Lu,Xu Jia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Relative Policy Optimization, Group Relative Policy, Reinforcement Learning, Relative Policy, Policy Optimization
备注:
点击查看摘要
Abstract:Reinforcement Learning like Group Relative Policy Optimization (GRPO) has significantly advanced text-to-image post-training. However, current methods often favor superficial aesthetics, such as over-saturated colors, leaving critical flaws like AI artifacts and biological implausibilities unresolved. We attribute these limitations to two primary factors: (1) The absence of real images during post-training confines GRPO sampling to the original distribution, failing to break inherent generative boundaries; (2) the optimization process lacks specific rewards targeting fine-grained artifacts like overly oily skin and other AI artifacts. To address this, we propose PortraitGen, a novel framework tailored for photorealistic portrait generation. First, we break inherent generative boundaries by directly introducing real images into the GRPO sampling groups, where image inversion is employed to obtain their transition probabilities and latents. Second, to explicitly steer the model toward photorealism, we introduce a complementary dual-reward mechanism: OmniReward for general quality and AI-Portrait for human-centric fidelity. Furthermore, we curate PortraitBench, a comprehensive portrait-centric benchmark. Extensive experiments demonstrate that PortraitGen significantly outperforms existing baselines, effectively suppressing AI artifacts and achieving unprecedented photorealism.
45. 【2606.26916】PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2606.26916
作者:Kexu Cheng,Zicheng Liu,Mingju Gao,Chunhe Song,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Developing physically aware, significant challenge due, Developing physically, physically aware video, aware video generation
备注: Accepted to ECCV 2026
点击查看摘要
Abstract:Developing physically aware video generation models remains a significant challenge due to the difficulty in capturing diverse physical phenomena, such as thermal dynamics, mechanics, and optics. In this work, we introduce PhysRAG, a novel pipeline that enhances physical awareness in video generation through Retrieval-Augmented Generation (RAG). To address the issue of limited high-quality data, we design a two-stage data filtering pipeline based on the WISA-80K dataset, resulting in a curated set of 7K high-quality videos for training. Furthermore, we construct a physical video database and develop a mechanism to inject physical knowledge into a video diffusion model using learnable queries. Our method achieves state-of-the-art performance in both visual quality and physical rule compliance, surpassing existing models in benchmarks such as PhyGenBench and VBench. We conduct extensive ablation studies to validate the effectiveness of our key components, including the data filtering pipeline, RAG mechanism, and method for physical information extraction. To facilitate future research, our code, data, and models are prepared for release at this https URL.
46. 【2606.26913】Neural Texture Compression using Hypernetworks
链接:https://arxiv.org/abs/2606.26913
作者:Belcour Laurent
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:small Multi-Layer Perceptron, based shading model, physically based shading, Multi-Layer Perceptron decoder, per-material texture representations
备注: 8 pages, 12 figures, conference
点击查看摘要
Abstract:Recent work on neural texture compression has demonstrated that it is possible to learn small, per-material texture representations (composed of latent textures and a small Multi-Layer Perceptron decoder) that can be decoded in real-time during shading to reproduce the input to a physically based shading model. However, existing methods require performing gradient-descent optimization per material for a given MLP and latent configuration. In this work, we train a single hypernetwork that outputs both the latent features and the MLP's weights and biases. Though the solution space is high-dimensional, this approach produces results comparable in quality to the current reference neural texture compressors. We further extend this approach to infer multiple decoders at once or even produce decoders that learn super-resolution.
47. 【2606.26907】Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation
链接:https://arxiv.org/abs/2606.26907
作者:Zekai Zhang,Jiahao Li,Jie Zhang,Kaiyuan Gao,Kun Yan,Lihan Jiang,Ningyuan Tang,Shengming Yin,Tianhe Wu,Xiaoyue Chen,Xiao Xu,Yan Shu,Yanran Zhang,Yixian Xu,Yuxiang Chen,Zhendong Wang,Zihao Liu,Zikai Zhou,Huishuai Zhang,Dongyan Zhao,Chenfei Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, remarkable progress, Context, achieved remarkable, struggle with real-world
备注:
点击查看摘要
Abstract:While text-to-image (T2I) models have achieved remarkable progress, they struggle with real-world requests that are often underspecified, implicit, or dependent on up-to-date knowledge. We identify this challenge as the Context Gap: the mismatch between the user context and the sufficient generation context for T2I models. To bridge this gap, we propose Qwen-Image-Agent, a unified agentic framework that integrates plan, reason, search, memory and feedback in a context-centric manner. Qwen-Image-Agent treats user input as partial context and progressively constructs the generation context through Context-Aware Planning and Context Grounding. Specifically, Context-Aware Planning identifies missing context and plans how it should be acquired and used, while Context Grounding gathers this context from reason, search, memory, and feedback. To evaluate agentic image generation, we further introduce Image Agent Bench (IA-Bench), a benchmark covering four core image agent capabilities: Plan, Reason, Search, and Memory. Experiments on IA-Bench, Mindbench and WISE-Verified show that Qwen-Image-Agent outperforms strong baselines and achieves state-of-the-art performance.
48. 【2606.26904】Confidence-Aware Tool Orchestration for Robust Video Understanding
链接:https://arxiv.org/abs/2606.26904
作者:Yangfan He,Yujin Choi,Jaehong Yoon
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Blind Trust Problem, language models implicitly, models implicitly assume, Video reasoning language, reasoning language models
备注: Project page: [this https URL](https://rova-v2.github.io/)
点击查看摘要
Abstract:Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.
49. 【2606.26898】ractography-Driven Synthetic Data Generation for Fiber Bundle Segmentation in Tracer Histology
链接:https://arxiv.org/abs/2606.26898
作者:Kyriaki-Margarita Bintsi,Sparsh Makharia,Yaël Balbastre,Joselyn Romero Avila,Julia F. Lehman,Suzanne N. Haber,Anastasia Yendiki
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Diffusion MRI, enables non-invasive reconstruction, tractography enables non-invasive, white-matter pathways, limited by indirect
备注: MICCAI 2026
点击查看摘要
Abstract:Diffusion MRI (dMRI) tractography enables non-invasive reconstruction of white-matter pathways, but its accuracy is fundamentally limited by indirect, low-resolution measurements of axonal organization. Tracer injection studies in non-human primates provide a gold standard for validating dMRI tractography. This, however, requires time-consuming manual annotation of fiber bundles in histology sections. We propose a synthetic-data augmented framework for automated fiber bundle segmentation in macaque tracer histology. Our approach uses ex vivo dMRI tractography as a generative prior to synthesize 2D image patches for training. This provides us with sufficiently realistic foreground texture, which we compose with backgrounds from blockface photos and diversify via domain randomization. A 2D U-Net is trained on mixed real and synthetic patches. Experiments on held-out brains demonstrate improved generalization across brains and fiber bundle densities compared to training with real data only. Training with synthetic data only leads to poor performance, underscoring the need for real supervision. Overall, our approach achieves performance comparable to the state-of-the-art while requiring 3x less manually annotated data.
50. 【2606.26894】Modeling Local, Global, and Cross-Modal Context in Multimodal 3D MRI
链接:https://arxiv.org/abs/2606.26894
作者:Minh Duc Do,Tillmann Rheude,Noel Kronenberg,Roland Eils,Benjamin Wild
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:limited sample sizes, sample sizes typical, data spanning multiple, spanning multiple co-registered, neuroimaging studies relative
备注:
点击查看摘要
Abstract:Brain MRI poses a fundamental challenge for machine learning: models must learn from high-dimensional 3D data spanning multiple co-registered modalities, despite the limited sample sizes typical of neuroimaging studies relative to the diversity in anatomy, pathology, and acquisition conditions. While multimodal imaging provides complementary information critical for clinical interpretation, effectively integrating these signals remains difficult. We propose Multimodal Intra- and Cross-Context Vision Transformer (MICViT), a 3D vision transformer that explicitly models both modality-specific representations and cross-modal interactions across local and global contexts. Concretely, MICViT combines four attention mechanisms: modality-specific local and global attention for intra-modal feature learning, and cross-modal local and global attention to capture interactions between modalities. We evaluate MICViT on brain age prediction across three heterogeneous datasets (UK Biobank, n=41,404; SOOP, n=1,062; Cam-CAN, n=613) using multiple MRI modalities (e.g. T1, FLAIR, DWI, SWI). MICViT consistently outperforms state-of-the-art CNN and transformer baselines in 3D settings. Notably, it benefits more strongly from multimodal inputs, yielding larger performance gains as additional modalities are incorporated. These results demonstrate that explicitly modeling intra- and cross-modal interactions is key to unlocking the full potential of multimodal brain MRI, highlighting a promising direction for representation learning in neuroimaging.
51. 【2606.26891】Bridging Vision and Language Concepts through Optimal Transport Semantic Flow
链接:https://arxiv.org/abs/2606.26891
作者:Chenyang Zhang,Anqi Dong,Guangming Zhu,Nuoye Xiong,Siyuan Wang,Lin Mei,Liang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:effectiveness fundamentally depends, promise transparent reasoning, Concept Bottleneck Models, Concept Bottleneck Model, Flow Concept Bottleneck
备注:
点击查看摘要
Abstract:Concept Bottleneck Models (CBMs) promise transparent reasoning by predicting through human-interpretable concepts, yet their effectiveness fundamentally depends on how well visual and textual representations are aligned or matched. Existing vision-language CBMs often rely on pre-aligned encoders or global cosine similarity, which obscures fine-grained concept localization and fails to reflect true semantic geometry. In this work, we rethink concept alignment as a dynamic cross-modal transport process instead of static projection and propose the Optimal Transport Flow Concept Bottleneck Model (OTF-CBM). It first learns a data-driven semantic cost via Inverse Optimal Transport to measure cross-modal distances, and then performs unbalanced optimal-transport-based flow matching to model semantic transitions between visual patches and textual concepts. With velocity-based concept activation, OTF-CBM captures interpretable geometric relations without ODE integration. Experiments further show that OTF-CBM achieves superior classification accuracy and concept faithfulness, offering a new geometric and dynamical perspective for interpretable cross-modal reasoning.
52. 【2606.26885】RIS-Assisted Proactive Handover for Reliable mmWave Wireless Networks
链接:https://arxiv.org/abs/2606.26885
作者:Alaa Adnan,Mohammad Al-Quraan,Ahmed Zoha,M. Majid Butt,Sami Muhaidat,Muhammad Ali Imran,Marco Di Renzo,Lina Mohjazi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:networks are highly, highly susceptible, RIS, RIS elements, Millimeter-wave
备注:
点击查看摘要
Abstract:Millimeter-wave (mmWave) networks are highly susceptible to line-of-sight (LoS) blockages. Vision-aided wireless communications (VAWC) enable proactive handovers (PHO) to mitigate such blockages; however, PHO becomes challenging when no nearby base station (BS) is available. In such cases, reconfigurable intelligent surfaces (RIS) can be used to restore connectivity. To ensure timely PHO, the RIS configuration time must be taken into account, as the large number of RIS elements can limit responsiveness in time-sensitive scenarios. This work proposes a novel RIS-assisted PHO approach that optimizes the number of allocated RIS elements to balance signal processing complexity and link quality under handover timing constraints, making the RIS-assisted link more energy-efficient. An optimization problem based on particle swarm optimization (PSO) is formulated to determine the optimal end-to-end RIS link setup that runs offline to bypass latency constraints. Results show that reducing the number of RIS elements by 12\% leads to a 10\% decrease in dissipated energy without compromising the signal-to-noise ratio (SNR). Moreover, the RIS-assisted link achieves a 15--30 dB improvement in blocked regions while maintaining accurate PHO timing.
53. 【2606.26872】SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing
链接:https://arxiv.org/abs/2606.26872
作者:Yankai Yang,Yancheng Long,Wei Chen,Xingyu Lu,Hongyang Wei,Bin Wen,Fan Yang,Tingting Gao,Han Li,Shuo Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent online reinforcement, online reinforcement learning, Recent online, substantially improved image, online reinforcement
备注:
点击查看摘要
Abstract:Recent online reinforcement learning has substantially improved image editing quality. However, existing Flow-GRPO-style methods usually rely on a single whole-image reward, which makes fine-grained editing optimization difficult. We observe that a key obstacle in image editing is this spatial uniformity assumption: a whole-image reward cannot distinguish how different spatial regions contribute to image quality. To address this issue, we propose SpatialFlow-GRPO, a training framework that introduces spatially fine-grained reward feedback. The framework converts region-aware rewards into semantic-region-level optimization signals and aligns region advantages with the corresponding latent positions during policy updates. We also train a region-aware reward model, SFReward, construct SFReward-14K with region-annotated editing samples, and introduce MultiEditBench to evaluate multi-region editing ability. On OmniGen2 and FLUX.2-klein-4B, SpatialFlow-GRPO outperforms Flow-GRPO on GEdit-Bench, ImgEdit-Bench, and MultiEditBench. The results show that SpatialFlow-GRPO converts local feedback into spatially aligned update signals and improves editing quality.
54. 【2606.26863】Rolling Shutter Relative Pose Estimation Made Practical
链接:https://arxiv.org/abs/2606.26863
作者:Daniel Barath
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cameras equip virtually, making RANSAC-based robust, relative pose estimation, RANSAC-based robust estimation, robust estimation prohibitively
备注:
点击查看摘要
Abstract:Rolling shutter (RS) cameras equip virtually all consumer devices, yet RS-aware relative pose estimation has remained impractical: the state-of-the-art solver requires a minimum of 20 point correspondences, making RANSAC-based robust estimation prohibitively expensive due to the exponential dependence of the iteration count on the sample size. We make RS relative pose estimation practical by introducing affine correspondences (ACs) into the RS two-view geometry. We derive novel \emph{RS-corrected affine constraints} that account for the coupling between point perturbations and the row-dependent essential matrix, providing two equations per correspondence beyond the standard epipolar constraint. Building on these constraints, we develop a linearized algebraic solver that estimates pose and RS motion from only 7 ACs. The solver exploits the physical smallness of RS parameters to linearize the constraints, eliminates the 12 RS unknowns via null-space projection, and solves the remaining degree-20 system via action matrices in 1.2\,ms. On the TUM RS benchmark, our method achieves the best pose and RS parameter accuracy among all tested methods and, uniquely among RS solvers, provides accurate translational velocity estimates -- which are poorly conditioned from point correspondences alone due to a $\vec{v}$-$\vec{t}$ coupling. On the global-shutter EuRoC MAV dataset, the solver achieves comparable accuracy to the standard 5-point algorithm, demonstrating that it generalizes well to the GS setting. Code is at this https URL.
55. 【2606.26850】Appearance-Preserving Refinement of Generated 3D Assets for Monochromatic Fabrication
链接:https://arxiv.org/abs/2606.26850
作者:Chentao Shen,Chen Jia,Mingjie Huang,Zhuang Zhang,Haisen Zhao,Xiangru Huang
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, mesh generation, generation have enabled, enabled the creation, visually realistic assets
备注: For preprint
点击查看摘要
Abstract:Recent advances in 3D mesh generation have enabled the creation of visually realistic assets. However, much of their visual fidelity is encoded in textures rather than geometry. When such assets are fabricated using monochromatic materials, texture information is largely lost, causing visually important details to disappear even when the original geometry is faithfully preserved. A key challenge is that the geometric perturbations required to recover texture-dependent appearance cues often introduce sharp local features and high-frequency surface structures, which may increase stress concentration and fabrication risk. In this paper, we present GenMF, an appearance-oriented geometry refinement framework for monochromatic fabrication. GenMF transforms texture-dependent visual cues into geometry-induced shading effects and formulates geometry refinement as a balance between appearance preservation and fabrication-oriented robustness. To discourage structurally and narrow the gap between simulation and physical manufacturing, we further introduce a differentiable stress-aware regularization based on a learned thermal-stress predictor. Experimental results demonstrate that GenMF significantly improves appearance preservation under monochromatic rendering while reducing stress concentration under a consistent thermo-mechanical simulation setting. Physical 3D printing examples further show that the refined geometries preserve more recognizable visual details while remaining suitable for fabrication. These results suggest that appearance-aware geometry refinement provides an effective bridge between generated 3D assets and fabrication-ready monochromatic objects.
56. 【2606.26849】Liquid Fusion of Heterogeneous Representations Towards General Salient Object Detection
链接:https://arxiv.org/abs/2606.26849
作者:Ke Chen,Ling Zhou,Guangqi Jiang,Gengshen Wu,Yi Liu,Shoukun Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cutting-edge State Space, segment visually interesting, visually interesting objects, State Space Models, Salient Object Detection
备注: 20 pages, 5 figures
点击查看摘要
Abstract:General Salient Object Detection (SOD) aims to identify and segment visually interesting objects from uni-modality or multi-modality scenes, recently advanced by cutting-edge State Space Models (SSMs). However, a critical limitation of current approaches is their neglect of the inherent spectral biases exhibited by different neural network paradigms. By digging to the dataset-level spectral analysis of Convolutional Neural Networks (CNNs) and SSMs, their semantic representations are inherently complementary based on their complementary frequency preferences. Inspired by this, we harmonize heterogeneous representations from SSMs and CNNs to bridge their spectral biases for general salient object detection. To this end, inspired by the dynamic information propagation of Liquid Neural Networks (LNNs), we introduce a liquid fusion to dynamically integrates features from two backbones, including VMamba and ConvNeXt, referred to Liquid Fusion Network (LFNet). Concretely, by treating the continuous VMamba features and ConvNeXt features as evolving states and exogenous stimulus, respectively, LFNet employs a dynamic gating mechanism for content-aware feature aggregation. Crucially, this state-stimulus paradigm enables to scale to multi-modal cues, resulting in flexibility in general SOD. Besides, a Saliency-Guided Upsampling (SGU) operator to propagate the features to the shallow layer, which leverages a spectral-spatial co-design to suppress upsampling artifacts while preserving semantics. Extensive experiments across five diverse tasks (RGB, RGB-D, RGB-T, VSOD, and VDT) demonstrate that LFNet achieves state-of-the-art performance, offering a superior trade-off between detection accuracy and model efficiency. Code has been released at this https URL.
57. 【2606.26839】Ordinal Neural Collapse as a Representation Prior for Visual Navigation
链接:https://arxiv.org/abs/2606.26839
作者:E-In Son,Jung-Taak Kim,Seung-Woo Seo
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:visual observations remains, robust navigation policies, navigation policies directly, vision-based robotic navigation, Learning robust navigation
备注: 27 pages, 14 figures. Supplementary material included
点击查看摘要
Abstract:Learning robust navigation policies directly from visual observations remains a fundamental challenge in vision-based robotic navigation. In end-to-end imitation learning approaches, the visual encoder and action decoder are jointly optimized using a single action loss, which provides only an indirect supervisory signal to the encoder. This indirect supervision frequently results in the encoder learning ambiguous, action-agnostic representations. The problem is further complicated by substantial variations in scene structure and appearance across diverse environments, as well as the prevalence of visual distractors inherent to real-world navigation settings. Such action-agnostic features cause the navigation policy to produce inconsistent actions at ambiguous decision points, leading to navigation failure. To overcome these limitations, we propose ORION (Ordinal Neural Collapse for Visual Navigation), a method that explicitly organizes the encoder's representation space according to the ordinal structure of navigation actions. In the context of goal-directed navigation, ego-centric control categories from Far Left to Far Right exhibit a natural ordinal relationship in which neighboring classes share similar visual contexts, while semantically opposing classes differ substantially in appearance. We encourage class representations to be arranged sequentially along a single discriminative axis, while suppressing off-axis variance within each class. The pretrained encoder is then integrated into a diffusion-based navigation framework, and the full pipeline is fine-tuned end-to-end. Extensive experiments in both simulation and real-world settings show that ORION consistently outperforms end-to-end and neural collapse baselines in navigation success rate and goal progress, with notable gains in visually challenging scenarios such as complex multi-way intersections.
58. 【2606.26829】Identifying the Unknown: Prompt-Free Open Vocabulary Anomaly Recognition for Robot-Object Interaction
链接:https://arxiv.org/abs/2606.26829
作者:Philipp Allgeuer,Jan-Gerrit Habekost,Stefan Wermter
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recognize previously unseen, previously unseen objects, operating in real-world, recognize previously, previously unseen
备注: International Conference on Artificial Neural Networks 2026
点击查看摘要
Abstract:Robots operating in real-world environments must in general be able to recognize previously unseen objects. As robotic systems move toward open-world autonomy, there is a growing, yet largely unmet, need for open vocabulary object detectors that are prompt-free and efficient enough for continuous deployment. We present AnomNOVIC, a two-stage known-workspace framework that combines a masked autoencoder (MAE) trained for anomaly detection, with NOVIC, a powerful real-time prompt-free open vocabulary image classifier. The MAE produces generic object-agnostic bounding boxes, allowing NOVIC to classify salient image regions without requiring a predefined candidate class list. We evaluate AnomNOVIC against strong open vocabulary baselines in a tabletop robot-object environment featuring the NICOL humanoid robot, reaching 47.1% AP / 57.5% AP50 for prompt-free recognition, and 59.0% AP / 72.5% AP50 if class candidates are provided. Across additional datasets, including an in-the-wild test set with 48 unique objects, AnomNOVIC reaches up to 82.6% prompt-free detection and classification accuracy. These results significantly surpass all tested open vocabulary baselines, including YOLO-World-v2, OWLv2, and YOLOE.
59. 【2606.26828】Learning Adversarial Augmentation Policies for Robust Garlic Seedling Detection
链接:https://arxiv.org/abs/2606.26828
作者:Soeun Lee,Chanho Kim,Yeji Kang,YoungKi Hong,Byeongkeun Kang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:early growth stages, effective crop management, Accurate seedling detection, Accurate seedling, early growth
备注: 16 pages
点击查看摘要
Abstract:Accurate seedling detection during early growth stages is essential for timely replanting and effective crop management in precision agriculture. However, existing studies are mostly evaluated under relatively stable imaging conditions, such as UAV imagery or greenhouse environments, leaving robust detection under severe and spatially heterogeneous illumination in ground-based outdoor monitoring insufficiently explored. In addition, many illumination-robust detection methods rely on additional enhancement or feature-extraction modules, which increase inference-time overhead and are not tailored to seedling detection and downstream missing seedling localization. To address these gaps, we construct a new garlic seedling dataset captured using a ground-based monitoring platform under real outdoor field conditions with highly variable illumination. We further propose an illumination-robust seedling detection framework based on adversarial augmentation policy learning. The proposed method jointly optimizes a stochastic augmentation policy agent and an object detector, enabling the detector to learn robust representations under challenging visual conditions. A structural penalty is introduced to prevent unrealistic distortions while encouraging challenging augmentations during training. Extensive experiments show that the proposed approach achieves an AP$_{50}$ of 91.6%, improving the baseline by 0.9 percentage points and outperforming the previous best-performing method by 0.2 percentage points. For downstream missing seedling localization, it achieves 75.0% precision and a 67.0% F1-score, improving the baseline by 4.8 and 2.0 percentage points, respectively. These results demonstrate the effectiveness of the proposed framework for practical ground-based agricultural monitoring under complex outdoor lighting conditions without additional inference-time computational overhead.
60. 【2606.26812】Multi-modality Image Fusion under Adverse Weather: Mask-Guided Feature Restoration and Interaction
链接:https://arxiv.org/abs/2606.26812
作者:Xilai Li,Xiaosong Li,Haishu Tan,Tao Ye,Huafeng Li,Hongbin Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:exploiting complementary cues, enhances scene representation, Multi-modality image fusion, Pseudo Ground Truth, Multi-modality image
备注: Accepted at ECCV 2026
点击查看摘要
Abstract:Multi-modality image fusion (MMIF) enhances scene representation by exploiting complementary cues from different modalities. Adverse weather, however, causes significant image degradation, disrupting feature representation and requiring simultaneous feature restoration and cross-modal complementarity. Existing methods often struggle with effective representation learning under such conditions, limiting their practical performance. To address these challenges, we propose a mask-guided MMIF method that integrates feature restoration and interaction. We first introduce "Pseudo Ground Truth" to simplify training, promoting faster and more effective feature learning. Then, we design a mask generation mechanism based on the mapping relationship between the fused result and the source images, quantifying the relative contribution of each modality during the fusion process. By incorporating the proposed mask-guided cross-modal cross-attention mechanism, the network is encouraged to selectively attend to informative features during modality interaction, mitigating the risk of overfitting to the static distribution of the "Pseudo Ground Truth". Additionally, we propose a mask-guided learning strategy and a task-coupled degradation-aware learning strategy to balance feature restoration and interaction. Extensive experiments on synthetic and real-world datasets demonstrate that our method surpasses state-of-the-art approaches in visual quality, quantitative metrics, and downstream tasks. The source code is available at this https URL.
61. 【2606.26801】Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision
链接:https://arxiv.org/abs/2606.26801
作者:Yuan Xu,Yixiang Chen,Kai Wang,Jiabing Yang,Peiyan Li,Qisen Ma,Yan Huang,Liang Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:shown strong potential, generalizable robotic manipulation, models have shown, shown strong, strong potential
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have shown strong potential for generalizable robotic manipulation. During fine-tuning, however, action supervision applies equally across all timesteps, without structured supervision on which manipulation stage the robot is in or what the next gripper-event target should be. This causes failures to concentrate around challenging gripper-event transitions. To address this, we propose StaKe, a plug-in auxiliary supervision framework that automatically derives two complementary signals from demonstration gripper states without manual annotation: a stage classifier that identifies the current manipulation stage, and a keyframe predictor that estimates the target joint action at the next gripper transition. Both are modeled as lightweight auxiliary heads that enrich the learned representations during training, while leaving the base VLA policy architecture and inference loop unchanged. Experiments on bimanual simulation and single-arm Franka real-robot tasks show that StaKe consistently improves success rates (relative gains of 14% and 56%, respectively), with larger improvements on longer-horizon tasks that involve more gripper-event transitions. Ablation studies validate each design choice, and qualitative analysis confirms that the learned representations faithfully track manipulation stages. These results indicate that structured supervision is an effective and general strategy for enhancing VLA fine-tuning in long-horizon manipulation. Project website: this https URL
62. 【2606.26795】NaviCache: Test-Time Self-Calibration Caching for Video Generation
链接:https://arxiv.org/abs/2606.26795
作者:Zheqi Lv,Zhibo Zhu,Jinke Wang,Qi Tian,Shengyu Zhang,Zhengyu Chen,Chengxi Zang,Zhou Zhao,Fei Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:Video Diffusion Models, immense computational costs, Video Diffusion, Diffusion Models, computational costs
备注: Published at ICML 2026: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
点击查看摘要
Abstract:Video Diffusion Models (VDMs) is constrained by immense computational costs. While offline calibration-based acceleration suffers from calibration data dependency, prohibitive calibration duration, and susceptibility to distribution shifts, offline calibration-free methods eliminate these hurdles. However, since they rely on instantaneous zero-order approximations where the mapping between input and output differences varies in real-time, they are susceptible to observational noise and ignore the intrinsic momentum within the diffusion trajectory. In this paper, we propose NaviCache, a plug-and-play test-time self-calibration method re-conceptualizing feature evolution as an Inertial Navigation System (INS) problem. NaviCache bridges the fundamental domain gap and the non-stationary nature of diffusion by modeling the relative coupling between input and output variations. We introduce a dual-state estimation architecture that adaptively tracks the feature change ratio and its latent drift, initialized via a specialized Initial Alignment phase. By integrating a time-dependent noise schedule with an uncertainty-aware Measurement Update mechanism, NaviCache provides a theoretically grounded mechanism for error-bounded computation skipping. Extensive experiments on the HunyuanVideo, Wan, and Open-Sora series demonstrate that NaviCache exhibits more accurate error judgment for computation skipping and achieves outstanding comprehensive performance.
63. 【2606.26794】ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP
链接:https://arxiv.org/abs/2606.26794
作者:Sicheng Zhang,Muzammal Naseer,Binzhu Xie,Naufal Suryanto,Shi Qiu,Jamal Bentahar,Naveed Akhtar,Mubarak Shah
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:widely adopted visual, adopted visual backbones, pretraining remains dominated, descriptive image-text alignment, variants are widely
备注: Accepted to ECCV2026
点击查看摘要
Abstract:CLIP and its variants are widely adopted visual backbones in multimodal systems, but their pretraining remains dominated by descriptive image-text alignment. As downstream applications increasingly demand visually grounded commonsense inference and compositional reasoning, it remains unclear whether CLIP-style encoders can support such reasoning without architectural changes. To address this, we present ReasonCLIP-58M, a continual pretraining framework that integrates large-scale reasoning supervision into CLIP-style models through our two-stage strategy, which progressively integrates reasoning signals while preserving descriptive alignment, followed by category-structured reasoning supervision. To support this framework, we construct two complementary datasets and a benchmark: ReasonLite-42M, with open-form, visually verifiable reasoning captions; ReasonPro-16M, with category-specific reasoning supervision; and RCLIP-Bench for diagnostic evaluation of visually grounded reasoning. We train a family of ReasonCLIP that improves visually grounded commonsense and compositional reasoning while also enhancing zero-shot retrieval performance. As a drop-in visual encoder for multimodal large language models such as LLaVA-NeXT, ReasonCLIP delivers consistent gains without additional inference cost, demonstrating that structured reasoning supervision enhances the expressive capacity of CLIP-style visual representations. All datasets, models, and training code are available at this https URL.
64. 【2606.26780】Event-based Gaze Control System for Accurate Real-time Spin Estimation in Professional Ball Games
链接:https://arxiv.org/abs/2606.26780
作者:Yunpu Hu,Fabian Schilling,Valentina Cavinato,Asude Aydin,Agis Politis,Ricardo Tapiador Morales,Kirk Y.W. Scheper,Peter Dürr,Naoya Takahashi
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:ball, plays a crucial, crucial role, Spin, ball sports due
备注:
点击查看摘要
Abstract:Spin plays a crucial role in many ball sports due to its effect on the trajectory of the ball. Vision-based estimation of the ball's spin during a game with conventional cameras is challenging due to the ball's small size, high speed, and fast rotation. To address these challenges, we propose an event-based active vision system that can track unmodified balls and measure their spin in real-time. The system consists of an event camera for its high temporal resolution and minimal motion blur, high-speed pan/tilt galvanometer mirrors to keep the ball in the field of view, and a low-latency focus-tunable telephoto lens to increase the spatial resolution on the ball and keep it in focus. To track the ball, we use a hybrid approach that combines 2D event-based detection for centering and 3D positions from a ball localization system for re-initialization. For high-accuracy spin estimation, we propose an offline method that performs contrast maximization on the sphere (s-CMax). This method achieves state-of-the-art accuracy on static balls across multiple sports (table tennis, baseball, tennis, and golf), with mean magnitude and axis errors of 2.1% and 4.0 degrees, respectively. We then develop a low-latency online method for table tennis as a case study in real-time applications. This method uses an uncertainty-aware convolutional neural network trained on pseudo-ground-truth spin labels from the offline approach, combined with a GPU-accelerated batch implementation of contrast maximization for refinement. We demonstrate reliable tracking and spin estimation with a three-view setup during professional table tennis matches, with high accuracy (8.8% magnitude and 6.4 degrees axis mismatch), 3 ms latency, and 750 Hz throughput.
65. 【2606.26778】LearniBridge: Learnable Calibration of Feature Caching for Diffusion Models Acceleration
链接:https://arxiv.org/abs/2606.26778
作者:Xuyue Huang,Zhe Chen,Wang Shen,Xiao-Ping Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Diffusion Transformers, prohibitive computational costs, driven substantial progress, computational costs, driven substantial
备注: Accepted to ICML 2026
点击查看摘要
Abstract:Diffusion Transformers (DiTs) have driven substantial progress in image and video generation but suffer from prohibitive computational costs. Feature caching accelerates inference by reusing intermediate representations. Existing methods rely on historical features for implementation simplicity, yet suffer from severe error accumulation at high acceleration ratios. To address this limitation, we investigate the nature of the requisite feature correction. We demonstrate that the optimal calibration update is characterized by a shared low-rank subspace across diverse prompts. Guided by this structural insight, we propose LearniBridge, a learnable calibration mechanism for feature caching that bridges multiple timesteps through lightweight LoRA updates. This mechanism enables effective calibration requiring only 3-5 training samples. Extensive experiments on image and video generation show that LearniBridge achieves up to $5.87\times$, $5.75\times$, and $4.10\times$ acceleration on FLUX, HunyuanVideo, and WAN2.1, respectively. On WAN2.1, it improves VBench by 1.28% over the previous SOTA at $4.10\times$ acceleration. Our code is available at this https URL.
66. 【2606.26769】ResilPhase: Plug-and-Play Phase Mapping and Noise-Resilient Macro-Trajectory Extrapolation for Diffusion Acceleration
链接:https://arxiv.org/abs/2606.26769
作者:Qicheng Zhao,Yu Li,Qi Sun,Zheyu Yan
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:significant inference latency, adoption of powerful, powerful diffusion models, powerful diffusion, inference latency
备注: Accepted by ECCV 2026
点击查看摘要
Abstract:The adoption of powerful diffusion models is hindered by their significant inference latency. Recent ``cache-then-forecast'' schemes alleviate this issue by accelerating DiTs using derivative-based polynomials, but they suffer from severe quality degradation at high acceleration ratios. Our analysis reveals its root cause: the discrete extrapolation performed on representations that are misaligned with the continuous diffusion trajectory and are numerically unstable. Thus, accelerated DiTs suffer from accumulated spatial errors, noisy derivative amplification, and high-order instability. We therefore reformulate accelerated inference as stable macro-trajectory extrapolation in ordinary differential equation (ODE) space. Instead of predicting intermediate features, we align forecasting with the model's Global Drift (GD), i.e., the end-to-end state evolution, thereby eliminating feature inconsistency and memory overhead. However, even this smooth macro-trajectory remains vulnerable to the derivative fallacy: its higher-order temporal derivatives are intrinsically noisy. Thus, we introduce a derivative-free barycentric Lagrange extrapolator to effectively bypass derivative instability and approximation error. We further propose a bounded Phase Mapping that regularizes the extrapolation domain, suppressing oscillatory error growth. These elements collectively constitute ResilPhase, a noise-resilient acceleration framework. Experiments on FLUX.1-dev and HunyuanVideo demonstrate state-of-the-art fidelity under aggressive acceleration ratios.
67. 【2606.26764】Anatomy-Guided Residual Motion Diffusion for Controllable 4D Cardiac MRI Synthesis
链接:https://arxiv.org/abs/2606.26764
作者:Yiheng Cao,Gustavo Andrade-Miranda,Jiatian Zhang,Lingxiao Zhao,Xin Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:inter-device domain shifts, Developing robust artificial, artificial intelligence models, robust artificial intelligence, inter-device domain
备注:
点击查看摘要
Abstract:Developing robust artificial intelligence models for 4D (3D + time) medical imaging is constrained by limited annotated data, inter-device domain shifts, and privacy restrictions. To address this, we propose a 4D controllable generative framework for anatomically consistent data augmentation. A semi-supervised variational autoencoder learns a compact latent representation of anatomical volumes while jointly predicting aligned segmentation masks in a unified framework. Anatomical structure is then disentangled from temporal dynamics through a cascaded latent diffusion model (LDM). A static LDM generates subject-specific anatomy conditioned on clinical priors (diagnosis and volumes measures) and a subsequent motion LDM estimates residual latent motions, ensuring strict temporal coherence across the 4D sequence. The proposed approach was evaluated on cine cardiac MRI as a representative 4D imaging application. Experiments across multiple datasets demonstrate high controllability of static anatomy (Pearson r 0.8) and strong temporal coherence (FVD = 288.08). In cross-vendor generalization experiments, augmenting training sets with synthetic 4D sequences significantly improves downstream segmentation performance. Using nnU-Net, the proposed augmentation strategy improves the average Dice score by 1.4% and reduces the Hausdorff Distance by 3.0mm compared to training on real data alone, for the left ventricle, Dice improves by 2.8% with a 5.4mm reduction in boundary error. Overall, this framework provides a scalable and controllable solution for 4D medical image synthesis, supporting the development of more robust models with limited annotations and cross-vendor variability. Code available on this https URL.
68. 【2606.26763】Calibrated Harmonic Overlaid Implicit Neural Representations for Multi-Dimensional Data
链接:https://arxiv.org/abs/2606.26763
作者:Honghang Chen,Xiujun Zhang,Xiaoli Sun,Mingqing Xiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Implicit neural representation, neural representation, Overlaid Implicit Neural, Implicit neural, Harmonic Overlaid Implicit
备注: ECCV2026 Accept
点击查看摘要
Abstract:Implicit neural representation (INR) has emerged as a powerful prior for multi-dimensional data (e.g., multispectral images and videos). However, most INR methods employing periodic activation functions (e.g., Sine) predominantly rely on function composition. This mechanism introduces optimization instability as network depth increases, thereby limiting their performance. Meanwhile, these methods fail to incorporate proper physical priors to effectively alleviate spectrum bias. To address these issues, inspired by the commonalities between deep periodic networks and generalized Fourier series, we propose a novel Calibrated Harmonic Overlaid Implicit Neural Representation (CHOIR). Specifically, we utilize Coordinated Harmonic Superposition (CHS) to replace the conventional function composition used in most INRs, thereby ensuring optimization stability when scaling network depth. Furthermore, we introduce a Perceptual Spectrum Calibration (PSC) to mitigate spectrum bias. This calibration embeds the ubiquitous power-law spectrum prior of natural images and adjusts the globally fixed spectrum towards a physically plausible log-uniform distribution. Extensive experiments on various multidimensional data recovery problems demonstrate that our method achieves superior performance over state-of-the-art approaches. Code is available at this https URL.
69. 【2606.26762】ProtoKV: Streaming Video Understanding under Delayed Query with Summary-State Memory
链接:https://arxiv.org/abs/2606.26762
作者:Le Tu Ngoc Minh(KAIST),Jinyeong Lim(KAIST),Dongsu Han(KAIST)
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Streaming video understanding, Streaming video, visual tokens stream, tokens stream continuously, video understanding
备注: 20 pages, 4 figures, Accepted to ICML 2026
点击查看摘要
Abstract:Streaming video understanding (SVU) must answer queries that arrive asynchronously while visual tokens stream continuously under strict GPU-memory and query-time latency budgets. A key challenge is delayed query: decisive cues may appear briefly, yet many subsequent updates occur before the query arrives, increasing the risk that those cues are evicted or diluted under bounded memory. We propose ProtoKV, a constant-footprint SVU memory that represents far history as a fixed-capacity summary state rather than retaining token instances. ProtoKV keeps an exact near-window KV cache and aggregates older content into a semantic-spatial prototype bank with residual statistics. At query time, each prototype is exposed through a bounded pseudo-token interface that is drop-in compatible with standard attention. Under matched budgets and comparable query-time cost, ProtoKV improves accuracy by up to 12.5 points over token-retention baselines on SVU benchmarks in the long-delay regime, with gains that grow as query delay increases.
70. 【2606.26754】Capacity-Controlled Multi-View Stylization of 3D Gaussian Splatting
链接:https://arxiv.org/abs/2606.26754
作者:Zhihao Wen,Yixin Yang,Bojian Wu,Yang Zhou,Dani Lischinski,Daniel Cohen-Or,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:viewpoints remains challenging, Gaussian Splatting, remains challenging, view synthesis, Splatting
备注: Accepted to ECCV 2026. Project page: [this https URL](https://vcc2310.github.io/SceneStyler/)
点击查看摘要
Abstract:While 3D Gaussian Splatting (3DGS) provides an efficient and explicit representation for novel view synthesis, enforcing stylistic coherence across viewpoints remains challenging. Existing 3D stylization methods typically apply 2D feature-matching losses independently per rendered view, which leads to unstable style allocation, many-to-one feature reuse, and limited cross-view consistency. We propose a capacity-controlled framework for multi-view stylization of 3DGS, grounded in optimal transport. Specifically, we reformulate local style matching as a semi-balanced optimal transport problem. By introducing explicit column-capacity constraints with tunable strength, our formulation mitigates many-to-one matching and enables controllable allocation of style features. This transport-based objective provides a principled mechanism for balancing feature coverage and stylistic diversity while maintaining stable correspondences across viewpoints. To further enhance cross-view coherence, we incorporate a novel cross-view matching guidance to constrain correspondences between scene content and style patterns. In addition, we introduce several geometric regularizations to enhance the vanilla 3DGS, thereby enabling optimized Gaussian primitives to represent finer-grained textures during stylization. Extensive experiments demonstrate that our approach significantly improves multi-view stylistic consistency and produces stable, expressive 3D stylizations while preserving the core semantic structure of the scene.
71. 【2606.26743】Depth-Semantic Alignment and Affinity-Guided Fusion for Structured Radar Point Cloud Generation
链接:https://arxiv.org/abs/2606.26743
作者:Amjad Hussain,Xin Qiu,Fuyuan Ai,Yuchen Tan,Zecheng Li,Chunyi Song,Wenjie Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Point clouds, radar point clouds, quality directly affects, point cloud generation, point cloud
备注:
点击查看摘要
Abstract:Point clouds are an important carrier of three-dimensional spatial information, and their quality directly affects the performance of downstream perception tasks such as object detection and tracking. However, millimeter-wave radar point clouds are typically sparse, noisy, and structurally incomplete. To address these limitations, this paper proposes a multimodal point cloud generation method based on vision-radar fusion. The proposed method leverages image semantic information to impose structural constraints and achieve spatial alignment for radar point clouds, while incorporating a sparse completion strategy to enhance point density and recover missing structures. The generated point clouds are further evaluated in object detection and tracking tasks. Experimental results demonstrate that the proposed method effectively improves point cloud quality and enhances the detection accuracy and robustness of perception models in complex environments, providing a practical solution for multisensor point cloud generation and intelligent perception systems.
72. 【2606.26741】PressMimic: Pressure-Guided Motion Capture and Control for Humanoid Robot Imitation
链接:https://arxiv.org/abs/2606.26741
作者:Yi Lu,Shenghao Ren,Tianyu Xiong,Zhaoxiang Li,Jiaqi Li,He Zhang,Tao Yu,Qiu Shen,Xun Cao
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Humanoid motion imitation, motion imitation requires, motion, faithful reproduction, Humanoid motion
备注:
点击查看摘要
Abstract:Humanoid motion imitation requires not only accurate perception of human kinematics but also faithful reproduction of physical interactions with the environment. However, existing pipelines rely primarily on vision-based motion capture and kinematic imitation, largely ignoring contact dynamics, leading to artifacts such as foot sliding, floor penetration, and unstable behaviors. In this work, we revisit humanoid motion imitation from the perspective of physical grounding and leverage pressure as a unified modality across perception and control. We present PressMimic, a framework that integrates pressure into the full pipeline from motion capture to humanoid control. In the perception stage, we introduce FRAPPE++, a multimodal model that fuses RGB and pressure to jointly estimate 3D pose and global motion, where pressure provides explicit contact and support constraints to resolve ambiguity in vision-based estimation. In the control stage, we propose a pressure-supervised policy (PSP) that incorporates pressure-derived signals into reinforcement learning, enabling physically consistent contact patterns during execution. We further construct MotionPRO, a large-scale dataset with synchronized RGB, pressure, and motion capture data. Experiments show that pressure improves motion estimation accuracy, trajectory consistency, and execution stability. These results demonstrate that pressure serves as an effective physical grounding signal, bridging perception and control for physically consistent humanoid motion imitation.
73. 【2606.26740】LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing
链接:https://arxiv.org/abs/2606.26740
作者:Xinyu Wang,Chongbo Zhao,Fangneng Zhan,Yue Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made rapid progress, low latency required, maintaining stable backgrounds, Streaming video editing, Streaming video
备注: Accepted by ECCV 2026, Project page: [this https URL](https://live-edit.github.io)
点击查看摘要
Abstract:Streaming video editing has made rapid progress, yet practical deployment is still limited by two core issues: maintaining stable backgrounds and non-edited regions over time, and achieving the low latency required for real-time interactive scenarios. Meanwhile, recent streaming video generation methods are mostly developed for synthesis and cannot be directly applied to editing due to the strict preservation requirement and region-specific control. In this work, we present a novel streaming video editing framework that performs causal, frame-by-frame editing with strong content preservation and real-time responsiveness. Our key design is a three-stage distillation pipeline that progressively transfers editing capability from a powerful bidirectional foundation model to an efficient unidirectional streaming editor, enabling stable long-horizon edits without sacrificing visual fidelity. To further support real-time deployment, we introduce an AR-oriented mask cache that reuses region-related computation across frames, substantially reducing redundant processing and accelerating inference. Finally, we establish a dedicated benchmark for streaming video editing. Extensive evaluations demonstrate that our method achieves state-of-the-art visual quality among streaming baselines while drastically boosting inference speed to 12.66 FPS, making it suitable for interactive and augmented reality applications.
74. 【2606.26738】Do Image Editing Models Understand Lighting?
链接:https://arxiv.org/abs/2606.26738
作者:Tim Küchler,Johann-Friedrich Feiden,Matthias Nießner,Carsten Rother
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:stunning visual fidelity, achieved stunning visual, visual fidelity, recent advancements, advancements in generative
备注:
点击查看摘要
Abstract:While recent advancements in generative image editing models have achieved stunning visual fidelity, it remains an open question whether these systems possess an intrinsic knowledge of real-world lighting. Existing benchmarks typically evaluate high-level plausibility of perceptual light transport on curated internet imagery, using VLMs or human judgement, or they rely on synthetically generated datasets. In this work, we introduce the 3D-anchored Light Probe (3DLP) benchmark, for which we have captured a new high-fidelity HDR dataset of real-world lighting changes. The dataset consists of 1K image pairs of diverse indoor scenery in which light probes are physically turned on and off. To allow for a granular performance analysis, we annotated specific image regions such as cast shadows or metallic surfaces. With this data, we evaluate a range of state-of-the-art image editing models by measuring how well their light probe edits align with reality. The evaluation uses two new scores to compensate for AI-generated photographic effects, such as adjusted white balance. Our results show that the overall performance of models differs considerably, with differences slightly less pronounced for specular highlights. The best image editing models are remarkably consistent with real-world physics, however, they still leave room for improvement. We observe that image regions that receive less light from the light probe are more prone to errors for all models. Furthermore, building on their success in evaluating macroscopic lighting plausibility, we test VLMs on our task but find that they are unsuitable for pixel-level light transport analysis. We will make the benchmark, together with the real-world dataset, publicly available to encourage future research on this topic.
75. 【2606.26734】Robust Onion: Peeling Open Vocab Object Detectors Under Noise
链接:https://arxiv.org/abs/2606.26734
作者:Priyank Pathak,Mukilan Karuppasamy,Aaditya Baranwal,Shruti Vyas,Yogesh S Rawat
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vocabulary Object Detectors, Open Vocabulary Object, Open Vocabulary, remains poorly understood, noise on Open
备注: Accepted at The 19th European Conference on Computer Vision (ECCV)
点击查看摘要
Abstract:The impact of real-world noise on Open Vocabulary Object Detectors (OV-ODs) remains poorly understood due to their architectural complexity. We present our comprehensive analysis Robust Onion, an empirical study that uses controlled synthetic visual degradations to peel OV-ODs layer-by-layer, revealing how, why, and where robustness degrades, systematically analyzing feature collapse. Our findings reveal that models with similar vision backbones exhibit comparable robustness, driven by similar feature collapse at similar layers, while factors such as pretraining strategy, architectural nuances, and caption supervision contribute little. Robustness is primarily governed by the image domain rather than annotations, explaining the similar robustness impact on COCO and LVIS, and why datasets like ODinW-13 can give an impression of inflated robustness due to large, isolated objects. Finally, we validate our insights by improving robustness on real-world BDD100K, WiderFace, and VisDRONE via our lightweight plug-and-play NN TK0 approach, using 96x fewer trainable parameters than end-to-end training. We also explain the prior works' robustness observations.
76. 【2606.26719】Full spectrum Unlearnable Examples via Spectral Equalization
链接:https://arxiv.org/abs/2606.26719
作者:Jiale Cai,Gezheng Xu,Zhihao Li,Ruiyi Fang,Ruizhi Pu,Di Wu,Qicheng Lao,Charles Ling,Boyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:extract exploitable representations, protect training data, injecting imperceptible perturbations, exploitable representations, data by injecting
备注: to be published in ICML
点击查看摘要
Abstract:Unlearnable examples (UEs) protect training data by injecting imperceptible perturbations so that models fail to extract exploitable representations. In this paper, we reveal that existing UEs exhibit a critical failure once low-pass filtering is applied, indicating that the effective perturbation signals for unlearnability concentrate predominantly in high frequencies. Hence, we argue that reliable UEs should remain effective across the full spectrum. To this end, we propose Full-spectrum Unlearnable examples via Spectral Equalization (FUSE), which aims to generate spectrum-agnostic perturbations by equalizing the contributions from different bands and enforcing cross-band consistency. Specifically, FUSE adopts a Random Spectral Masking (RSM) strategy during generator training, which randomly removes a contiguous frequency band, forcing the remaining bands to maintain unlearnability. In addition, FUSE further integrates Cross-Band Guidance (CBG), which enforces mutual consistency between high- and low-frequency components, thereby further enhancing low-frequency unlearnability and regulating high-frequency perturbations to preserve the semantic fidelity of images. Extensive experiments across multiple datasets, architectures, and spectral filtering demonstrate the strong protection achieved by FUSE.
77. 【2606.26718】A Latent ODE Approach to Spatiotemporal Modeling of Cine Cardiac MRI
链接:https://arxiv.org/abs/2606.26718
作者:David Brüggemann,Ekaterina Krymova,Firat Özdemir,Jochen von Spiczak,Sebastian Kozerke,Samia Mora,Robert Manka,Mathieu Salzmann,Olga V. Demler
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:magnetic resonance imaging, captures rich spatiotemporal, rich spatiotemporal information, selected cardiac phases, Cardiac magnetic resonance
备注:
点击查看摘要
Abstract:Cardiac magnetic resonance imaging (CMR) captures rich spatiotemporal information about ventricular structure and motion, but conventional risk models use only a few image-derived indices from selected cardiac phases. We present a latent dynamical model that encodes bi-ventricular anatomy and full-cycle cine motion as a continuous latent trajectory, using heart-rate-aware neural ordinary differential equation (ODE) dynamics and a graph-based mesh autoencoder to reconstruct anatomically consistent 3D+t ventricular motion. A covariate-conditioned prior defines the expected end-diastolic latent state, and a Cox proportional hazards model tests whether deviations from this prior predict incident heart failure. We studied 72,386 UK Biobank participants without baseline cardiovascular disease, including 367 incident heart failure events. In a held-out evaluation subset, adding the latent score to refitted pooled cohort equations improved the stratified C-index from 0.704 to 0.785, compared with 0.764 for seven established cardiac markers. Compared with non-graph and non-ODE approaches, the proposed model gave the best trade-off between reconstruction fidelity, generative realism, and downstream prognostic performance. These results suggest that continuous full-cycle modeling of ventricular motion provides informative cardiac phenotypes beyond conventional CMR summaries, while external validation in more representative patient cohorts is required before clinical risk-prediction use.
78. 【2606.26715】Extracting Neural Materials from Multi-view Images
链接:https://arxiv.org/abs/2606.26715
作者:Kim Youwang,Jon Hasselgren,Peter Kocsis,Andrea Weidlich,Tae-Hyun Oh,Jacob Munkberg
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:universal basis, reflections and scattering, Material Reconstruction Model, Neural materials, neural material
备注: Project website: [this https URL](https://nvlabs.github.io/neumatex/)
点击查看摘要
Abstract:Neural materials can represent complex specular reflections and scattering effects in a compact, universal basis. However, acquiring and authoring such materials remains challenging. We present NeuMatEx, a differentiable inverse rendering method for extracting spatially varying neural materials from images. The nonlinear structure of neural material latent spaces makes optimization with naive inverse rendering infeasible. To address this, we train a Large Material Reconstruction Model (LMRM) that directly predicts initialbase color, neural material latents, and aleatoric uncertainty guides from images. This material prior provides a good initialization and better constrains our subsequent optimization using inverse path tracing. The predicted uncertainty further helps by anchoring high-confidence regions more tightly to the LMRM prediction, preventing lighting and complex specular effects from being baked into materials. Experiments on synthetic and real assets show that NeuMatEx extracts complex materials with better visual quality and material decomposition than PBR-based methods.
79. 【2606.26711】Mask to Concept: Auto-Promptable SAM3 via Efficient Test-Time Concept Embedding Search for Few-Shot Annotation
链接:https://arxiv.org/abs/2606.26711
作者:Quan Zhou,Shaoqing Zhai,Qiang Hu Jia Chen,Qiang Li,Zhiwei Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Transforming foundation segmentation, Transforming foundation, foundation segmentation models, models from human-prompted, human-prompted tools
备注: Accepted by MICCAI 2026
点击查看摘要
Abstract:Transforming foundation segmentation models from human-prompted tools into auto-promptable annotators is critical for scalable medical data annotation. Current methods commonly depend on external feature matchers or auxiliary networks to automate geometric prompting, but introducing architectural overhead and limiting performance scalability. Although SAM3 natively supports concept segmentation via reusable text prompts, its direct use in medical imaging is hindered by a lack of fine-grained clinical knowledge and the ambiguity of human-written descriptions. In this work, we propose Mask to Concept (M2C), an efficient framework that adapts SAM3 for medical few-shot annotation without external modules, parameter retraining, or manual text engineering. Using only a few labeled images, M2C enables SAM3 to automatically search for transferable visual concepts entirely within its frozen architecture: it initializes a learnable concept embedding, uses it to prompt segmentation, and updates the embedding by gradients of minimizing the concept segmentation error. We further introduce a Hybrid Uncertainty Estimation (HUE) module that calculates the prediction entropy and maps concept predictions back to the box prompts, measuring concept-geometry prompting inconsistency. Highly uncertain samples are flagged actively for human correction, and the corrected masks are then fed back to M2C to continuously search for more precise concept embeddings, forming a self-enhancing annotation loop with minimal expert effort. Experiments on medical segmentation benchmarks show that our method achieves SOTA few-shot segmentation performance and outstanding annotation efficiency, offering a practical and efficient pathway toward scalable medical image labeling. Codes are at this https URL.
80. 【2606.26706】Intracranial Aneurysm Classification and Segmentation via Tri-Axial ROI and Multi-Task Learning
链接:https://arxiv.org/abs/2606.26706
作者:Pengcheng Shi,Kaiyuan Yang,Houjing Huang,Jiawei Chen,Yan Lu,Jiaqi Liu,Murong Xu,Bjoern Menze,Xinglin Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:carries high mortality, high mortality, carries high, Intracranial Aneurysm Detection, Intracranial aneurysms
备注:
点击查看摘要
Abstract:Intracranial aneurysms are often asymptomatic until rupture, which carries high mortality. Rupture risk assessment and treatment planning depend on both aneurysm morphology and anatomical location, yet existing automated methods remain limited to binary detection without fine-grained anatomical classification or multi-class segmentation. We present a multi-task framework that simultaneously performs multi-label classification, multi-class aneurysm segmentation, and multi-class vessel segmentation across 13 anatomical locations and four imaging modalities (CTA, MRA, T2, T1-post). Our two-stage approach combines a fast 2D tri-axial Region of Interest (ROI) extraction method with a 3D multi-task nnU-Net backbone. A dual-decoder design mitigates the extreme volume imbalance between aneurysm and vessel classes, while cross-attention pooling and modality-specific auxiliary heads improve feature learning across heterogeneous inputs. Our two-fold ensemble achieved 2nd place in the RSNA 2025 Intracranial Aneurysm Detection challenge. Code, model weights, and a 3D Slicer plugin are publicly available.
81. 【2606.26694】PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models
链接:https://arxiv.org/abs/2606.26694
作者:Bin Hu,Yanwen Ma,Jiehui Huang,Ziliang Zhang,Haoning Wu,Ruicheng Zhang,Yaokun Li,Zijun Wang,Yuechen Zhang,Chun-Mei Tseng,Hanhui Li,Shengju Qian,Jun Zhou,Kaipeng Zhang,Xiaodan Liang,Jiaya Jia,Xiu Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:synthesize visually plausible, Recent game world, Recent game, visually plausible, synthesize visually
备注: Project page: [this https URL](https://yizhiqianbi.github.io/physeditworld/)
点击查看摘要
Abstract:Recent game world models can synthesize visually plausible, action-conditioned rollouts. However, their interaction behaviors often remain limited to exploratory or wandering trajectories, and physical dynamics are typically learned as implicit correlations from data rather than as controllable variables. This limitation hinders their applicability to authored game environments, where physical rules are deliberately designed and require explicit manipulation. We introduce PhysEditWorld, a multimodal dataset with physical parameters, with a primary focus on gravity in this initial version. At its core, PhysEditWorld is built upon a replay paradigm implemented with a UE5 replay-and-rendering pipeline. Each scenario records a normalized action trace and replays the same initial state, character controller, action sequence, and camera policy under multiple gravity configurations, enabling controlled and attributable physical variation. PhysEditWorld contains 12 cinematic UE5 scenes, over 100 hours of gameplay interactions, and more than 60 million rendered rollout frames. Each sample provides synchronized multimodal signals, including RGB, depth, normals, audio, action traces, camera trajectory, engine states, semantic annotations, and explicit gravity labels. We further conduct initial utility studies on both generative video models and world understanding models, demonstrating that PhysEditWorld enables improved gravity-faithful dynamics modeling, enhances consistency under physical edits, and provides a scalable foundation for controllable world modeling research.
82. 【2606.26687】DeCoFlow: Structural Decomposition of Normalizing Flows for Continual Anomaly Detection
链接:https://arxiv.org/abs/2606.26687
作者:Hun Im,Jungi Lee,Subeen Cha,Pilsung Kang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:categories arrive sequentially, requiring continual anomaly, product categories arrive, continual anomaly detection, industrial environments
备注:
点击查看摘要
Abstract:In industrial environments, new product categories arrive sequentially, requiring continual anomaly detection without access to past data. Normalizing Flows (NFs) provide exact density estimation but suffer from catastrophic forgetting as parameter updates across tasks distort the density manifold. While parameter isolation can prevent interference, it must preserve the strict invertibility and Jacobian validity of NFs. To satisfy these requirements, we exploit the inherent property that affine coupling layers maintain transformation validity regardless of subnet parameterization. Based on this, we propose DeCoFlow, which decomposes subnets into a frozen universal base and task-specific low-rank adapters to isolate updates. We further introduce Task-Specific Alignment, Auxiliary Coupling Layers, and Tail-Aware Loss to compensate for frozen-base rigidity. DeCoFlow achieves state-of-the-art image-level AUROCs of 98.40% on MVTec-AD and 93.00% on VisA, while maintaining parameter-level zero forgetting (0.00% FM under correct routing) with only 2.27M parameters per task.
83. 【2606.26668】Disco-LoRA: Disentangled Composition of Content, Style, and Motion for Multi-concept Video Customization
链接:https://arxiv.org/abs/2606.26668
作者:Xuancheng Xu,Gengyun Jia,Bing-Kun Bao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:learn specific features, Video customization based, models aims, Video customization, multi-concept video customization
备注:
点击查看摘要
Abstract:Video customization based on Text-to-Video (T2V) models aims to learn specific features from reference data to generate controllable videos. While significant strides have been made in image stylization and video motion customization, simultaneously controlling multiple concepts, such as content, style, and motion, remains a major challenge. In this work, we systematically define the task of multi-concept video customization, which requires the joint control of content, style, and motion. To facilitate research in this area, we construct a comprehensive benchmark and propose Disco-LoRA, a unified framework designed to tackle this problem by disentangling and flexibly recombining different concepts in two stages: (1) We decompose the objective into two sub-tasks: Content-Style and Content-Motion. Each sub-task is addressed using our Iterative Dual-LoRA Disentanglement Framework, which effectively disentangles distinct concepts within the data. (2) We identify layer-wise weight trends as crucial for LoRA identity, while weight magnitudes dictate composability. To harmonize these scales, we propose a Z-score-based statistical regularization that aligns weight distributions, preserving layer-wise trends while minimizing interference between different LoRAs. Extensive experiments show that Disco-LoRA excels in multi-concept video customization, effectively preserving appearance, style, and motion for controllable text-to-video generation.
84. 【2606.26647】LayersReg: A Layer-by-Layer Progressive Regressor for Reliable Intraoperative 3D/2D Registration
链接:https://arxiv.org/abs/2606.26647
作者:Xiyuan Wang,Zhenchao Wang,Xinran Chen,Junkai Liu,Chuan Chen,Feng Yin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:surgical navigation, cornerstone technique, technique in surgical, spatial pose, iterative optimization
备注:
点击查看摘要
Abstract:3D/2D registration serves as a cornerstone technique in surgical navigation. Traditional iterative optimization algorithms suffer from low efficiency and high failure rates in intraoperative settings. Deep learning-based methods reformulate registration from iterative optimization to a regression problem that maps image appearance features to spatial pose, typically achieving improved real-time performance and accuracy. However, such learnable methods are confined to memory-driven retrieval of specific pose features rather than understanding the task of image alignment itself, which limits their generalization in complex scenarios. We propose LayersReg, a pioneering regression paradigm that endows the model with 3D anatomical awareness and searches for the correct pose in a progressive, layer-by-layer manner. Inspired by the iterative pose-searching optimization criterion of classical registration, LayersReg searches for correlations between the moving and fixed images in feature space, capturing the trend of pixel flow and thereby converging iteratively toward the correct spatial pose transformation. We further design a coupling of node-wise regression with the progressive registration framework to enhance the model's perception of spatial pose changes. Experimental results demonstrate that under large offsets and multimodality conditions, LayersReg achieves high accuracy on both X-ray/CT registration (0.68°, 1.41 mm) and slice localization (0.73°, 1.55 mm) tasks, outperforming existing state-of-the-art methods while meeting the intraoperative demands for precision and real-time capability.
85. 【2606.26636】FracEvent: Event-Camera Simulation via Fractional-Relaxation Pixel Dynamics
链接:https://arxiv.org/abs/2606.26636
作者:Langyi Chen,Chuanzhi Xu,Haoxian Zhou,Pengfei Ye,Ziyu Luo,Haodong Chen,Qiang Qu,Xiaoming Chen,Weidong Cai
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:cameras asynchronously report, asynchronously report brightness, data remain difficult, microsecond-level temporal resolution, Event cameras asynchronously
备注:
点击查看摘要
Abstract:Event cameras asynchronously report brightness changes with microsecond-level temporal resolution, but real event data remain difficult to collect at scale because specialized sensors, careful synchronization, and task-specific annotations are required. Event-camera simulation is therefore important to event-based vision tasks. Most practical simulators build on contrast-threshold event generation, some with additional filtering, stochastic noise, or hand-tuned sensor parameters. While effective, such formulations often simplify the temporal structure produced by the lifecycle of each pixel, which can distort event timing and weaken downstream transfer. We introduce FracEvent, an event simulator that models this pixel-level lifecycle with fractional-relaxation voltage dynamics. Given a log-intensity trajectory, FracEvent drives a compact stack of relaxation modes, combines their responses into a voltage state, emits ON/OFF events by localizing threshold crossings on the continuous voltage trajectory, and updates the reference while retaining the underlying memory modes. This retained state links residual voltage response to later event timing. We evaluate FracEvent through event-stream comparison and downstream transfer on image reconstruction and optical flow estimation. Across multiple datasets, FracEvent improves the temporal structure of generated events and achieves stronger downstream-transfer results than competing simulator baselines, showing its practical value for event-camera simulation.
86. 【2606.26634】mporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions
链接:https://arxiv.org/abs/2606.26634
作者:Garam Kim,Juyoun Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:prohibitive labeling costs, Effective multi-task learning, annotation granularity mismatch, selected keyframes due, Effective multi-task
备注: 17pages, 16figures
点击查看摘要
Abstract:Effective multi-task learning for surgical scene understanding is fundamentally hindered by annotation granularity mismatch; temporal workflow tasks such as phase recognition, step recognition and anticipation benefit from dense frame-level supervision, whereas pixel-level spatial tasks including instrument segmentation and action recognition are only sparsely annotated on selected keyframes due to prohibitive labeling costs. This supervision imbalance undermines shared representation learning and limits joint optimization across heterogeneous surgical tasks. To address this, we propose Flow-guided Annotation for Robust Operating Scenes (FAROS), a flow-guided label interpolation framework, that combines zero-shot segmentation-based mask propagation with optical flow estimation to overcome the limitations of appearance-based propagation under challenging surgical conditions such as occlusion, smoke, and motion blur, generating temporally consistent dense pseudo labels from sparse keyframe annotations. The densified instrument masks and action labels are integrated into a unified Transformer-based multi-task framework that jointly learns surgical phase recognition, step recognition, anticipation, instrument segmentation, and action recognition, enabling balanced optimization between dense temporal supervision and sparse spatial supervision. The label interpolation quality of FAROS is first validated on the DAVIS 2017 benchmark under a sparse ground-truth protocol, confirming robust propagation beyond the surgical domain. Extensive experiments on GraSP, MISAW, and AutoLaparo benchmarks further demonstrate that FAROS significantly improves cross-task representation learning and enhances holistic surgical scene understanding performance across spatio-temporal tasks.
87. 【2606.26631】Position Rebinding Cache Reuse: Replay-Free Visual Revisiting for Interleaved Multimodal Reasoning
链接:https://arxiv.org/abs/2606.26631
作者:Mengzhao Wang,Yanli Ji,Wangmeng Zuo,Peng Ye,Chongjun Tu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:repeatedly forwarding selected, existing methods typically, methods typically rely, selected visual tokens, Interleaved multimodal reasoning
备注:
点击查看摘要
Abstract:Interleaved multimodal reasoning improves visual grounding by revisiting visual evidence during multi-step generation, yet existing methods typically rely on token replay, repeatedly forwarding selected visual tokens. A natural shortcut is to reuse the historical visual key-value (KV) cache directly. However, we identify a critical failure mode of this strategy: cached visual keys are already bound to their original positional context. Such stale positional binding distorts attention under later decoding contexts and can trigger severe autoregressive decoding collapse. This failure suggests that effective cache reuse requires reconstructing visual evidence under positions compatible with the current decoding state, rather than directly copying position-bound historical cache entries. To this end, we propose Position Rebinding Cache Reuse (PRCR), a cache-level framework for replay-free visual revisiting. PRCR stores raw visual KV cache together with their original spatial coordinates, then reassigns position-compatible coordinates to select entries and rebinds their keys before injecting the reconstructed cache into the active decoder cache. This design reuses historical visual evidence while preserving textual positional continuity and relative visual structure. Experiments across multiple multimodal reasoning benchmarks show that PRCR achieves replay-level or better performance, improving average accuracy by 5 percent and reducing visual-revisiting computation by up to tens of thousands of times.
88. 【2606.26615】askTok: Delving into Task Tokens for Task-driven Image Restoration
链接:https://arxiv.org/abs/2606.26615
作者:Hongjae Lee,Sojung Kang,Jaeseong Yu,Seung-Won Jung
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:image restoration focuses, traditional image restoration, downstream high-level vision, restoration focuses, image restoration
备注: ECCV 2026
点击查看摘要
Abstract:While traditional image restoration focuses on perceptual quality, Task-Driven Image Restoration (TDIR) aims to maximize the performance of downstream high-level vision tasks. Recent approaches leveraging generative priors have shown promise for TDIR; however, they typically suffer from computational inefficiency and potential semantic alteration by indiscriminately updating all latent tokens. In this paper, we posit that not all visual information is equally important for machine perception. Through an analysis of the latent token space, we observe that task-relevant cues are unevenly distributed across the token sequence, exhibiting index-wise specialization. This suggests that selectively refining a subset of tokens can be sufficient for task-driven objectives. Leveraging this insight, we propose TaskTok, a novel framework that selectively restores only task-relevant tokens via a learnable token switch and a lightweight token refinement module. Extensive experiments across image classification, semantic segmentation, and object detection demonstrate that TaskTok significantly enhances task performance with high computational efficiency. The source code is available at this https URL
89. 【2606.26609】LogicIR: Logic Gate Networks for Image Restoration
链接:https://arxiv.org/abs/2606.26609
作者:Hongjae Lee,Myungjun Son,Jaeseong Yu,Seung-Won Jung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:degraded low-quality inputs, reconstruct high-quality images, Image restoration, low-quality inputs, Image restoration aims
备注: ECCV 2026
点击查看摘要
Abstract:Image restoration aims to reconstruct high-quality images from degraded low-quality inputs. As the computational demands of image restoration models continue to rise, there is growing interest in lightweight architectures optimized for fast and efficient inference. Logic gate networks (LGNs), which operate using fundamental logic operations such as NAND and XOR, have recently emerged as a promising direction for achieving highly efficient computation. However, their potential remains largely untapped in the domain of image restoration. In this work, we introduce LogicIR, the first LGN specifically designed for image restoration tasks. LogicIR incorporates a UNet-inspired architecture composed entirely of logic gates. In addition, we propose a differentiable bit decoding layer and an index shuffling mechanism that improves information propagation across logic gates. Experimental results across multiple image restoration benchmarks demonstrate that LogicIR achieves strong performance with significantly reduced computational cost, establishing LogicIR as a viable and efficient alternative for image restoration. The source code is available at this https URL
90. 【2606.26602】DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues
链接:https://arxiv.org/abs/2606.26602
作者:Geng Li,Yuxin Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Recent advancements
备注: Accepted by ECCV 2026. Project page with code: [this https URL](https://github.com/PKU-ICST-MIPL/DICO_Bench_ECCV2026)
点击查看摘要
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive fine-grained perception capabilities. However, existing benchmarks predominantly rely on explicit textual cues or low-resolution inputs, failing to evaluate a model's ability to autonomously perceive implicit visual cues in high-resolution. To bridge this gap, we introduce DiCoBench, a comprehensive, multi-image high-resolution benchmark designed for cross-image fine-grained perception. DiCoBench consists of 765 meticulously curated samples categorized into two progressive tracks: Differential Visual Cues and Commonality Visual Cues, covering 8 distinct perception tasks. By formulating the benchmark as a multiple-choice question task and utilizing high-resolution imagery (approaching 2K), we eliminate evaluation metric bias and pose a substantial challenge to current state-of-the-art MLLMs. Our extensive evaluation of 18 diverse MLLMs reveals a striking performance gap compared to human accuracy (98.3\%), with top-performing models struggling significantly with micro-scale detail capture. We believe DiCoBench will serve as a challenging testbed to drive future research in autonomous, high-resolution multi-image perception.
91. 【2606.26559】SpaceRipple: Lightweight Semantic Delivery for Mission-Oriented LEO Earth Observation Satellite Networks
链接:https://arxiv.org/abs/2606.26559
作者:Ziyi Yang,Hao Yuan,Yunxiang Yi,Wenbo Wang,Xing Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:generate massive volumes, networks generate massive, Earth observation satellite, resources remain limited, Earth observation
备注:
点击查看摘要
Abstract:Earth observation satellite networks generate massive volumes of high-resolution imagery, whereas inter-satellite and downlink resources remain limited. In many time-sensitive missions, ground users require mission-relevant semantic information rather than a full raw-image downlink. This paper proposes SpaceRipple, a lightweight framework for mission-oriented semantic delivery and on-board processing in Earth observation satellite networks. A sensing satellite performs adaptive compression and metadata generation to reduce inter-satellite traffic, while an edge computing satellite restores the received representation and extracts task-relevant semantic information. Unlike fidelity-driven image transmission, SpaceRipple coordinates compression, forwarding, restoration, and semantic inference within a collaborative pipeline, enabling semantic-oriented delivery instead of pixel-level image delivery. A compression-aware MoE enhancement module is further introduced to improve robustness under degraded visual inputs. Experimental results show that SpaceRipple achieves favorable reconstruction quality, improved semantic detection performance, and substantial bandwidth savings, demonstrating its potential for efficient and reliable Earth observation under constrained satellite-network resources.
92. 【2606.26557】Coarse-to-Fine: A Hybrid Self-Supervised Method for Non-rigid 3D Shape Matching
链接:https://arxiv.org/abs/2606.26557
作者:Feifan Luo,Ting Li,Zhao Li,Hongyang Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shape matching, vision and graphics, fundamental task, task in computer, computer vision
备注:
点击查看摘要
Abstract:Non-rigid 3D shape matching is a fundamental task in computer vision and graphics. In this paper, we propose a hybrid self-supervised method based on a coarse-to-fine strategy, which ensures consistency between the coarse mapping and the refined correspondence produced by our refinement module. The architecture features a dual-branch design, consisting of two symmetric functional map learning streams: one based on the Laplacian basis and the other utilizing the elastic basis. Extensive experiments show that our approach not only maintains computational efficiency, but also achieves state-of-the-art performance across a variety of challenging scenarios, including non-isometric deformations and topological noise. Finally, we rigorously demonstrate that contrastive energies promote feature discrimination. Furthermore, integrating these energies with existing methods yields consistent improvements, validating the overall efficacy of our approach. Our code is available at this https URL.
93. 【2606.26552】Perception, Verdict, and Evolution: Hindsight-Driven Self-Refining Forensics Agent for AI-Generated Image Detection
链接:https://arxiv.org/abs/2606.26552
作者:Yangjun Wu,Keyu Yan,Yu Liu,Jingren Zhou,Fei Huang,Rong Zhang,Zhou Zhao,Fei Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:generative models presents, Large Language Models, deepfake detection methods, Multimodal Large Language, highly realistic AI-generated
备注: 10 pages
点击查看摘要
Abstract:The rapid advancement of generative models presents a significant challenge to existing deepfake detection methods, particularly given the widespread dissemination of highly realistic AI-generated images. Although Multimodal Large Language Models (MLLMs) show strong potential for this task, existing approaches suffer from two key limitations: insufficient sensitivity to fine-grained forensic artifacts and reliance on static synthetic supervision from frontier models, leading to limited flexibility and high-cost. To address these issues, we propose ForeAgent, an agentic forensics framework for AI-generated image detection with iterative self-evolution. First, ForeAgent adopts a Perception-Verdict architecture that aggregates multi-view cues spanning semantic, spatial, and frequency-domain features, and leverages an MLLM as a verdict module to fuse these signals for a logical-grounded verdict. Second, to enable continual self-improvement, we introduce a Hindsight-Driven Self-Refining strategy following a Sampling-Reflection-Evolution paradigm. The agent performs inference rollouts on training instances. Guided by ground-truth labels as hindsight, it reflects on failure cases and low-quality reasoning trajectories to regenerate higher-quality reasoning traces. These synthesized samples are then strictly filtered through a dual-expert quality gating module. ForeAgent continuously evolves via fine-tuning on self-curated high-quality samples. Extensive experiments demonstrate that ForeAgent achieves state-of-the-art performance on the Chameleon benchmark, reaching 82.18% accuracy (+16.41% over AIDE), and achieves 93.3% mean accuracy on AIGCDetect-Benchmark across 16 generators. In addition, external evaluation shows that ForeAgent produces more consistent and causally grounded reasoning compared to GPT-5 and GPT-5-mini.
94. 【2606.26551】PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing
链接:https://arxiv.org/abs/2606.26551
作者:Shengbin Guo,Shaokang He,Chaoyue Meng,Shengpeng Xiao,Xunzhi Xiang,Shaofeng Zhang,Qi Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing benchmarks lack, handling real-world scenarios, multi-modal generative models, enabled by multi-modal, advanced significantly
备注: 19 pages, 6 figures, 2 tables. Accepted to ECCV 2026
点击查看摘要
Abstract:While instruction-based image editing, enabled by multi-modal generative models, has advanced significantly, existing benchmarks lack a comprehensive evaluation of physics-based reasoning, a critical capability for handling real-world scenarios. To address this, we introduce PhyEditBench, a benchmark designed to assess the physical understanding of editing models. Guided by a hierarchical taxonomy, we establish 4 primary classes and 12 subclasses. It comprises 238 high-quality, high-resolution, real-world instances meticulously extracted from videos to capture authentic physical dynamics, alongside 35 synthetic Anti-Physics instances. Our empirical analysis of current SOTA editing methods exposes substantial limitations in their physics-based reasoning. We further propose a training-free baseline named PhyWorld that uses test-time scaling and a latent reduction strategy. PhyWorld outperforms comparable models and suggests that the video generation process can effectively serve as a reasoning mechanism for image editing. The project page is available at this https URL.
95. 【2606.26535】From Hallucination to Grounding: Diagnosing Visual Spatial Intelligence via CRISP
链接:https://arxiv.org/abs/2606.26535
作者:Zhixing Li,Yinan Yu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Current VLM evaluations, Current VLM, VLM evaluations, genuine spatial reasoning, Current
备注: Accepted to ECCV 2026
点击查看摘要
Abstract:Current VLM evaluations often conflate language priors with genuine spatial reasoning. To address this, we introduce CRISP, a novel structural-diagnostic evaluation paradigm that assesses visual spatial intelligence through consistency, the alignment between implicit perception and explicit reasoning. Unlike traditional black-box QA, CRISP utilizes metric 3D Scene Graphs and an oracle intervention protocol to decouple latent reasoning capabilities from perceptual bottlenecks. This granular diagnosis uncovers a systematic perception-reasoning disconnect. Crucially, we reveal that while proprietary models possess robust latent reasoning engines, they suffer from inaccurate metric estimation and a critical failure to leverage their implicit structural representations. Conversely, open-source models remain fundamentally bottlenecked by their lack of multi-hop compositional reasoning. By shifting the focus from merely ``guessing correctly'' via language priors to genuinely ``perceiving, verifying, and reasoning,'' CRISP offers a rigorous roadmap for multimodal alignment beyond end-to-end post-training. The code and dataset are available at this https URL.
96. 【2606.26529】he Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report
链接:https://arxiv.org/abs/2606.26529
作者:Kwan Soo Shin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:told to find, accidents often arise, detects the hazards, model detects, model
备注: 20 pages, 8 figures. Reproducibility deposit: [this https URL](https://doi.org/10.5281/zenodo.20826824)
点击查看摘要
Abstract:AI safety is evaluated by how reliably a model detects the hazards it is told to find, yet accidents often arise from the hazard no one specified. We show that conditioning a language or vision model on a narrow task suppresses its reporting of co-present, safety-critical signals it can otherwise report, a machine analogue of human inattentional blindness arising from a different mechanism. Across radiology and driving text scenarios and chest-radiograph vision tasks, suppression appeared in every model tested, did not diminish with scale, persisted in a reasoning model, and varied more by model family than by size, while the same models reported these signals at substantially higher rates when unconstrained. We name this dissociation the Inattentional Gap and argue that it decouples measured benchmark safety from real-world safety: a system can score near-perfectly on the hazards an evaluation specifies while remaining blind to those that cause harm.
97. 【2606.26515】Forget, Anticipate and Adapt: Test Time Training for Long Videos
链接:https://arxiv.org/abs/2606.26515
作者:Rajat Modi,Sebastian Noel,Xin Liang,Yogesh Singh Rawat
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Test Time Training, Test Time, Time Training, performing some self-supervised, test-sample by performing
备注: ECCV 2026. GLOM/APM's temporal binding now works for long videos
点击查看摘要
Abstract:Test Time Training (TTT) is a mechanism in which a model adapts to an incoming test-sample by performing some self-supervised (SSL) task and updating its weights even during inference. This procedure does not require labels at test-time. This paper focuses on TTT for long-videos. A major concern with existing approaches is: 1) they perform TTT updates using a sliding window containing frames in the past, whose compute increases linearly with the size of window. This becomes computationally intractable when the videos are hours long. 2) TTT is performed even when temporally close frames look similar, thereby consuming a lot of compute. We present the Frame Forgetting Network (FFN) that: 1) operates on only three frames within the sliding window, namely the frame that exits, the current frame and the frame after that. The model still manages to retain temporal context and work for hours long-videos; 2) mathematically define a surprise metric: how much new information the incoming frame contains with respect to the past seen frame. This facilitates determining how to modify the effective window size during TTT and constitutes the core mechanism of an adaptive windowing algorithm. Additionally, we curate a dataset EpicTours containing up to 3 hour long videos of walking city-tours, whereas earlier datasets on this problem were only 5 min long. We demonstrate FFNs empirical effectiveness on dense-segmentation, video classification tasks, generalization to depth-estimation, and multi-hour long videos.
Comments:
ECCV 2026. GLOM/APM’s temporal binding now works for long videos
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2606.26515 [cs.CV]
(or
arXiv:2606.26515v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2606.26515
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
98. 【2606.26508】Budget-Aware Keyboardless Interaction
链接:https://arxiv.org/abs/2606.26508
作者:Quang-Thang Nguyen,Gia-Phuc Song-Dong,Minh-Triet Tran,Trung-Nghia Le
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
关键词:seeking greater mobility, computers typically relies, traditional input devices, users seeking greater, greater mobility
备注: SOICT 2024
点击查看摘要
Abstract:Interacting with computers typically relies on traditional input devices such as keyboards, mice, and monitors, which can be cumbersome for users seeking greater mobility. Virtual keyboards have been explored to address these limitations, but they often involve complex setups or expensive equipment. This paper proposes a novel virtual keyboard system that leverages only a standard camera and a paper with a printed keyboard layout. Unlike previous methods requiring complex calibration or special lighting conditions, our approach can work on standard environment using modern computer vision technologies. Combining modern segmentation and detection models with traditional image processing algorithms, we efficiently identify the keyboard region. Touch detection is performed using an algorithm analyzing the color of the user's fingernail. Experiments demonstrated a promising results our proposed solution of keyboard and keystroke detection for practical applications. Participants attended our user study also found the proposed system interesting.
99. 【2606.26507】DanceDuo: Bridging Human Movement and AI Choreography
链接:https://arxiv.org/abs/2606.26507
作者:Gia-Cat Bui-Le,Tuong-Vy Truong-Thuy,Hai-Dang Nguyen,Trung-Nghia Le
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
关键词:revolutionized music-driven dance, recent years, advancements in deep, deep learning, learning and generative
备注: SOICT 2024
点击查看摘要
Abstract:In recent years, advancements in deep learning and generative models have revolutionized music-driven dance generation. This paper introduces a novel platform, namely DanceDuo, leveraging diffusion models to generate AI-choreographed dance sequences synchronized with a variety of music genres, to encourage dancing practice. The system allows users to interact with AI by selecting music tracks, humanoid models, and importing personal dance videos for comparison, fostering a rich and engaging user experience. DanceDuo not only offers dance generation but also integrates human pose estimation models to provide users with insightful comparisons of their own performances with AI-generated sequences. We conducted a comprehensive user study, revealing that users found the interface intuitive, with particular praise for the dance comparison feature. Our DanceDuo contributes significantly to the integration of AI in dance choreography, offering novel avenues for both recreational and professional applications.
100. 【2606.26455】Active Adversarial Perturbation-driven Associative Memory Retrieval for RGB-Event Visual Object Tracking
链接:https://arxiv.org/abs/2606.26455
作者:Xiao Wang,Xufeng Lou,Zikang Yan,Lan Chen,Sibao Chen,Yaowei Wang,Yonghong Tian,Jin Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:fusing RGB appearance, RGB appearance textures, dense temporal motion, temporal motion cues, tracking improves localization
备注:
点击查看摘要
Abstract:RGB-Event tracking improves localization robustness by fusing RGB appearance textures and dense temporal motion cues from event sensors. While this multi-modal scheme broadens tracking applicability, real-world scenes suffer diverse structured signal degradations that hinder traditional multi-modal fusion. In harsh environments, either modality can lose reliability drastically, and targets frequently appear incomplete due to occlusion, edge truncation and foreground this http URL tackle the above challenges, we present a hierarchical perturbation and retrieval framework tailored for RGB-Event tracking with robustness against partial target missing and modal degradation, termed APRTrack. To mimic real-world signal corruption, APRTrack constructs structured degradation via two adversarial perturbation branches at the modality and spatial levels, which separately simulate full-modal failure and localized target region absence. A hierarchical routing mechanism is designed to disentangle the training pipelines of the two perturbation types, effectively eliminating feature collapse induced by superimposed degradation constraints. Furthermore, we devise Footprint-guided Channel-calibrated Hopfield Retrieval (FCHR) for reliable historical information compensation. This module evaluates retrieval confidence based on association footprints between queries and memory banks, and calibrates the retrieval metric space prior to Hopfield matching, realizing controllable historical feature compensation bounded to target regions. Extensive experiments on FE108, COESOT, VisEvent, and FELT datasets demonstrate the effectiveness of our proposed strategies for the RGB-Event visual object tracking. The source code and pre-trained models will be released on this https URL
101. 【2606.26443】WatchAct: A Benchmark for Behavior-Grounded Robot Manipulation
链接:https://arxiv.org/abs/2606.26443
作者:Baiqi Li,Ce Zhang,Yu Fang,Yue Yang,Shangzhe Li,Mingyu Ding,Gedas Bertasius
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:working alongside people, robot working alongside, observed human behavior, working alongside, alongside people
备注:
点击查看摘要
Abstract:A robot working alongside people must reason about what they have done, in what order, and with what intent. Video carries the spatial layouts, object histories, and gestures that language leaves underspecified, yet today's manipulation benchmarks pair an instruction with a single current image, offering no way to evaluate reasoning over observed human behavior. We introduce WatchAct, a benchmark for robot manipulation grounded in observed human behavior. Each instance pairs a real-world human-action video and a language instruction with an aligned simulator scene and an executable LIBERO task, enabling scalable and reproducible evaluation. WatchAct comprises 3,000 long-horizon instances across 14 tasks in four capability domains drawn from the cognitive demands of watching another agent: parsing events (Event Grounding), recovering procedural structure (Procedural Reasoning), inferring unstated intent (Implicit Intent Inference), and tracking how the scene was changed (Episodic Reasoning). We further propose a disentangled evaluation protocol that separately measures (i)~video-to-plan reasoning by vision-language models, (ii)~policy execution under oracle plans, and (iii)~full task completion by integrated planner--policy pipelines. In both simulation and on a Franka Research 3 robot, current systems remain far from solving WatchAct. The best pipeline, Gemini-3.1-Pro with $\pi_{0.5}$, reaches only 16.3% Success Rate (SR) in simulation and 14.0% on the real robot. Gemini-3.1-Pro attains just 36.8% Plan SR (vs. 97.1% for humans), while $\pi_{0.5}$ reaches only 21.5% Task SR under oracle plans and drops to 10.6% on out-of-domain scenarios. Dataset and code are available at this https URL.
102. 【2606.26424】Rethinking Training Inference for Forecasting: Linking Winner-Take-All back to GMMs
链接:https://arxiv.org/abs/2606.26424
作者:Qiyuan Wu,Katie Z Luo,Bharath Hariharan,Wei-Lun Chao,Mark Campbell
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:advanced rapidly, causing problems, Gaussian mixture models, forecasting for autonomous, autonomous driving
备注: Accepted by ECCV 2026
点击查看摘要
Abstract:Trajectory forecasting for autonomous driving has advanced rapidly, yet representative models often produce uninformative posteriors over forecast modes, causing problems for mode pruning. We trace this to a modeling-training mismatch: forecasters are typically modeled as conditional Gaussian mixture models (GMMs) but trained with a winner-take-all (WTA) loss that assigns each sample to its nearest mode. We argue that this K-means-like hard assignment (one-hot), while preventing mode collapse, is the source of uninformative mode probabilities: it over-segments the trajectory space, ignores relatedness among nearby modes, and yields assignment instability under small perturbations. Guided by this lens, we introduce two post-hoc treatments: (1) test-time posterior-weighted merging that aggregates nearby candidate trajectories; and (2) a one-step expectation-maximization (EM) update that replaces hard labels with soft responsibilities, sharing probability mass across neighboring modes. Across several WTA-trained architectures, these lightweight steps produce more informative, faithfully ranked mode posteriors and strengthen final forecasts on popular displacement metrics -- without retraining. Our analysis unifies recent design choices through a GMM-vs-K-means perspective and offers principled, practical corrections that better align training objectives with inference.
103. 【2606.26416】Methane-Plume Segmentation From Hyperspectral Satellite Imagery Via Multimodal Deep Learning
链接:https://arxiv.org/abs/2606.26416
作者:Brayan Quintero,Jeferson Acevedo,Samuel Traslaviña,Hoover Rueda-Chacón
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:mitigating global warming, earth observation imagery, observation imagery remain, imagery remain essential, global warming
备注: Accepted at IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 2026
点击查看摘要
Abstract:Efficient detection of methane plumes is crucial for understanding and mitigating global warming, as accurately identifying and segmenting them in earth observation imagery remain essential for large-scale monitoring. In this work, we propose a multimodal deep learning model that integrates a feature-guided methane enhancement (FGME) mechanism which injects physically meaningful methane cues into transformer-based RGB representations at multiple semantic scales. Our method is evaluated on the MPDataset, where it outperforms the state-of-the-art with improvements of +0.92 in MIoU, +0.87 in MPrecision and +1.01 in Recall. Notably, these gains are obtained with a substantially lower computational cost than other high-performing architectures, resulting in a favorable accuracy-efficiency trade-off for large-scale methane monitoring. These results highlight the potential of efficient multimodal fusion strategies for accurate and scalable methane plume segmentation in real-world remote sensing applications.
104. 【2606.26410】Neural Voxel Dynamics: Learning Implicit 3D Physics via Volumetric Feature Advection
链接:https://arxiv.org/abs/2606.26410
作者:Zican Wang,Niloy Mitra
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Volumetric Latent Space, present a self-supervised, self-supervised framework, framework for learning, physical dynamics directly
备注:
点击查看摘要
Abstract:We present a self-supervised framework for learning implicit 3D physical dynamics directly from video-derived supervisory signals. While current generative video models achieve high visual fidelity, they lack a 3D geometric foundation, often resulting in physical inconsistencies and a failure to maintain object permanence. We address this by shifting the predictive bottleneck from 2D image space to a `lifted' 3D Volumetric Latent Space. Our method unprojects semantic features from a Video Joint-Embedding Predictive Architecture (V-JEPA) into a voxelized grid, grounded by monocular depth priors. This lifting enables a Volumetric Feature Advection to learn an action-conditioned transition operator that treats physics as a spatio-temporal state advection problem, i.e., learn implicit 3D physics. Unlike state-of-the-art hybrid models that rely on explicit classical simulators for training and/or inference, our architecture tracks material states implicitly within high-dimensional V-JEPA features. This allows for the emergent simulation of heterogeneous phenomena (e.g., rigid body motion in fluid flow) within a single, unified pipeline. Supervised solely via end-to-end video-derived signal plus action conditions, without access to physics engine internal states, labels, or surrogate models, our model demonstrates good long-term structural stability and physical plausibility on multiple benchmarks (CLEVERER, PhysInOne, PhysGaia). We believe that this work opens a scalable pathway toward general-purpose dynamic world models that internalize the 3D invariants of the physical world solely through passive observation of monocular videos.
105. 【2606.26398】DinoLink: A Token-Centric Representation Compression Framework for Bandwidth-Constrained Collaborative V2X Perception
链接:https://arxiv.org/abs/2606.26398
作者:Tianle Zhu,Haohua Que,Handong Yao,Hongyi Xu,Zhipeng Bao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severe bandwidth constraints, High-precision remote perception, Residual Vector Quantization, High-precision remote, severe bandwidth
备注:
点击查看摘要
Abstract:High-precision remote perception is often hindered by the severe bandwidth constraints of Vehicle-to-Everything (V2X) networks. We propose \textit{DinoLink}, a token-centric compression framework that replaces raw pixel streaming with discrete semantic communication for vehicle-cloud collaborative inference. DinoLink employs a dual-sparsity architecture: a saliency-aware selector prunes redundant background tokens, while a Residual Vector Quantization (RVQ) module collapses features into compact codebook indices. By transmitting only lightweight indices and positional priors, DinoLink achieves a $139\times$ bitrate reduction compared to uncompressed transmission while maintaining a competitive 32.8\% mAP on the nuScenes dataset. Deployment simulations further demonstrate a $34.5\times$ acceleration in narrow-band environments, such as LoRa. Our results substantiate DinoLink as a robust, bandwidth-efficient frontend for high-fidelity remote perception in constrained V2X scenarios. The code is publicly available at this https URL.
106. 【2606.26387】Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs
链接:https://arxiv.org/abs/2606.26387
作者:Xi Xiao,Chen Liu,Chih-Ting Liao,Yunbei Zhang,Qizhen Lan,Yuxiang Wei,Lin Zhao,Janet Wang,Jianyang Gu,Muchao Ye,Tianyang Wang,Hao Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Multimodal large language, extend large language, Multimodal large, large language models, enabling joint reasoning
备注: ECCV 2026
点击查看摘要
Abstract:Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to hallucinations that contradict their visual inputs. Mechanistic studies indicate that this weakness stems from visual laziness: MLLMs encode the correct visual evidence internally, but overly rely on strong language priors during response. Existing alignment methods, such as direct preference optimization, primarily optimize outcome-level rewards based on text. This introduces an optimization bias toward linguistic shortcuts, leading to responses that often contradict the visual evidence. To address this, we propose Visual Information Gain In aLignment (VIGIL), a reinforcement-learning (RL) post-training framework that shifts the focus from numerical reward fitting to causal visual grounding. VIGIL introduces a geometric constraint that explicitly maximizes the mutual information between the visual input and the generated response. We achieve this by penalizing "blind confidence" instances where the model remains improperly certain even when textual-visual attention is masked to create a counterfactual blind state. Extensive experiments show that VIGIL consistently outperforms recent alignment methods across hallucination and reasoning benchmarks without compromising text-only capabilities. Our approach matches the full-data performance of state-of-the-art methods using only 25% of the preference data and even demonstrates emergent spatial grounding capabilities without explicit bounding box supervision.
107. 【2606.26384】What Do Deepfake Benchmarks Measure? An Audit Using Frozen Self-Supervised Representations
链接:https://arxiv.org/abs/2606.26384
作者:Samuel Pagon,Yixuan Shen,Vishal Asnani,Feng Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:approach perceptual indistinguishability, generators approach perceptual, perceptual indistinguishability, deepfake generators approach, benchmarks
备注: 14 pages, 9 figures
点击查看摘要
Abstract:As deepfake generators approach perceptual indistinguishability, reliable detection becomes critical. Yet, detectors that score well on benchmarks routinely fail in the wild. A concerning feedback loop has emerged: benchmarks drive increasingly complex, engineered detectors, yet if those benchmarks do not reflect real-world deepfakes, this complexity may be solving the wrong problem entirely. This raises a prior question: what are these benchmarks actually measuring? We conduct an audit of video, image, and audio deepfake benchmarks using a deliberately simple diagnostic. If a linear probe on frozen, general-purpose self-supervised representations can approximate the performance of a bespoke detector, the benchmark is largely rewarding general modality understanding rather than forensic understanding. This has two implications: the benchmark may not reflect realistic threat models, and it raises the question of whether the bespoke detectors the probe approaches are truly learning forensic understanding. We observe, across three modalities, linear probes on general-purpose self-supervised representations closely approach the performance of bespoke detectors. We further show that generator-level difficulty is partly explained by Frechet geometry in the same representation space. Together, these results support a benchmark-audit view of deepfake detection: before high scores are read as evidence of forensic understanding, it is worth asking how much of the benchmark is already solved by general-purpose representations.
108. 【2606.26379】Layer-Specific Prompt Fusion Discovery via Differentiable Search in Vision Foundation Models
链接:https://arxiv.org/abs/2606.26379
作者:Xi Xiao,Xingjian Li,Yunbei Zhang,Cheng Han,Tianming Liu,Tianyang Wang,Runmin Jiang,Jihun Hamm,Xiao Wang,Min Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large-scale Vision Transformers, adapting large-scale Vision, parameter-efficient fine-tuning approach, Vision Transformers, Visual prompt tuning
备注: ECCV 2026
点击查看摘要
Abstract:Visual prompt tuning has emerged as a parameter-efficient fine-tuning approach for adapting large-scale Vision Transformers (ViTs) to downstream tasks. As its learnable prompts are applied in input and feature spaces, prior to jointly going through attention in transformer layers, the most commonly used scheme for fusing image and prompt tokens is concatenation or addition. In this paper, we aim to study a fundamental yet essential problem in visual prompt tuning: whether a single fusion scheme tends to yield better results, and whether that would be beneficial to develop a hybrid fusion scheme. To this end, we formulate the task as a bi-level optimization problem, and solve it leveraging differentiable architecture search. In this context, the learnable prompts and their fusion schemes are jointly optimized. To enrich the search space in the architecture search, we propose two additional fusion schemes, namely, affine transformation and cross-attention, in addition to concatenation and addition. Extensive experiments on 34 datasets spanning VTAB-1k, FGVC, and HTA show consistent gains over prompt-tuning baselines. With a frozen ViT backbone, our method delivers a favorable accuracy--latency--parameter trade-off compared with VPT-Deep and recent variants. Our findings reveal that how prompts fuse with image tokens plays a significant role in visual prompt tuning, and a hybrid fusion fashion can more effectively leverage layer semantics of ViTs, contributing a novel perspective for visual prompt-tuning research.
109. 【2606.26295】Beyond Aesthetics: Quantifying Information Loss in Turbid Scenes
链接:https://arxiv.org/abs/2606.26295
作者:Vasiliki Ismiroglou,Stefan H. Bengtson,Tasos Benos,Thomas B. Moeslund,Malte Pedersen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:environments degrades rapidly, underwater environments degrades, models remain unclear, computer-vision models remain, remain unclear
备注:
点击查看摘要
Abstract:Visibility in underwater environments degrades rapidly under turbid conditions, yet the effects on computer-vision models remain unclear. This issue is compounded by reliance on synthetic turbidity datasets, which may misrepresent real-world information loss. To address this gap, we introduce the Turbid Underwater Baseline (TUB) dataset, comprising 1,320 images captured under extreme turbidity and over 16,000 high-confidence ground-truth segmentation masks. We additionally propose PCD, a metric derived from phase congruency maps that is invariant to contrast and aims to capture the loss of structural information in real turbidity. We show that PCD correlates strongly with the performance of instance segmentation models on both real and synthetic turbid images, whereas common metrics in the field show weak to no correlation at all. The dataset and relevant code can be found on the project page: this https URL
110. 【2606.26287】GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models
链接:https://arxiv.org/abs/2606.26287
作者:Chaoxiang Cai,Minghe Weng,Jie Li,Yibo Jiang,Longrong Yang,Zequn Qin,Xi Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large VisionLanguage Models, capabilities of Large, Large VisionLanguage, training data, significantly improved
备注:
点击查看摘要
Abstract:With the increase in model parameters and training data, the instruction following and generalization capabilities of Large VisionLanguage Models (LVLMs) have been significantly improved. Based on the Mixture of Experts (MoE) architecture, LVLMs expand their parameter capacity while maintaining the inference cost. However, traditional MoE methods employ a Top-k static routing strategy, which fails to account for variations in the input and adaptively select the number of experts, resulting in suboptimal resource utilization. In this paper, we propose viewing token routing as an information encoding task, framing dynamic routing as a Minimum Description Length (MDL) problem in encoding By validating the connection between MDL and gating entropy in the MoE scenario, we introduce Gating Entropy-based Uncertainty-aware Adaptive Routing (GeMoE) for MoE. Unlike traditional static or heuristic-based dynamic routing methods, GeMoE explicitly models the trade-off between model complexity and performance. By using gating entropy to assess the complexity of tokens, GeMoE adaptively determines the number of experts each token should engage. On a wide range of backbones and benchmarks, our method achieves 99.5% average performance retention compared to the original static routing, while improving average expert activation sparsity by 36.5%.
111. 【2606.26279】Beyond Single-Source Cognitive Taskonomy:Multi-Source Task Relations through fMRI Transfer Learning
链接:https://arxiv.org/abs/2606.26279
作者:Junfeng Xia,Wendu Li,Mengjiao Zhang,Jie Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
关键词:Human Connectome Project, Boolean Integer Programming, specialized neural processes, specialized neural, Connectome Project task
备注:
点击查看摘要
Abstract:Cognitive tasks are organized by shared and specialized neural processes. Masked fMRI reconstruction provides a common self-supervised objective for quantifying transfer relations among task states, but existing reconstruction-based taskonomies mainly study one-to-one transfer from a single source task to a target. Here, we extend an fMRI cognitive taskonomy from single-source to multi-source transfer across 23 Human Connectome Project task states and use Boolean Integer Programming (BIP) to analyze budget-constrained task allocation. We train 1,127 task-specific and transfer models. Single-source transfer is directional and paradigm structured: motor states transfer well within the motor paradigm but provide limited support to most non-motor targets, consistent with a shared sensorimotor execution system and effector-specific representations. Multi-source transfer depends on the composition of the source set, suggesting that many-to-one task relations are not fully captured by pairwise taskonomy alone. Across supervision budgets, BIP repeatedly allocates direct supervision to several 0-back and 2-back working-memory states, although these states are not consistently the strongest individual sources. This pattern may reflect the integration of perceptual, attentional, and executive processes in working-memory tasks. Together, these findings reveal a cross-paradigm-limited motor cluster and working-memory states with high priority under the specified global allocation objective. Our study extends reconstruction-based fMRI taskonomy from one-to-one transfer to many-to-one task relations and budget-constrained task dependencies.
112. 【2606.26260】A multi-task spatiotemporal deep neural network for predicting penetration depth and morphology in laser welding
链接:https://arxiv.org/abs/2606.26260
作者:Sen Li,Haichao Cui,Chendong Shao,Yaqi Wang,Xinhua Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:weld seam morphology, seam morphology plays, seam morphology, laser penetration welding, weld seam
备注:
点击查看摘要
Abstract:In laser penetration welding, the assessment of penetration state and weld seam morphology plays a crucial role in determining the weld quality. This paper presents a comprehensive introduction of the innovative muti-task deep learning model that has the capability to predict penetration state, depth, and weld seam morphology with high accuracy. The monitoring platform relies on weld pool images captured during the laser welding process using a complementary metal-oxide-semiconductor camera. The proposed model integrates spatiotemporal features extracted from top weld pool images along with welding parameters, establishing a deep learning framework based on convolutional neural networks and state space models for more efficient extraction and processing of spatial-temporal information. Furthermore, a reliable method for constructing the dataset is proposed to enhance both robustness and generalization capability of the developed model. Validation results on the test set demonstrate that prediction accuracy for penetration state can reach 99.35%, while prediction error for penetration depth is 1.79 millimeter, and accuracy of reconstructing the weld cross-section is 95.65%. This study provides new insights and methodologies for in-situ quality control strategies in laser penetration welding systems.
113. 【2606.26217】Fast LeWorldModel
链接:https://arxiv.org/abs/2606.26217
作者:Yuntian Gao,Xiangyu Xu
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Joint-Embedding Predictive Architectures, Predictive Architectures, Joint-Embedding Predictive, including recent LeWorldModel, reconstruction-free visual world
备注:
点击查看摘要
Abstract:Joint-Embedding Predictive Architectures (JEPAs), including recent LeWorldModel (LeWM), have become a promising foundation for reconstruction-free visual world models. For visual planning, however, LeWM evaluates candidate action sequences by repeatedly applying a local one-step latent transition model. This autoregressive rollout makes planning computationally expensive and exposes the predicted trajectory to accumulated latent errors as the horizon grows. We propose Fast LeWorldModel (Fast-LeWM), a fast latent world model that replaces repeated local rollout with action-prefix prediction. Given the current latent and a candidate action sequence, Fast-LeWM encodes its prefixes and predicts the future latents reached after executing those prefixes in parallel. By making action prefixes the basic prediction unit, Fast-LeWM directly models action effects accumulated to different extents over multiple horizons. This prefix-level supervision forces the model to learn how states continuously evolve under different action prefixes, rather than only fitting one-step state transitions. During planning, the predictor can use the last prefix token from the encoded action sequence to evaluate the corresponding future latent without explicitly rolling through each intermediate imagined state. Across multiple tasks, Fast-LeWM improves average success over LeWM while substantially reducing planning time, achieving lower open-loop latent loss whose growth becomes significantly slower as the rollout horizon increases.
114. 【2606.26215】askNPoint: How to Teach Your Humanoid to Hit a Backhand in Minutes
链接:https://arxiv.org/abs/2606.26215
作者:Blake Werner,Ilona Demler,Pietro Perona,Aaron D. Ames
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamic skills, tennis backhand, tennis tournaments, teaching dynamic skills, tennis
备注:
点击查看摘要
Abstract:How do we learn to hit a tennis backhand? Not from a thousand hours of tennis tournaments on TV - we work with a coach and practice. We argue this is also the right recipe for teaching dynamic skills to humanoid robots. This follows from a structural property of dynamic skills: the outcome is decided by a short, crucial portion of the trajectory - for a backhand, the ~20cm of racket travel around ball contact. Getting this interaction window right requires coordinating the whole motion, so that control, physics, and morphology act in concert. Learning thus reduces to mastering a handful of distinct actions and, for each, practicing until the window comes out right. To this end, we introduce TaskNPoint, a training protocol which makes the coach-learner division of labor explicit. The human coach contributes four inputs: a discrete set of skills (e.g. different shots), one demonstration per skill, identification of the interaction window, and the goal. Learning in a physically realistic simulation environment fills in each action trajectory and provides robustness to unmodeled events. Crucially, randomized target sampling during training lets a single demonstration generalize zero-shot to unseen goal locations. We test this approach on a Unitree G1 humanoid that hits forehands and backhands against balls thrown by a human, kicks incoming soccer balls, and picks and places boxes from novel locations. We find that learning is successful from short human video demonstrations and under an hour of training on a single GPU, with no per-task reward tuning.
115. 【2606.26196】From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models
链接:https://arxiv.org/abs/2606.26196
作者:Haoxiang Sun,Tao Wang,Li Yuan,Jian Zhao,Jiancheng Lv
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Large Language Models, Multimodal Large Language, recently made remarkable, made remarkable progress, DeepSeek R-series
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's R-series, which have driven a paradigm shift toward perception-centric intelligence. However, there remains a lack of systematic surveys that examine perception from a truly unified vision-language perspective -- one that treats vision and language as an inseparable modality. Existing reviews are often fragmented, focusing separately on either vision or language, and thus rarely capture the cross-modal evolution of perception as an integrated capability. To bridge this gap, we present the first systematic survey of unified vision-language perception in MLLMs. Specifically, we (1) formalize MLLM perception as an intrinsic, unified vision-language capability analogous to human innate perception, (2) introduce a five-stage taxonomy tracing the paradigm evolution of MLLM perception and survey representative methods and milestones at each phase, and (3) identify open challenges and outline promising research directions toward truly general, unified multimodal intelligence. We hope our study will provide both a foundational understanding and an actionable roadmap to foster further innovation on the path toward artificial general intelligence (AGI).
116. 【2606.26194】Self-Supervised Tree-level Biomass Estimation in Urban Environments From Airborne LiDAR and Optical Observations
链接:https://arxiv.org/abs/2606.26194
作者:Jose Bermudez(1),Zilong Zhong(1),Dominic Cyr(2),Camile Sothe(3),Alemu Gonsamo(1) ((1) McMaster University, Hamilton, Ontario, Canada (2), Environment and Climate Change Canada, Montreal, Quebec, Canada, (3) Planet Labs PBC, San Francisco, California, USA)
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:spatially explicitly quantified, Urban tree biomass, resolve individual crowns, tree biomass remains, Urban tree
备注:
点击查看摘要
Abstract:Urban tree biomass remains less spatially explicitly quantified than biomass in managed forests because many estimates rely on inventories or coarse products that cannot resolve individual crowns or fine-scale heterogeneity. We present a crown-level above-ground biomass (AGB) framework for an 810~km$^2$ landscape in Ontario, Canada, using leaf-off airborne LiDAR (8--10~pulses~m$^{-2}$) and near-infrared RGB orthophotography (0.16--0.20~m) from 2018 and 2023. A dual-stream cross-attention network trained on rule-based pseudo-labels produced semantic marks for buildings, needleleaf trees, and deciduous trees, supporting crown delineation and functional-type assignment. On independently annotated withheld tiles, global/mean precision, recall, and Dice scores were 0.86, 0.83, and 0.84. Crowns were delineated with multiscale watershed segmentation in mapped tree areas, and AGB was estimated from a crown area--height power-law proxy calibrated to species-specific allometry (Lambert et al., 2005) for 21,921 inventory trees. For 18,713 inventory--segment matched pairs from a 90,726-tree held-out test set, AGB prediction achieved $R^2=0.609$ using inventory crown geometry and $R^2=0.570$ under operational segmentation, identifying crown delineation as the remaining uncertainty source. Aggregated to 30~m, estimates yielded total AGB stocks of 1.73~Tg in 2018 and 1.81~Tg in 2023 (811--850~Gg~C), local densities up to ${\sim}140$~Mg~ha$^{-1}$ along the Niagara Escarpment, and a net carbon gain of 39~Gg~C over five years. Deep-ensemble uncertainty maps highlighted high-epistemic-uncertainty areas linked to underrepresented land covers and guided assignment of uncertain crowns to a pooled allometric equation. The framework uses standard provincial data, requires no manual annotation, and produces a public bitemporal crown-level AGB database for trees outside forests at management-relevant resolution.
117. 【2606.26171】LCG: Long-Context Consistent Image Generation with Sparse Relational Attention
链接:https://arxiv.org/abs/2606.26171
作者:Zihao Wang,Yijia Xu,Haoze Zheng,Xuran Ma,Haokun Gui,Harry Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:models achieve impressive, achieve impressive quality, Recent image generation, generation models achieve, Sparse Relational Attention
备注:
点击查看摘要
Abstract:Recent image generation models achieve impressive quality in single-image synthesis, but often fail to maintain consistency across sequential outputs, as required in comics, storyboards, and visual narratives. We propose Long-Context Generation (LCG), a framework for long-context multi-image text-to-image generation, to improve consistency and scalability in long-context multi-image generation. LCG employs the Sparse Relational Attention (SRA) mechanism to selectively attend to core features across extended visual contexts, ensuring that the propagation of semantic and layout information remains computationally tractable. To enforce semantic alignment, we introduce the Routing Consistency Constraint (RCC), which leverages identity-aware masks to align structural patterns across generation branches, effectively mitigating drift in appearance even in complex multi-character scenes. To support training and evaluation in this setting, we construct the Long-Context Consistency Dataset (LCCD), a large-scale synthetic dataset comprising character-centric multi-image sequences spanning varied situational contexts. LCCD contains 600K training sequences and a separate 1K test set, with each sequence containing 6 to 20 images. The experiments demonstrate that LCG outperforms the compared baselines in prompt alignment and character consistency for long-context image generation, including multi-character scenes.
118. 【2606.26165】Predicting Fruit Quality with a Hybrid Machine Learning and Image Processing Approach
链接:https://arxiv.org/abs/2606.26165
作者:Amir Reza Hashemi,Shahram Amiri
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:substantial economic losses, issue in agriculture, leading to substantial, economic losses, image processing
备注: 22 pages, 13 figures, 2 tables
点击查看摘要
Abstract:Fruit spoilage is a significant issue in agriculture, leading to substantial economic losses. Addressing this, our study introduces a hybrid approach combining image processing and deep learning to assess fruit freshness. We developed an image processing algorithm that quantifies spoilage on a scale from 0 (fully fresh) to 100 (fully rotten). Alongside, we trained a convolutional neural network (CNN) to perform binary classification (fresh or rotten) using a large dataset of fruit images. The outcomes of both methods were synthesized using logistic regression to enhance the accuracy of freshness predictions. Subsequently, this logistic regression model was utilized to enable the image processing algorithm to provide binary classification based on its percentage output, thus eliminating the need for the CNN in real-time applications. Our approach, which does not require high computational resources, achieved real-time performance and was validated with over 90% accuracy on a dataset comprising apples and oranges. The primary limitation lies in the requirement for fruits to be isolated on a background that must be either white or transparent, suggesting future improvements could include advanced segmentation models to automate background removal. This study's results highlight the potential of integrating simple image processing techniques with machine learning to provide practical solutions in the agricultural sector.
119. 【2606.26122】DocArena: Turning Raw Documents into Controllable Training Environments for Document Search Agents
链接:https://arxiv.org/abs/2606.26122
作者:Jiamian Wang,Ruiyi Zhang,Tong Yu,Jing Shi,Samyadeep Basu,Rajiv Jain,Zhiqiang Tao,Tong Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent methods train, requiring expert trajectories, Recent methods, methods train search, expert trajectories
备注: search agent for documents
点击查看摘要
Abstract:Recent methods train search agents via reinforcement learning from (question, answer, evidence) tuples without requiring expert trajectories. The tuples serve as the training environment, and whose properties directly shape what search strategies and generalization abilities the agent can develop. While prior works have made encouraging progress in improving training data quality, existing environments remain predominantly text-based and existing approaches can struggle to construct training environments that are controllable, scalable, and account for multimodal data. Given this, we propose DocArena, a fully automated data curation pipeline building on the practical need for multimodal document search and question-answering. It transforms raw document collections into training environments for search agents without any human annotation. The pipeline first structures and indexes documents through MLLM-based visual perception, then profiles and leverage the cross-page information distribution to construct reasoning-intensive QA pairs, as well as performs cascaded quality assurance operations via MLLM. We introduce DocArena-79K with QA pairs from 8,336 documents spanning 16 domains and 49 languages. We further design a Doc-Search agent infrastructure that decouples visual perception from the policy model, allowing text-based LLMs to serve as the reasoning backbone for multimodal document retrieval and QA. Under a unified evaluation framework where only the policy model differs, experiments on six multimodal document scenarios and seven text-based QA benchmarks show that agents trained on DocArena data achieve the best performance on both retrieval accuracy and QA quality. Further analysis on agent search behaviors confirms the effectiveness and controllability of the constructed training environment.
120. 【2606.26121】Dot-Flik: A Scalable Edge AI Architecture for Distributed Insect Monitoring
链接:https://arxiv.org/abs/2606.26121
作者:Mattia Consani,Denisa-Andreea Constantinescu,Åse Håtveit,Titus Venverloo,Fabio Duarte,Carlo Ratti,David Atienza
类目:Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Global insect population, declines necessitate scalable, population declines necessitate, existing vision-based solutions, vision-based solutions remain
备注:
点击查看摘要
Abstract:Global insect population declines necessitate scalable, continuous monitoring systems, yet existing vision-based solutions remain constrained by high hardware costs, energy demands, and reliance on centralized processing or cloud connectivity. This article presents three contributions to address these limitations. First, we propose a motion-informed frame filtering algorithm based on temporal differencing, gamma-corrected motion amplification, and block-based motion density analysis that discards irrelevant frames at the edge while preserving insect activity, without requiring deep learning inference on the sensing device. Second, we introduce a distributed, hierarchical IoT architecture that decouples data acquisition from AI classification through this edge-level preprocessing, projecting fractional scaling of central processing requirements and significantly increasing monitoring coverage compared to monolithic single-stream approaches. Third, we validate the complete system through real-world outdoor deployments on low-cost commodity hardware along four axes: real-time performance, network scalability, hardware cost, and energy efficiency under varying wind conditions. Results demonstrate 60-80% frame reduction under light-wind conditions, sustained real-time 30 FPS operation with 12.8 ms of computational headroom, up to 22.6% energy savings, and support for 5-6 concurrent edge streams per central node. These findings establish a practical foundation for dense, low-cost biodiversity monitoring networks in urban environments.
121. 【2606.26716】Dual-Prior Guided Null-Space Learning with Mixture-of-Splines for Arbitrary Medical Slice Super-Resolution
链接:https://arxiv.org/abs/2606.26716
作者:Haofei Song,Siyuan Xu,Xintian Mao,Shaojie Guo,Qingli Li,Yan Wang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:super-resolution reconstructs isotropic, reconstructs isotropic volumes, anisotropic clinical acquisitions, Arbitrary slice super-resolution, slice super-resolution reconstructs
备注: Accepted to ECCV 2026! Project page: [this https URL](https://github.com/DeepMed-Lab-ECNU/Medical-Image-Reconstruction)
点击查看摘要
Abstract:Arbitrary slice super-resolution reconstructs isotropic volumes from anisotropic clinical acquisitions by synthesizing intermediate slices at arbitrary scales. However, treating this ill-posed inverse problem as unconstrained residual-based regression risks hallucinating anatomically implausible structures or altering the originally observed data. To address both concerns, this paper presents the Dual-Prior Null-space Learning (DP-NSL) framework, which reformulates the task as a constrained recovery process guided by two complementary priors. A Measurement-Consistent Projection (MCP) enforces a Deterministic Observation Prior: the reconstruction undergoes an exact orthogonal projection that reproduces every acquired slice with zero error, confining all learned details to the unobservable null space. Within this null space, a Mixture-of-Splines (MoS) module imposes a Geometric Continuity Prior by dynamically mixing B-spline experts of different analytic orders, allowing each anatomical region to be modeled with a content-aware level of continuity. To promote spatial coherence, a Local Spatial Consistency Decoder (LSCD) further injects local inductive bias. Experiments on three CT and one MRI benchmark show that DP-NSL outperforms existing approaches while strictly preserving measurement consistency. Code is available at this https URL.
122. 【2606.26712】MLFFM-SegDiff: A Multi-Level Feature Fusion Diffusion Model for Skin Lesion Segmentation
链接:https://arxiv.org/abs/2606.26712
作者:Jingjun Gu,Chaojie Shen,Yifeng Cao,Wei Zhang,Yiliu Li,Aobo Fan
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:computer-aided dermatological diagnosis, directly impacts downstream, impacts downstream analysis, accuracy directly impacts, multi-level feature fusion
备注:
点击查看摘要
Abstract:Skin lesion segmentation is a key task in computer-aided dermatological diagnosis, where accuracy directly impacts downstream analysis and disease classification. However, dermoscopic images are challenging due to blurred boundaries, low contrast, large shape variations, and artifacts such as hair and shadows. Recently, diffusion models have shown strong performance in medical image segmentation thanks to their progressive denoising and distribution modeling capabilities. Nevertheless, existing diffusion-based methods still suffer from limited cross-level feature interaction and insufficient boundary detail recovery. To address these issues, we propose MLFFM-SegDiff, a multi-level feature fusion diffusion model for skin lesion segmentation. Built on a diffusion framework, the method introduces a dual-path U-Net encoder, a Multi-Level Feature Fusion Module (MLFFM), and a boundary-sensitive loss function. The dual-path encoder enhances interaction between noisy mask features and dermoscopic image features. MLFFM improves skip connections via attention, scale alignment, and adaptive cross-level fusion. These designs enable the decoder to jointly leverage shallow boundary cues and deep semantic representations, improving mask reconstruction quality. Experiments on ISIC2018, PH2, and HAM10000 demonstrate that MLFFM-SegDiff outperforms representative methods including DermoSegDiff, U-Net, and SwinUNETR across Accuracy, F1-score, Jaccard index, Recall, and Dice. In particular, it achieves an average Jaccard index of 0.8546 and Dice coefficient of 0.9207. These results validate the effectiveness of the proposed multi-level feature fusion strategy for improving lesion segmentation performance. The code will be released at this https URL after publication.
Subjects:
Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2606.26712 [eess.IV]
(or
arXiv:2606.26712v1 [eess.IV] for this version)
https://doi.org/10.48550/arXiv.2606.26712
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
123. 【2606.26431】Revealing Mammographic Phenotypes in Deep Learning Breast Cancer Risk Models
链接:https://arxiv.org/abs/2606.26431
作者:Ruiyu Jia,Yanqi Xu,Yuxuan Chen,Yiqiu Shen,Laura Heacock
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Mammogram-based deep learning, Mammogram-based deep, improved breast cancer, patterns remain underexplored, deep learning models
备注:
点击查看摘要
Abstract:Mammogram-based deep learning models have improved breast cancer risk prediction, but the learned imaging patterns remain underexplored. Existing interpretability methods rely on single-image saliency maps, failing to identify recurring mammographic phenotypes across large patient cohorts. By clustering patch embeddings from a pre-trained model, Mirai, we isolate recurring phenotypes linked to 5-year cancer risk. Analyses show risk-increasing phenotypes capture complex structures (e.g., dense tissue, microcalcifications) and shortcut artifacts (e.g., clips). These phenotypes correlate strongly with older age and higher BI-RADS density. Our framework connects tissue patterns to AI risk scores, revealing clinical signatures and potential latent model confounders.
124. 【2606.26312】ailor Made Embeddings for Quantum Machine Learning
链接:https://arxiv.org/abs/2606.26312
作者:Aldo Lamarre,Dominik Šafránek
类目:Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:enabling principled weight, principled weight initialization, Autoencoders transformed classical, transformed classical machine, classical machine learning
备注: 17 pages, 17 figures
点击查看摘要
Abstract:Autoencoders transformed classical machine learning by solving the curse of dimensionality, enabling principled weight initialization and learning compact, structured representations. In this work, we extend this paradigm to quantum machine learning by introducing a variational autoencoder framework that learns task-specific quantum embeddings of classical data. We demonstrate that high-dimensional datasets, including ImageNet, can be compressed into a 13-qubit quantum representation while remaining reconstructable through a learned decoder. On MNIST (3 vs 5), our approach achieves 98.5% validation accuracy using a circuit-centric quantum classifier, within 1.2 percentage points of a classical neural network baseline (99.7%) and more than 30 percentage points above a naive amplitude-embedding approach. Unlike amplitude embeddings, which require full quantum state tomography for recovery, or angle embeddings, which generally rely on circuit inversion under restrictive assumptions, the proposed framework reconstructs the original data from only a polynomial number of measurements. The framework was further validated on IBM quantum hardware, confirming that the learned embeddings remain stable and reconstructable under real device noise.
125. 【2606.26236】Rendering Novel Views of MRI Using 3D Gaussian Splatting
链接:https://arxiv.org/abs/2606.26236
作者:Robin Y. Park,Mark C. Eid,Rhydian Windsor,Amir Jamaludin,Ana I.L. Namburete,João F. Henriques,Andrew Zisserman
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:improve radiological gradings, radiological gradings measured, imaging planes aligned, Gaussian Splatting, sparse anisotropic MRIs
备注:
点击查看摘要
Abstract:The objective of this paper is to improve radiological gradings measured on MRIs of spines, by resampling scans so that the new view planes are better aligned with the target anatomy than the original sparse images. To this end, we adapt 3D Gaussian Splatting to form a volumetric reconstruction starting from sparse anisotropic MRIs, and imaging planes aligned with the anatomy relevant for clinical evaluation are then sampled and rendered. The novel view plane is optimal for diagnostic radiological grading of the target anatomy, whereas the original MRI is not. The resampled scans are then used to predict ordinal severity grades of localised stenosis conditions in spinal MRIs. We compare our method against Voxel Interpolation resampling, which takes the average of inverse-distance weighted nearest neighbour intensities for each target coordinate. Experiments show that across all stenosis conditions, resampled scans using Gaussian Splatting produce more accurate stenosis gradings compared to the raw scans which do not include the complete anatomy in-plane, as well as images resampled using Voxel Interpolation.

