本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新553篇论文，其中：

自然语言处理77篇
信息检索13篇
计算机视觉83篇

自然语言处理

1. 【2606.14703】Gaze Heads: How VLMs Look at What They Describe

链接：https://arxiv.org/abs/2606.14703

作者：Rohit Gandikota,David Bau

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：vision-language model internally, model internally solves, internally solves, solves the task, heads

备注：

点击查看摘要

Abstract:How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at this https URL

2. 【2606.14697】ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

链接：https://arxiv.org/abs/2606.14697

作者：Sicheng Yang,Hangjie Yuan,Wenjun Zhang,Jinwang Wang,Yichen Qian,Weihua Chen,Fan Wang,Lei Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Building trustworthy medical, large language models, clinical decision support, multimodal large language, reliable clinical decision

备注： Code and datasets: [this https URL](https://github.com/alibaba-damo-academy/ClinHallu)

点击查看摘要

Abstract:Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at this https URL.

3. 【2606.14695】Persona-Pruner: Sculpting Lightweight Models for Role-Playing

链接：https://arxiv.org/abs/2606.14695

作者：Jinsu Kim,Jihoon Tack,Noah Lee,Jongheon Jeong

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：shown remarkable potential, Language Models, delivering consistent, stylized interactions, shown remarkable

备注： 25 pages; ICML 2026; Code is available at [this https URL](https://github.com/jsu-kim/Persona-Pruner)

点击查看摘要

Abstract:Language Models (LMs) have shown remarkable potential as role-playing chatbots, delivering consistent, stylized interactions when given a specification of a character or user persona. However, applying these capabilities to real-world applications (e.g., ecosystems with numerous NPCs interacting simultaneously) exposes a critical inefficiency due to the excessive computational cost. In this paper, we question the necessity of dedicating a full, generalist model to a single persona, hypothesizing that a specific character identity relies on only a fraction of the model's total capacity. We observe that naively pruning LMs often severely degrades the role-playing performance for a specific persona; it does not distinguish between redundant knowledge and essential character traits. We propose Persona-Pruner, a framework that sculpts a lightweight role-playing model by isolating persona-specific sub-networks from a single description. Our experiments consistently show that Persona-Pruner preserves role-playing performance substantially more effectively than existing state-of-the-art LLM pruning techniques, reducing the performance drop from the dense model by up to 93.8% over the strongest baseline on RoleBench in LLM-as-a-judge score, while still maintaining general LLM capabilities. Code is available at this https URL.

4. 【2606.14694】AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

链接：https://arxiv.org/abs/2606.14694

作者：Junlong Tong,Wenqi Xu,Yingqi Fan,Anhao Zhao,Xuan Lu,Yang Tan,Xiaoyu Shen

类目：Computation and Language (cs.CL)

关键词：Large reasoning models, models typically follow, Large reasoning, static context, produce the answer

备注：

点击查看摘要

Abstract:Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at this https URL.

5. 【2606.14691】CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

链接：https://arxiv.org/abs/2606.14691

作者：Jiayue Cao,Zhicong Lu,Xuehan Sun,Wei Jia,Hongling Zheng,Changyuan Tian,Zichuan Lin,Wenqian Lv,Nayu Liu

类目：Computation and Language (cs.CL)

关键词：Reinforcement learning, large language models, Group Relative Policy, motivating its extension, Relative Policy Optimization

备注： Submitted to EMNLP 2026

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucinations, but underestimate the semantic inconsistency between the reasoning process and the final answer. In this paper, we delve into thinking-answer inconsistency in RLVR for large vision-language models (LVLMs), showing thorough analyses of rollouts collected throughout Group Relative Policy Optimization (GRPO) training process and post-RLVR evaluation outputs that this issue persists during training and remains present during inference. Motivated by the analysis, we propose Consistency-Oriented Reasoning Alignment (CORA), which introduces thinking-answer semantic consistency into RLVR through a lightweight plug-and-play consistency reward model, and further incorporates Hybrid Reward Advantage Splitting (HRAS) to stably coordinate task and consistency optimization. Extensive experiments across representative multimodal reasoning benchmarks and mainstream LVLMs show that CORA improves task performance while effectively mitigating thinking-answer inconsistency, leading to more faithful reasoning traces.

6. 【2606.14688】Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit

链接：https://arxiv.org/abs/2606.14688

作者：Xiaoyu Li,Andi Han,Dai Shi,Zheng Gao,Jiaojiao Jiang,Junbin Gao

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS)

关键词：generate formal mathematics, binding constraint, systems coupled, assistants now generate, verifiable formal language

备注：

点击查看摘要

Abstract:AI systems coupled to proof assistants now generate formal mathematics at scale, and the gap between what a checker can verify and what a mathematician would value has become the binding constraint. We model the generation of valuable mathematics as nested language generation in the limit: a verifiable formal language $F$, accessed through a membership oracle (the proof checker), contains an unknown valuable language $H \in \mathcal{H}$ revealed only through an adversarial enumeration of a core $C \subseteq H$ of exact density $\alpha$ (the literature). Every output is valuable ($\in H$), trivial ($\in F \setminus H$), or a hallucination ($\notin F$). We settle four questions. First, the verifier is not taste: the collections admitting generation with breadth are exactly those of the oracle-free model, characterized fiber-wise by Angluin's condition. Second, the verifier does buy sound coverage, covering all unseen valuable statements while asserting only valid ones: possible with it, impossible without it; it relocates unavoidable errors from false to trivial. Third, and centrally, a sharp dichotomy on the tight family: generators emitting finitely many trivia achieve optimal coverage $\alpha/2$, while any infinite trivia allowance, even at vanishing rate, jumps the optimum to $1-\alpha/2$ (both tight, for cores presented as the candidate intersection), and one generator attains both ends. The transition is in trivia count, not rate; the gap $1-\alpha$ is the unrecorded mass. Fourth, both regimes instantiate in a compression model of mathematics. A perfect verifier cannot substitute for taste: the unbounded stream of correct-but-worthless statements is not an engineering accident but a provable necessity, since covering unrecorded valuable mathematics requires an infinite, but asymptotically negligible, stream of certified trivia.

7. 【2606.14674】AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

链接：https://arxiv.org/abs/2606.14674

作者：Jixuan Chen,Jianzhi Shen,Haoqiang Kang,Zhi Hong,Qingyi Jiang,Soham Bose,Yiming Zhang,Leon Leng,Amit Vyas,Lingjun Mao,Siru Ouyang,Kun Zhou,Lianhui Qin

类目：Computation and Language (cs.CL)

关键词：single model calls, increasingly built, scaffolded systems, systems that combine, LLM agents

备注：

点击查看摘要

Abstract:LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent behavior. We introduce AgentSpec, a modular specification framework that represents embodied agents as typed compositions of reusable policy components with standardized interfaces. AgentSpec standardizes the interfaces among perception, memory, reasoning, reflection, action, and optional learning, enabling components to be swapped and recombined under controlled conditions. We instantiate this framework across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and analyze reasoning, memory, reflection, and reinforcement-learning modules across model backbones. Our results show that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength. In particular, structured multi-granularity memory improves long-horizon state tracking, reasoning and memory interact non-uniformly across environments, reflection trades off correction and cost, and RL-trained policies compose best when optimized with deployment-time scaffold structure. AgentSpec provides a controlled foundation for studying, comparing, and designing composable LLM agents. Our code, baselines and interactive playground are publicly available at this https URL.

8. 【2606.14672】owards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

链接：https://arxiv.org/abs/2606.14672

作者：Shikun Liu,Mufei Li,Dongqi Fu,Haoyu Wang,Yinglong Xia,Hong Li,Hong Yan,Pan Li

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, language models increasingly, models increasingly serve, Large language, sequential text interface

备注：

点击查看摘要

Abstract:Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

9. 【2606.14654】Abstracting Cross-Domain Action Sequences into Interpretable Workflows

链接：https://arxiv.org/abs/2606.14654

作者：Gaurav Verma,Scott Counts

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：provide objective records, obscure meaningful insights, logs provide objective, digital application usage, time-stamped interaction logs

备注： preprint; 9 pages, 5 figures

点击查看摘要

Abstract:Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people's work. Such insights are essential for improving digital products in ways grounded in real-world user interactions. Prior research has applied deep learning models to cluster user actions into high-level activities, but these approaches are highly sensitive to noise and struggle to generalize across applications. To address this limitation, we introduce WorkflowView, a framework that uses large language models (LLMs) to abstract low-level action sequences into high-level activities. We establish the effectiveness and generality of our approach across three distinct, challenging sequential tasks and diverse domains: (a) zero-shot task description reconstruction from browser logs (achieving high semantic similarity, $\mu_{sim} = 0.91$), (b) few-shot student dropout prediction using MOOC interaction logs (reaching weighted $F_1 = 0.90$ with only five few-shot examples), and (c) anonymized, privacy-preserving analysis of AI tool integration within document workflows in Microsoft Word. Our work demonstrates that LLM-based abstraction is a robust and efficient path forward for transforming low-level behavioral data into high-level, interpretable, and actionable insights. We also discuss practical considerations for deploying LLM-based inferences within logging infrastructures, including computational efficiency and user privacy.

10. 【2606.14626】Characterizing Cultural Localization in AI-Generated Stories

链接：https://arxiv.org/abs/2606.14626

作者：Shaily Bhatt,Supriti Vijay,Jeremiah Milbauer,Fernando Diaz

类目：Computation and Language (cs.CL)

关键词：generate culturally localized, culturally localized content, artificial intelligence, intelligence has increased, increased interest

备注： Accepted to the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP) Co-located with ACL 2026, San Diego, USA (non-archival)

点击查看摘要

Abstract:The global use of artificial intelligence has increased interest in assessing the ability to generate culturally localized content, including stories. Cultural localization in stories often occurs through either templated localization -- the use of cultural markers (e.g., names, locations) in a generic narrative -- or holistic localization -- the variation of plots, values, and themes, in addition to cultural markers. We propose a method to measure the degree to which content was generated through templated localization. Specifically, we identify the lexical tokens that distinguish stories across nationalities and measure the similarity of the narratives that remain after removing them. In stories generated by five models on 125 topics for 193 nationalities, our method is able to detect that only a small subset (9-17%) of the vocabulary accounts for the variation across nationalities and that the narratives that remain after removing them contain repeated multi-word sequences, suggesting the presence of a shared culturally-agnostic narrative template. Finally, we characterize the cultural markers for their stereotypicality and offensiveness, finding that markers from 19 countries, mostly located in the Global South, are on average offensive.

11. 【2606.14600】LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

链接：https://arxiv.org/abs/2606.14600

作者：Mateusz Winiarek,Maksymilian Bilski,Mateusz Jacniacki

类目：Computation and Language (cs.CL)

关键词：Online group chats, Online group, rarely stated explicitly, rarely stated, Online

备注：

点击查看摘要

Abstract:Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce LoSoNA, a benchmark for local social norm adaptation in multi-party chat. Each scenario gives a subject model a curated group-chat transcript in which non-subject participants demonstrate a hidden local norm, followed by a final elicitor turn that forces a response revealing whether the subject has inferred that norm. We evaluate eight frontier and open-weight models under four prompting conditions that vary how explicitly the model is told to treat the prior conversation as evidence for how it should answer. Naive prompting remains limited for most models; explicit norm-aware prompting helps unevenly, with Gemini 3.1 Pro reaching $84.2\%$ and Claude Fable 5 reaching $81.6\%$, while several other models show small gains or regressions. LoSoNA contributes to recent calls for evaluating LLM social capabilities by testing whether models can infer local conversational norms from precedent and use them in a one-turn group-chat response.

12. 【2606.14580】Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

链接：https://arxiv.org/abs/2606.14580

作者：Liancheng Gong,Zhiyang Wang,Yiwei Xu,Julia Mendelsohn

类目：Computation and Language (cs.CL)

关键词：Identifying persuasive rhetorical, detecting information manipulation, Identifying persuasive, persuasive rhetorical cues, advancing public health

备注：

点击查看摘要

Abstract:Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15 dimensions grounded in persuasion theories from psychology and communication, and one transparent implementation using 55 sub-features built from lexicons and rule-based detectors. The taxonomy is modular: individual detectors can be replaced while preserving the theoretical structure. By evaluating PI on four public datasets varying in domain, style, and outcome measures, we show that PI provides a shared feature space for interpreting rhetorical patterns associated with persuasion-related outcomes. Linear models show that PI features carry meaningful predictive signal while remaining computationally lightweight. Dimension-level analyses reveal recurring associations between PI dimensions and persuasion outcomes across datasets, while also highlighting topic- and stance-specific variation. We release PI as an open-source package and web interface for principled and auditable analysis of human and AI-mediated communication.

13. 【2606.14574】SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

链接：https://arxiv.org/abs/2606.14574

作者：Xiaoxin Lu,Ranran Haoran Zhang,Rui Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, household environments, increasingly deployed, autonomous agents

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

14. 【2606.14528】BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

链接：https://arxiv.org/abs/2606.14528

作者：Qingkai Fang,Shoutao Guo,Yang Feng

类目：Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：next-generation spoken chatbots, handle natural phenomena, Voice Activity Detection, full-duplex speech interaction, external Voice Activity

备注： Code: [this https URL](https://github.com/BayLing-Models/BayLing-Duplex)

点击查看摘要

Abstract:Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in. Existing speech language models (SpeechLMs) such as LLaMA-Omni and GLM-4-Voice are still turn-based and rely on an external Voice Activity Detection (VAD) module to mark the end of the user's turn, which fundamentally limits their interactive ability. In this paper, we introduce BayLing-Duplex, a native full-duplex SpeechLM where a single autoregressive LLM decides when to listen, when to speak, and when to stop, with no auxiliary turn-taking module. The design adds only a few special tokens to the standard vocabulary, so it transfers across LLMs and reuses existing training and serving stacks with no architectural adaptation. Starting from the public GLM-4-Voice checkpoint and using only 400K full-duplex samples for fine-tuning followed by a lightweight DPO stage, BayLing-Duplex reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, while improving the speech-response score from 2.17 to 3.39 over Moshi. BayLing-Duplex also matches or surpasses its turn-based counterpart on Llama Questions, Web Questions, and Alpaca-Eval, showing that simultaneous listen-and-speak modeling does not sacrifice response quality.

15. 【2606.14516】Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

链接：https://arxiv.org/abs/2606.14516

作者：Jan Batzner,Sree Harsha Nelaturu,Anastassia Kornilova,Jon Crall,Tommaso Cerruti,Yanan Long,Yifan Mai,Sanchit Ahuja,Asaf Yehudai,Marek Šuppa,John P. Lalor,Oluwagbemike Olowe,Jatin Ganhotra,Brian H. Hu,Eliya Habba,Andrew M. Bean,Chang Liu,Sander Land,Steven Dillmann,Aniketh Garikaparthi,Elron Bandel,Saki Imai,James Edgell,Wm. Matthew Kennedy,Jenny Chim,Patrick Meusling,Asteria Kaeberlein,Venkata Ramachandra Karthik Chundi,Manasi Patwardhan,Martin Ku,Austin Meek,Leon Knauer,Brian Wingenroth,Srishti Yadav,Usman Gohar,Felix Friedrich,Michelle Lin,Jennifer Mickel,Arman Cohan,Stella Biderman,Irene Solaiman,Zeerak Talat,Anka Reuel,Mubashara Akhtar,Gjergji Kasneci,Avijit Ghosh,Leshem Choshen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：understanding progress, testing and understanding, evaluation, schema, results

备注：

点击查看摘要

Abstract:AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.

16. 【2606.14512】Fodor and Pylyshyn's Systematicity Challenge Still Stands

链接：https://arxiv.org/abs/2606.14512

作者：Michael Goodale,Salvador Mascarenhas

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：caused significant stir, networks producing human-like, producing human-like language, neural networks, cognitive science

备注： Accepted in the Transactions of the Association for Computational Linguistics (TACL). This is a pre-MIT Press publication version of the paper

点击查看摘要

Abstract:The recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence "John saw Mary" just in case that they understand the sentence "Mary saw John." Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn's challenge to neural networks remains unmet.

17. 【2606.14470】GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

链接：https://arxiv.org/abs/2606.14470

作者：Pavan C Shekar,Abhishek H S,Aswanth Krishnan

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：pruned search branches, search branches leave, Large language model, Large language, context window

备注： 10 pages, 1 figure, 9 tables

点击查看摘要

Abstract:Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity ~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.

Comments:
10 pages, 1 figure, 9 tables

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

ACMclasses:
I.2.7; I.2.6; D.2.7

Cite as:
arXiv:2606.14470 [cs.AI]

(or
arXiv:2606.14470v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2606.14470

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

18. 【2606.14460】A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions

链接：https://arxiv.org/abs/2606.14460

作者：Kehinde Temitayo Soetan

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：decision support pipelines, remain empirically underspecified, Transformer-based clinical language, demographic associations encoded, distributions remain empirically

备注： 17 pages, 4 tables, appendices A-E, preprint

点击查看摘要

Abstract:Transformer-based clinical language models are increasingly integrated into high-stakes clinical decision support pipelines, yet the computational mechanisms through which demographic associations encoded in medical documentation propagate into model probability distributions remain empirically underspecified. We present a systematic computational audit of representational bias in ClinicalBERT (Alsentzer et al., 2019), a BERT-based model pretrained on MIMIC-III discharge summaries, employing two complementary probing methodologies: Log Probability Bias Analysis (LPBA), which quantifies demographic descriptor-induced shifts in masked token probability distributions across behavioral and evaluative semantic categories, and Masked Language Model-based analysis (MLM), which probes internal representational structure for demographic agency attribution encoding across 98 real clinical sentence templates and eight intersectional race-gender combinations. Corpus frequency analysis operationalizes the distinction between statistical disparity and bias amplification by benchmarking model outputs against empirical term frequencies in the MIMIC-III training corpus. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing, providing direct empirical evidence that representational bias in ClinicalBERT operates predominantly through model-internal amplification rather than training data inheritance. Keywords: natural language processing, clinical documentation, algorithmic auditing, representational bias, health equity 1

19. 【2606.14459】MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

链接：https://arxiv.org/abs/2606.14459

作者：Theresa Pekarek Rosin,Matthias Kerzel,Stefan Wermter

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词：Modern Automatic Speech, Automatic Speech Recognition, Modern Automatic, made remarkable progress, Speech Recognition

备注： Accepted at Interspeech 2026

点击查看摘要

Abstract:Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech impairments, and noise. Existing datasets and benchmarks typically isolate these factors, which overlooks their co-occurrence in real-world applications. In this paper, we argue that model robustness can be treated as a dynamic capability that continually develops, and we introduce MoDiCoL, a Modular Diagnostic Continual Learning dataset designed for controlled analysis of linguistic content, speaker characteristics, and acoustic environments. Furthermore, we propose a real-world-inspired continual learning curriculum to simulate incremental updates and study how robustness is acquired, transferred, and forgotten. We evaluate three continual learning strategies and provide detailed insights into robustness under evolving conditions.

20. 【2606.14420】Coping in Crisis: Computational Modeling of Coping Styles in Digital Crisis Discourse During the 2023 Turkiye Earthquake

链接：https://arxiv.org/abs/2606.14420

作者：Şevval Çakıcı

类目：Computation and Language (cs.CL)

关键词：real time, people cope, cope when disaster, disaster strikes, million Turkish-language tweets

备注： 20 pages, 5 figures, 3 tables. To be submitted to Social Science Computer Review

点击查看摘要

Abstract:How do people cope when disaster strikes and can we detect it at scale, in real time, from what they write? This study addresses that question using over one million Turkish-language tweets posted in the aftermath of the February 6, 2023 earthquake in Turkiye, which unfolded in a deeply polarized political context just months before a national election. Drawing on Lazarus and Folkman's (1984) coping theory, we develop a multi-label BERTurk classifier to detect three coping styles (problem-focused, emotion-focused, and meaning-making) across four theoretically motivated crisis phases. BERTurk achieves a macro F1 of 0.693, substantially outperforming a zero-shot mDeBERTa baseline (macro F1 = 0.324). Applied to the full corpus, the classifier reveals a clear temporal trajectory: problem-focused coping dominates the urgency phase and declines sharply, emotion-focused coping rises and stabilizes, and meaning-making increases monotonically. Anger correlates most strongly with meaning-making (Spearman r = 0.387), suggesting it functions as a mobilizing force toward blame attribution rather than practical action. These findings demonstrate that coping theory can be reliably operationalized in real-world digital crisis data and that doing so can help humanitarian organizations tailor their responses to where a population actually is.

21. 【2606.14391】Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

链接：https://arxiv.org/abs/2606.14391

作者：Henri-Leon Kordt,Theresa Pekarek Rosin,Jae Hee Lee,Stefan Wermter

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词：Automatic Speech Recognition, large-scale Automatic Speech, disfluent speech remains, speech remains challenging, Speech Recognition

备注： Accepted at Interspeech 2026

点击查看摘要

Abstract:Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior work has focused on verbatim transcription and the integration of disfluency markers, but adapting models on limited datasets can lead to catastrophic forgetting of general-domain knowledge. We address this gap by leveraging continual learning (CL) with explicit disfluency tokens. We first introduce these tokens into a pretrained ASR model to establish stable token mechanisms, and then continue training on additional datasets with varying disfluency distributions. Through a detailed analysis of model dynamics during training, we identify a trade-off between marker learning and ASR performance, and a consistent cross-attention head mechanism shared across CL methods.

22. 【2606.14368】Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

链接：https://arxiv.org/abs/2606.14368

作者：Woohyeon Byeon,Jiwon Jeon,Jeonghye Kim,Youngchul Sung

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：study multi-domain LLM, multi-domain LLM training, multi-domain LLM, LLM training, co-evolve by tutoring

备注：

点击查看摘要

Abstract:We study multi-domain LLM training in which two models, each stronger in a different domain, co-evolve by tutoring each other through on-policy feedback. Unlike one-way distillation or single-model fine-tuning, our goal is mutual Pareto improvement: each model improves across domains without losing its original strength. To this end, we propose On-Policy Co-Distillation (OPCoD), where each student's self-distillation is conditioned on its own correct rollout and feedback from its peer. To make feedback exchange effective, OPCoD uses cognizance-based gating to decide when to give feedback and feedback anchoring to ground feedback in the problem. On Science Q\A tasks, OPCoD consistently outperforms baselines and achieves Pareto improvement across all evaluated domain pairs and students.

23. 【2606.14325】Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

链接：https://arxiv.org/abs/2606.14325

作者：Francesco Cazzaro,Jessica Lennon,Ariadna Quattoni

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Property Graphs, heterogeneous data sources, Graphs are rapidly, representing heterogeneous data, rapidly being adopted

备注：

点击查看摘要

Abstract:Property Graphs are rapidly being adopted as database frameworks for representing heterogeneous data sources. To enable precise access to the information contained in them we need conversational interfaces based on Text-To-Cypher (Text2Cypher) parsers. This paper presents an automatic synthetic data generation method that can be leveraged to fine-tune small LLMs for this task. We conduct experiments on all the major Text-To-Cypher benchmarks, demonstrating that with our synthetic data generation approach we can significantly increase the performance of small LLMs, allowing them to compete with much larger proprietary models. This means that in settings in which models must be locally deployed we can ensure data-sovereignty without sacrificing accuracy and without costly annotation campaigns.

24. 【2606.14302】Retrospective Progress-Aware Self-Refinement for LLM Agent Training

链接：https://arxiv.org/abs/2606.14302

作者：Xinbei Ma,Congmin Zheng,Jiyang Qiu,Jiale Hong,Yao Yao,Xiangmou Qu,Jiaxin Yin,Xingyu Lou,Jun Wang,Weiwen Liu,Weinan Zhang,Zhuosheng Zhang,Hai Zhao

类目：Computation and Language (cs.CL)

关键词：hinders long-horizon scaling, reinforcement learning optimize, lack metacognitive awareness, LLM-based agents trained, learning optimize step-wise

备注：

点击查看摘要

Abstract:LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online progress prompting hurts performance while retrospective demonstrations help, yet this capability cannot emerge from outcome-reward training alone. We present RePro, Retrospective Progress-Aware Training, a framework that trains agents to self-generate progress signals via a forward-then-reflect rollout paradigm: the agent executes actions online, then retrospectively reassesses its step-wise progress given the completed trajectory and known outcome. RePro initializes with a Retrospection Warmup that teaches reflection format from minimal external demonstrations, then further trains through RePro-PO with a composite reward that produces self-generated signals without continuous external supervision. Experiments on WebShop, ALFWorld, and Sokoban show that RePro enhances the Qwen family's performance, with up to $12\%$ absolute success rate gains.

25. 【2606.14278】Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

链接：https://arxiv.org/abs/2606.14278

作者：Shaojie Yin

类目：Computation and Language (cs.CL)

关键词：open-ended instruction-following evaluation, Large language models, Large language, instruction-following evaluation, open-ended instruction-following

备注：

点击查看摘要

Abstract:Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it also introduces a new reliability question: does a judge evaluate the quality of an answer, or does it also react to the language in which the comparison is presented? We propose Judge-LS, a lightweight meta-evaluation protocol that transforms LLMBar response-pair items into English, Chinese, and Chinese-English language-switched variants. A reliable judge should preserve its preference under label-preserving language transformations and should not prefer a language when two answers are translation-equivalent. We evaluate four API-accessible judges on the full 419-item LLMBar benchmark, producing 13,408 successful pairwise judgments. Across models, Chinese and language-switched presentations induce 10.7--14.4% preference flips relative to English, and all judges achieve their highest accuracy in English. However, translation-equivalent tie probes do not reveal a systematic English preference: most probes are judged as ties, and non-tie decisions more often favor Chinese. We add confidence intervals, paired significance tests, and an automatic transformation audit with a sensitivity analysis that excludes mechanically flagged high-risk variants. The experiment requires no model training, uses only API calls, and is feasible on modest local hardware.

26. 【2606.14269】ScoreGate: Adaptive Chunk Selection for Retrieval-Augmented Generation via Dual-Score Statistical Fusion

链接：https://arxiv.org/abs/2606.14269

作者：Karamvir Singh,Arvind Jain

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Fixed-cardinality retrieval injects, constant top-K chunks, causing over-retrieval, Fixed-cardinality retrieval, injects a constant

备注： 20 pages, 6 figures, 14 tables

点击查看摘要

Abstract:Fixed-cardinality retrieval injects a constant top-K chunks into the generator regardless of query complexity, causing over-retrieval for narrow queries and under-retrieval for compositional ones. We describe ScoreGate, a lightweight score-space decision mechanism that controls retrieval cardinality at inference time using two scores already produced by the standard pipeline: bi-encoder similarity s_i and cross-encoder reranker score r_i, with no additional model inference calls required. Its core insight is that cross-encoder affirmation can rescue semantically relevant chunks that bi-encoder retrieval ranks poorly due to vocabulary mismatch -- a failure mode unaddressed by fixed-K or single-score thresholding. On MS MARCO (200 dev queries), ScoreGate achieves MRR@10 = 0.401 with 35% fewer retained chunks than Standard Top-K. On an internal benchmark (n=300, Fleiss' kappa=0.87), ScoreGate observed zero false positives (95% CI [96.4%, 100%]) at 97.77-99.34% recall, with 34.8% fewer tokens per query and only 31ms added latency. Results on both MS MARCO and real-world production traffic suggest that adaptive retrieval cardinality can improve retrieval efficiency without degrading retrieval quality.

27. 【2606.14257】he Linguistics Olympiads: Towards a New Corpus for Linguistics Research?

链接：https://arxiv.org/abs/2606.14257

作者：Vlad A. Neacsu

类目：Computation and Language (cs.CL)

关键词：International Linguistics Olympiad, self-sufficient puzzles consisting, scaled-down corpus representative, Linguistics olympiad, Linguistics

备注： Accepted for publication in LingBaW. Linguistics Beyond and Within (Volume 12, 2026)

点击查看摘要

Abstract:Linguistics olympiad problems (LOPs) are a category of self-sufficient puzzles consisting of a scaled-down corpus representative of certain linguistic phenomena, from which the solver must deduce a primitive set of rules of the language and then translate a new set of elements. The linguistics olympiads (LOs) have become a worldwide phenomenon with 43 different territories taking part in the International Linguistics Olympiad (IOL) 2025. While the typology and solving strategies of LOPs have been analysed, their scientific facet and connections to academic linguistics have yet to be explored. LOPs are directly connected to many linguistic fields, e.g., linguistic typology, linguistic relativity, and linguistics fieldwork. Recently, LOPs have become a research focus as benchmarks for large language models, thus highlighting their usefulness in computational linguistics. Nevertheless, they have not yet been integrated into mainstream linguistics research. This paper attempts to open new directions of including this particular type of puzzle in academic research by offering a structured evaluation of LOPs as linguistic data sources and proposes criteria for their responsible use in academic research. Starting from a set of over 1800 LOPs, this study critically examines the potential of LOPs as a novel corpus for linguistics research by discussing their strengths and limitations as tools, as well as the areas of linguistics into which these problems could fit. This work forms the foundation for a broader initiative aimed at bridging the gap between LOs and academic linguistics, by establishing a robust theoretical framework for LOPs.

28. 【2606.14243】Decoupled Mixture-of-Experts for Parametric Knowledge Injection

链接：https://arxiv.org/abs/2606.14243

作者：Baoqing Yue,Weihang Su,Qingyao Ai,Yichen Tang,Changyue Wang,Jiacheng Kang,Jingtao Zhan,Yiqun Liu

类目：Computation and Language (cs.CL)

关键词：equip large language, large language models, Knowledge injection aims, aims to equip, equip large

备注：

点击查看摘要

Abstract:Knowledge injection aims to equip large language models (LLMs) with external, domain-specific, or time-sensitive knowledge. Existing approaches typically face a trade-off between flexibility and integration: retrieval-augmented generation keeps knowledge outside the model but only provides prompt-level augmentation, whereas post-training based methods encode new knowledge into shared parameters but may introduce catastrophic forgetting, knowledge conflict, and costly updates. In this paper, we propose Decoupled Mixture-of-Experts (DMoE), a modular architecture for parametric knowledge injection that decouples both experts and the router from the base model. DMoE converts external knowledge corpora into independently updatable expert modules and uses a lightweight uncertainty-aware router to activate relevant experts only when the base model lacks sufficient knowledge during generation. To support efficient auto-regressive inference, DMoE attaches experts only to the final-layer feed-forward network, preserving KV-cache reuse while enabling parameter-level knowledge augmentation. Experiments on knowledge-intensive benchmarks show that DMoE consistently improves answer quality over retrieval and adapter-based baselines.

29. 【2606.14230】A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators

链接：https://arxiv.org/abs/2606.14230

作者：Amna Amjid,Sana Qadir,Mehwish Fatima,Raja Khurram Shahzad

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Generative Adversarial Networks, artificially generated images, threaten privacy, information integrity, videos that threaten

备注：

点击查看摘要

Abstract:Deepfakes are artificially generated images, audio, or videos that threaten privacy, security, and information integrity. Detecting such content is crucial for countering disinformation, as the latest models generate highly realistic content. While spatial- or frequency-based approaches achieve good detection rates on Generative Adversarial Networks (GANs)-based generated deepfakes, they often struggle with recent diffusion model-generated images. In particular, existing approaches rarely exploit complementary multi-domain representations or systematically evaluate cross-generator robustness. To address these challenges, we propose a multi-domain deepfake detection framework called SGFF-Net (Spatial-Gradient-Frequency Fusion Network) that integrates spatial, gradient, and DWT (Discrete Wavelet Transform)-based frequency representations within a dual residual learning architecture. Experimental results show that the SGFF-Net achieves 98.95\% accuracy in intra-dataset evaluation and improves performance in both cross-model (70.46\%) and cross-paradigm (69.94\%) settings. Incorporating multi-source training and data augmentation further enhances robustness, increasing accuracy from 70.46\% to 79.80\% in cross-model evaluation, from 69\% to 78\% in cross-paradigm evaluation, and from 61.50\% to 75.80\% on real-world data. Unlike single-domain detectors, the SGFF-Net learns complementary forensic cues across spatial, gradient, and wavelet-frequency domains, resulting in greater robustness under cross-generator and cross-paradigm evaluation. The results further show that combining multi-domain representations with data diversity and augmentation substantially improves generalization, providing practical insights for developing more reliable deepfake detection systems.

30. 【2606.14209】Detecting undisclosed LLM-generated content in parliamentary texts

链接：https://arxiv.org/abs/2606.14209

作者：Minerva Suvanto,Andrea McGlinchey,Peter J. Barclay,Mattias Wahde

类目：Computation and Language (cs.CL)

关键词：Kingdom and Sweden, United Kingdom, evaluate the extent, undisclosed LLM-generated content, Sweden

备注：

点击查看摘要

Abstract:In this paper, we evaluate the extent of undisclosed LLM-generated content in texts from the parliaments of the United Kingdom and Sweden. In many areas, such as in journalism or in academic writing, there are often requirements to clearly disclose whether AI tools, such as LLMs, have been used. In the case of parliamentary texts, the guidelines on disclosure of AI use are more vague. However, in order to maintain transparency and retain public trust, it is generally recommended that parliamentarians should state whether or not they have used AI when writing texts, such as parliamentary motions. Here, we train an interpretable (glass-box) text classifier using pre-LLM parliamentary texts and LLM-generated versions of such texts. We then apply the classifier to a test set containing recent parliamentary texts, finding a steady increase in undisclosed LLM use, in both parliaments, from 2022 onwards.

31. 【2606.14199】OdysSim: Building Foundation Models for Human Behavior Simulation

链接：https://arxiv.org/abs/2606.14199

作者：Xuhui Zhou,Weiwei Sun,Weihua Du,Jiarui Liu,Haojia Sun,Qianou Ma,Tongshuang Wu,Yiming Yang,Maarten Sap

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large language models, Large language, increasingly deployed, simulators for interactive, interactive evaluation

备注： 34 pages. Code: [this https URL](https://github.com/sunnweiwei/OdysSim) ; Models and data: [this https URL](https://huggingface.co/collections/cmu-lti/odyssim)

点击查看摘要

Abstract:Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i.e., models trained to simulate human behavior at scale. We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that unifies 62 datasets and 23 benchmark tasks under one framework. Specifically, we curate the OdysSim corpus (21.4M interactions, 10B tokens, retrofitted with back-generated social contexts), construct the SOUL-Index benchmark, and develop an end-to-end training recipe combining midtraining, task-specific RL, and expert distillation. The resulting open 8B OSim model ranks first or tied-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks. Its outputs are also more human-like in length, formatting, and word choice, and it transfers zero-shot to out-of-distribution user simulation on $\tau$-bench, nearly matching real users on reaction alignment (93.2 vs. 93.5). We further show that LLM-as-judge RL induces reward-hacking patterns, and that our detectors can mitigate them during post-training. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm. We release all artifacts to support future research.

32. 【2606.14179】CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

链接：https://arxiv.org/abs/2606.14179

作者：Md Amirul Islam,Sumiran Thakur,Huancheng Chen,Su Min Park,Jiayun Wang,Gyuhak Kim

类目：Computation and Language (cs.CL)

关键词：multi-step tool-calling tasks, percent process accuracy, times less compute, process accuracy, accuracy on multi-step

备注：

点击查看摘要

Abstract:We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach addresses three challenges in practical agent training: transferring tool-calling knowledge from large models at scale, enabling reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments. CacheRL introduces three key innovations. First, a hybrid thinking trajectory pipeline augments agent trajectories with LLM-generated reasoning traces, producing training examples that teach models not only what tools to call but also why. Second, the CacheAgentLoop eliminates live execution costs through a three-tier fuzzy cache while preserving trajectory fidelity using token-level masking. Third, a cache-tier-aware reward dynamically adjusts answer-quality weights to avoid penalizing models for cache-induced limitations. Through iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), CacheRL improves Qwen3-4B-Thinking's validation reward from 0.43 to 0.78. On public agentic tool-calling benchmarks, our model achieves competitive performance against frontier models such as GPT-5. Ablation studies show that removing knowledge transfer reduces performance by 41 percent, while cache-aware rewards contribute a 17 percent improvement. Interestingly, reinforcement learning improves training stability but yields limited gains beyond strong supervised fine-tuning, suggesting that data quality and reward design play a more important role than complex optimization methods in building practical small agent models.

33. 【2606.14155】Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems

链接：https://arxiv.org/abs/2606.14155

作者：Tan Zhu,Tong Yao,Kananart Kuwaranancharoen,Amit Singh,Yushang Lai,Deepa Mohan,Shankara Bhargava

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：modifying model weights, iteratively revising tunable, Context adaptation automates, task feedback, model weights

备注：

点击查看摘要

Abstract:Context adaptation automates prompt engineering in LLM-based systems by iteratively revising tunable prompts from task feedback, without modifying model weights. Extending this paradigm to multi-LLM agentic systems is crucial: existing methods suffer from inaccurate credit assignment and lack convergence guarantees. We propose \textbf{G}raph-based \textbf{T}arget \textbf{B}ack-\textbf{P}ropagation (GTBP), a context adaptation framework for agentic workflows modeled as directed acyclic graphs. GTBP propagates local target outputs backward through the workflow graph and uses target--output discrepancies to guide a stage-wise prompt update mechanism. Theoretically, we show that GTBP's stage-wise prompt updates become stable over iterations, and that a sufficiently capable LLM optimizer can decrease the overall objective. Empirically, GTBP consistently outperforms strong baselines across three benchmarks while maintaining comparable computational cost.

34. 【2606.14150】Small LLMs: Pruning vs. Training from Scratch

链接：https://arxiv.org/abs/2606.14150

作者：Yufeng Xu,Taiming Lu,Kunjun Li,Jiachen Zhu,Mingjie Sun,Zhuang Liu

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：strong small language, small language models, training token budget, token budget, Pruning

备注： Our code is available at [this https URL](https://github.com/zlab-princeton/llm-pruning-collection)

点击查看摘要

Abstract:Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.

35. 【2606.14145】Personal Care Utility: Health as Everyday Infrastructure

链接：https://arxiv.org/abs/2606.14145

作者：Mahyar Abbasian,Elahe Khatibi,Saba A. Farahani,Nitish Nagesh,Arshia Ilaty,Hooman Sajjadi,Amir Rahmani,Ramesh Jain

类目：Computation and Language (cs.CL)

关键词：Healthcare is essential, episodic by design, year a person, person spends, Healthcare

备注： 12 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Healthcare is essential, expert, and episodic by design - built around the roughly one hour per year a person spends with a clinician. The 8,759 hours outside clinical settings, where eating, sleeping, movement, medication, and stress actually shape long-term health, have no comparable infrastructure. The bottleneck for personalized health is not raw data or reasoning capability; it is the absence of that infrastructure layer. This paper introduces the Personal Care Utility (PCU): a layered, event-driven architecture proposed as the missing utility for everyday health, in the way that payments, networks, and power are utilities for their domains. PCU organizes continuous personal signals into semantically meaningful life events through a Personicle, estimates dynamic health state against personal baselines, reasons about cause and context, and routes guidance through an orchestrator that separates clinical decision logic, behavioral strategy selection, and natural-language expression. This separation lets large language models support reasoning and communication while keeping safety-critical clinical decisions grounded in validated evidence. We instantiate PCU for Type 2 Diabetes - turning CGM, meal, activity, medication, sleep, stress, and clinical data into glycemic events, individualized state estimates, causal explanations, and knowledge-grounded interventions. A day-in-the-life scenario shows the same infrastructure producing real-time nudges, weekly summaries, medication check-ins, silence, or deterministic safety alerts depending on context and risk. We close with how PCU generalizes to other chronic conditions and the governance questions any always-on personal health utility must address. The result is a blueprint that treats personalization not as a final messaging layer, but as an architectural property of everyday health guidance.

36. 【2606.14142】Implicit Reasoning for Large Language Model-based Generative Recommendation

链接：https://arxiv.org/abs/2606.14142

作者：Yinhan He,Liam Collins,Bhuvesh Kumar,Jundong Li,Neil Shah,Donald Loveland

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Language Models, Generative Recommendation, pretrained world knowledge

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key obstacle is that LLM-based GR typically represents items with Semantic IDs (SIDs), disrupting LLMs' natural-language reasoning interface because these tokens are unseen by the LLM during pretraining. Existing approaches address this with expensive multi-stage pipelines that ground SIDs and elicit explicit rationales, but offer limited insight into when and why each stage is necessary. In this work, we systematically decompose explicit reasoning training pipelines for LLM-based GR, revealing three key limitations: weakened world-knowledge verbalization, misalignment between SID and natural-language token embedding spaces, and sensitivity to rationale quality, all of which hurt explicit reasoning performance. To circumvent these issues, we propose PauseRec, a lightweight implicit reasoning paradigm tailored for GR. PauseRec is exceptionally practical, avoiding costly reasoning trace acquisition and reasoning alignment training, leading to a multitude of benefits: (1) it outperforms standard explicit CoT methods by up to 6.22%, (2) it reduces training cost by up to 65% GPU hours, and (3) it speeds up inference by up to 71.3%. These results position PauseRec as a lightweight alternative to explicit rationale generation, enabling more effective and efficient LLM-based GR.

37. 【2606.14141】Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

链接：https://arxiv.org/abs/2606.14141

作者：Oh Hyun-Bin,Kazuki Shimada,Yuhta Takida,Kim Sung-Bin,Toshimitsu Uesaka,Takashi Shibuya,Kyeongyoon Lee,Tae-Hyun Oh,Yuki Mitsufuji

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：global event content, current audio-language models, current audio-language, reason about clips, clips as global

备注：

点击查看摘要

Abstract:Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.

38. 【2606.14127】CoRe: A Continuously Reward-Finetuned LLM Query Rewriter for Multi-Stage Context-Aware Relevance in Web-Scale Video Search

链接：https://arxiv.org/abs/2606.14127

作者：Yilin Wen,Rong Yang,Xiaojia Chang,Hong Sun,Gefu Tang,Chunhui Liu,Jeffrey Chen,Zeyu Ma,Lisong Qiu,Xiaochuan Fan,Congjia Yu,Quan Zhou,Yuheng Chen,Zian Wang

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：support continuous redeployment, LLM-based query rewriters, training procedure, LLM-based query, face a tension

备注： 12 pages, 3 figures

点击查看摘要

Abstract:LLM-based query rewriters in production face a tension: the training reward must reflect how the rewrite is consumed by the production ranker, yet the training procedure must be cheap enough to support continuous redeployment as data drifts. We present CoRe (Context Relevance), such a system, redeployed weekly for over five months in a major short-video search engine. Our reward uses the deployed multimodal relevance model as its source and a multiplicative ratio form mirroring the production fusion algebra, closing the simulation-production gap that offline reward proxies leave open. A semi-online Mixed Preference Optimization loop makes this reward affordable at multi-million-instance weekly scale: a DPO-style pairwise objective restricts the gradient pass to a small top-k/bottom-k subset of sampled trajectories, and a phase structure reduces trainer/inference-server parameter syncs from per-step to per-phase. An automated promotion gate over reward-like and stability metrics detected and recovered from a real reward-hacking incident in production. Rewriter output is consumed as parallel relevance signals at recall, rawrank, and finerank without displacing the original signals, bounding rewriter-failure blast radius. Online A/B from two sequential production launches, first deploying the rewriter at finerank, then extending consumption to recall and rawrank, delivers statistically significant reductions in change-query rate on rewrite-impacted queries, with all headline relevance and engagement metrics moving in the expected direction.

39. 【2606.14122】Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

链接：https://arxiv.org/abs/2606.14122

作者：Sangwhan Moon,Daisuke Oba,Youmi Ma,Tatsuya Hiraoka,Naoaki Okazaki

类目：Computation and Language (cs.CL)

关键词：Byte-level tokenization enables, Unicode input, Byte-level tokenization, handle any Unicode, tokenization enables language

备注：

点击查看摘要

Abstract:Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.

40. 【2606.14113】Simulating Students' Java Programming Errors with Large Language Models

链接：https://arxiv.org/abs/2606.14113

作者：Ali Keramati,Jie Cao,Iman Mohammadi,Mark Warschauer,Yang Shi

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：extensive classroom deployment, Understanding student errors, newly designed task, designed task remains, task remains slow

备注：

点击查看摘要

Abstract:Understanding student errors in the programming is a cornerstone of programming education, yet obtaining a representative set of student errors for any newly designed task remains slow and costly, since authentic submissions only accumulate after extensive classroom deployment. This paper explores whether large language models (LLMs) can serve as scalable proxies for students by simulating realistic logical errors in code submissions. Using the CodeWorkout dataset of 74,000+ unique student Java submissions across 37 problems, we evaluate five LLMs under three mainstream prompting strategies: Input-Output (IO), Chain-of-Thought (CoT), and iterative Self-Refine. We assess performance along two key dimensions: diversity (the range of distinct error patterns) and alignment (alignment with authentic student mistakes), and examine how these vary by struggling level of programming tasks. Our quantitative findings reveal that while all models generate diverse errors, their alignment to human submissions diverges: Claude Sonnet 4 achieves the most balanced performance. In addition, we conducted a blinded expert annotation study (N = 401) comparing synthetic and authentic errors. This qualitative analysis confirms that the generated errors are functionally indistinguishable from authentic student errors. Moreover, higher-struggling-level problems elicit more diverse but less student-like errors. These results highlight trade-offs in using LLMs to simulate human learners and suggest design considerations for integrating synthetic errors into teachable agents, intelligent tutoring systems, and large-scale learning analytics.

41. 【2606.14072】Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

链接：https://arxiv.org/abs/2606.14072

作者：Wentao Ke,Jianche Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Accurate pediatric brain, heterogeneous imaging phenotypes, limited annotated data, remains challenging due, diffuse tumor boundaries

备注：

点击查看摘要

Abstract:Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage deep learning framework for improving multi-modal pediatric brain MRI segmentation and clinical interpretation. First, we evaluate 3D Res U-Net and Swin-UNETR baselines on BraTS-PEDs MRI scans, using four co-registered modalities to predict tumor core, whole tumor, and enhancing tumor regions. Second, we introduce diffusion-based refinement models conditioned on coarse Swin-UNETR predictions, including a 3D DDPM refiner and MedSegDiff. Conditioning substantially improves diffusion stability and performance, particularly for enhancing tumor boundary segmentation. Conditioned MedSegDiff achieves the strongest boundary agreement with the lowest HD95. Finally, predicted tumor volumes and representative segmentation overlays are integrated with a multimodal language model to generate structured radiology-style reports. Together, our results suggest that coarse-to-refined diffusion segmentation can improve pediatric tumor boundary delineation and support end-to-end interpretable AI-assisted neuro-oncology workflows.

42. 【2606.14068】Harsher on Male? Evaluating LLMs on Gender-Asymmetric Moral Framing Across Diverse Conflict Scenarios

链接：https://arxiv.org/abs/2606.14068

作者：Guangzong Si,Dong Wang,Zhenhao Li,Yifan Yu,Panwang Pan,Wentao Zhu

类目：Computation and Language (cs.CL)

关键词：Existing studies, explicit harmful outputs, occupational associations, focused on stereotypes, harmful outputs

备注： underreview

点击查看摘要

Abstract:Existing studies on gender bias in LLMs have largely focused on stereotypes, occupational associations, or explicit harmful outputs. In this work, we ask whether LLMs apply consistent response standards to the same negative behavior under matched male-actor and female-actor conditions. We introduce GAMA-Bench, a gender-mirrored benchmark of 1,298 scenarios covering intimate relationship and public social conflicts. It constructs gender-neutral misconduct templates through controlled grids and cross-model review, then compiles them into paired first-person prompts with matched actor-gender and role-reference variations. We further design a structured response-framing protocol to measure how models allocate punishment, empathy, escalation, instruction, and blame. Experiments on 10 representative LLMs reveal a consistent male-disadvantaging asymmetry: male actors receive more punitive, escalatory, and blame-centered framing, whereas female actors receive more therapeutic and empathy-oriented framing for the same misconduct. Further analyses show that this pattern persists across model families, scenario tracks, model scale, and explicit thinking-style reasoning. The official code is available at this https URL.

43. 【2606.14060】Non-Parametric Machine Text Detection via Multi-View Gaussian Processes

链接：https://arxiv.org/abs/2606.14060

作者：Aleem Khan,Nicholas Andrews

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：targeted style transfer, style transfer sharply, transfer sharply degrade, Adversarial conditions, machine text detectors

备注：

点击查看摘要

Abstract:Adversarial conditions such as paraphrasing and targeted style transfer sharply degrade the accuracy of machine text detectors. A document, however, carries multiple complementary signals (e.g., stylistic features, likelihood and rank-order features, and structural features), and an attack that suppresses one may leave others intact. While a parametric classifier can learn to combine these features given sufficient supervision, classifiers are prone to making confidently incorrect predictions when the distribution shifts (e.g., novel attacks or unseen language models). To address this, we propose a multi-view, non-parametric detection framework that extracts complementary feature views from the same document and aggregates per-view evidence through a Gaussian process ensemble. By aggregating evidence across views, an adversary must simultaneously defeat multiple independent axes of detection, substantially raising the cost of evasion. The Gaussian process formulation additionally provides calibrated probabilities and principled abstention on out-of-distribution inputs, supporting reliable deployment in high-stakes settings. We evaluate on three benchmarks spanning diverse generators and attacks: the DetectRL and RAID benchmarks, and the PAN2025 shared task and demonstrate that our multi-view detector maintains strong performance under the considered attacks, outperforming existing approaches against held out attacks.

44. 【2606.14047】Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

链接：https://arxiv.org/abs/2606.14047

作者：Ghadir Alselwi,Basem Suleiman,Hao Xue,Shoaib Jameel,Hakim Hacid,Flora D. Salim,Imran Razzak

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：maintaining coherent understanding, Long-context language modeling, extending context windows, Long-context language, windows but maintaining

备注：

点击查看摘要

Abstract:Long-context language modeling requires not only extending context windows but maintaining coherent understanding of entity states and relationships across thousands of tokens -- a challenge that semantic similarity alone cannot address. KGERMAR addresses this by constructing dynamic, context-specific knowledge graphs from input text during inference, enabling domain-adaptive retrieval that leverages both semantic similarity and explicit entity relationships. The framework performs real-time entity and relation extraction to build contextual knowledge graphs, then integrates graph-structural embeddings with textual semantics through a multi-component memory architecture. Three memory banks -- contextual, semantic, and structural -- are maintained with retrieval signals fused via learned weights to capture both surface-level semantics and deeper relational patterns. Evaluated on SlimPajama (84.7K training examples), WikiText-103 (4,358 examples), PG-19 (100 examples), and Proof-pile (46.3K examples), KGERMAR achieves up to 8.5\% lower perplexity and 2--2.5x better memory efficiency than memory-augmented baselines across context lengths from 1K to 32K tokens, with superior in-context learning performance across five NLU tasks. The dynamic knowledge graph construction approach advances memory-augmented language modeling by enabling domain-specific knowledge representation that adapts to input contexts rather than relying on fixed knowledge bases.

45. 【2606.14037】Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

链接：https://arxiv.org/abs/2606.14037

作者：Jihye Kim,Jeffrey Flanigan

类目：Computation and Language (cs.CL)

关键词：critical alignment property, integrated roles, user pushback, compliance, alignment property

备注：

点击查看摘要

Abstract:As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models resist pressure but not whether they resist it selectively. We introduce Compliance Asymmetry (A = BCR/HCR), a bidirectional diagnostic that compares beneficial output change under helpful nudges with harmful change under misleading nudges. Across 9 models and 972,000 nudge-condition responses, we find that this selectivity differs in factual and moral judgments: models follow helpful nudges more than harmful ones on factual questions (A = 1.58), but follow both directions at nearly identical rates on moral questions (A = 1.04). This phenomenon persists across model families, capability levels, and nudging types. Interestingly, we also find that chain-of-thought prompting amplifies helpful and harmful compliance together, while identity-based prompting suppresses both by nearly identical margins. These results identify direction-blind moral compliance as a distinct failure mode in current LLMs and suggest that alignment should target directionally calibrated updating rather than lower compliance alone.

46. 【2606.14030】Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

链接：https://arxiv.org/abs/2606.14030

作者：Rishit Chatterjee,Tahiya Chowdhury

类目：ound (cs.SD); Computation and Language (cs.CL)

关键词：hardware requires smaller, resource-constrained hardware requires, Streaming speaker diarization, time-critical medical dispatch, medical dispatch

备注： 6 pages, 3 figures, preprint

点击查看摘要

Abstract:Streaming speaker diarization is crucial for time-critical medical dispatch, but deploying it on resource-constrained hardware requires smaller, faster models. Using SIMSAMU, a dataset of simulated medical-dispatch conversations, we evaluate streaming behavior before compressing the segmentation model with pruning and low-bit quantization. We characterize performance across a range of streaming latency budgets and find that additional buffering is not consistently beneficial, while very low-latency operating points can substantially degrade performance. Our study shows that model compression trades performance for memory footprint, and we highlight an operating point where FP16 reduces model size by half with essentially unchanged real-time factor, at a cost of a 40\% relative DER increase against the baseline. This work characterizes the trade-offs for real-time deployment and contributes to speech technology that can enable reliable human communication in time-critical contexts.

47. 【2606.14027】Same-Origin Policy for Agentic Browsers

链接：https://arxiv.org/abs/2606.14027

作者：Xilong Wang,Xiaoxing Chen,Patrick Li,Dawn Song,Neil Gong

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)

关键词：accomplish web tasks, browsers integrate autonomous, Agentic browsers, SOP, accomplish web

备注：

点击查看摘要

Abstract:Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that prevents unauthorized automated cross-origin data flows induced by scripts. However, whether SOP remains effective in agentic browsers is an open question that has not been systematically studied. In this work, we bridge this gap. We first observe that an agentic browser can itself serve as an automated channel for cross-origin data flows, potentially leading to SOP violations. To investigate this phenomenon, we construct SOPBench, a benchmark for evaluating SOP violations in agentic browsers. Our evaluation shows that existing agentic browsers frequently violate SOP, both in benign settings and under attacks. To address this problem, we propose SOPGuard, an SOP enforcement mechanism tailored to agentic browsers. We implement SOPGuard in BrowserOS, an open-source agentic browser. Extensive evaluations demonstrate that SOPGuard effectively enforces SOP while preserving utility and incurring only a small runtime overhead. Our code and data are available at this https URL.

48. 【2606.13995】Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

链接：https://arxiv.org/abs/2606.13995

作者：Brendan King,Jeffrey Flanigan

类目：Computation and Language (cs.CL)

关键词：rapidly transformed software, powering widely, interactive coding assistants, transformed software engineering, rapidly transformed

备注： 22 pages, 13 figures

点击查看摘要

Abstract:AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue SWE-Bench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user. We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.

49. 【2606.13993】he Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

链接：https://arxiv.org/abs/2606.13993

作者：Zachary Nicholas Houghton,Yu Zhou,Dan Pluth,Vijay K. Gurbani

类目：Computation and Language (cs.CL)

关键词：applying productive rules, retrieve learned representations, productive rules, crucial aspect, aspect of linguistic

备注：

点击查看摘要

Abstract:A crucial aspect of linguistic capability is the ability to trade off between stored representations and abstract knowledge: one must retrieve learned representations, but also generate novel ones by applying productive rules. While recent work has examined abstract knowledge in language models, holistic storage of multi-word units has received far less attention. We probe internal representations in text-based LLMs and an ASR model, testing whether V+up phrasal verbs develop distinct representations as a function of frequency and predictability. All models show evidence of holistic storage driven by frequency and predictability, further supporting usage-based theories of language.

50. 【2606.13991】Fusing Stylometric and Embedding Systems to Estimate Authorship Likelihood Ratios in Japanese

链接：https://arxiv.org/abs/2606.13991

作者：Praju Ghatpande,Satoru Tsuge,Shunichi Ishihara,Wataru Zaitsu,Mitsuyuki Inaba

类目：Computation and Language (cs.CL)

关键词：legally sound basis, likelihood ratio, likelihood ratio magnitudes, likelihood ratio framework, textual evidence

备注：

点击查看摘要

Abstract:The likelihood ratio framework is widely recognized as the logically and legally sound basis for evidential analysis across forensic sciences, and its importance is increasingly acknowledged in analyses of authorship in textual evidence. To date, however, its application has been confined to English-language texts. Meanwhile, authorship attribution has traditionally relied on a diverse array of stylometric features, even as the rise of pre-trained large language models enables new contextual-embedding approaches. Combining these diverse approaches through fusion promises enhanced performance, yet it has not been applied to integrate stylometric-feature systems with embedding-based systems within the likelihood ratio paradigm. This study is the first to apply likelihood ratio-based forensic text comparison to Japanese digital texts, using ~1,000-character excerpts from blogs, to 1) evaluate system performance and likelihood ratio magnitudes and 2) assess the impact of fusing stylometric-feature systems with embedding-based systems. The results demonstrate that the fused system maintains excellent calibration while 1) increasing consistent-with-fact likelihood ratio magnitudes; 2) decreasing contrary-to-fact likelihood ratio magnitudes and 3) improving overall discriminability. The best-performing fusion achieved a log-likelihood-ratio cost of 0.32484, illustrating both the feasibility of likelihood ratio framework for Japanese and the benefits of fusion across heterogeneous systems.

51. 【2606.13977】Creative Integration: A Decidable Criterion of Creativity

链接：https://arxiv.org/abs/2606.13977

作者：Yoshinori Nomura

类目：Computation and Language (cs.CL)

关键词：Toggle, Toggle Hugging Face, Bibliographic Explorer Toggle, Explorer Toggle Bibliographic, Toggle Bibliographic Explorer

备注： 18 pages, 1 figure

点击查看摘要

Abstract:"Integrative" solutions are widely praised but rarely defined: we lack an operational way to tell a genuine integration -- one that makes the world cheaper to describe -- from a tidy re-description. Building on the lineage that treats creativity and intelligence as compression, we give such a criterion for creative integration (CI): the resolution of a real conflict between A and B is CI if and only if, under a fixed description language, the description length strictly shrinks (C = L_pre/L_post 1), with the reduction located in the conflict itself. We make the judgment decidable through four binary, conjunctive gates, and we fix its extension through a taxonomy of pseudo-integration that names and rejects the look-alikes. We back the criterion with a curated, multi-domain corpus and -- crucially -- validate it not by human inter-rater agreement but by four falsifiable tests it could fail: an independent computational check, discrimination against hard negatives, out-of-sample prediction, and description-language robustness; all pass with margin. The contribution is not "creativity is compression" but its decidability, discrimination, and corpus: on this account, what makes a move genuinely creative -- rather than merely novel -- is that it compresses a conflict, with novelty and value as downstream symptoms; whether all creativity is so constituted we state as an explicit conjecture. We claim only the sign of C-1; we judge, not generate. The result is a citable primitive for a broader program.

Comments:
18 pages, 1 figure

Subjects:

Computation and Language (cs.CL)

ACMclasses:
I.2.7; I.2.0; F.4.1

Cite as:
arXiv:2606.13977 [cs.CL]

(or
arXiv:2606.13977v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.13977

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Yoshinori Nomura [view email] [v1]
Thu, 11 Jun 2026 23:49:25 UTC (26 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Creative Integration: A Decidable Criterion of Creativity, by Yoshinori NomuraView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CL

|
next

new
|
recent
| 2026-06

Change to browse by:

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

信息检索

1. 【2606.14557】Private Information Retrieval for Large-Scale DNA-Based Data Storage

链接：https://arxiv.org/abs/2606.14557

作者：Gökberk Erdoğan,Daniella Bar-Lev,Rawad Bitar,Antonia Wachter-Zeh,Zohar Yakhini

类目：Information Retrieval (cs.IR)

关键词：Private Information Retrieval, investigate Private Information, Information Retrieval, Private Information, investigate Private

备注： 9 pages, 6 figures

点击查看摘要

Abstract:We investigate Private Information Retrieval (PIR) in the context of synthetic DNA-based data storage. While PIR is a well-studied primitive for digital databases, extending it to DNA-based databases presents unique challenges arising from biochemical query mechanisms and their complexity. We propose two approaches for adapting two-server PIR protocols to DNA-based storage, balancing privacy, efficiency, and feasibility. These approaches illustrate how information-theoretic privacy trade-offs manifest in DNA-based storage systems.

2. 【2606.14474】Verifiable User Simulation for Search and Recommendation Systems

链接：https://arxiv.org/abs/2606.14474

作者：Chenglong Ma,Xinye Wanyan,Danula Hettiachchi,Ziqi Xu,Yongli Ren,Jeffrey Chan

类目：Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

关键词：evaluating search engines, retrieval-augmented generation pipelines, intended user profile, simulators remain opaque, simulated user made

备注： Presented as a half-day tutorial at SIGIR 2026, 4 pages

点击查看摘要

Abstract:Large-language-model (LLM) based user simulation is increasingly adopted for evaluating search engines, recommender systems, and retrieval-augmented generation pipelines, yet most simulators remain opaque: it is difficult to determine why a simulated user made a particular choice or whether that choice is consistent with the intended user profile. Compounding this, recent research shows that LLMs can produce biased or discriminatory responses depending on user background characteristics such as language, education level, and cultural context, raising concerns about the equitable treatment of minority and disadvantaged groups. This half-day, in-person tutorial introduces a proposed design-and-audit framework that treats a user simulator as a verifiable engineering artefact composed of seven auditable components - structured Persona, task-aware Contract, matched human-vs-agent Execution, auditable Trace, persona-aligned Verification, structured Feedback, and a Refinement loop that updates personas and contracts. Through two hands-on mini-labs on recommendation-list evaluation and search-query formulation, participants will inspect simulator behaviour end-to-end, distinguish diagnostic discrepancy analysis from statistical validation, and apply checks for fidelity, credibility, and demographic bias. The tutorial targets information retrieval and recommender systems researchers and practitioners interested in user behaviour simulation and responsible AI.

3. 【2606.14269】ScoreGate: Adaptive Chunk Selection for Retrieval-Augmented Generation via Dual-Score Statistical Fusion

链接：https://arxiv.org/abs/2606.14269

作者：Karamvir Singh,Arvind Jain

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Fixed-cardinality retrieval injects, constant top-K chunks, causing over-retrieval, Fixed-cardinality retrieval, injects a constant

备注： 20 pages, 6 figures, 14 tables

点击查看摘要

4. 【2606.14260】ChronoID: Infusing Explicit Temporal Signals into Semantic IDs for Generative Recommendation

链接：https://arxiv.org/abs/2606.14260

作者：Dongdong Nian,Dongqi Fu,Chenliang Xu,Yinglong Xia,Hong Li,Hong Yan,Jian Kang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Semantic IDs, fundamental limitation, Semantic, IDs are crucial, IDs

备注：

点击查看摘要

Abstract:Semantic IDs are crucial in generative recommendation, but with a fundamental limitation: temporal information is not well incorporated into semantic IDs. Instead, time influences recommendation only implicitly (e.g., through session construction heuristics, preference alignment, or sequence order), while existing semantic ID learning remains entirely time-agnostic. This design conflates interactions occurring under distinct temporal contexts into identical semantic representations, implicitly assuming that item semantics and user intent are temporally stationary. Such an assumption is misaligned with real-world recommendation scenarios, where evolving interaction rhythms play a central role. In this work, we investigate where and how the explicit time should be incorporated into semantic ID for generative recommendation. First, we systematically characterize the design space along three orthogonal dimensions of temporal signals and present a unified framework, ChronoID, for time-aware semantic ID learning. Then, by contributing a new time-explicit generation recommendation benchmark, ChronoID answers the questions: what is the effective way of infusing time, how to design the architecture, and where does the gain come from.

5. 【2606.14127】CoRe: A Continuously Reward-Finetuned LLM Query Rewriter for Multi-Stage Context-Aware Relevance in Web-Scale Video Search

链接：https://arxiv.org/abs/2606.14127

作者：Yilin Wen,Rong Yang,Xiaojia Chang,Hong Sun,Gefu Tang,Chunhui Liu,Jeffrey Chen,Zeyu Ma,Lisong Qiu,Xiaochuan Fan,Congjia Yu,Quan Zhou,Yuheng Chen,Zian Wang

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：support continuous redeployment, LLM-based query rewriters, training procedure, LLM-based query, face a tension

备注： 12 pages, 3 figures

点击查看摘要

6. 【2606.14047】Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

链接：https://arxiv.org/abs/2606.14047

作者：Ghadir Alselwi,Basem Suleiman,Hao Xue,Shoaib Jameel,Hakim Hacid,Flora D. Salim,Imran Razzak

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：maintaining coherent understanding, Long-context language modeling, extending context windows, Long-context language, windows but maintaining

备注：

点击查看摘要

7. 【2606.14046】When Recommendation Denoising Meets Popularity Bias: Understanding and Mitigating Their Interaction

链接：https://arxiv.org/abs/2606.14046

作者：Guohang Zeng,Jie Lu,Guangquan Zhang

类目：Information Retrieval (cs.IR)

关键词：dominant data source, Implicit feedback, false-positive interactions caused, biased exposure, caused by mis-clicks

备注：

点击查看摘要

Abstract:Implicit feedback is the dominant data source for recommender systems, but behavioral logs are often contaminated by false-positive interactions caused by mis-clicks, biased exposure, and interface effects. Denoising recommendation methods improve robustness by down-weighting or filtering interactions suspected to be noisy, often relying on the small-loss heuristic. We revisit this heuristic through the lens of popularity bias. Tail-item positives can be harder to fit because they are sparsely observed, and thus may receive larger losses even when they reflect genuine user preference. Under such popularity-dependent loss patterns, monotone loss-based reweighting can suppress clean-but-hard tail signals and increase the head-tail imbalance in effective supervision. We formalize this interaction through the effective head-tail signal ratio induced by denoising weights and derive a conditional reallocation result: when the loss distribution of tail positives is right-shifted relative to that of head positives, small-loss reweighting increases the effective head-tail signal ratio compared with ERM. Motivated by this analysis, we propose Popularity-Aware Denoising (PAD), a lightweight plug-in framework that modulates denoising strength by item popularity. PAD applies stronger denoising to highly exposed items while being more conservative on tail items, preserving more clean-but-hard long-tail signals. Experiments on three datasets and three backbones show that PAD generally improves over representative denoising baselines and provides favorable accuracy-diversity tradeoffs, especially on MF-style recommenders.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2606.14046 [cs.IR]

(or
arXiv:2606.14046v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.14046

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

8. 【2606.13905】ADORE: Iterative Query Expansion with Retrieval-Grounded Relevance Feedback

链接：https://arxiv.org/abs/2606.13905

作者：Amin Bigdeli,Negar Arabzadeh,Radin Hamidi Rad,Sajad Ebrahimi,Charles L. A. Clarke,Ebrahim Bagheri

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：additional context, LLM-based query expansion, expansion improves retrieval, query expansion, expansion

备注：

点击查看摘要

9. 【2606.13858】Mood-Aware Music Recommendation: Integrating User Affective Signals into Ranking Systems

链接：https://arxiv.org/abs/2606.13858

作者：Terence Zeng,Abhishek K. Umrawal

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：streaming platforms due, modern music streaming, music streaming platforms, essential in modern, streaming platforms

备注： 13 pages, 4 figures, and 1 table

点击查看摘要

Abstract:Recommendation systems are essential in modern music streaming platforms due to the vast amount of available content. While collaborative filtering is widely used to suggest items based on the preferences of others with similar patterns, it performs poorly in domains where user-item interactions are sparse, such as music. Content-based filtering is an alternative approach that examines the qualities of the items themselves. Genre, instrumentation, and lyrics have been explored; however, relatively little attention has been given to emotion recognition. Since a user's emotional state strongly influences their music choice, incorporating mood signals offers a promising direction for personalization. In this work, we propose a mood-conditioned ranking framework that integrates user affective signals into the recommendation process via softmax-based sampling in the energy-valence space. We evaluate the approach via single-blind experiments in which participants compare recommendations from the proposed system against a baseline. The results indicate improved perceived recommendation quality, providing preliminary evidence for the effectiveness of incorporating mood-based inputs into music recommendations.

10. 【2606.13837】Hybrid Neural Retrieval with Generative Query Refinement for Quranic Passage Retrieval

链接：https://arxiv.org/abs/2606.13837

作者：Mohamed G. Salman,Mohammad E. Moftah,Ali Hamdi

类目：Information Retrieval (cs.IR)

关键词：Modern Standard Arabic, Standard Arabic, Classical Arabic, Modern Standard, challenging task due

备注： Accepted for presentation at the Intelligent Methods, Systems, and Applications (IMSA) 2026 conference. \c{opyright} 2026 IEEE

点击查看摘要

Abstract:Quranic Passage Retrieval (PR) could be a challenging task due to the linguistic complexity and the semantic gap between the Modern Standard Arabic (MSA) used in daily queries and the Classical Arabic (CA) of the Holy Quran. These factors hinder conventional retrieval methods. To handle these limitations and improve multi-verse retrieval and filter the zero-answer queries, this paper proposes a four-phase neural architecture designed to enhance retrieval accuracy and contextual understanding. The methodology combines hybrid candidate retrieval using AraColBERT dense indexing and BM25 sparse retrieval, followed by semantic reranking with a CAMeLBERTmix cross-encoder. A confidence gating mechanism is then applied to filter zero-answer queries, and an AraT5-based refinement module for multi-verse aggregation. The system is evaluated on an expanded version of the Quran QA 2022 dataset. Results show improved performance compared to the baseline models, achieving a Recall@10 of 0.7024 and a Mean Average Precision (MAP@10) of 0.4947. While the system exhibits a marginal tradeoff in absolute top-rank precision (MRR = 0.5807) compared to heavily optimised single models, the proposed architecture provides a substantially more comprehensive, reliable, and context aware solution for multi-verse Quranic passage retrieval.

11. 【2606.13814】ASR: Training-Free Adaptive Stopping for Iterative Retrieval

链接：https://arxiv.org/abs/2606.13814

作者：Adrian Kieback,Uyiosa Philip Amadasun,Aman Chadha,Aaron Elkins

类目：Information Retrieval (cs.IR)

关键词：Iterative retrieval-augmented generation, retrieval-augmented generation agents, generation agents commonly, agents commonly overspend, Iterative retrieval-augmented

备注： 9 pages, 5 figures. Accepted at Agent4IR Workshop, KDD 2026

点击查看摘要

Abstract:Iterative retrieval-augmented generation agents commonly overspend by continuing to retrieve after the model has converged on an answer, incurring calls that change neither the prediction nor the supporting evidence. Existing remedies learn a stopping policy from labeled trajectories, tying the decision to a trained component that requires retraining for each new model or task. We propose TASR (Training-Free Adaptive Stopping Rule), a one-line predicate that fires when the model repeats its previous-round normalized answer and the isotonically calibrated logit margin exceeds 0.25. No classifier or value head is learned; the threshold is fixed across all twenty-four (model, retriever, corpus) configurations we evaluate. On a 3-model x 2-dataset distractor grid, TASR retains 94.8% of fixed-k=5's macro F1 at 62.6% of its calls and exceeds fixed-k=3 by +3.42 F1. The pattern holds on nine open-domain BM25 cells (55.01 F1 at 2.98 calls vs. 54.33 at 3.00 for fixed-k=3) and, with calibration locked from the distractor split, on nine dense-retrieval cells across two retriever families, with zero significant regressions in either extension. The rule was selected from an exhaustive enumeration of 381 candidate stopping rules; no alternative Pareto-dominates it on any evaluated configuration. A signal-quality analysis shows that verbalized 1-5 confidence collapses on RLHF-tuned models (96.5% of values equal 5, entropy 0.182 nats), while the logit margin achieves 44x better class-conditional separation, grounding the design in a measurable model pathology. TASR is an auditable, training-free Pareto baseline against which learned stopping controllers can be compared. Code is publicly available.

12. 【2606.13719】Nomenclature Ontology for Medical And Disease names (NOMAD): taxonomy of types and origins of disease names

链接：https://arxiv.org/abs/2606.13719

作者：Spiros Denaxas,Cai Ytsma,Giannos Louloudis,Jackie MacArthur,Harry Hemingway

类目：Information Retrieval (cs.IR)

关键词：centuries using Greek, Arabic terminology, past centuries, terminology and reflects, reflects the idiosyncrasies

备注：

点击查看摘要

Abstract:The nomenclature of human disease has developed organically over the past centuries using Greek, Latin, and Arabic terminology and reflects the idiosyncrasies of different eras of medical discovery. Despite evident heterogeneity in naming practices, no systematic framework exists for characterising these conventions across all diseases. In this paper, we describe the Nomenclature Ontology for Medical And Disease names (NOMAD), a meta-taxonomy that classifies disease names according to their naming conventions. We developed a two-level taxonomy comprising 9 top-level categories and 20 subcategories and applied it to 22,548 index entries from the ICD-10-CM 2026 Alphabetical Index in a scalable three-stage machine learning-driven classification pipeline. Classification was multi-label, reflecting the compositional nature of medical nomenclature. We classified 99.1% of terms with a mean of 2.12 labels per entry. Anatomical categories were the most prevalent (63.8% of entries), followed by Descriptive (48.4%) and Pathophysiological (40.2%), while Eponymous and Geographical labels were less common than their cultural prominence might suggest (9.7% and 1.9% respectively). Among all Eponymous diseases, we identified only 57 (2.6%) of diseases named after a female person. We manually reviewed a random sample of n=2,255 entries (10%) for accuracy and calculated a full agreement rate of 70% and partial agreement rate of 26% (macro-averaged Cohen's Kappa score 0.832). Naming convention profiles varied substantially across ICD-10-CM chapters, reflecting specialty-specific epistemological traditions: infectious disease chapters were dominated by etiological labels and showed the highest proportion of geographical region related labels, the circulatory chapter by anatomical and pathophysiological labels, and mental and behavioural disorders showed the highest prevalence of socio-behavioral labels.

13. 【2606.13717】Personalization and Evaluation of Conversational Information Access

链接：https://arxiv.org/abs/2606.13717

作者：Hideaki Joko

类目：Information Retrieval (cs.IR)

关键词：users increasingly favour, increasingly favour direct, favour direct answers, reshaped information retrieval, Conversational Information Access

备注： PhD Thesis of Hideaki Joko (Radboud University, the Netherlands)

点击查看摘要

Abstract:Conversational interactions have reshaped information retrieval systems, as users increasingly favour direct answers over traditional hyperlinks. To build reliable Conversational Information Access (CIA) systems that account for personal context, this thesis addresses challenges: (1) personal context extraction, (2) personalized response generation, and (3) effective and interpretable system evaluation. First, we tackle personal context extraction by studying what Entity Linking (EL) in conversations entails, introducing a dataset for conversational entity linking (ConEL), and proposing CREL, a novel EL method tailored for conversational settings. Second, we focus on personalized response generation by proposing LAPS, a method for efficiently constructing large-scale, human-written, personalized conversational datasets, and using them to study how users' preferences can be utilized to generate personalized responses. Finally, we address the need for effective and interpretable system evaluation by introducing FACE, an automatic, reference-free method that assesses entire conversations and aligns closely with human judgments.

计算机视觉

1. 【2606.14703】Gaze Heads: How VLMs Look at What They Describe

链接：https://arxiv.org/abs/2606.14703

作者：Rohit Gandikota,David Bau

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：vision-language model internally, model internally solves, internally solves, solves the task, heads

备注：

点击查看摘要

2. 【2606.14702】OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

链接：https://arxiv.org/abs/2606.14702

作者：Xinyue Cai,Chaoyou Fu,Yi-Fan Zhang,Ran He,Caifeng Shan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：audio-visual Question Answering, Current automated pipelines, Question Answering, generally adopt, Current automated

备注： Project page: [this https URL](https://github.com/MiG-NJU/OmniVideo-100K)

点击查看摘要

Abstract:Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

3. 【2606.14701】RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

链接：https://arxiv.org/abs/2606.14701

作者：Timing Yang,Predrag Neskovic,Jansen Seheult,Wenchao Han,Anand Bhattad,Alan Yuille,Feng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：assembly of reusable, bird, Register Attention Transformers, reusable parts, RATS

备注：

点击查看摘要

Abstract:When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L-N-N-L bottleneck via a three-step compress-communicate-broadcast attention. The N registers are partitioned across the H attention heads, so that registers assigned to different heads do not interact with each other. Without auxiliary losses or part annotations, each register spontaneously specializes into a proto-semantic region whose emerging structure resembles object parts. RATS surpasses all baselines by +12 mIoU on average across five segmentation benchmarks, with consistent gains on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Its register dictionary further exhibits part-level consistency and semantic proximity across related categories. Our results suggest that RATS may provide a useful architectural prior for structured and interpretable visual representation learning.

4. 【2606.14700】RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

链接：https://arxiv.org/abs/2606.14700

作者：Xichen Pan,Aashu Singh,Satya Narayan Shukla,Xiangjun Fan,Shlok Kumar Mishra,Saining Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large language models, trained generative backbones, Large language, newly trained generative, language models

备注： Project Page: [this https URL](https://xichenpan.com/repfusion)

点击查看摘要

Abstract:Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.

5. 【2606.14699】Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

链接：https://arxiv.org/abs/2606.14699

作者：Ruining Li,Yuxin Yao,Matt Zhou,Chuanxia Zheng,Christian Rupprecht,Joan Lasenby,Shangzhe Wu,Andrea Vedaldi

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词：Reconstructing articulated, important for animation, robotic simulations, Reconstructing, kinematic specification

备注： Project page: [this https URL](https://instruct-particulate.github.io/)

点击查看摘要

Abstract:Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.

6. 【2606.14697】ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

链接：https://arxiv.org/abs/2606.14697

作者：Sicheng Yang,Hangjie Yuan,Wenjun Zhang,Jinwang Wang,Yichen Qian,Weihua Chen,Fan Wang,Lei Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Building trustworthy medical, large language models, clinical decision support, multimodal large language, reliable clinical decision

备注： Code and datasets: [this https URL](https://github.com/alibaba-damo-academy/ClinHallu)

点击查看摘要

7. 【2606.14686】CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification

链接：https://arxiv.org/abs/2606.14686

作者：Rafi Ahamed,Md. Abir Rahman,Tasnia Tarannum Roza,Munaia Jannat Easha,Md. Asif Khan,Sudeepta Mandal

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：economically beneficial crop, highly economically beneficial, textile industry heavily, industry heavily depends, cotton leaf disease

备注： This paper contains 11 figures and 4 tables. It was Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

点击查看摘要

Abstract:Globally, cotton is a highly economically beneficial crop, as the textile industry heavily depends on it. So, the precise identification and detection of cotton leaf disease is crucial for economic stability. The development goal of "CottonLeafVision" is to accurately classify and detect cotton leaf disease. With this goal, we have evaluated multiple pretrained Deep Convolutional Neural Networks, including DenseNet201, InceptionV3, and VGG19 on a publicly available cotton leaf disease image dataset. This image dataset includes seven classes, six disease classes, and one healthy class, collected under various field conditions reflecting real-world challenges. Among these pretrained models, with DenseNet201, we have achieved the highest classification accuracy of 98%. To enhance the model reliability and interpretability, we have implemented different techniques and methods such as Gradient-weighted Class Activation Mapping (Grad-CAM), occlusion sensitivity analysis and adversarial training to increase the noise resistance of the model. Finally, we have developed a prototype in order to utilize the model's capabilities on real life agriculture. This paper shows the deep learning model's capabilities to classify the disease in real-life cotton disease management situations.

8. 【2606.14684】HumP-KD: A Hybrid Uncertainty-Aware Multi-Stage Progressive Knowledge Distillation Framework for Efficient Fire Classification

链接：https://arxiv.org/abs/2606.14684

作者：Mohammed Arif Mainuddin,Najifa Tabassum,Omar Ibne Shahid,Riasat Khan

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：classification systems require, fire classification systems, Progressive Knowledge Distillation, efficient fire classification, Hybrid Uncertainty-aware Multi-stage

备注：

点击查看摘要

Abstract:Real-time fire classification systems require models that are simultaneously accurate, computationally efficient, and deployable on resource-constrained hardware. This work proposes \textbf{HumP-KD}, a Hybrid Uncertainty-aware Multi-stage Progressive Knowledge Distillation framework for efficient fire classification. Two datasets, FlameVision and Dataset-II, containing 8,600 and 31,309 images, are used. Various CNN and transformer baselines are applied under standard preprocessing, online augmentation, Gaussian noise and motion blur robustness conditions. The proposed HumP-KD model distills knowledge from two frozen heterogeneous transformer teachers, Swin-Tiny and ViT-Base, along with their Meta-MLP ensemble, into a lightweight MobileViT-S student via three tightly integrated components. Hierarchical Progressive Knowledge Distillation employs a Hierarchical Feature Builder. It generates a fused spatial attention mask to guide distillation toward discriminative regions selectively. Multi-Stage Knowledge Distillation progressively activates three distillation stages across training. On Dataset-II, HumP-KD achieves a mean F1 score of $0.9876 \pm 0.0063$ across 10 independent trials, significantly outperforming the MobileViT-S baseline trained without distillation ($0.9537 \pm 0.0351$), with statistical significance confirmed by both independent t-test ($p = 0.0195$) and Wilcoxon signed-rank test ($W = 1$, $p = 0.0039$). The proposed method also demonstrates strong generalization across datasets and robustness under degraded visual conditions. The student model retains only 4.94M parameters and 19.01Mb model size, representing a $5.7\times$ parameter reduction over Swin-Tiny and a $17.5\times$ reduction over ViT-Base, while achieving 37.72 CPU FPS, making it suitable for real-time deployment.

9. 【2606.14667】Memento: Reconstruct to Remember for Consistent Long Video Generation

链接：https://arxiv.org/abs/2606.14667

作者：Xuan Wei,Longbin Ji,Guan Wang,Xiangrui Liu,Zhenyu Zhang,Shuohuan Wang,Yu Sun,Qingqi Hong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Long-form video generation, Long-form video, scene transitions, video generation requires, generating videos shot

备注： Project page: [this https URL](https://ernie-research.github.io/Memento/)

点击查看摘要

Abstract:Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.

10. 【2606.14658】Giving AI a Headache: Acoustic Adversarial Attacks to Computer Vision Applications

链接：https://arxiv.org/abs/2606.14658

作者：Nicole Villavicencio-Garduño,Maksim Ekin Eren,Milo Prisbrey,Ben Migliori,Michael Teti

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：autonomous vehicle control, real-world computer vision, vehicle control, Artificial Intelligence, automate a variety

备注： 9 pages, 7 figures, SPIE Defense + Security

点击查看摘要

Abstract:Artificial Intelligence (AI) is increasingly used to automate a variety of real-world computer vision (CV) applications, such as autonomous vehicle control, facial recognition, and security cameras. Recent research has shown that acoustic vibration can induce real physical motion in cameras, interfering with their internal stabilization mechanisms. Because the motion falls outside the conditions the stabilization system was designed to handle, the system introduces artifacts into the frame, causing AI-based CV models to misclassify, miss targets, or hallucinate objects. Previous work used ultrasonic frequencies (20 kHz) to perform short-range attacks, which limits them to short distances due to the attenuation exhibited by high frequencies. In this work, we investigate acoustic attacks using lower frequencies in the audible range (20 kHz), and we further expand our analysis to include how various image and object features are affected by the attacks. Specifically, we performed physical experiments to demonstrate the viability of our attacks on an off-the-shelf object detection model (YOLO11) by resonating a commercially available camera with various frequencies. Based on our results, we provide insights into several factors that make an AI CV system more vulnerable to these attacks, which could help inform the development of future mitigation strategies.

Comments:
9 pages, 7 figures, SPIE Defense + Security

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.14658 [cs.CV]

(or
arXiv:2606.14658v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.14658

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
Proc. SPIE 14046, Assurance and Security for AI-enabled Systems 2026, 1404609 (10 Jun 2026)

Related DOI:

https://doi.org/10.1117/12.3093699

Focus to learn more

            DOI(s) linking to related resources</p>

11. 【2606.14657】HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities

链接：https://arxiv.org/abs/2606.14657

作者：Yijun Liu,Jie Huang,Zeyue Xue,Yuming Li,Ruizhe He,Haoran Li,Shijia Ge,Siming Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reward models guide, reward model, systems toward outputs, Reward models, Reward

备注：

点击查看摘要

Abstract:Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discriminative shifts arising from evolving model capabilities and reinforcement learning (RL) iterations, limiting their broader applicability. In this work, we propose HPSv3++, a reward model framework that elevates the HPSv3 model for varying T2I model capabilities and their RL iteration changes across the full capability-iteration spectrum. Specifically, we first introduce HPDv3++, a 212K dual-dimension preference dataset annotated for text fidelity and aesthetic quality using a recent high-capability (Qwen-Image) model with human supervision. We then propose a two-stage training framework. Stage 1 employs data-aware orthogonal gradient projection to incorporate diverse aesthetic perception from HPDv3++ while preserving the original effective human preference knowledge in HPSv3. Stage 2 further leverages unlabeled data from T2I models spanning different capability levels and RL iterations, and introduces a joint capability-iterations conditioned signal for the reward model together with a standard deviation-driven unsupervised guidance mechanism, strengthening reward model across the capability-iteration spectrum. HPSv3++ achieves state-of-the-art preference prediction, outperforming HPSv3 9.8% on HPDv3, 5.5% on GenAI-Bench, while achieving 79.1%/88.1% on our proposed HPDv3++. When used for T2I RL training, it consistently improves GenEval scores across diverse T2I models, demonstrating its wide-range capabilities. The code is available at this https URL.

12. 【2606.14638】Improving Lunar Topography with Deep Learning Schrödinger Bridges

链接：https://arxiv.org/abs/2606.14638

作者：Matthew Repasky,Erwan Mazarico,Michael K. Barker,Stefano Bertone,Terence J. Sabaka,Yao Xie

类目：Computer Vision and Pattern Recognition (cs.CV); Earth and Planetary Astrophysics (astro-ph.EP)

关键词：processes and geomorphology, planetary topography models, understanding of surface, surface processes, expensive and difficult

备注：

点击查看摘要

Abstract:Increasing the resolution of planetary topography models can enable a better understanding of surface processes and geomorphology; however, existing analytical super-resolution methods are expensive and difficult to apply at large scales. Generative models provide the tools to learn complex relationships within data and can be applied at scale due to hardware accelerators and parallelization. We present a diffusion-based Schrödinger Bridge (SB) generative modeling approach for lunar topography super-resolution, connecting the distribution of low-resolution topography to that of high-resolution topography, incorporating physically-constraining optical imagery. Our approach is inspired by existing Shape-from-Shading methods, which improve a priori low-resolution topography by using optical images at the target resolution. We train SBs on a novel dataset of rendered lunar topography, emulating optical imagery from the Lunar Reconnaissance Orbiter Narrow Angle Camera. The result is a flexible approach for topography super-resolution which can provide pixel-level uncertainties in the reconstruction.

13. 【2606.14631】SED:Lightweight Saliency prediction for Event-based data via Distillation

链接：https://arxiv.org/abs/2606.14631

作者：Romaric Mazna,Jean Martinet,Michele Magno

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：gained attention recently, downstream eventbased perception, Event-based saliency prediction, combining event cameras, event-based saliency benchmarks

备注：

点击查看摘要

Abstract:Event-based saliency prediction has gained attention recently, as combining event cameras with saliency estimation can act as an upstream stage that naturally improves the efficiency of downstream eventbased perception at the edge. However, current approaches are either neuromorphic, underperforming on event-based saliency benchmarks, or too heavy for resource-constrained edge applications due to their reliance on transformers or 3D convolutions. Drawing inspiration from efficient convolutional modules, SED and aiming to exploit the temporal information in event data, we propose a lightweight network, trained through knowledge distillation, built on a Depthwise Spatio-Temporal Block (DSTconv) -- a factorization of the 3D depthwise separable convolution. Relative to its teacher, our model reduces the model size from 180 MB to 0.32 MB (562x) and the parameter count from 45M to 81k (554x), while matching or outperforming it on the N-DHF1K and N-UCF Sports datasets. Moreover, it generalizes strongly beyond its training distribution, transferring from synthetic to real event data where a model trained from scratch fails.

14. 【2606.14619】StereoGeo: an end-to-end stereo camera calibration method

链接：https://arxiv.org/abs/2606.14619

作者：Imane Meddour,Andréa Macario Barros,Cédric Gouy-Pailler

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：network-based approach, extrinsic estimation, Abstract, propose StereoGeo, extrinsic

备注： 5 pages, 1 figure, accepted at the 34th European Signal Processing Conference (EUSIPCO 2026)

点击查看摘要

Abstract:In this work, we propose StereoGeo, an end-to-end network-based approach for stereo camera calibration. Our method estimates the focal lengths and gravity directions of the left and right cameras, as well as the relative extrinsic transformation relating them. Existing methods often rely on calibration patterns in structured environments or address only a single camera configuration, being limited to either intrinsic or extrinsic estimation, and depending on a multi-view setups. StereoGeo extends the GeoCalib algorithm, integrating deep neural network feature extraction with a differentiable optimizer. Extensive experiments on real-world benchmarks demonstrate that StereoGeo achieves competitive performance for intrinsic calibration and provides accurate stereo extrinsic estimation, outperforming existing methods that are limited to monocular settings. The dataset used in this work is partially publicly available at this https URL.

15. 【2606.14586】S$^2$COPE: Self-Supervised Concept Discovery via Preference Learning

链接：https://arxiv.org/abs/2606.14586

作者：Shilong Xiang,Zirui Zhang,Chengzhi Mao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Current representation learning, Current representation, representation learning paradigms, learning paradigms force, yield opaque features

备注：

点击查看摘要

Abstract:Current representation learning paradigms force a fundamental compromise: self-supervised methods scale to massive datasets but yield opaque features, whereas interpretable models remain bottlenecked by the need for dense human annotation. We introduce Self-Supervised Concept discOvery via Preference lEarning (\model), a label-free framework that resolves this dilemma. Instead of treating Vision-Large-Language Models (VLLMs) as static feature extractors, \model leverages them as active participants in a self-supervised preference optimization loop. By autonomously hypothesizing, validating, and reinforcing candidate visual attributes directly from raw imagery, our framework discovers novel, structured concepts without a single label. Extensive experiments across natural, medical, and physics domains demonstrate that \model successfully extracts domain-specific concepts where standard VLLMs often fail to generate. By amortizing concept discovery directly into the VLLM backbone through our self-supervised preference objective -- rather than relying on static generation and disjoint filtering -- we achieve up to a 24-point absolute improvement in downstream top-1 classification accuracy on unseen data. Our work suggest that interpretability can emerge through a model's autonomous interaction with incidental visual structures, without any human supervision.

16. 【2606.14578】A Qualitative Review of GenAI-Based Methods for Data Generation and Augmentation in Industrial Computer Vision Applications

链接：https://arxiv.org/abs/2606.14578

作者：Paul Koch,Paul Hofmann,Ferdinand Waßelewsky,Adem Karakurt,Andre Sérs,Jörg Krüger

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ensure predictable behaviors, predictable behaviors, require a profound, ensure predictable, vision applications require

备注： Accepted to Computing Conference 2026

点击查看摘要

Abstract:AI-driven computer vision applications require a profound database to ensure predictable behaviors and performance. Such predictable behaviors are especially important for industrial applications in gaining trust from users. However, such a database is not readily available in industrial applications, and its acquisition is not trivial either. Active learning methods can be applied to ramp up data within a project deployment to iteratively increase the database, and thus the application predictability. Unfortunately, we observe that this often leads to a loss of user trust in the application, which is difficult to regain once lost. This leads to a "chicken-and-egg" dilemma in which neither the database nor the application is developed. In this work, we review state-of-the-art methods and approaches to further boost the database the initial active data ramp-up phase. Here, we focus on recent advancements in GenAI-based data generation and augmentation methods and review their adaptability on an industrial computer vision classification use case. Although we observe a potential for automatic data ramp-up, we also see a domain miss match in between the source (training environment) and target (industrial use-case) - regarding context defined in natural language and object characteristics.

17. 【2606.14562】NEST3D: A High-Resolution Multimodal Dataset of Sociable Weaver Tree Nests

链接：https://arxiv.org/abs/2606.14562

作者：Constanza A. Molina Catricheo,Simon Boeder,Ting-Jia Guo,Giacomo May,Clément Berthelot,Devis Tuia,Friedrich Fedor Reinhard,Fabio Remondino,Benjamin Risse

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Sociable weaver nests, studies lack fine-grained, structures offering thermoregulatory, offering thermoregulatory microhabitats, prior studies lack

备注： 14 pages, 4 figures. Dataset available at [this https URL](https://huggingface.co/NEST3D)

点击查看摘要

Abstract:Sociable weaver nests function as complex ecological structures offering thermoregulatory microhabitats and sustaining diverse species; however, datasets used in prior studies lack fine-grained 3D structural detail. Producing usable and accurate 3D weaver nest data is challenging due to their irregular geometry and integration with complex host vegetation. We bridge this gap with an open-access, 1.4 TB multimodal drone dataset of 104 nest-bearing trees, comprising 27,945 RGB images, 111,780 multispectral images, approximately 781 million 3D points, and expert-annotated semantic segmentation labels. We benchmark semantic segmentation using KPConv, RandLA-Net, and Point Transformer V3, with PT-v3 achieving an mIoU of 86.35% on the test set. While the results demonstrate strong performance for transformer-based and point-wise methods, they also highlight architecture-dependent challenges, particularly for convolution-based approaches such as KPConv. By uniquely combining spectral, spatial, and structural information, the presented dataset advances 3D reconstruction, segmentation, and classification algorithms, enabling ecological applications from nest volume estimation to species conservation, and serves as a demanding benchmark that exposes architecture-dependent performance under extreme class imbalance.

18. 【2606.14556】Visual Quality Score Assessment of Large White Goods in Remanufacture with Multi-View Deformable-DETR

链接：https://arxiv.org/abs/2606.14556

作者：Paul Koch,Vivek Chavan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large white goods, Remanufacturing large white, circular economy, training and pricing, large white

备注： Accepted to GCSM 2026

点击查看摘要

Abstract:Remanufacturing large white goods is essential for a circular economy, yet visual quality assessment remains a manual bottleneck for training and pricing. Conventional detection methods require extensive annotation and struggle with small defects in high-resolution multi-view data. We present a multi-view framework based on Deformable-DETR for automated quality scoring that aggregates information across redundant views to extract fine-grained features. To enhance robustness with limited labels, we employ self-supervised pretraining followed by supervised fine-tuning on expert-annotated scores. Additionally, a linear projection over frozen feature maps identifies regions of interest to explain model decisions. Evaluated on an industrial multi-view dataset, our approach delivers precise quality assessments while reducing reliance on manual annotation and per-part customization, enabling scalable and transparent inspection for remanufacturing lines.

19. 【2606.14555】Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

链接：https://arxiv.org/abs/2606.14555

作者：Aray Karjauv

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：widely adopt global, global average pooling, adopt global average, linear classification head, Modern image classifiers

备注：

点击查看摘要

Abstract:Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.

20. 【2606.14534】A Lightweight Fiducial-Based Pipeline for 3D Hyperspectral Mapping of ex-vivo Lumpectomy Specimens

链接：https://arxiv.org/abs/2606.14534

作者：Anna Bicchi,Alberto Rota,Leonardo Passoni,Nicola Ancellotti,Andrea Peroni,Lorenzo Vinco,Dario Polli,Elena De Momi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：clinical translation requires, translation requires aligning, Hyperspectral Imaging, aligning the inherently, spectral information

备注：

点击查看摘要

Abstract:Hyperspectral Imaging (HSI) is a promising modality for intraoperative assessment of resection margins in Breast-Conserving Surgery (BCS), but its clinical translation requires aligning the inherently 2D spectral information onto the 3D shape of the excised tissue so that suspicious regions can be precisely localized for targeted follow-up. We present a fully automated, calibration-free pipeline that produces a 3D hyperspectral point cloud of an ex-vivo lumpectomy specimen from a set of consumer-camera RGB images and a single top-down HSI acquisition. The 3D geometry is reconstructed with a deep-learning Structure-from-Motion backbone, stabilized in a metric reference frame by a custom bundle adjustment that enforces consistency on the corners of four ArUco markers placed around the specimen. The HSI cube is then registered to the reconstruction without recovering the HSI camera pose: the markers, visible in both modalities, define 16 corner correspondences that drive a planar homography, and 3D coordinates are recovered by lookup on an orthographically rendered depth map. Evaluated on two ex-vivo lumpectomy specimens, the pipeline achieves a median 3D registration error below 1~mm and a 2D reprojection error below 0.02 mm, with a total per-specimen processing time under 4 minutes on accelerated hardware. These results support the feasibility of integrating HSI-guided spatial localization into intraoperative margin assessment workflows for breast-conserving surgery.

21. 【2606.14504】Scratched Lenses, Shifted Depth: Passive Camera-Side Optical Attacks

链接：https://arxiv.org/abs/2606.14504

作者：Qinlin He,Zeming Zhuang,Yongji Wu,Lan Zhang,Xiaoyong(Brian)Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：patches or projections, typically studied, adversary controls, Streak Hijacking SLASH, depth

备注：

点击查看摘要

Abstract:Physical adversarial attacks on vision systems are typically studied through scene manipulation, such as adversarial patches or projections, where the adversary controls what the camera observes. Camera-side attacks using stickers or auxiliary optics have also been explored, but they treat attacks as image-space perturbations from designed patterns. This misses how physical imperfections interact with scene-dependent lighting and optics. We identify a threat: passive lens-side damage that is persistent yet trigger-conditioned, producing optical artifacts that bias geometric inference under particular visual conditions. We instantiate this threat through Scratch-induced Lens Adversarial Streak Hijacking SLASH, a physical-world attack caused by small scratches on a camera lens or protective cover. Scratches interact with bright light sources and specular reflections to create structured streak artifacts that distort depth cues. Since the perturbation is fixed in the optical path but triggered by the scene, it is both persistent and selective. We formulate the attack in optical space, model the scratch pattern as a trigger-conditioned optical channel, and optimize one fixed configuration across diverse viewing conditions. We evaluate SLASH on monocular depth estimation and monocular 3D object detection in digital and real-world settings. Under the fixed-scratch constraint, directional depth shifts reach up to 32% relative error for monocular depth estimation, with consistent effects on monocular 3D object detection. Physical experiments confirm transfer to real camera recordings, inducing depth shifts above the model's natural prediction baseline. These findings reveal an attack surface where benign-looking hardware imperfections act as latent, scene-triggered adversarial mechanisms, challenging assumptions about physical robustness and motivating defenses for secure vision systems.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.14504 [cs.CV]

(or
arXiv:2606.14504v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.14504

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

22. 【2606.14475】Value-order Decomposition for Generalist Anomaly Detection

链接：https://arxiv.org/abs/2606.14475

作者：Miaoyun Zhao,Jing Chen,Miaoni Zhao,Qiang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：anomaly detection suffers, anomaly detection, Industrial anomaly detection, making cross-domain generalization, defect types

备注：

点击查看摘要

Abstract:Industrial anomaly detection suffers from limited data, making cross-domain generalization particularly challenging. Generalist Anomaly Detection (GAD) aims to train a unified model on a source domain that can effectively detect anomalies in unseen target domains. In the initial semantic feature space, strong entanglement between anomalies and object categories or defect types hinders effective generalization across domains. Recent works address this issue by projecting features into a residual space; however, such methods primarily increase cross-domain overlap for normal features, while anomalous features remain specific to object categories, defect types and data domains, leading to poor alignment and generalization. To address this limitation, we propose Value-order Decomposition (VOD), a simple yet effective technique that bridges \textbf{three types of generalization gaps} across object categories, defect types (including real and synthetic defects), and data domains. VOD disentangles and suppresses object-category-, defect-type-, and domain-specific information, promoting alignment within normal and abnormal samples while preserving their separability, thereby enabling robust generalization across the three gaps. Leveraging the strong alignment between real and synthetic defects within the same object, we perform anomaly detection using only normal and synthetic-abnormal reference, and effectively generalize to unseen real defect types. Experiments on diverse industrial and medical benchmarks demonstrate that our method, using a simple cut-and-paste anomaly simulation strategy, achieves strong generalization across the three gaps.

23. 【2606.14389】MooMIns -- Monocular 3D Reconstruction and Object Pose Estimation from Multiple Instances

链接：https://arxiv.org/abs/2606.14389

作者：Robert Langendörfer,Markus Hillemann,Markus Ulrich

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：inherently ill-posed problem, ill-posed problem, inherently ill-posed, Simultaneous, single

备注：

点击查看摘要

Abstract:Simultaneous 3D reconstruction and 6D object pose estimation from a single monocular image is an inherently ill-posed problem. In industrial settings, however, multiple instances of an object are often randomly arranged in bins, implicitly providing several views of the same object within a single image. We show that this implicit multi-view geometry can be exploited to simultaneously reconstruct the object in 3D and estimate the 6D pose of each visible object instance. We present MooMIns, a new Gaussian-splatting-based approach that inverts the original Gaussian splatting formulation: instead of rendering a single scene from multiple cameras, we render multiple object instances from a single camera. Our method is initialized with SAM3 instance segmentation masks and a modified Structure from Motion (SfM) pipeline. In contrast to learned monocular depth estimation, we perform true geometry-based reconstruction from image evidence, avoiding hallucinations caused by training data priors. We evaluate MooMIns on synthetic and real bin-picking scenarios, and demonstrate accurate reconstruction of previously unseen objects as well as reliable pose estimation of individual instance

24. 【2606.14383】IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

链接：https://arxiv.org/abs/2606.14383

作者：Haonan Qi,Jin Cao,Yongqi Zhang,Xintong Wang,Weidong Tang,Bin Chen,Chengfu Huo,Haojun Pan,Hengyu You,Jing Li,Yingde Wang,Liang Ding

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, dense technical specifications, govern procurement, supply chains

备注：

点击查看摘要

Abstract:Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.

25. 【2606.14380】FLaRA: Predicting Future Latent Representations for Accident Anticipation

链接：https://arxiv.org/abs/2606.14380

作者：Lorenzo Caselli,Tomaso Trinci,Tommaso Bianconcini,Simone Magistri,Leonardo Taccari,Francesco Sambo,Andrew D. Bagdanov

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Anticipating traffic accidents, intelligent transportation systems, Anticipating traffic, transportation systems, Accident Anticipation

备注： Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

点击查看摘要

Abstract:Anticipating traffic accidents from dashcam videos is a critical challenge in intelligent transportation systems. Existing methods typically map visual context directly to a collision probability without explicitly modeling the future evolution of the driving scene. In this paper we propose FLaRA (Predicting Future Latent Representations for Accident Anticipation), a novel predictive architecture that shifts this paradigm by forecasting future latent representations for accident anticipation. Building upon the Video Joint-Embedding Predictive Architecture (V-JEPA2), our model conditions a predictor network on observed context frames to predict the forthcoming latent features of the scene. A classifier then operates on these predicted future representations rather than only on past observations. To ensure these forecasts remain grounded in realistic future dynamics, we introduce a joint training objective that simultaneously optimizes an auxiliary feature-level reconstruction loss and a cross-entropy classification loss. Extensive evaluations on the Nexar dataset, alongside cross-domain validations on the DAD, DADA-2000, and DoTA benchmarks, demonstrate that our approach achieves state-of-the-art performance while maintaining realistic early warning capabilities.

26. 【2606.14355】Point Cloud Upsampling through Patch-based Frequency Superposition

链接：https://arxiv.org/abs/2606.14355

作者：Marina Ritthaler,Azhar Hussian,Vasileios Belagiannis,André Kaup

类目：Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词：recent years, neural networks, point cloud upsampling, dominant models, Patch-based Frequency Superposition

备注：

点击查看摘要

Abstract:In recent years, neural networks have become the dominant models in most point cloud upsampling methods. Although these approaches are achieving good results, they do have drawbacks, such as a lack of interpretability and data dependency. Moreover, they have to be trained on a dataset that is similar to the test data in order to perform well. To avoid these disadvantages, we propose Point Cloud Upsampling through Patch-based Frequency Superposition (PUtPFS), an optimization-based approach that selects subsets of points and estimates the surface of this set through superpositioning spatial frequencies. Then, new points are placed on this surface. By successively selecting points in the least dense regions of the point cloud, a uniform upsampling can be reached. With this method, we surpass the current best upsampling results in the commonly considered point-to-surface distance. Furthermore, we achieve the best Chamfer and Hausdorff distance among the optimization-based approaches. As an additional advantage, our method does not need any training data and is mathematically interpretable.

27. 【2606.14351】ForceForget: Reinforcement Concept Removal for Enhancing Safety in Text-to-Image Models

链接：https://arxiv.org/abs/2606.14351

作者：Dong Han,Yong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：advance of generative, concept erasing, unsafe, concept erasing methods, generate unsafe contents

备注： Accepted to ICML 2026

点击查看摘要

Abstract:With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, T2I models still can generate unsafe contents. To alleviate this issue, various concept erasing methods are proposed. However, existing methods tend to excessively erase unsafe concepts and suppress benign concepts contained in harmful prompts, which can negatively affect model utility. In this paper, we focus on eliminating unsafe content while maintaining model capability in safe semantic meaning interpretation by optimizing the concept erasing reward (CER) with reinforcement learning. To avoid overly content erasure, we introduce the Safe Adapter to project partial text embedding for efficient concept regulation in cross-attention layers. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high fidelity of benign images compared with existing state-of-the-art (SOTA) concept erasing methods. In terms of robustness, our method outperforms counterparts against red-teaming tools. Moreover, we showcase the proposed approach is more effective in emerging image-to-image (I2I) scenarios compared with others. Lastly, we extend our method to erase general concepts, such as artistic styles and objects. Disclaimer: This paper includes discussions of sexually explicit content that may be offensive to certain readers. All images used in this work are synthesized or from public datasets.

28. 【2606.14317】CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation

链接：https://arxiv.org/abs/2606.14317

作者：Sihan Zhuang,Xinyuan Chen,Tianfan Xue,Yaohui Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significantly improved visual, Recent advances, improved visual quality, advances in diffusion-based, significantly improved

备注： Project Page: [this https URL](https://zhuangsh0713.github.io/CausalMotion/)

点击查看摘要

Abstract:Recent advances in diffusion-based video generation have significantly improved visual quality and short-term temporal coherence. However, existing methods still struggle to produce videos with physically consistent and causally plausible dynamics, especially in scenarios involving long-horizon interactions. This limitation arises from the fact that video diffusion models primarily learn physical consistency implicitly, while vision-language models can directly model physical laws. Based on this idea, in this work, we propose \textbf{CausalMotion}, a training-free framework that injects explicit physical reasoning into video generation through structured intermediate representations. Our key idea is to decouple reasoning from generation by leveraging a vision-language model to decompose a text prompt into a sequence of causally consistent keyframes and object-centric motion trajectories. These representations are then aligned and integrated as soft constraints to guide a pretrained video diffusion model during inference. This design enables explicit modeling of object dynamics and causal transitions without requiring additional training or supervision. Extensive experiments show that our method consistently improves physical plausibility and temporal coherence, particularly in dynamics-intensive scenarios, while maintaining high perceptual video quality.

29. 【2606.14307】Pano3D: Unified 3D Reconstruction and Panoptic Segmentation

链接：https://arxiv.org/abs/2606.14307

作者：Victor Barberteguy,Ahmet Iscen,Mathilde Caron,Alireza Fathi,Gül Varol,Cordelia Schmid

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable success, Recent advances, camera parameters, achieved remarkable, remarkable success

备注： Project page: [this https URL](https://victorbbt.github.io/Pano3D/)

点击查看摘要

Abstract:Recent advances in 3D feedforward reconstruction neural networks have achieved remarkable success in dense reconstruction from images without any camera parameters. Yet, equipping these models with robust semantic understanding remains an open problem. Here we introduce an approach that performs 3D reconstruction and 3D panoptic segmentation in a unified framework. We build on existing 3D reconstruction models and augment them with a set-based mask decoder. The approach is jointly trained with a geometric and semantic loss, which are shown to be mutually beneficial. More precisely, the features are initialized from the geometric information and then finetuned to capture jointly geometry and semantics. We demonstrate the generality of our approach by successfully applying our framework both to online and all-to-all attention reconstruction backbones. Our method achieves state-of-the-art performance in 3D panoptic segmentation across ScanNet, ScanNet200, and ScanNet++ datasets. Ablation studies show that such joint training of a unified model equips 3D feedforward reconstruction neural networks with panoptic segmentation and yields mutually beneficial improvements.

30. 【2606.14299】What Drives Test-Time Adaptation for CLIP? A Controlled Empirical Study from an Update Perspective

链接：https://arxiv.org/abs/2606.14299

作者：Jiazhen Huang,Xiao Chen,Zhiming Liu,Yaru Sun,Jingyan Jiang,Zhi Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Vision-Language Models, predictions remain vulnerable, zero-shot predictions remain, distribution shifts encountered, open-vocabulary recognition

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) such as CLIP have become a standard backbone for open-vocabulary recognition, yet their zero-shot predictions remain vulnerable to distribution shifts encountered at deployment. Test-Time Adaptation (TTA) has recently been extended to CLIP as a lightweight solution, leading to a rapidly growing body of TTA4CLIP methods. However, empirical progress in this area has largely outpaced our understanding of what truly drives adaptation, where their gains originate, and under which shifts they remain reliable. In this paper, we take a step back from the pursuit of state-of-the-art accuracy and conduct a systematic controlled study of TTA4CLIP. We first organize existing methods into three unified paradigms according to what is updated at test time. We then introduce TTABC, an open-source TTA Benchmark for CLIP, which standardizes evaluation protocols and integrates more than 20 representative methods. Our controlled empirical analysis focuses on three key areas. First, we determine the driving factors in parameter-based methods, revealing that adaptation gains are primarily driven by test-time evidence and reliable proxies rather than heavy optimization. Second, we explore evidence utilization beyond heavy parameter tuning, showing that competitive and efficient performance can be achieved through cross- or current-sample evidence and lightweight prototype updates. Finally, we demonstrate that there is no silver bullet for TTA: no single adaptation paradigm is universally optimal, and the preferred paradigm depends on the nature of shift. We hope our benchmark and study provide a clearer understanding of the current TTA4CLIP landscape and establish a foundation for further research.

31. 【2606.14297】Pix2Pix-Hybrid: Structure-Guided Conditional Synthesis of Hajj Crowd Images with Multi-Channel Conditioning and Weak Attribute Supervision

链接：https://arxiv.org/abs/2606.14297

作者：Amirah F. Alshammari,Bander A. Alzahrani,Nahed A. Alowidi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：raises privacy concerns, Developing accurate crowd-counting, large gatherings raises, gatherings raises privacy, Developing accurate

备注：

点击查看摘要

Abstract:Developing accurate crowd-counting models for Hajj pilgrimage scenes remains challenging because domain-specific annotated images are scarce and data collection during large gatherings raises privacy concerns. To address these limitations, this paper proposes Pix2Pix-Hybrid (P2P-H), a hybrid conditional GAN for structure-guided Hajj crowd-image synthesis and data augmentation. P2P-H builds on Pix2Pix and employs a U-Net generator conditioned on eight input channels that jointly encode structural cues (edges and grayscale) and contextual attributes (crowd density and time of day). To capture detailed textures in dense scenes, the framework integrates two multi-scale PatchGAN discriminators operating at different resolutions. The training procedure combines adversarial, perceptual, and feature-matching objectives with adaptive data augmentation and stabilization strategies. The model was trained on 993 real Hajj frames collected from 60 publicly available video sources, with conditioning attributes derived automatically to reduce manual labeling effort. Using this framework, we constructed CrowdH, a synthetic dataset of 10,000 high-resolution Hajj crowd images. Experimental results show that P2P-H improves structure-preserving conditional synthesis quality compared with Pix2Pix and StyleGAN2-ADA baselines and shows favorable transfer to other crowd datasets. To assess downstream utility, we further constructed CrowdH-Mix-469, an annotated mixed real-synthetic dataset comprising 384 real Hajj images and 85 selected synthetic images,and evaluated five crowd-counting models under real-only and real-plus-synthetic training. The selected synthetic data reduced MAE across all five models, with the strongest gain observed for CSRNet.

32. 【2606.14292】A Robust Point Cloud Analysis Framework Inspired By Primary Visual Cortex

链接：https://arxiv.org/abs/2606.14292

作者：Jisheng Dang,Dengyue Pan,Delin Deng,Yifan Zhang,Bimei Wang,Hong Peng,Bin Hu,Qi Tian,Tat-Seng Chua

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Convolutional Neural Networks, reducing energy consumption, Convolutional Neural, Neural Network, Continuous-Coupled Neural Network

备注： 12 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Despite significant advancements in point cloud analysis, reducing energy consumption and improving robustness remain understudied, largely due to the inherent limitations of Convolutional Neural Networks (CNNs). To address this issue, we draw inspiration from the primary visual cortex and propose a Dendritic-Connected Continuous-Coupled Neural Network (DC-CCNN), a novel Brain-Inspired Neural Network (BINN) architecture for point cloud analysis. By combining discrete and continuous encoding, our design replaces traditional Multilayer Perceptrons (MLPs) with more efficient and robust BINNs. Building upon this framework, we further propose an extended model, DC-CCNN++, to improve robustness under complex corruption conditions. Specifically, we introduce a Neuro-Inspired Robust Modulation-and-Readout Module (NRMR) to enhance feature stability and decision robustness through global-context gain modulation and dual-code evidence integration. We also design a Cortically Inspired Progressive Variability Training (CPVT) strategy, which progressively exposes the model to structured environmental variability while preserving stable clean-sample anchors during training. Experimental results show that DC-CCNN++ improves the performance of brain-inspired networks on point cloud analysis while maintaining performance comparable to state-of-the-art methods. Compared with the original DC-CCNN, it achieves stronger results on both classification and part segmentation, and exhibits enhanced robustness against sparsity, occlusion, Gaussian noise, salt-and-pepper noise, and spatial transformations. With its efficiency, robustness, and biologically grounded design, DC-CCNN++ provides a promising alternative to traditional deep learning methods for point cloud analysis. Code is available at this https URL.

33. 【2606.14277】One Layer's Trash is Another Layer's Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs

链接：https://arxiv.org/abs/2606.14277

作者：Yongru Chen,Kai Zhang,Zeliang Zong,Yuchen Lu,Wenming Tan,Ye Ren,Jilin Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision-Language Models, diverse multimodal tasks, achieved remarkable success, practical deployment remains, Large Vision-Language

备注： Accepted by CVPR 2026 (highlight)

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model's accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.

34. 【2606.14251】HiST: A Hierarchical Sparse Transformer for Cross-Modal Spatial Transcriptomics Modeling

链接：https://arxiv.org/abs/2606.14251

作者：Weiyi Wu,Xinwen Xu,Xingjian Diao,Siting Li,Zhi Wei,Alma Andersson,Jiang Gui

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：links gene expression, Spatial transcriptomics, infer expression, expensive and low-throughput, motivating surrogates

备注：

点击查看摘要

Abstract:Spatial transcriptomics (ST) links gene expression with tissue morphology but remains expensive and low-throughput, motivating surrogates that infer expression from routine histology. Whole-slide HE-to-ST inference pairs a gigapixel image with gene measurements at a sparse, irregular set of locations, making multiscale modeling challenging without incurring dense-grid overhead or quadratic token mixing. We propose HiST, a hierarchical sparse transformer that treats measured locations as a lattice-indexed sparse field and builds a dyadic encoder--decoder directly on the active tissue footprint. HiST combines sparse window attention for local geometric correspondence with resolution-changing operators for rapid multiscale context integration. For a fixed window size, the dominant runtime and memory scale with the number of observed locations rather than the dense slide area. To mitigate slide-specific acquisition variation, HiST adds a bottlenecked global conditioning pathway via a \emph{slide calibration token} that summarizes slide-level context and conditions local representations. On a multi-organ benchmark spanning diverse tissues and acquisition sources, HiST improves predictive performance over recent baselines while reducing runtime and peak memory.

35. 【2606.14230】A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators

链接：https://arxiv.org/abs/2606.14230

作者：Amna Amjid,Sana Qadir,Mehwish Fatima,Raja Khurram Shahzad

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Generative Adversarial Networks, artificially generated images, threaten privacy, information integrity, videos that threaten

备注：

点击查看摘要

36. 【2606.14194】Hybrid Classical-Quantum (HCQ) Alzheimer's Classification via Supervised $β$-VAE and Quantum Kernels

链接：https://arxiv.org/abs/2606.14194

作者：Tia Tiwari,Vamshi Krishna Kancharla,Neelam Sinha

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：two-stage Hybrid Classical-Quantum, structural MRI volumes, binary Alzheimer disease, Hybrid Classical-Quantum, two-stage Hybrid

备注：

点击查看摘要

Abstract:This paper presents a two-stage Hybrid Classical-Quantum (HCQ) pipeline for binary Alzheimer's disease (AD) classification from 3D T1-weighted structural MRI volumes, where the classical and quantum components are designed to complement each other rather than operate independently. A supervised 3D $\beta$-variational autoencoder (VAE) is trained end-to-end under voxel-wise reconstruction, KL-divergence, and focal classification losses that compress each 3D MRI volume (resized from 152 x 184 x 152 to 96 x 96 x 96) into a 64-dimensional latent code. Partial Least Squares (PLS) regression selects the six components in the latent code that best separate Alzheimer's Disease (AD) from cognitively normal (CN) subjects and rescales them into rotation angles, which are encoded onto a six-qubit register using the ZZ quantum feature map to give us the respective quantum states. The input to a precomputed-kernel Support Vector Machine (SVM) is an N x N Gram matrix (N = 308), created by calculating the overlap between every pair of quantum states. The novelty of this work lies in the fact that the quantum kernel operates directly on disease-aware features that are learned end-to-end by a supervised autoencoder, rather than on pre-extracted inputs. On 308 ADNI-1 subjects, consisting of 137 AD and 171 CN subjects, the baseline achieved 67.2% accuracy and 0.759 AUC, while the stability-enhanced variant reached 72.1% accuracy and 0.799 AUC with cross-fold variance halved. 3D Grad-CAM further helped validate our model's focus on brain regions linked to Alzheimer's. The HCQ pipeline could serve as a general-purpose framework for diagnostic classification across biomedical imaging domains that present similar challenges for classical approaches.

37. 【2606.14172】Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

链接：https://arxiv.org/abs/2606.14172

作者：Sirui Zhang,Xu Wang,Zhengyu Wu,Xunkai Li,Hongchao Qin

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：model real-world entities, Multimodal Attributed Graphs, Attributed Graphs, Multimodal Attributed, coupling graph topology

备注：

点击查看摘要

Abstract:Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative representations, and modality-centric tasks requiring fine-grained cross-modal correspondence. However, existing MAG methods often rely on fixed graph contexts or uniformly fused representations, causing task-agnostic propagation and over-compressed fusion that hinder diverse task requirements and modality-specific evidence preservation. To address this, we propose CoMAG, a unified MAG backbone that learns task-adaptive reliable contexts and modality-preserving alignment within them. CoMAG first conducts Reliable Context Learning by estimating edge reliability from multimodal semantic consistency, complementing raw topology with semantic neighbors, and selecting context components through a task-aware gate. It then performs Modality-preserving Hop-token Alignment by maintaining modality-specific multi-hop trajectories, matching modality-hop tokens across modalities, and decoupling shared and private representations. Thus, CoMAG produces graph and modality representations from one forward pass while retaining modality-specific cues. We further analyze stable propagation, over-smoothing mitigation, and modality-collapse control. Experiments on nine OpenMAG datasets compare CoMAG with feature-only, graph-only, multimodal, and unified MAG baselines across graph-level prediction, modality matching, and graph-conditioned generation. Results show that CoMAG achieves the best reported performance, demonstrating that task-adaptive reliable contexts and modality-preserving alignment improve structural prediction, cross-modal matching, and graph-conditioned generation while retaining sparse edge-linear complexity.

38. 【2606.14168】MUSE: Agentic 3D Scene Authoring via Memory-Grounded Incremental Requirement Satisfaction

链接：https://arxiv.org/abs/2606.14168

作者：Ruijie Xu,Xinnan Zhu,Jiayu Ying,Daoguo Dong,Yuzhou Ji,Xin Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：digital content creation, preserving non-target content, correcting existing scenes, content creation, embodied AI simulation

备注：

点击查看摘要

Abstract:Text-driven 3D scene generation is a promising technique for digital content creation, embodied AI simulation, and interactive design, yet practical workflows often require refining, extending, or correcting existing scenes while preserving non-target content. Existing methods can produce realistic and structurally plausible scenes, but they generally lack editability with requirement-level state tracking, so part-level failures often lead to full-scene regeneration or manual intervention. To tackle this challenge, we formulate controllable 3D scene authoring as incremental requirement satisfaction, unifying construction and editing. In this paper, we present MUSE, a memory-grounded multi-agent framework in which an Architect compiles instructions into structured requirements, a Sculptor executes local scene operations, and an Inspector verifies each step while updating Working, Scene, and Skill Memory. To evaluate requirement-level controllability and preservation-aware editing, we introduce AuthorBench, offering 145 constrained construction cases and a 1,584-case preservation-aware editing pool paired with external structured checks. On full construction cases, MUSE improves All-Goal success from 37.9 to 80.7 and surface-constraint fulfillment from 35.0 to 92.6 over the strongest baseline. On a stratified 240-case editing test split, MUSE achieves 49.6 All-Goal success, 99.9 preservation rate, and only 0.6 unintended change rate. Beyond automated metrics, human evaluations on compared local-editing baselines support stronger alignment with user intent, and downstream navigation-proxy tests indicate stronger spatial stability. Combined with ablations validating our memory designs, these results establish MUSE as an effective framework for controllable 3D scene authoring.

39. 【2606.14162】VideoWeave: Unlocking Geometric Consistency in Video Generation via Joint Geometry-Video Modeling

链接：https://arxiv.org/abs/2606.14162

作者：Xunzhi Xiang,Zixuan Duan,Yabo Chen,Zhengxuan Wei,Guiyu Zhang,Zixiao Gu,Zhe Gao,Haibin Huang,Chi Zhang,Qi Fan,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large-scale video diffusion, causing geometric drift, Large-scale video, fail to preserve, drift and implausible

备注：

点击查看摘要

Abstract:Large-scale video diffusion models often fail to preserve 3D structure over time, causing geometric drift and implausible motion under viewpoint changes. Existing methods usually enforce geometric consistency by using explicit geometry reconstructions, such as depth maps, point clouds, or reconstructed 3D structures, to define conditions, supervision, or reward signals, making the generator sensitive to errors from upstream geometry pipelines. We propose VideoWeave, a latent-space post-training framework that uses implicit geometry-model features to constrain the generative distribution, providing a more flexible and non-rigid form of guidance that mitigates the impact of reconstruction errors from geometry models. Specifically, VideoWeave adapts these features into geometry latents and jointly models them with video latents in a shared denoising space, allowing geometry to shape the generative distribution during training. To support this process, we build GeoVid-80K, an 80K-video dataset with paired appearance and geometry representations. Experiments on text-to-video and image-to-video generation show that VideoWeave improves geometric coherence while preserving strong visual quality. VideoWeave project page at this https URL

40. 【2606.14153】Encoder Winners Do Not Reliably Transfer Across VLA Backbone Scale: A Frozen-Backbone Grafting Diagnostic

链接：https://arxiv.org/abs/2606.14153

作者：Qingping Zeng,Fei She

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：upstream VLM releases, small VLA transfers, policies typically inherit, encoder choice validated, VLM releases

备注： 23 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Vision-language-action (VLA) policies typically inherit their vision encoder from upstream VLM releases, but it is unclear whether an encoder choice validated on a small VLA transfers to a larger backbone. We introduce a frozen-backbone grafting diagnostic: the vision tower of a released VLA is replaced by a candidate encoder under a fixed protocol (adaptive average pooling, LayerNorm, and a single trainable linear projector), with the language model and action expert frozen. Across four encoders, two LIBERO suites, two backbones (SmolVLA-450M and $\pi_{0.5}$-3.3B), and two-to-three seeds per cell (40 main grafting runs plus native, LoRA, pooling, and zero-/shuffled-image controls, all scored by offline action MSE), the small-backbone winner does not reliably select the large-backbone top tier: SigLIP is best on SmolVLA across both suites, while on $\pi_{0.5}$ DINOv2-small leads the spatial suite and the object suite is a seed-sensitive near-tie band; three of the four backbone-suite comparisons (and 11 of 12 seed-level cells) support backbone-dependent rankings. The grafting wrapper is itself non-neutral with opposite sign across backbones (+45-56% MSE on the SmolVLA native tower, -50-52% on $\pi_{0.5}$), so all conclusions are conditional on the fixed grafting protocol. We position frozen grafting as a cheap target-backbone diagnostic to run before committing to an encoder at scale, not as a closed-loop deployment claim.

41. 【2606.14129】BoRAD: Bootstrap your Own Representations for Multi-class Anomaly Detection

链接：https://arxiv.org/abs/2606.14129

作者：Duy Hoang Khuong,Tri Nguyen Minh,Ngu Huynh Cong Viet

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reconstruction-based anomaly detection, setting is challenging, industrial inspection, Reconstruction-based anomaly, attractive for industrial

备注：

点击查看摘要

Abstract:Reconstruction-based anomaly detection is attractive for industrial inspection, but scaling it from category-specific training to a one-for-all setting is challenging. A single model must reconstruct diverse normal appearances without copying abnormal details, which exposes two coupled failure modes: identical shortcut, where anomalies pass through the reconstruction path, and mis-reconstruction, where normal categories are confused with one another. We propose \textbf{BoRAD}, a label-free training framework that treats this as a representation-capacity allocation problem. BoRAD uses a shared learnable prototype bank to impose two complementary regularizers: spatial prototype alignment contracts local within-prototype variation to suppress anomaly copying, while prototype-relative global alignment preserves between-prototype structure and improves sensitivity to abnormal angular deviations. The prototype bank and prediction heads are used only during training; inference remains a standard teacher-student feature discrepancy pass, with no class labels, negative pairs, memory retrieval, or prototype lookup. BoRAD achieves competitive one-for-all anomaly detection performance, including 86.2\% mAD on MVTec AD, 80.7\% mAD on VisA and 73.1\% mAD on Real-IAD. Diagnostic analyses further show reduced anomaly leakage, improved normal-category separability, and stronger anomaly-normal score separation.

42. 【2606.14125】Conditioning Matters: Stabilizing Inversion and Attention in Diffusion Image Editing

链接：https://arxiv.org/abs/2606.14125

作者：Zheyuan Zhan,Hongchen Li,Can Wang,Yinfei Ma,Mingzhen Huang,Ruoshi Bai,Jiawei Chen,Siwei Lyu,Defang Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Inversion-based image editing, image editing offers, editing offers flexible, Inversion-based image, offers flexible

备注： Accepted to ECML PKDD 2026 Research Track

点击查看摘要

Abstract:Inversion-based image editing offers flexible and training-free control but still struggles with inversion accuracy and the trade-off between editing fidelity and background preservation. While recent methods improve inversion formulations or attention interactions, the role of textual conditioning in shaping diffusion dynamics and editing behavior remains underexplored. We show both empirically and theoretically that the precision of textual conditioning influences inversion stability by modulating the geometry of the diffusion velocity field, while also affecting the consistency of cross-branch attention during editing. These effects directly impact background preservation and semantic fidelity. Building on this analysis, we propose SimEdit, a conditioning-aware framework with two complementary components: (a) conditioning refinement, which constructs conditioning signals with improved semantic precision and structural alignment to facilitate stable inversion and consistent attention manipulation, and (b) token-wise cross-branch attention control, which separates edit-relevant and structure-preserving components and modulates them asymmetrically during attention manipulation. Extensive experiments on PIE-Bench demonstrate that SimEdit consistently improves both inversion reconstruction quality and editing performance over previous attention-manipulation approaches. Our code is available at this https URL.

43. 【2606.14106】Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents

链接：https://arxiv.org/abs/2606.14106

作者：Seoyoung Choi,Minseok Ko,Hyunseok Lee,Kunwoong Kim,Woomin Song,Chanseok Jeon,Jinwoo Shin

类目：Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)

关键词：Graphical User Interface, Graphical User, User Interface, automate complex computer, visual memory

备注： 9 pages, 5 figures, ICML 2026 WORKSHOP

点击查看摘要

Abstract:Graphical User Interface (GUI) agents are increasingly used to automate complex computer tasks across applications, websites, and operating systems. To improve their reliability, recent work has introduced experiential memory, where agents retrieve prior trajectories to guide decision-making in similar states. More recent approaches further extend this idea to visual memory by storing and retrieving screenshots from past interactions, providing agents with richer contextual information than text-only memories. However, the effect of visual memory in GUI agents remains insufficiently understood: it is unclear which failures visual memory mitigates, or which failures it exacerbates. To systematically analyze the effect of visual memory, we introduce a taxonomy of four GUI agent failures (i.e., cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error) that map to distinct stages of the perception-reasoning-action pipeline. We find that prepending full-image memory has a divergent effect on the failure distribution: it reduces state-level failures but worsens action-level ones, and increases hidden operation blindness and grounding error. Motivated by this finding, we propose Action-Grounded Visual Memory (AGMem), an action-grounded memory framework for GUI agents. The core idea of AGMem is to store image crops that capture the local GUI region closely related to a successful action or a recovery, rather than storing full screenshots. Experiments on OSWorld show that AGMem improves task success rates by 33.3 % over full-image memory. These results demonstrate that AGMem is an effective representation for visual memory in GUI agents.

44. 【2606.14096】A New Multi-Domain Benchmark for Micro-Action Recognition and Detection

链接：https://arxiv.org/abs/2606.14096

作者：Yanbin Hao,Pengyu Liu,Xing Wei,Xun Yang,Dan Gu,Meng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：low-amplitude subtle body, reveal latent intentions, subtle body movements, involuntary reactions, low-amplitude subtle

备注： 10 pages, 9 figures

点击查看摘要

Abstract:Micro-actions are short-duration, low-amplitude subtle body movements at the whole-body level that can reveal latent intentions, involuntary reactions, and fine-grained affective changes. Our previous MA-52 benchmark has provided an important foundation for micro-action recognition, but it remains limited in scale, scene diversity, task coverage, and evaluation protocols. To advance micro-action analysis toward more realistic and comprehensive settings, we introduce MMA-82, a large-scale multi-domain extension of MA-52. MMA-82 expands the label space from 52 to 82 fine-grained micro-action categories and covers four distinct domains, including laboratory interviews, street interviews, psychiatric patient interviews, and emotion-rich television videos, resulting in 77,856 annotated instances from 454 subjects. Built upon MMA-82, we establish two core tasks: Micro-Action Recognition and Multi-label Micro-Action Detection. For recognition, we further define in-domain and cross-domain protocols, including few-shot and zero-shot settings, to evaluate model robustness, transferability, and generalization. Extensive experiments show that current methods still struggle with realistic micro-action understanding, especially under domain shift, long-tailed category distributions, and complex temporal localization. Beyond benchmarking, we investigate the relationship between micro-actions and emotion, showing that micro-actions are strongly associated with emotional states and provide complementary cues to facial micro-expressions for improved emotion recognition. These results demonstrate that MMA-82 serves as a comprehensive and challenging benchmark for realistic micro-action analysis and a valuable resource for human-centered AI. MMA-82 is available at this https URL.

45. 【2606.14094】FEMOT: Multi-Object Tracking using Frame and Event Cameras

链接：https://arxiv.org/abs/2606.14094

作者：Shiao Wang,Xiao Wang,Chao Wang,Yitao Li,Menghao Liu,Bo Jiang,Yaowei Wang,Yonghong Tian,Jin Tang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：capture rich appearance, Conventional RGB cameras, Conventional RGB, RGB-event multi-object tracking, semantic information

备注：

点击查看摘要

Abstract:Conventional RGB cameras have been widely used in multi-object tracking due to their ability to capture rich appearance and semantic information. However, their performance is often degraded under complex real-world challenges, such as motion blur, low illumination, and overexposure. Bio-inspired event cameras offer high temporal resolution and high dynamic range, providing complementary cues under extreme scenarios. Nevertheless, RGB-event multi-object tracking remains underexplored due to the lack of large-scale and well-annotated datasets. To address this issue, we propose FEMOT, a large-scale RGB-event multi-object tracking dataset that covers diverse real-world scenarios and 14 challenging attributes. With both RGB and event data as well as high-quality annotations, FEMOT provides a reliable platform for systematically evaluating RGB-event multi-object tracking methods. Based on FEMOT, we retrain and evaluate over ten strong trackers, thereby establishing a comprehensive benchmark for future research. Furthermore, we propose FEMOTR, a multimodal tracking framework that decouples RGB and event features and fuses them in the frequency domain, thereby effectively exploiting their complementary characteristics for robust object localization and identity association. Extensive experiments on FEMOT and DSEC-MOT datasets demonstrate the effectiveness of the proposed method. The source code and benchmark dataset have been released on this https URL.

46. 【2606.14081】Clay-CNN Hybrids: Leveraging Geo-Foundational Models as Auxiliary Context for Landslide Detection

链接：https://arxiv.org/abs/2606.14081

作者：Huong Binh Vu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词：Rapid post-event landslide, extreme class imbalance, post-event landslide mapping, Rapid post-event, class imbalance

备注： 9 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Rapid post-event landslide mapping is essential for disaster response but remains difficult to automate due to extreme class imbalance. This study evaluates whether Clay v1.5, a Geo-Foundational Model (GFM), can improve pixel-level landslide segmentation on the Landslide4Sense (L4S) benchmark, which contains 3,799 training chips with 14 Sentinel-2 and terrain bands and approximately 2% positive pixels. We compare three strategies: Clay as the primary encoder with multi-scale residual terrain fusion, a U-Net backbone augmented with Clay semantic context at the bottleneck, and a standard U-Net baseline. The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.

47. 【2606.14072】Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

链接：https://arxiv.org/abs/2606.14072

作者：Wentao Ke,Jianche Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Accurate pediatric brain, heterogeneous imaging phenotypes, limited annotated data, remains challenging due, diffuse tumor boundaries

备注：

点击查看摘要

48. 【2606.14071】ShearFuse-UNet: Hadamard, DCT, and Shearlet Transform Fusion for Next-Day Wildfire Spread Prediction

链接：https://arxiv.org/abs/2606.14071

作者：Ene Meco,Yingyi Luo,Emadeldeen Hamdan,Adam Watts,Ahmet Enis Cetin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multi-modal satellite data, computationally efficient deep, efficient deep learning, deep learning model, Discrete Cosine Transform

备注：

点击查看摘要

Abstract:We propose ShearFuse-UNet, a lightweight and computationally efficient deep learning model for next-day wildfire spread prediction from multi-modal satellite data. The model integrates three complementary transform-domain branches inside each encoder block of a U-Net backbone: a 2D Fast Walsh-Hadamard Transform (WHT) branch, a 2D Discrete Cosine Transform (DCT) branch, and a cone-adapted digital Shearlet residual branch. The WHT and DCT branches establish orthogonal latent spaces with learnable spectral scaling and fixed soft-thresholding, while the Shearlet branch provides anisotropic, multi-directional feature decomposition that explicitly encodes the elongated edge structures characteristic of fire fronts. A learned SpectralFusion gate adaptively combines the WHT and DCT responses, and the Shearlet reconstruction is added as a residual. This three-branch design bears a loose structural analogy to transformer self-attention: the WHT and DCT branches provide complementary spectral representations that are adaptively fused, while the Shearlet branch contributes directional content through a residual pathway. Unlike self-attention, the proposed design relies on fixed mathematical transforms rather than learned projection operators, reducing parameter count and computational cost. Evaluated on the WildfireSpreadTS dataset, ShearFuse-UNet achieves an F1 score of 0.596 with only 267k parameters, outperforming a ResNet18-based U-Net (14M parameters, F1 = 0.589) and demonstrating a highly favorable accuracy-efficiency trade-off. Results on the Google Next-Day Wildfire Spread dataset further validate these findings across a different benchmark.

49. 【2606.14049】FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

链接：https://arxiv.org/abs/2606.14049

作者：Shiyao Wang,Xijuan Zeng,Hui Wang,Shiwan Zhao,Feng Deng,Chen Zhang,Yong Qin

类目：ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)

关键词：versatile audio synthesis, framework integrating multi-modal, frame-level temporal alignment, integrating multi-modal control, enabling synchronized

备注： Accepted by INTERSPEECH 2026

点击查看摘要

Abstract:We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at this https URL.

50. 【2606.14048】WAM4D: Fast 4D World Action Model via Spatial Register Tokens

链接：https://arxiv.org/abs/2606.14048

作者：Ying Li,Xiaobao Wei,Jiajun Cao,Hao Wang,Xiaowei Chi,Chengyu Bai,Qianpu Sun,Jiajun Li,Xiaojie Zhang,Jian Tang,Sirui Han,Shanghang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：recently shown promise, jointly modeling future, modeling future observations, executable robot actions, recently shown

备注： 15 pages, 7figures, 9tables

点击查看摘要

Abstract:World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation. While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation. To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens. Comprehensive experiments on RoboTwin 2.0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.

51. 【2606.14042】Rethinking One-Step Image Editing through ChordEdit: Reproduction, Simplification, and New Insights

链接：https://arxiv.org/abs/2606.14042

作者：Minghan Li,Jeremy Moebel,Mengyu Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：One-step image editing, text-guided editing fast, making text-guided editing, One-step image, easy to deploy

备注： 9 pages

点击查看摘要

Abstract:One-step image editing is important for making text-guided editing fast, practical, and easy to deploy, but its underlying mechanism is still not fully understood. We revisit ChordEdit through reproduction, ablation, and simplification. Our analysis shows that a) the chord window $\delta$ largely acts as an effective timestep shift from $t$ to $t - \delta$; b) chord transport acts on high-noise images and mainly performs low-frequency semantic editing; and c) proximal alignment acts on low-noise images and complements it by adding high-frequency target details. In this view, ChordEdit naturally decomposes editing into a coarse low-frequency transport stage and a fine high-frequency alignment stage. These findings suggest a path toward prompt-conditioned dynamic timestep selection for adaptive image editing. All code and results can be found at \href{this https URL}{link}.

52. 【2606.14035】oward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

链接：https://arxiv.org/abs/2606.14035

作者：Dinh-Khoi Vo,Nhut-Thanh Le-Hinh,Viet-Tham Huynh,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unintentionally affect non-target, limited fine-grained supervision, significantly advanced image, meticulous prompt engineering, affect non-target regions

备注： ICCCI 2026. Project page: [this https URL](https://vdkhoi20.github.io/FocusDiff)

点击查看摘要

Abstract:Zero-shot text-guided diffusion has significantly advanced image editing; however, its practical usability remains constrained by three persistent challenges: prompt brittleness that requires meticulous prompt engineering, spillover edits that unintentionally affect non-target regions, and failures on small or cluttered objects caused by limited fine-grained supervision in training data. We propose FocusDiff (Target-Aware Refocusing for Tuning-Free Diffusion Editing), a tuning-free framework for precise and region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while accurately transferring the object's identity, structure, and appearance to the edited output. Integrated context-preserving modules further ensure background fidelity and global coherence, enabling accurate edits from simple text prompts in a single pass. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments. Extensive experiments on our localized editing benchmark LIMB, comprising 30 multi-object images and 100 annotated examples including challenging small-object cases, show that FocusDiff outperforms existing zero-shot editors in text-image alignment and background preservation, achieving superior precision, photorealism, and usability. The project page is available at this https URL.

53. 【2606.14025】GarmentSketch: Large-scale Sketch-to-Fashion Benchmark

链接：https://arxiv.org/abs/2606.14025

作者：Duong-Duy-Khang Bui,Minh-Tan Pham,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：allowing rapid visualization, creative concepts prior, allowing rapid, physical prototyping, rapid visualization

备注： ICCCI 2026. Project page: [this https URL](https://khangbdd.github.io/garmentsketch)

点击查看摘要

Abstract:Fashion sketching is a cornerstone of design workflows, allowing rapid visualization of creative concepts prior to physical prototyping. Yet, progress in sketch-based fashion image synthesis has been hindered by the absence of large-scale, high-quality paired resources. To bridge this gap, we present GarmentSketch, a novel dataset comprising 26,249 fashion sketches across 21 garment categories, each paired with detailed textual descriptions. Captions were produced through a multi-stage pipeline that integrates multiple multimodal large language models (MLLMs) with human-in-the-loop refinement, ensuring both semantic accuracy and descriptive richness. We benchmark GarmentSketch on state-of-the-art generative models, providing baseline performance for sketch-guided text-to-image generation. Our experiments reveal both the promise and the current limitations of existing methods. By offering a comprehensive and richly annotated resource, GarmentSketch establishes a foundation for advancing sketch understanding, fine-grained fashion image generation, and creative human-AI collaboration in design. The dataset will be available at: this https URL.

54. 【2606.14024】ViT-Up: Faithful Feature Upsampling for Vision Transformers

链接：https://arxiv.org/abs/2606.14024

作者：Krispin Wandel,Jingchuan Wang,Hesheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Transformers, providing exceptionally strong, visual representation learning, broadly reusable backbone, providing exceptionally

备注： Code is available at: [this https URL](https://github.com/krispinwandel/vit-up)

点击查看摘要

Abstract:Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.

55. 【2606.14010】RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

链接：https://arxiv.org/abs/2606.14010

作者：Xiangyu Huang,Zhenlin Hua,Han Zhou,Shounak Sural,Ragunathan Rajkumar

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：modeling visual perception, shown strong potential, jointly modeling visual, visual perception, explainability and action

备注：

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.

56. 【2606.14006】HARBOR: Heading Analysis and Reconstruction from Behavioral Observation and Radar

链接：https://arxiv.org/abs/2606.14006

作者：Joao P. A. Dantas,Paulo F. Silva Filho,Jelton A. Cunha,Gabriel Dietzsch

类目：Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)

关键词：Automatic Identification System, Identification System, Maritime situational awareness, track vessel movements, Automatic Identification

备注：

点击查看摘要

Abstract:Maritime situational awareness often relies on Automatic Identification System (AIS) transmissions to track vessel movements. However, in operational or conflict scenarios, these data may be unavailable due to signal loss, deliberate deactivation, or intentional spoofing. In such conditions, synthetic aperture radar (SAR) imagery becomes a critical sensing alternative for wide-area maritime monitoring, despite providing only static scene snapshots. This work introduces HARBOR (Heading Analysis and Reconstruction from Behavioral Observation and Radar), a complete pipeline for transforming a single SAR image into predictive motion information without requiring any auxiliary data source at inference time. The method begins with SAR image preprocessing to enhance and segment vessel candidates, followed by automatic detection, size-based classification, and heading estimation using skeleton geometry and local intensity patterns. AIS data are used exclusively during an offline calibration phase to derive vessel-type-dependent motion parameters, which are then applied to generate probabilistic heatmaps of candidate future vessel positions. A case study using real COSMO-SkyMed SAR imagery demonstrates the pipeline on a maritime scene in southern Brazil, showing its ability to extract motion tendencies and generate probabilistic projections of vessel positions in data-denied environments.

57. 【2606.14005】Context-Guided Semantic Alignment for Feature Fusion Networks

链接：https://arxiv.org/abs/2606.14005

作者：Hyungseop Lee,Jiho Lee,Woochul Kang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：aggregating multi-scale features, modern object detectors, aggregating multi-scale, varying sizes, modern object

备注： 26 pages, 12 figures, 8 tables

点击查看摘要

Abstract:Feature fusion networks are fundamental components in modern object detectors, aggregating multi-scale features to detect objects of varying sizes. However, directly fusing features from different pyramid levels often introduces semantic inconsistency due to their heterogeneous representations. In this paper, we propose Feature Interaction NEtwork (FINE), a lightweight semantic alignment module that refines low-level features via high-level contextual guidance using cross-level attention prior to fusion. To bridge the structural gap and ensure computational efficiency, we introduce an Alignment-Aware Token Sampling that aligns corresponding spatial regions across scales, reducing the attention complexity by an order of magnitude. The resulting attention weights generate a spatial-channel modulation map that is upsampled and applied to the low-level features via residual element-wise modulation. This mechanism ensures that the network selectively enhances semantically relevant pixels while preserving the sub-pixel localization accuracy necessary for dense prediction tasks. FINE is generally applicable to various detectors and consistently improves detection accuracy without compromising efficiency.

58. 【2606.13971】Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation

链接：https://arxiv.org/abs/2606.13971

作者：Xiaomeng Yang,Yanyu Li,Gordon Guocheng Qian,Ivan Skorokhodov,Viacheslav Ivanov,Avalon Vinella,Xuan Zhang,Yanzhi Wang,Sergey Tulyakov,Anil Kag

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：specific visual effects, high-end video generation, specific visual, increasingly demanded, demanded for high-end

备注：

点击查看摘要

Abstract:Personalizing Image-to-Video (I2V) diffusion models with specific visual effects is increasingly demanded for high-end video generation. Current practice requires training a separate Low-Rank Adaptation (LoRA) module for each effect, incurring substantial data curation and iterative optimization costs that hinder interactive control. We present Prompt2Effect, a weight-driven hypernetwork that amortizes per-effect training by directly synthesizing effect-specific LoRA weights in a single forward pass. Unlike prior hypernetworks that regress adapter weights purely from semantics, Prompt2Effect is explicitly conditioned on the frozen base model weights, grounding weight prediction in the structural geometry of each layer. Furthermore, instead of predicting raw LoRA matrices, we introduce an SVD-canonicalized parameterization that resolves factorization ambiguity and stabilizes large-scale weight synthesis. Together, these design principles enable accurate and scalable LoRA prediction for high-dimensional I2V diffusion models. Extensive experiments demonstrate that Prompt2Effect achieves on-par or superior video quality and effect alignment compared to conventional LoRA fine-tuning, while reducing the computational cost from 56 GPU training hours to 3.3 seconds of hypernetwork inference. When used as initialization for subsequent fine-tuning, our predicted weights further improve final performance and accelerate optimization by approximately 10x.

59. 【2606.13964】CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis

链接：https://arxiv.org/abs/2606.13964

作者：Dongyu Wang,Dar-Yen Chen,Yi-Zhe Song

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Sketch-based caricature synthesis, fundamental failure mode, caricature synthesis suffers, create destructive interference, Sketch-based caricature

备注：

点击查看摘要

Abstract:Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognizable distortions. We identify the root cause as \emph{condition signal contamination} -- competing probability distributions in the denoising trajectory that make balanced generation impossible. We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: $\mathcal{P}^{\mathrm{i}}$ (pure identity), $\mathcal{P}^{\mathrm{s}}$ (pure shape), and $\mathcal{P}^{\mathrm{i+s}}$ (harmonized output). Novel energy functions operating on cross-attention features provide gradient guidance that steers $\mathcal{P}^{\mathrm{i+s}}$ toward optimal balance: $\mathcal{E}_{\mathrm{shape}}$ ensures sketch fidelity through layout and semantic alignment, while $\mathcal{E}_{\mathrm{id}}$ employs token-level correspondence matching robust to extreme distortions. Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall user preference score (vs. 6.06). Our method fundamentally reconceptualizes the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.

60. 【2606.13929】Self-Evolving Visual Questioner

链接：https://arxiv.org/abs/2606.13929

作者：Yijun Liang,Hengguang Zhou,Ming Li,Lichen Li,Cho-Jui Hsieh,Tianyi Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Vision-language models, actively ask diverse, typically trained, trained as passive, ability to actively

备注： 21 pages, including references and appendix. Project Page is available at [this https URL](https://joliang17.github.io/SelfEvolvingVQG/)

点击查看摘要

Abstract:Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.

61. 【2606.13911】Overhead Wildlife Locator (OWL): Benchmarking Weakly Supervised Learning for Aerial Wildlife Surveys

链接：https://arxiv.org/abs/2606.13911

作者：Isai Daniel Chacón,Zhongqi Miao,Bruno Demuro,Caleb Robinson,Rahul Dodhia,Lasha Otarashvili,Jason Holmberg,Kirk Larsen,Howard Frederick,Nathan J. Pamperin,Pablo Arbeláez,Juan M. Lavista Ferres

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Automated aerial wildlife, require bounding-box annotations, standard object detectors, object detectors require, detectors require bounding-box

备注： 16 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Automated aerial wildlife surveys increasingly rely on deep learning, yet standard object detectors require bounding-box annotations, reported to be up to seven times slower and three times more expensive to produce than point-level labels. To address this bottleneck, we introduce the Overhead Wildlife Locator (OWL), a weakly supervised density-estimation framework with three variants: OWL-C, a fully convolutional model for high-throughput screening; OWL-T, a Swin-augmented hybrid for heterogeneous, cluttered scenes; and OWL-D, built on a frozen DINOv3 ViT-H+/16 encoder with a DPT-style fusion decoder. We benchmark all three against POLO, YOLOv11n, and YOLOv11l across five public aerial datasets, from sparse fixed-wing savanna surveys to dense UAV paddock imagery, and against the published HerdNet baseline on its native Delplanque split. OWL-D sets a new state of the art on Delplanque (0.934 AP vs. HerdNet's 0.840) and records the highest AP on four of the five datasets. Performance is regime-dependent: on the extreme-density SheepCounter UAV dataset the hybrid OWL-T leads (0.978 AP) and the convolutional variants attain the lowest counting error, whereas the foundation-based OWL-D degrades, indicating which variant suits which survey type. We further validate operational readiness on the Alaska Department of Fish and Game's 2022 Central Arctic Caribou census: under cross-herd and cross-temporal transfer, OWL-C fine-tuned on the 2017 Porcupine Caribou Herd split attains F1 = 0.965 on a held-out patch test set, with a signed count error of +3.1% aggregated across the released test patches. We release the OWL code, model weights, and the annotated Porcupine Caribou Herd 2017 (PCH) and Central Arctic Herd 2022 (CAH) patches, the first open patch-level datasets for large-scale caribou aerial surveys, at this https URL.

62. 【2606.13910】PMOF: A Dataset and Benchmark for Passenger Monitoring Using Overhead Fisheye Cameras

链接：https://arxiv.org/abs/2606.13910

作者：Stella Katharina Wermuth,Qazi Arbab Ahmed,Klaus Neumann,Thorsten Jungeblut

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Autonomous staff-free public, transport requires reliable, requires reliable in-vehicle, Autonomous staff-free, reliable in-vehicle passenger

备注： 6 pages, 7 figures. Accepted to the 22nd IEEE International Conference on Advanced Visual and Signal-Based Systems (AVSS 2026)

点击查看摘要

Abstract:Autonomous staff-free public transport requires reliable in-vehicle passenger monitoring. However, perception inside moving vehicles is challenged by confined spaces, variable illumination, motion-induced background variation, occlusion, and limited viewpoints. To mitigate these spatial constraints, ceiling-mounted fisheye cameras provide full-scene coverage from a single viewpoint. Yet existing public overhead fisheye datasets are recorded in static environments and do not capture the domain shift introduced by vehicle motion. To fill this gap, we introduce PMOF, Passenger Monitoring using Overhead Fisheye cameras, the first public dataset of top-view fisheye imagery captured inside a moving vehicle, comprising over 19k manually annotated frames. PMOF provides rotated bounding boxes, tracking identifiers, and action labels, supporting object detection, tracking, and action recognition. We benchmark PMOF using YOLO26m-obb models fine-tuned under multiple dataset configurations that combine PMOF with existing overhead fisheye datasets. Cross-domain fine-tuning with custom rotation-aware augmentation achieves 94.8% AP50 on PMOF and 96.5% AP50 on an unseen overhead fisheye dataset from a different domain. Our results highlight the domain gap between static and moving environments and show that incorporating PMOF improves detection performance and advances generalization beyond passenger monitoring to broader fisheye-based person detection tasks. The dataset and code are available at this https URL.

63. 【2606.13898】HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

链接：https://arxiv.org/abs/2606.13898

作者：Haoran You,Yotam Nitzan,Lingzhi Zhang,Yifan Gong,Mang-Tik Chiu,Connelly Barnes,Yan Kang,Yuqian Zhou,Eli Shechtman,Sohrab Amirghodsi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Generative Fill buttons, Photoshop Remove, Photoshop and Lightroom, Fill buttons, Creative image editing

备注： 14 pages, 10 figures, Patent filled

点击查看摘要

Abstract:Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose $\textbf{HiLo-Token}$, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.

64. 【2606.13896】How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

链接：https://arxiv.org/abs/2606.13896

作者：Julia Romero,Qin Lv,Morteza Karimzadeh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Self-supervised geospatial foundation, remote sensing data, learn transferable representations, geospatial foundation models, Self-supervised geospatial

备注：

点击查看摘要

Abstract:Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

65. 【2606.13894】Gefen: Optimized Stochastic Optimizer

链接：https://arxiv.org/abs/2606.13894

作者：Nadav Benedek,Tomer Koren,Ohad Fried

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：modern deep learning, states add roughly, moment states add, deep learning, modern deep

备注：

点击查看摘要

66. 【2606.13886】PhysVLA: Towards Physically-Grounded VLA for Embodied Robotic Manipulation

链接：https://arxiv.org/abs/2606.13886

作者：Namai Chandra,Shriram Damodaran,Lin Wang

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：mapping visual inputs, natural language instructions, language instructions directly, robotic control policies, models excel

备注： 9 pages, 5 figures, supplementary material included

点击查看摘要

Abstract:Vision-Language-Action (VLA) models excel at mapping visual inputs and natural language instructions directly to robotic control policies. However, because they are trained primarily to fit behavioural demonstration data, they do not explicitly enforce fundamental physical principles such as rigid-body dynamics or contact constraints. This exposes a critical physics gap: standard temporal smoothing applied on top of single-step or chunked VLAs trades trajectory quality for added failures that short-term memory cannot resolve. To bridge this gap, we introduce PhysVLA (Physics-VLA), a plug-and-play, inference-time framework designed to wrap any frozen VLA backbone without retraining, fine-tuning, or weight access, with less than 1 ms of overhead per control step. PhysVLA intercepts the predicted control action, captures only the simulator or system state, and applies a dual-layered correction: (i) a phase-aware finite-state machine that structures discrete task segments (approach, grasp, transport, and place), and (ii) a selective Euler-Lagrange gate that activates only when a dynamics oracle detects kinodynamic inconsistency. Evaluated across OpenVLA, OpenVLA-OFT, Force-VLA, and Generalist-VLA on LIBERO-Spatial with a 7-DoF Franka Panda, the framework delivers absolute success rate increases of up to 17% and stability increases of up to 19% with no per-task regressions, improves trajectory efficiency by up to 15% across all four backbones, and shows up to a 10x improvement in trajectory jerk robustness on a Robosuite Lift cross-simulator sweep. We further validate the framework on a real Agilex Piper arm with a pick-and-place task, confirming that PhysVLA transfers to physical hardware without retraining, with success-rate improvements of up to 50%, establishing physical awareness as a composable, backbone-agnostic runtime module.

67. 【2606.13872】Avatar V: Scaling Video-Reference Avatar Video Generation

链接：https://arxiv.org/abs/2606.13872

作者：Benjamin Liang,Ce Chen,Desmond Lin,Ivan Somov,Jiajun Zhao,Jiewei Yuan,Jingfeng Zhang,Junhao Huang,Nik Nolte,Pedram Haqiqi,Penghan Wang,Rong Yan,Rui Zhang,Sam Prokopchuk,Sivan Wang,Viktor Goriachko,Yi Ren,Yuanming Li,Yutao Chen,Zhenhui Ye,Zhibin Hong,Zilong Nie,Zujin Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generating avatar videos, Generating avatar, gestural tendencies, behaviorally recognizable, faithfully reproducing

备注： 31 pages, 15 figures. All contributors are listed in alphabetical order by first name

点击查看摘要

Abstract:Generating avatar videos that are not merely visually similar to a target individual but behaviorally recognizable, faithfully reproducing their talking rhythm, gestural tendencies, and expression dynamics, remains an open challenge. Existing methods predominantly condition on single static images, which provide insufficient identity information and cannot capture dynamic motion traits, while standard pixel-level objectives underserve the perceptually critical facial regions that determine avatar fidelity. We present Avatar V, a production-scale framework that addresses these limitations through video-reference-conditioned identity modeling. Rather than compressing identity into fixed-size embeddings, the model conditions directly on the full token sequence of a reference video, learning to reproduce both static identity attributes (facial geometry, skin texture) and dynamic behavioral patterns (talking rhythm, micro-expressions) through attention over the reference context. We introduce Sparse Reference Attention, an asymmetric mechanism achieving linear-complexity conditioning on arbitrarily long references; a motion representation stream enabling closed-loop talking style transfer; and an identity-aware super-resolution refiner inheriting the full reference conditioning. These are supported by a data engine curating 100M+ training clips from 50M raw videos, and a five-stage training pipeline with flow matching pre-training, personality fine-tuning, two-phase distillation (10x acceleration), and RLHF alignment, deployed across thousands of GPUs. Avatar V generates 1080p videos of unlimited duration, achieving state-of-the-art identity preservation, lip synchronization, and generation quality on our cross-scene benchmark, consistently outperforming leading systems including Seedance 2.0, Kling O3 Pro, Veo 3.1, and OmniHuman 1.5 in both automated metrics and human evaluation.

68. 【2606.13870】Mirage Probes: How Vision Models Fake Visual Understanding

链接：https://arxiv.org/abs/2606.13870

作者：Daniel Ben-Levi,Judah Goldfeder,Weiliang Zhao,Raz Lapid,Amit LeVi,Allen G. Roush,Ravid Shwartz-Ziv,Hod Lipson

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：image-based questions confidently, Vision-language models, Vision-language, answer image-based questions, questions confidently

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. Using Mirage Probes, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across residual stream, MLP, post-attention, and attention-head sites in two open-source VLMs. We demonstrate that a Naive Bayes text baseline cannot recover this signal, ruling out surface lexical confounds. Cross-benchmark separability patterns, together with a novel Prior Harnessing Index (PHI) measuring how much a model can answer from text alone, expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model's visual representations rather than its text. Faithful visual grounding will require interventions at the representational level.

69. 【2606.13861】mporal Backtracking Search for Test-time Generative Video Reasoning

链接：https://arxiv.org/abs/2606.13861

作者：Sejoon Jun,Zheng Ding,Huangyuan Su,Weirui Ye,Yilun Du

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large language models, reasoning remains bottlenecked, large language, remains bottlenecked, TBS

备注：

点击查看摘要

Abstract:While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process. Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis. TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling. Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where one-shot generation collapses (0.7% for BoN), TBS achieves 22.7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.

70. 【2606.13840】Multi-Agent Embodied Autonomous Driving: From V2X Information Exchange to Shared World Models

链接：https://arxiv.org/abs/2606.13840

作者：Senkang Hu,Zhengru Fang,Yihang Tao,Zihan Fang,Sam Tak Wu Kwong,Yuguang Fang

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：isolated vehicle intelligence, Shared World Models, shifting from isolated, systems that share, multi-agent embodied systems

备注：

点击查看摘要

Abstract:Autonomous driving is shifting from isolated vehicle intelligence toward multi-agent embodied systems that share perception, infer intent, and coordinate action under uncertainty. This survey examines this transition through the lens of Shared World Models (SWMs): predictive cross-agent representations maintained across vehicles, infrastructure, and other traffic participants. We review more than 380 publications spanning vehicle-to-everything (V2X) communication, collaborative perception, inter-agent cognition, cooperative planning, end-to-end cooperative driving, and simulation and data engines for closed-loop validation. The organizing question is how exchanged observations become aligned state, intent-aware interaction, and coordinated downstream action. Across the surveyed literature, evaluation remains concentrated in simulation, curated benchmarks, and offline protocols. Foundation-model-based coordination also lacks verified real-time safety guarantees in open traffic. These gaps motivate key research priorities for multi-agent embodied autonomous driving (MAEAD): verifiable shared-state maintenance, robust intent and plan alignment, and safe coordinated action under communication, latency, and deployment constraints.

71. 【2606.13839】Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography

链接：https://arxiv.org/abs/2606.13839

作者：Louis Chen,Torbjörn E. M. Nordling

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词：transformers achieve low, decisions remain opaque, heart rate estimation, achieve low heart-rate, low heart-rate error

备注： 26 pages, 8 figures

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) transformers achieve low heart-rate error on benchmarks, yet their decisions remain opaque--a growing concern as rPPG moves toward clinical heart rate estimation. Existing rPPG XAI is dominated by qualitative heatmap inspection without quantitative faithfulness metrics or physiology-grounded validation, leaving a gap between visual plausibility and auditable evidence. We address this gap. First, we adapt four attribution methods (raw attention, rollout, flow, Beyond Intuition) to RhythmFormer's bi-level routing attention with top-$k$ selection. Second, we introduce a skin coverage metric quantifying how much attribution mass falls on skin regions. Third, we adapt the SaCo faithfulness coefficient from its original classification setting to rPPG regression by using the MAE between original and perturbed predicted rPPG waveforms as the perturbation impact. Applying these tools, we quantify a multi-hop leakage effect under sparse top-$k$ routing: attention rollout and flow almost completely restores the connections that individual refined-attention layers explicitly set to zero. Beyond Intuition mitigates this via its value-projection-weighted rollout and gradient-supported mask, attaining the highest median refined skin coverage ($0.83$ vs. $0.57$ for vanilla rollout) and faithfulness ($F=0.92$) among the evaluated methods on UBFC-rPPG. Validation across diverse datasets and model variants is needed. A case study on a low-SaCo outlier further shows all four methods recovering consistently once an artefactual region is replaced, suggesting consistent SaCo behavior across attribution families in this illustrative case. Together, these metrics move XAI for rPPG toward auditable numerical evidence about spatial alignment and perturbation faithfulness, i.e. trustworthy rPPG XAI.

72. 【2606.13809】Compressing Image Style Training into a Single Model Forward

链接：https://arxiv.org/abs/2606.13809

作者：Zhongjie Duan,Yingda Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion-based style transfer, balance inference efficiency, Diffusion-based style, transfer must balance, balance inference

备注： 11 pages, 9 figures

点击查看摘要

Abstract:Diffusion-based style transfer must balance inference efficiency with stylization fidelity. Adapter-based methods are efficient, but they inject style as an external condition and can either weaken reference-specific appearance or copy reference semantics into the generated image. Optimization-based personalization methods such as LoRA internalize style more effectively, but require a separate training process for every new style. We introduce i2L (image-to-LoRA), a framework that amortizes style LoRA training into a single forward pass. Given one or more reference images, i2L predicts LoRA weights for a text-to-image model, enabling immediate style instantiation without per-style optimization. The architecture combines an image encoder, learnable LoRA queries, and compressed decoding heads that generate adapted matrices. Training on semantically diverse style pairs encourages the predictor to preserve appearance cues while suppressing reference-content copying. Experiments on Z-Image, FLUX.2, and Hidream-O1 show that i2L improves style fidelity, prompt alignment, and perceptual quality over existing baselines. Because i2L produces explicit LoRA weights, it also supports asymmetric classifier-free guidance, multi-reference style fusion, and composition with controllable-generation modules.

73. 【2606.13769】$μ_0$: A Scalable 3D Interaction-Trace World Model

链接：https://arxiv.org/abs/2606.13769

作者：Seungjae Lee,Yoonkyo Jung,Jusuk Lee,Jonghun Shin,Amir Hossein Shahidzadeh,Yao-Chih Lee,H. Jin Kim,Jia-Bin Huang,Furong Huang

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：induce physical change, physical change enable, actions induce physical, induce physical, physical change

备注：

点击查看摘要

Abstract:World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $\mu_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $\mu_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $\mu_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $\mu_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $\mu_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $\pi_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

74. 【2606.13768】CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

链接：https://arxiv.org/abs/2606.13768

作者：Sharath Girish,Tsai-Shien Chen,Zhikang Dong,Mukesh Singhal,Hao Chen,Sergey Tulyakov,Aliaksandr Siarohin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：deliberate camera movement, video depicts multiple, depicts multiple subjects, captured with deliberate, Cinematic video depicts

备注： Project page: [this https URL](https://snap-research.github.io/CineOrchestra)

点击查看摘要

Abstract:Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.

75. 【2606.13736】Connections Between Pairs of Filters Improve the Accuracy of Convolutional Neural Networks

链接：https://arxiv.org/abs/2606.13736

作者：Kathleen Anderson,Philipp Grüning,Erhardt Barth

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：newly invented architectures, stacking convolutional blocks, improved network structures, researchers continue, continue to find

备注： IJCNN 2023

点击查看摘要

Abstract:While researchers continue to find new and improved network structures for CNNs, most of the newly invented architectures still rely on the traditional pattern of stacking convolutional blocks and separating them with pointwise activation functions. However, there are drawbacks to a network purely building on pointwise nonlinearities. One alternative is to introduce a pairwise connection between two filters of a network. Typical connection functions use multiplications or the minimum operation to realize logical AND connections. In this paper, we go one step further by demonstrating that CNNs can benefit from more general connections, which include parameters that are learned. With such parameters, the network is able to implement different connections in different network layers and better adapt the connection function to the task at hand.

76. 【2606.13723】Morphology-Aware Sample Assignment: Overcoming IoU Insensitivity for Surface Defect Detection

链接：https://arxiv.org/abs/2606.13723

作者：Pengfei Liu,Yuhan Guo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：visual detection models, directly determines, detection models, positive sample, positive sample sets

备注：

点击查看摘要

Abstract:Intersection-over-Union (IoU), as a pivotal metric for evaluating the spatial alignment between candidate proposals and ground-truth annotations, directly determines the quality of positive sample sets and the training efficacy of visual detection models. Through theoretical modeling and analysis, we uncover a non-sensitive region on the IoU response curve, within which samples yield nearly identical IoU scores despite distinct geometric overlaps. To overcome this limitation, we introduce a set of morphological similarity metrics covering area, shape, and aspect ratio, to refine the positive sample assignment process, thereby ensuring more discriminative and reliable matching. A supplementary matching score is derived via mean-based aggregation of these multidimensional similarities, compensating for the intrinsic limitation of IoU in representing structural correspondence. Theoretically, incorporating morphological similarity reshapes the response distribution of the matching function, yielding both effective directional gradients and polygon-like iso-response contours, which tightly confine high-response regions around each ground-truth instance and substantially enhance the precision of positive sample selection. Experiments based on the YOLOv9 framework demonstrate consistent performance gains on both NEUDET and GC10- DET datasets. Notably, the proposed approach is fully plug-and-play and incurs zero additional inference overhead, thereby ensuring deployment efficiency for industrial visual inspection.

77. 【2606.13714】SA: Temporal Slot Activation for Persistent Object-Centric Video Representation

链接：https://arxiv.org/abs/2606.13714

作者：Duc Nguyen,Sieu Tran,Hao Vo,Khoa Vo,Duy Minh Ho Nguyen,Nghi D. Q. Bui,Anh Nguyen,Long Mai,Ngan Le

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Unsupervised video object-centric, object-centric learning aims, decompose dynamic scenes, video object-centric learning, temporally persistent entity

备注：

点击查看摘要

Abstract:Unsupervised video object-centric learning aims to decompose dynamic scenes into temporally persistent entity representations. Existing recurrent video slot-attention methods propagate a fixed set of slots across frames, but typically assume unconditional slot propagation: every slot is updated and decoded at every frame, regardless of whether its corresponding object is visible. We show that this design violates a basic lifecycle requirement for persistent slots: when an object is absent or fully occluded, its slot should preserve its previous state and avoid explaining unrelated visible content. Instead, unconditional propagation creates two failure pathways: update-induced state drift, where current-frame evidence overwrites the absent object's representation, and decoder-induced reconstruction interference, where the inactive slot remains coupled to reconstruction through decoder attention. We propose Temporal Slot Activation (TSA), a mechanism that learns a per-slot, per-frame activation score $\alpha_{k,t} \in (0, 1)$ without visibility supervision. TSA uses this activation as a shared latent control variable for slot lifecycle modeling. When a slot is inactive, TSA anchors its state to the previous slot via activation-gated updating and suppresses its decoder participation through an activation-dependent additive bias on attention logits before softmax normalization. This jointly reduces state drift and reconstruction-driven interference. To improve decisions under partial occlusion and gradual reappearance, TSA further conditions activation prediction on a per-slot temporal memory produced by a Temporal Context Encoder. We evaluate TSA on MOVi-C/E, YT-VIS, and OVIS benchmarks using both standard and tracking-based metrics (FG-ARI, mBO, IDF1, HOTA). TSA consistently improves object decomposition and temporal identity preservation, with large gains on long, heavily occluded videos.

78. 【2606.13707】Orchestra-o1: Omnimodal Agent Orchestration

链接：https://arxiv.org/abs/2606.13707

作者：Fan Zhang,Vireo Zhang,Shengju Qian,Haoxuan Li,Hao Wu,Jinyang Wu,Donghao Zhou,Zhihong Zhu,Zheng Lian,Xin Wang,Pheng-Ann Heng

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：large language model, language model, highlighting the importance, recent success, swarms has shifted

备注：

点击查看摘要

79. 【2606.14568】rimodal Glioma Representation Alignment via Volumetric Contrastive Learning

链接：https://arxiv.org/abs/2606.14568

作者：Denise Marini,Eleonora Grassucci,Danilo Comminiello

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：heterogeneous information collected, biological scales, require the integration, integration of heterogeneous, heterogeneous information

备注：

点击查看摘要

Abstract:Glioma grading and survival prediction require the integration of heterogeneous information collected at different spatial and biological scales. Histopathology describes tissue morphology, mRNA expression captures molecular activity, and magnetic resonance imaging provides a non-invasive view of tumor extent and radiological heterogeneity. Existing glioma prognosis models often combine only two of these sources, while their alignment objectives remain mostly pairwise. This paper introduces GLORIA, a novel trimodal framework for GLioma Omics - Radiology - hIstopathology Alignment. GLORIA processes whole-slide image regions, gene-expression profiles, and 3D MRI volumes through modality-specific encoders, projects them into a shared latent space, and aligns them with a Gramian contrastive loss that measures the volume spanned by the three modality embeddings. The aligned representations are fused through a cross-modal gating module and optimized jointly for three-class glioma grading and overall survival prediction. We evaluate GLORIA on a matched TCGA-GBM/LGG and BraTS21 cohort, comprising 132 patients with all three modalities. On the shared trimodal test set, GLORIA improves over the bimodal WSI-mRNA baseline in all the metrics considered.

80. 【2606.14248】Spectrum Aware Illumination Estimation Using Multispectral Image

链接：https://arxiv.org/abs/2606.14248

作者：Hyejin Oh,Woo-Shik Kim,Sangyoon Lee,YungKyung Park,Je-Won Kang

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：conventional RGB imaging, RGB imaging, conventional RGB, illuminant spectrum estimation, improving illuminant spectrum

备注： Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). DOI: [https://doi.org/10.1109/TCSVT.2026.3701975](https://doi.org/10.1109/TCSVT.2026.3701975)

点击查看摘要

Abstract:Multispectral (MS) imaging extends beyond conventional RGB imaging by capturing more spectral bands, thereby improving illuminant spectrum estimation (ISE). However, existing methods often fail to fully exploit spectral information, resulting in suboptimal performance under diverse lighting conditions and across different sensor domains. Hence, we propose a deep learning framework with a spatio-spectral feature extraction block, which incorporates spectral attention mechanisms to enhance spectral correlation and preserve illuminant-relevant spatial features. Through the inclusion of an illuminant prior (IP), our approach prioritizes specific channels that provide more meaningful information in an MS image. We also propose a spectral-domain transform across different MS sensor spaces. The results demonstrate that illuminant spectra learned in high-dimensional sensor spaces can be effectively transformed to various lower-dimensional camera sensor spaces without any additional training. To facilitate evaluation, we introduce a real-world MS dataset containing high-dimensional ground-truth illumination spectra captured under diverse lighting conditions. Through extensive experiments, we demonstrate that our method achieves superior accuracy compared to existing models, thus providing a practical solution for real-world ISE. The code and dataset are available at this https URL.

81. 【2606.13957】High-Fidelity Video Compression based on Invertible Neural Transform and Implicit Conditioning

链接：https://arxiv.org/abs/2606.13957

作者：Siyue Teng,Ho Man Kwan,Yuxuan Jiang,Fan Zhang,David Bull

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：recently achieved competitive, achieved competitive rate-distortion, Learning-based video compression, competitive rate-distortion performance, rate-distortion performance compared

备注：

点击查看摘要

Abstract:Learning-based video compression has recently achieved competitive rate-distortion performance compared to conventional video codecs. However, most existing methods rely on non-invertible analysis-synthesis transforms, with reconstruction quality subject to both quantization and transform approximation errors. This limitation becomes particularly restrictive at higher quality points, where quantization errors are small and transform-induced distortion dominates. To address this, we propose InnVC, an Invertible neural network based Video Codec for wide-range and high-fidelity compression. The core idea is to preserve an invertible main transform path prior to quantization, while injecting content-adaptive context through a compact implicit conditioning field. This decouples strongly correlated video content from harder-to-model fine details, allowing different components to specialize in complementary reconstruction tasks for more efficient compression. To further improve compressibility, we introduce a scheduled masking strategy that progressively concentrates informative content into fewer latent channels for more effective entropy coding. Experiments on the UVG and MCL-JCV benchmarks show that InnVC achieves strong compression performance over a broad quality range, being particularly effective in the high-quality regime, yielding BD-rate reductions of 21.66% in PSNR and 46.06% in MS-SSIM relative to x265 on UVG. To the best of our knowledge, InnVC is the first neural video codec covers operating poins from low bitrate to high fidelity within a single architecture scale, spanning more than 20 dB in PSNR.

82. 【2606.13919】GMN4AD: Graph Matching Network for Alzheimer's Disease Diagnosis with Test-Time Domain Adaptation using Multi-centered Structure Magnetic Resonance Imaging

链接：https://arxiv.org/abs/2606.13919

作者：Chen Zhao,Huan Huang,Yixin Xie,Jiajing Huang,Weihua Zhou,Nandakumar Narayanan

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：progressive neurodegenerative disorder, Alzheimer Disease Diagnosis, Alzheimer Disease, Magnetic Resonance Imaging, older adults

备注：

点击查看摘要

Abstract:Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that affects millions of older adults, with prevalence expected to rise significantly in the coming years. Early diagnosis, particularly during the mild cognitive impairment (MCI) stage, is critical for timely intervention. Structural Magnetic Resonance Imaging (sMRI) has emerged as a key modality for detecting AD-related brain changes, but traditional graph-based approaches often struggle with modality and inter-site heterogeneity, limiting diagnostic performance. In this paper, we propose Graph Matching Network for Alzheimer's Disease Diagnosis (GMN4AD), designed to model interactions between heterogeneous brain graphs derived from neuroimaging data. Unlike conventional methods that treat each brain graph independently, GMN4AD leverages graph matching to capture cross-graph relationships, enhancing diagnostic precision. Furthermore, we introduce a test-time domain adaptation strategy that combines contrastive learning to mitigate domain shifts during inference. Extensive experiments on three public AD datasets demonstrate that GMN4AD achieves superior performance compared to state-of-the-art methods, offering a robust and generalizable solution for AD diagnosis.

83. 【2606.13700】C-MambaPose: A Physics-Informed Complex Mamba Framework for Cross-Environment WiFi Human Pose Estimation

链接：https://arxiv.org/abs/2606.13700

作者：Phuc Nguyen H

类目：ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)

关键词：promising technology owing, utilizing wireless WiFi, Human pose estimation, wireless WiFi signals, privacy preservation

备注：

点击查看摘要

Abstract:Human pose estimation (HPE) utilizing wireless WiFi signals has emerged as a promising technology owing to its device-free nature, privacy preservation, and robustness against occlusion and poor lighting. However, existing methods often overlook the physical complex phase information of WiFi signals and fail to generalize across diverse environments due to severe domain shifts. In this paper, we present C-MambaPose, a physics-informed complex-valued Mamba-GraFormer hybrid framework for robust cross-environment WiFi-based 3D HPE. Our framework first sanitizes raw WiFi Channel State Information (CSI) phase errors and constructs a phase-preserving complex-valued representation. We then employ a Spatiotemporal Complex Mamba encoder with a dynamic selective receptive field to capture fine-grained phase dynamics. A cross-attention joint-query mapper maps the unstructured sequence tokens to human joints, which are decoded by a Graph Convolutional Network (GCN) to predict anatomically coherent 3D coordinates. Extensive evaluations on the MM-Fi dataset show that C-MambaPose achieves competitive or superior performance to state-of-the-art baselines across all settings, setting a new state-of-the-art specifically on the challenging cross-environment split, requiring only 3.78 M parameters-an 83.1\% reduction compared to GraphPose-Fi~\cite{chen2026graph} and an 85.7\% reduction compared to MetaFi++~\cite{zhou2023metafi++}, while maintaining a comparable size to DT-Pose~\cite{chen2025towards} (which is only 18\% smaller) but achieving significantly superior performance without requiring any pretraining. Our code is publicly available at this https URL.