本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新445篇论文,其中:
- 自然语言处理58篇
- 信息检索13篇
- 计算机视觉106篇
自然语言处理
1. 【2606.28277】owards Automating Scientific Review with Google's Paper Assistant Tool
链接:https://arxiv.org/abs/2606.28277
作者:Rajesh Jayaram,Drew Tyler,David Woodruff,Corinna Cortes,Yossi Matias,Vahab Mirrokni,Vincent Cohen-Addad
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:mathematical theorem proving, theorem proving, driving a revolution, hypothesis generation, Artificial intelligence
备注:
点击查看摘要
Abstract:Artificial intelligence is driving a revolution in scientific discovery, accelerating everything from hypothesis generation to mathematical theorem proving. However, this rapid acceleration is creating a systemic challenge: traditional human peer review cannot scale to match the influx of AI-assisted science. Ultimately, to resolve this tension, we must also deploy AI to accelerate the verification and review process itself. To frame the discussion around this transition, we propose a taxonomy consisting of four progressive levels of AI-human collaboration in scientific evaluation, and discuss various trade-offs involved with each. As a step toward this future, we introduce the Paper Assistant Tool (PAT), an agentic AI framework built for deep scientific review and verification. PAT ingests full scientific manuscripts and produces a comprehensive evaluation, checking theoretical results, validating experiments, suggesting improvements, and identifying potential flaws. By utilizing inference scaling techniques, PAT is able to identify deeper issues than a single model call alone, achieving a 34% improvement over zero-shot recall on mathematical errors in the SPOT benchmark. Pilot deployments of PAT as a pre-submission tool for authors at two major Computer Science conferences -- STOC and ICML -- demonstrate its ability to identify critical errors and suggest substantive improvements to research papers. By catching errors early, PAT eases the cognitive burden placed on referees, while preserving their control over the outcomes of the review process.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:
arXiv:2606.28277 [cs.LG]
(or
arXiv:2606.28277v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2606.28277
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2606.28273】Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models
链接:https://arxiv.org/abs/2606.28273
作者:Niclas Lietzow,Danielle Bitterman,Carsten Eickhoff,William Rudman,Michal Golovanevsky
类目:Computation and Language (cs.CL)
关键词:reconcile visual evidence, Vision-language models, evidence with memorized, Vision-language, memorized world knowledge
备注: 14 pages, 11 figures, 8 tables
点击查看摘要
Abstract:Vision-language models must reconcile visual evidence with memorized world knowledge when the two conflict. How they resolve this conflict shapes the reliability of multimodal systems, yet prior work characterizes it behaviorally without a component-level causal account. We combine activation patching across three granularities (residual stream, attention heads, and MLP sublayers) with model-component ablation studies and mechanistic analysis. Across three VLM families, we find that visual grounding emerges by default, whereas prior grounding depends on a small set of causally necessary attention heads (2.5-4.8%) concentrated in the second half of the network. These heads enable answers from stored world knowledge (e.g., "red" for a strawberry) despite conflicting visual input. Ablating them flips predictions from knowledge-grounded to visually grounded answers in 68-96% of cases under prior-knowledge prompts, but changes only 0.8-7.5% of visually grounded predictions, establishing an asymmetric causal structure. The identified heads decompose into routing heads, which modulate information flow, and writing heads, which directly project answer tokens into the residual stream. This structure is consistent across model families and scales, revealing a sparse causal circuit underlying perception-knowledge conflict in VLMs.
3. 【2606.28186】Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction
链接:https://arxiv.org/abs/2606.28186
作者:Chenguang Wang,Ming Li,Xinyue Zeng,Zhuochun Li,Hong Jiao,Tianyi Zhou,Dawei Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:effective test construction, Predicting human item, reliable estimates support, estimates support fairness, Predicting human
备注: 32 pages, 8 figures, 10 tables
点击查看摘要
Abstract:Predicting human item difficulty is central to educational assessment, where reliable estimates support fairness and effective test construction. Existing methods often depend on costly human calibration or item-level textual representations, providing limited evidence about the cognitive processes that make items difficult. We argue that difficulty should be viewed not only as a property of item text, but also as an observable consequence of the problem-solving burden an item induces. Large Reasoning Models (LRMs) offer scalable process evidence through reasoning traces, but such evidence must be structured to support interpretable modeling. To this end, we introduce Epi2Diff (Episode to Difficulty), a framework that maps LRM reasoning traces into cognitively grounded episode sequences. These episodes group trace segments into functional problem-solving states, enabling difficulty to be modeled through reasoning scale, effort allocation, and state transitions. Epi2Diff extracts compact episode-dynamic features and combines them with semantic item representations for human difficulty prediction. Experiments on four real-world human difficulty datasets show that Epi2Diff consistently outperforms strong baselines, including fine-tuned small language models, LLM in-context learning, and supervised LLM adaptation. On SAT-derived classification benchmarks, Epi2Diff achieves an 8.1% average relative gain over supervised LLM fine-tuning baselines. Further analyses show that harder items induce more effortful, iterative, and implementation-centered episode dynamics, rather than merely longer responses. These results demonstrate that cognitive episodes in LRM reasoning traces provide a predictive and interpretable process representation for human item difficulty, offering a new lens for educational measurement with reasoning models.
4. 【2606.28127】From Tokens to States: LLMs as a Special Case of World Models and the Continuous Path Beyond
链接:https://arxiv.org/abs/2606.28127
作者:Paul Dubois
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:models simulate reality, large language models, world models simulate, world models, simulate reality
备注: 10 pages, 6 figures, 1 table
点击查看摘要
Abstract:The AI community has framed the relationship between large language models (LLMs) and world models as a dichotomy: LLMs predict tokens; world models simulate reality. Yann LeCun argues in 2022 that reaching general intelligence requires abandoning autoregressive token prediction in favour of latent-space architectures. This framing is unnecessarily binary. Two claims will be defended. First, LLMs are a degenerate special case of world models: the state space is the set of all token sequences, the only action is appending one token, and world models are therefore a strict generalisation of LLMs, not a replacement. Second, there is a natural continuous spectrum from NTP to JEPA, with multi-token prediction, future-summary prediction, and next-latent prediction as intermediate stations already populated by current research. Moving along this spectrum relaxes the LLM constraints one by one. It also progressively surrenders the two practical advantages that make LLMs trainable at scale: internet-scale self-supervised data, and a transformer architecture co-designed for discrete token prediction. Both are examined as open research questions: the data question (the cliff from self-supervised text to instrumented action-labelled environments) and the architecture question (whether the transformer generalises to continuous-state prediction, or whether a new primitive is needed).
5. 【2606.28116】Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability
链接:https://arxiv.org/abs/2606.28116
作者:Ruixuan Huang,Yipei Wang,Wenyi Fang,Hantao Huang,Yifan Huang,Ansheng You,Zhenxing Zhang,Shuai Wang,Fan Wu,Yang Zheng
类目:Computation and Language (cs.CL)
关键词:long wall-clock computation, Frontier large language, consumes massive accelerator, massive accelerator fleets, making stability failures
备注:
点击查看摘要
Abstract:Frontier large language model training consumes massive accelerator fleets and long wall-clock computation, making stability failures costly when they occur. After a numerical or a hyperparameter fault has already destabilized the training dynamics, it may continue for thousands of steps while loss and gradient norms still appear normal. We study mechanism-driven detection of training instability by deriving internal monitors from the functional role of each critical module and from the earliest computational sites where failures are expected to produce measurable signatures. For low-precision flash attention, we monitor the spectral entropy of a QK bilinear decomposition, whose first-order term becomes abnormal before the loss fully collapses. For MoE routers, we derive indicators from their role in expert selection. Our fault-injection experiments on low-precision attention, large learning-rate, and combined faults show that these signals provide distinct signatures for different failures, triggering thousands of steps before loss divergence.
6. 【2606.28062】Single and Multi Truth Data Fusion using Large Language Models
链接:https://arxiv.org/abs/2606.28062
作者:Hira Beril Kucuk,Norman W Paton,Jiaoyan Chen,Zhenyu Wu
类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:data integration problem, Data fusion tasks, Large Language Models, Data fusion, integration problem
备注:
点击查看摘要
Abstract:Data fusion, also known as truth discovery, is a data integration problem that aims to determine the correct value or set of values for each attribute of an object when presented with potentially conflicting values from multiple sources. Data fusion tasks belong to two main categories: single-truth scenarios, where each attribute has only one correct value, and multi-truth scenarios, where multiple values can be valid simultaneously. This paper investigates the use of Large Language Models (LLMs) in data fusion tasks for tabular data. Various prompting strategies, encompassing both single-truth and multi-truth scenarios, are investigated empirically. Domain-dependent, domain-independent, zero-shot and one-shot prompts are evaluated on three different benchmark datasets. Experimental results demonstrate that LLM-based approaches outperform traditional unsupervised truth discovery methods, such as DART and LTM, across all datasets. The codebase of this study has been made publicly available on GitHub.
7. 【2606.28057】MultiHashFormer: Hash-based Generative Language Models
链接:https://arxiv.org/abs/2606.28057
作者:Huiyin Xue,Atsuki Yamaguchi,Nikolaos Aletras
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Language models, embedding matrices, Language, hash, represent tokens
备注: Under review
点击查看摘要
Abstract:Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.
8. 【2606.28050】Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA
链接:https://arxiv.org/abs/2606.28050
作者:Sambaran Bandyopadhyay
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:pipelines implicitly assume, implicitly assume, self-evaluation pipelines implicitly, generation, pipelines implicitly
备注: 18 pages
点击查看摘要
Abstract:LLM-as-a-Judge and self-evaluation pipelines implicitly assume that evaluation is easier than generation. We test this in a controlled in-context QA setting where a context passage is the sole information source and each model judges the answer it generated, removing the parametric-knowledge confound of open-domain comparisons. Across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models, evaluation is not uniformly easier: generation accuracy exceeds self-evaluation on three of four, with multi-hop MuSiQue the exception. Attention analysis reveals why: evaluation attends to context 3--5x less than generation does and barely reads the candidate answer. LoRA fine-tuning confirms the asymmetry is not a training artifact: generation fine-tuning induces over-acceptance and evaluation fine-tuning degrades generation. These findings challenge core assumptions in self-evaluation pipelines.
9. 【2606.28048】DG^VoiC: Speaker Clustering for Fraud Investigation under Real Call-Centre Conditions
链接:https://arxiv.org/abs/2606.28048
作者:Muhammad Shakeel Akram,Amal Htait,Abdul Hamid Sadka,Emma Meisingseth,Karishma Jaitly
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:Insurance fraud remains, begin at FNOL, customer interactions begin, Insurance fraud, fraud remains costly
备注: 5 pages, 4 figures, 1 table
点击查看摘要
Abstract:Insurance fraud remains costly and operationally difficult, particularly in call-centre workflows where many customer interactions begin at FNOL. While recent fraud detection methods mainly rely on structured data, text, or images, repeated speaker identity across calls remains underused as an investigative signal. This paper presents DG^VoiC, a voice clustering framework for customer verification and cross-profile speaker linking on anonymised real call-centre audio. The approach combines sensitive information-aligned anonymisation, speech-focused preprocessing, sliding-window speaker embedding extraction, and cosine similarity based clustering to identify repeated speakers under real telephony conditions. The method was evaluated on 121 recordings, with a curated reference subset of 56 samples in 22 human-agreed speaker clusters. used for validation. The best configuration achieved 96% AMI, 95% ARI, 98% completeness, 100% homogeneity, and 99% V-measure. These results show that speaker clustering can provide a strong additional signal for fraud investigation by helping analysts verify speaker consistency and surface repeated voices across customers.
10. 【2606.28044】A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs
链接:https://arxiv.org/abs/2606.28044
作者:Aniket Deroy,Kripabandhu Ghosh,Saptarshi Ghosh
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, recent times, Large
备注: Accepted at ICAIL 2026
点击查看摘要
Abstract:In recent times, Large Language Models (LLMs) are increasingly being used for legal case judgement summarization. Most prior works have tried traditional extractive and abstractive summarization of case judgements. However, hybrid or extractive-abstractive techniques have not been explored much. In this work, we propose a novel tree-of-thoughts inspired extractive-abstractive summarization approach for legal judgement summarization. We conduct experiments using two popular LLMs, DeepSeek and LLama, and compare among extractive, abstractive and extractive-abstractive summarization. Our experiments show that the proposed extractive-abstractive prompt provides better summaries compared to other types of LLM prompts.
11. 【2606.28013】he Signal-Coverage Matrix: Stratifying Type and Semantic Errors in Statement Autoformalization
链接:https://arxiv.org/abs/2606.28013
作者:Chengxiao Dai,Zhaokun Yan,Zhanhui Lin
类目:Computation and Language (cs.CL)
关键词:Headline type-correctness, LLM autoformalization, sim, Headline, method resolves
备注:
点击查看摘要
Abstract:Headline type-correctness (TC\%) of LLM autoformalization has climbed from $\sim$53\% to $\sim$76\% in two years, yet this scalar conceals which errors each method resolves. We propose a signal-coverage matrix that crosses the Lean elaborator (pass/fail) with a semantic-equivalence judgment (equivalent/not), sorting every output into one of four cells: true success (TS), type-only (TO), semantic-only (SO), or both fail (BF). On ProofNet\# and MiniF2F-test with DeepSeek V4-Pro across Vanilla, Lean-Retry, Sample-Filter, and Stratified Autoformalization (SAF): (1) the +34 to +36 TS gain across the three elab-feedback methods is $\sim$64\% type-stratum recovery, with SO flat on net (87.5\% of original semantic errors rescued, 8 newly created). (2) The TO-to-TS rate is 23/61 for each method (Wilson 95\% CI [26.6\%, 50.3\%]), and this stratum-level recovery rate predicts $\Delta$TS on held-out methods to within 2/186 and renders $\Delta$TC linear in the Vanilla elab-fail rate across six (model, dataset) cells ($R^2=0.96$). (3) The two judges disagree by 26 to 37 pp on elab-feedback outputs (vs. 7 pp on Vanilla), with 30 to 56\% of symbolic-judge false negatives traceable to elaborator-forced rewrites. The persistent residual reduces to two gold-formalization errors. TC\% gains should be credited by which cell moved, not by the scalar alone.
12. 【2606.28002】Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection
链接:https://arxiv.org/abs/2606.28002
作者:Muhammad Shakeel Akram,Amal Htait,Abdul Hamid Sadka,Emma Meisingseth,Karishma Jaitly
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
关键词:imposes substantial financial, substantial financial losses, Insurance fraud imposes, fraud imposes substantial, operational inefficiencies
备注: 10 pages, 8 figures, 2 tables
点击查看摘要
Abstract:Insurance fraud imposes substantial financial losses and operational inefficiencies, raising premiums and impacting trust among legitimate policyholders. Early detection at FNOL remains a persistent challenge. Existing approaches rely largely on private, text-only datasets, limiting progress on multimodal methods that integrate linguistic, behavioural, and speaker-based indicators. We introduce a synthetic multimodal framework that replicates FNOL conditions. It generates agent-customer dialogue transcripts and two-speaker audios, performs ASR and diarisation. Downstream modules combine NER, regex-based feature extraction, LLM-RAG retrieval, and speaker embeddings in a rule-based risk score to flag narrative reuse, structural inconsistencies, and cross-case voice repetition while balancing sensitivity and false positives. Dataset validation and component-level evaluations show stability and transfer potential, offering a reproducible baseline beyond text-only fraud detection.
13. 【2606.27981】oxiREX: A Dataset on Toxic REasoning in ConteXt
链接:https://arxiv.org/abs/2606.27981
作者:Stefan F. Schouten,Ilia Markov,Piek Vossen
类目:Computation and Language (cs.CL)
关键词:toxic reasoning schema, multilingual dataset called, Toxic REasoning, systematic toxic reasoning, reasoning schema
备注:
点击查看摘要
Abstract:We introduce a new, contextual, multilingual dataset called ToxiREX: Toxic REasoning in ConteXt. The dataset consists of threads of Reddit comments and structured characterizations of what the comments imply, following a systematic toxic reasoning schema developed in a previous paper. Using the schema allows us to capture and explain implicit and context-dependent toxicity, while supporting mappings to existing toxicity taxonomies. The dataset includes comments in six languages (English, Arabic, Turkish, Spanish, German, and Dutch), collected from posts connected to specific major events (e.g. the 2023 Turkey earthquakes; the Russian invasion of Ukraine). We describe the context-preserving preprocessing of the threads. We create a training set of 125 thousand comments which is annotated by a commercially available LLM, and a test set of just under three thousand comments that is annotated by native speakers. We show that apparent disagreements in the test set annotations often reflect defensible alternative interpretations rather than noise. Finally, we provide baseline results by prompting and fine-tuning language models. To produce these results, we develop evaluation strategies for our hierarchical, schema-based predictions. While models perform better than random, there remains a lot of room for improvement, showing the task to be challenging. ToxiREX is the first dataset to simultaneously incorporate multiple languages, conversational context, and implicit toxicity, while using the toxic reasoning schema for rich, structured annotations. Dataset available at: this https URL
14. 【2606.27973】From Black-Box to Clinical Insight: A Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection
链接:https://arxiv.org/abs/2606.27973
作者:Yasaman Haghbin,Sina Rashidi,Ali Zolnour,Fatemeh Taherinezhad,Ali Fartoot,Hossein Azadmaleki,James M Noble,Maryam Dadkhah,Maryam Zolnoori
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:costly biomarker assays, Speech-based cognitive impairment, impairment detection offers, remain clinically uninterpretable, transformer-based models remain
备注: Accepted to Interspeech 2026
点击查看摘要
Abstract:Speech-based cognitive impairment detection offers a noninvasive, accessible alternative to costly biomarker assays, yet transformer-based models remain clinically uninterpretable. We propose a multi-stage explainability framework that translates black-box transformer predictions into clinically grounded narratives by integrating SHapley Additive exPlanations (SHAP)-based token attribution, theory-informed linguistic features, and a four-stage LLM reasoning pipeline using LLaMA-3.1-70B-Instruct. Built on the SpeechCARE-Adaptive Gating Network multimodal screening model (F1 = 72.11% on the NIA PREPARE benchmark), the framework maps model outputs to four cognitive-linguistic dimensions, including lexical richness, syntactic complexity, and semantic coherence. Physician evaluation on 70 stratified English samples demonstrated strong alignment with patient-level cognitive profiles, and a System Usability Scale score of 82/100 indicated high potential for clinical workflow integration.
15. 【2606.27959】An Empirical Analysis of Factual Errors in Human-Written Text and its Application
链接:https://arxiv.org/abs/2606.27959
作者:Kazuma Iwamoto,Kazumasa Omura,Shotaro Ishihara
类目:Computation and Language (cs.CL)
关键词:identifying factually incorrect, factually incorrect spans, important research problem, factual errors, Factual Error Detection
备注:
点击查看摘要
Abstract:Factual Error Detection (FED), which is the task of identifying factually incorrect spans in a given text, has long been recognized as an important research problem. However, with the rapid rise of large language models (LLMs), research attention has shifted toward factual errors specific to LLM-generated text (hallucinations) and their detection. As a result, the detection of factual errors in human-written text has been relatively neglected. To address this gap, we first distill a taxonomy of human-induced factual errors by analyzing corrections of newspaper articles, a representative source of text that is guaranteed to be human-written and contains few grammatical errors. Our analysis revealed that there are characteristic categories such as kanji misconversions and numeral classifier errors, which are not focused in existing hallucination benchmarks. Based on the taxonomy, we then evaluate the FED capability of vanilla LLMs on synthesized realistic test cases and real corrections. Experimental results demonstrated that even high-performance LLMs such as GPT-5.4 achieved only word-level F1 score of 52% on the synthetic evaluation data, highlighting the task difficulty. Furthermore, a detailed analysis by detection difficulty revealed the current state of FED.
16. 【2606.27951】AI Persuasive Framing in Collective Dilemmas
链接:https://arxiv.org/abs/2606.27951
作者:Anders Giovanni Møller,Alessia Galdeman,Arianna Pera,Luca Maria Aiello
类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Physics and Society (physics.soc-ph)
关键词:large-scale societal problems, addressing large-scale societal, enhance human cooperation, flexible behavioral nudges, societal problems
备注: The first two authors contributed equally to this research. The article contains 20 pages, 10 figures, and 2 tables
点击查看摘要
Abstract:AI agents are promising tools that can act as flexible behavioral nudges to enhance human cooperation in addressing large-scale societal problems. However, evidence on whether AI agents can effectively boost cooperation remains mixed. We recruited 1,283 participants to play iterated Collective Risk Games in small groups, testing whether AI assistants could nudge participants toward cooperation. By using persuasive framing personalized to each player's Social Value Orientation profile, the AI interventions significantly increased contributions and group success rates. These cooperative effects were short-lived, however, fading after the first few rounds. Strikingly, when the AI treatments were reconfigured to promote selfish behavior through exculpatory framing, the negative effects on contributions and group success were larger and substantially more persistent, particularly for personalized interventions. This asymmetry between prosocial and antisocial persuasion highlights the dual-use risks of AI systems designed to influence group behavior in collective action settings.
17. 【2606.27941】VASAE: Naming SAE Dictionary Directions with Vocabulary-Aligned Anchoring
链接:https://arxiv.org/abs/2606.27941
作者:Kairui Zhang,Ziwen Yu,Zahraa S. Abdallah,Martha Lewis
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Transformer token vocabulary, Transformer residual streams, Transformer residual, Vocabulary-Aligned Sparse Autoencoder, Sparse autoencoders
备注: 14 pages, 7 figures. Accepted to the 2nd Workshop on Compositional Learning at ICML 2026
点击查看摘要
Abstract:Sparse autoencoders (SAEs) provide useful decompositions of Transformer residual streams, but their learned features are usually named post hoc rather than directly connected to the Transformer's token vocabulary. We introduce Vocabulary-Aligned Sparse Autoencoder (VASAE), a method that trains SAE features under vocabulary-aligned anchoring and assigns each feature an intrinsic token name: the token string whose embedding is nearest to that feature. Without reducing reconstruction quality compared with a standard SAE, VASAE produces dictionaries with vocabulary-aligned features. Using a 0.8 cutoff on the nearest-token alignment score, dictionaries trained on GPT-2-small post-residual streams align about 90% of features in layers 0--10. In Llama-3.1-8B, representative shallow and middle-layer dictionaries contain strongly aligned features, including 92.8% in the shallow layer, while the representative final-layer dictionary shows limited alignment. After subtracting the sentence-level mean sparse code, case studies show that many remaining intrinsic token names are relevant to nearby input tokens. These results suggest that vocabulary-aligned anchoring can connect learned features to intrinsic token names during training, complementing post hoc interpretation of learned dictionaries.
18. 【2606.27926】Verifiable Geometry Problem Solving: Solver-Driven Autoformalization and Theorem Proposing
链接:https://arxiv.org/abs/2606.27926
作者:Can Li,Ting Zhang,Junbo Zhao,Hua Huang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Geometry Problem Solving, Geometry Problem, Problem Solving, Solving have increasingly, combining neural intuition
备注:
点击查看摘要
Abstract:Geometry Problem Solving have increasingly adopt the neuro-symbolic paradigm, combining neural intuition with symbolic rigor. However, current frameworks suffer from severe bottlenecks in two core stages: autoformalization, which treats multimodal translation as a static task decoupled from downstream solver compatibility, and theorem prediction, where solvers frequently hit a deductive impasse due to fixed rule libraries. To address these, we propose SD-GPS, a solver-driven framework that treats the symbolic solver as an execution oracle throughout both formalization and deduction. First, Solver-Driven Autoformalization unifies supervised formal-language adaptation and solvability-guided reinforcement learning into a single module built on QwenVL3-2B, making executability the central training signal. Second, Verified Theorem Proposing introduces an impasse-aware agent that proposes local auxiliary lemmas from current proof states, ensuring soundness by filtering all proposals through symbolic verification. Empirical evaluations on Geometry3K and PGPS9K demonstrate that SD-GPS consistently outperforms existing MLLM, neural, and neuro-symbolic methods across standard completion, multiple-choice, and cross-modal reference regimes, proving that closing the loop between multimodal perception and symbolic execution significantly improves geometric reasoning, offering profound insights into how neural agents can be grounded by formal systems to achieve verifiable problem-solving capabilities.
19. 【2606.27909】riadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs
链接:https://arxiv.org/abs/2606.27909
作者:Avni Mittal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
关键词:large language models, language models typically, single hidden side, strong language priors, observable cue points
备注:
点击查看摘要
Abstract:Theory-of-mind evaluations of large language models typically use dyadic social-deduction games, where every observable cue points to a single hidden side, so a model with strong language priors can score well without ever simulating opponents' incentives. We extend the Werewolf game with a Jester, a third faction whose utility on peer suspicion is inverted because it wins by being voted out, so optimal play requires reasoning across three opposing utility functions. Across 60 games on GPT-4.1, DeepSeek-V3.1, and Llama-3.3-70B with Jester self-learning on and off, the Jester wins 60-70% of games while Werewolves never exceed 20%, and GPT-4.1 wolves vote the Jester out on day 1 in 60-70% of games, a strictly self-defeating action. Self-learning helps DeepSeek and Llama but hurts GPT-4.1, with the cost landing on Villagers rather than Werewolves. Only DeepSeek learns the subtle strategy of looking suspicious without looking intentionally suspicious, and it gains the most from the loop. Triadic incentive structure exposes a layer of multi-agent reasoning that dyadic deduction games leave invisible.
20. 【2606.27881】A Study of Temporal Fusion Strategies for Named Entity Recognition in Historical Texts
链接:https://arxiv.org/abs/2606.27881
作者:Emanuela Boros
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:named entity recognition, Temporal variation poses, entity recognition, salience across time, variation poses
备注:
点击查看摘要
Abstract:Temporal variation poses a unique challenge for named entity recognition (NER) in historical texts, where entities drift in surface form and salience across time. While language models (LMs) have made progress in various NLP tasks, their ability to reason about temporality, especially in diachronic contexts, remains limited or at least, questionable. In this paper, we systematically study how temporal metadata can be structurally embedded into NER models using a range of lightweight fusion strategies. We experiment with both absolute and relative temporal representations, injected into Transformer-based architectures via early or late fusion mechanisms such as cross-attention, adapters, and concatenation. Our evaluations on French and German historical datasets reveal that late fusion strategies yield more robust and temporally generalisable performance, particularly in early and noisy periods.
21. 【2606.27808】Learning Complementary Action Modeling from Automotive Maintenance Instructions
链接:https://arxiv.org/abs/2606.27808
作者:Jiaqi Wu,Bai Li,Jochen Hartmann,Martin Gaedke,Sander Stuijk
类目:Computation and Language (cs.CL)
关键词:minute lexical variation, action phrase, variation can reverse, sentence remains unchanged, Complementary Action Modeling
备注: Preprint. 11 pages, 4 figures
点击查看摘要
Abstract:A minute lexical variation can reverse the procedural meaning of an instruction even when the rest of the sentence remains unchanged. In automotive maintenance instructions, this pattern often appears when an action phrase turns an instruction into its procedural counterpart. The entities, modifiers, and surrounding context remain largely invariant, while the action phrase determines the procedural relation. We define this task as Complementary Action Modeling (CAM). Given a maintenance instruction, the goal is to identify or generate its procedural counterpart by modifying the action phrase while preserving the remaining sentence context. This task focuses on three aspects: distinguishing complementarity from surface similarity, controlling generation at the action-phrase level, and evaluating relational correctness using retrieval, overlap-based, and human evaluation. Using a German automotive maintenance dataset, we examine these questions through candidate matching and controlled Seq2Seq generation. The results show that complementary maintenance instructions are best modeled as procedural associations grounded in subtle lexical cues. They should therefore not be treated as ordinary cases of sentence similarity or synonym-based paraphrasing.
22. 【2606.27793】Position Bias Correction is Insufficient for One-Pass Attention Sorting
链接:https://arxiv.org/abs/2606.27793
作者:Qiong Tang,Xiangkun Hu,Xiangyang Liu,Yiran Chen,Yunfan Shao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Long-context language models, Long-context language, language models suffer, information in middle, Sorting
备注:
点击查看摘要
Abstract:Long-context language models suffer from position bias, where information in middle positions is underutilized. Attention Sorting addresses this by iteratively reordering documents based on attention patterns, but its multiple sort-and-generate cycles increase deployment cost. We hypothesize that position bias is the primary bottleneck and propose Debiased One-Pass Attention Sorting, which estimates a per-prompt position-bias curve from the low-attention majority of documents and uses it to correct raw attention scores (via subtraction or division) to enable single-pass sorting. Our experiments on two models refute this hypothesis in the tested setting: on LLaMA-2-7B-32K-Instruct, debiasing produces identical results to uncalibrated single-pass sorting (94.83\% containment accuracy), while on YaRN-Llama-2-7b-64k, debiasing improves accuracy by 8.67 percentage points but remains 14.84pp behind iterative sorting, closing only 37\% of the gap. These results suggest that position-bias correction is insufficient to match iterative sorting, and that repeated reordering provides additional benefits beyond bias correction.
23. 【2606.27791】NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation
链接:https://arxiv.org/abs/2606.27791
作者:Qiong Tang,Xiangkun Hu,Xiangyang Liu,Yiran Chen,Yunfan Shao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Hybrid attention models, attention remains unsolved, efficient long-context inference, Hybrid attention, remains unsolved
备注:
点击查看摘要
Abstract:Hybrid attention models that mix full and sliding-window attention across layers offer a promising approach to efficient long-context inference, but the critical question of \emph{which layers} should retain full attention remains unsolved. Existing methods use either fixed periodic patterns or attention-based heuristics that may not capture what matters for downstream accuracy. We propose NLL-guided layer selection, a training-free method that directly measures each layer's importance by computing the negative log-likelihood degradation on answer tokens when that layer uses sliding-window instead of full attention. On LongMemEval with Qwen3-4B, our method achieves 64.6\% accuracy using only 1/4 full-attention layers, matching the 1/2-FA periodic baseline (65.0\%) while halving the computational budget. NLL-guided selection outperforms the SWAA-reported periodic 1/4-FA baseline by 10.4 percentage points and a matched LightTransfer-style baseline by 26.4 percentage points. De-confounding analysis shows the signal is consistent with long-range attention needs rather than generic layer sensitivity. The method requires only $\sim$15 minutes of one-time calibration, advancing the efficiency-accuracy Pareto frontier for long-context LLM deployment.
24. 【2606.27786】SHIFT: Gate-Modulated Activation Steering for Knowledge Conflict Mitigation in Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2606.27786
作者:Ruochang Li,Pengcheng Huang,Zhenghao Liu,Yukun Yan,Huiyuan Xie,Yu Gu,Ge Yu,Maosong Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:support response generation, Retrieval-augmented generation, incorporating external knowledge, incorporating external, support response
备注: 19 pages, 13 Figures
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) enhances LLMs by incorporating external knowledge to support response generation. However, conflicts between retrieved context and parametric knowledge have emerged as a critical challenge in RAG systems. To mitigate such conflicts, numerous studies have attempted to identify and edit knowledge-related internal neurons, aiming to improve the ability of LLMs to rely on contextual evidence during generation. However, these neuron-level approaches may introduce unintended cascading effects that compromise the general capabilities of LLMs, as the modified neurons are often entangled with broader model behaviors and functionalities. In this paper, we introduce SHIFT, a novel framework that reformulates neuron-level modification as learnable gate modulation, allowing LLMs to adaptively regulate internal activations for knowledge conflict resolution. Technically, our SHIFT equips LLMs with a lightweight gate module and optimizes fewer than 0.01% trainable parameters while keeping the backbone model frozen. During generation, the gate module adjusts the model's internal representations to adaptively leverage contextual and parametric knowledge. Extensive experiments on six datasets validate the effectiveness of our SHIFT in comparison with various competing baselines. All datasets and code are available at this https URL.
25. 【2606.27785】Output-Space Allocation Costs for Calibration-Guided LLM Compression: An Empirical Study
链接:https://arxiv.org/abs/2606.27785
作者:Qiong Tang,Xiangkun Hu,Xiangyang Liu,Yiran Chen,Yunfan Shao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Training-free compression methods, large language models, guide compression decisions, Training-free compression, large language
备注:
点击查看摘要
Abstract:Training-free compression methods for large language models (LLMs) often use calibration data to guide compression decisions. ROCKET, a recent method combining sparse-dictionary factorization with multi-choice knapsack problem (MCKP) allocation, derives its per-layer factorization from an output reconstruction objective but uses weight-space Frobenius error as the MCKP allocation cost. We investigate whether aligning the allocation cost with the output-space objective improves compressed model fidelity. On Qwen3-8B at 50\% compression, our ROCKET-ActCost achieves +0.8 percentage points higher average accuracy across 8 zero-shot benchmarks (53.1\% vs 52.3\%), but increases WikiText perplexity by 16\% (61.46 vs 52.98). This accuracy-perplexity tradeoff reveals that different allocation objectives favor different downstream metrics. The high correlation ($$0.99) between weight-space and output-space errors limits allocation divergence, explaining the modest effect size. On Llama-3.2-1B at 20\% compression, the two methods produce near-identical results (53.3\% vs 53.5\% accuracy, 14.45 vs 14.66 PPL), suggesting that the effect of the cost function is minor at lower compression ratios.
26. 【2606.27742】KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems
链接:https://arxiv.org/abs/2606.27742
作者:Minjun Choi,Yerin Kim,Junghyuk Seo,Sujin Mo,Hyemin Lee,Youngjoong Ko
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Enterprise Knowledge Graphs, graphs remains costly, private enterprise graphs, enterprise graphs remains, Knowledge Graphs
备注: 11 pages, 2 figures, 10 tables
点击查看摘要
Abstract:Enterprise Knowledge Graphs (KGs) are increasingly used for internal search, analytics, and question answering, but building natural-language interfaces for private enterprise graphs remains costly. We present KG2Cypher, a data-centric pipeline for building enterprise text-to-Cypher systems from existing KGs. KG2Cypher first constructs an executable Cypher query from observed graph facts and then uses LLMs to generate its associated natural-language question. The resulting Text-Cypher pairs are validated with an LLM judge and human validation, and are converted into candidate-aware SFT data. The trained generator is served with class-conditioned schema prompting, entity retrieval, and LoRA-based inference. We evaluate KG2Cypher in Korean enterprise settings, where short search-style queries and schema paraphrases make language grounding difficult. LoRA SFT improves execution-result F1 from 0.806 to 0.950 on broadcast-program queries and from 0.70 to 0.92 on company queries. In an 11-class setting, KG2Cypher achieves 95.2% exact match, 99.9% execution rate, and 0.964 execution-result F1.
27. 【2606.27731】Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment
链接:https://arxiv.org/abs/2606.27731
作者:Zhuo Zuo,Li Yue,Wenhao Zheng,Chenpeng Wang,Xianggen Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
关键词:large language models, strong general capabilities, general capabilities, large language, language models
备注:
点击查看摘要
Abstract:Despite their strong general capabilities, large language models (LLMs) often remain unreliable when outputs must be numerically precise. A key reason is the training objective: standard cross-entropy treats numeric tokens as unstructured categories and ignores the metric structure of their values. We address this mismatch with Smooth Maximum Mean Discrepancy (SMMD), which builds on the classic MMD by incorporating value-distance kernels over numeric tokens and graph-based smoothness. With this kernel defined over a numeric sub-vocabulary, SMMD aligns the predicted numeric distribution to the target via kernel matching and smooths the prediction-target residual over the induced kernel graph to encourage local consistency. We evaluate SMMD on four numeric-target tasks: mathematical reasoning, arithmetic calculation, clock-time recognition, and chart question answering, across multiple open-weight LLM and VLM backbones. SMMD consistently improves accuracy over both cross-entropy and recent numeric-target losses; analyses show complementary effects between MMD and smoothness and underscore the importance of distance-based kernel design. Code is available at this https URL.
28. 【2606.27717】Do Speech Emphasis Models Generalize across Languages and Emotions?
链接:https://arxiv.org/abs/2606.27717
作者:Megan Wei,Deepali Aneja,Jiaqi Su,Yunyun Wang,Haonan Chen,Zeyu Jin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:neutral read speech, existing emphasis detection, Prosodic emphasis varies, monolingual neutral read, emphasis detection models
备注: Interspeech 2026
点击查看摘要
Abstract:Prosodic emphasis varies across languages, emotions, and speaking styles, yet existing emphasis detection models are largely trained and evaluated on monolingual neutral read speech. We introduce MMEE (Multilingual Multi-Emotion Emphasis), a corpus of 10,000 professionally recorded expressive utterances (14.13 hours) across 7 languages and 34 emotion/style categories, with three-level perceptual labels (10 annotations per sample). We benchmark two state-of-the-art architectures under monolingual, cross-lingual, multilingual, cross-emotion, cross-dataset, and data-scale settings. Monolingual models show limited zero-shot transfer, degrading across typologically distant languages, while multilingual training substantially improves robustness. Models transfer robustly between high- and low-arousal emotions; bidirectional transfer between synthetic and perceptual benchmarks suggests shared prosodic structure; and performance stays robust even at smaller training scales.
29. 【2606.27709】Low-Agreeableness Persona Conditioning for Safe LLM Fine-Tuning
链接:https://arxiv.org/abs/2606.27709
作者:Austin MY Cheung,Yi Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:degrades factual reliability, Recent work, large language models, fine-tuning large language, social warmth degrades
备注: 9 pages, 8 tables, 5 figures
点击查看摘要
Abstract:Recent work has shown that fine-tuning large language models (LLMs) for social warmth degrades factual reliability and increases sycophancy. We investigate a related but distinct failure mode: warmth fine-tuning also weakens adversarial safety, making models more susceptible to jailbreaks and harmful output generation. We examine whether this reflects an inherent consequence of empathetic adaptation or an artifact of data construction. To address this, we introduce a persona-driven rewriting pipeline that conditions user turns on low agreeableness and pairs this with warm, de-escalating assistant responses. Across three experiments on four models, our approach reduces jailbreak susceptibility and harmful output rates relative to generic warmth fine-tuning baselines, while preserving conversational warmth. Representational probing provides suggestive evidence that this conditioning reduces the geometric alignment between warmth and compliance directions in latent space. These results show that safer empathetic fine-tuning is achievable through data design alone, without safety labels, harm detectors, or changes to the training objective.
30. 【2606.27705】Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling
链接:https://arxiv.org/abs/2606.27705
作者:Changze Lv,Zhenghua Wang,Yiran Ding,Yixin Wu,Tianlong Li,Zhibo Xu,Muling Wu,Tianyuan Shi,Shizheng Li,Qi Qian,Xuanjing Huang,Xiaoqing Zheng
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, critical information located, Language Models, underrepresented or lost
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) still struggle with the ``lost-in-the-middle'' problem, where critical information located in the middle of long-context inputs is often underrepresented or lost. While existing methods attempt to address this by combining multi-scale rotary position embeddings (RoPE), they typically suffer from high latency or rely on suboptimal hand-crafted scaling strategies. To overcome these limitations, we introduce a layer-specific positional embedding scaling~(LPES) method that assigns distinct scaling factors to each layer. LPES achieves a more balanced attention distribution without fine-tuning model parameters or increasing inference delay. A specially designed genetic algorithm is employed to efficiently select the optimal scaling factors for each layer by incorporating Bézier curves to significantly reduce the search space. Extensive experiments demonstrate that LPES effectively mitigates positional attention bias and delivers consistent improvements across multiple long-context benchmarks, yielding up to an $11.2$\% accuracy gain on the key-value retrieval dataset.
31. 【2606.27700】Joint Transcription and Decryption of Images of Encrypted Handwritten Documents: A Comparison with the Traditional Pipeline
链接:https://arxiv.org/abs/2606.27700
作者:Marino Oliveros-Blanco,Lei Kang,Alicia Fornés,Beáta Megyesi
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Historical encrypted manuscripts, Historical encrypted, intersection of cryptology, computer vision, challenging problem
备注: Published at HistoCrypt 2026 (9th International Conference on Historical Cryptology). NEALT Proceedings Series Number 61. Tartu University Library. 10 pages
点击查看摘要
Abstract:Historical encrypted manuscripts present a challenging problem at the intersection of cryptology, linguistics, paleography, and computer vision. Current automatic decipherment approaches usually rely on a two-stage pipeline: transcription of cipher symbols from manuscript images, followed by decryption into plaintext. However, this design is sensitive to transcription errors, which propagate to the final output. We present Direct Image Decryption, an end-to-end approach that directly maps encrypted manuscript images to plaintext, bypassing the intermediate transcription stage. Using the Copiale cipher as a case study, we build a synthetic data generation pipeline to create large-scale cipher-like training data and compare the traditional pipeline with the proposed joint architecture. Results show that joint image-to-plaintext modeling is a promising alternative to traditional transcription-based pipelines.
32. 【2606.27687】Mitigating LLM-based p-Hacking by Preregistering for the Next LLM
链接:https://arxiv.org/abs/2606.27687
作者:Maria Thomas,Kristina Gligoric,Nihar B. Shah
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
关键词:Large language models, feed downstream hypothesis, outputs feed downstream, Large language, downstream hypothesis tests
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to generate, classify, and annotate data whose outputs feed downstream hypothesis tests. However, LLM-based research is easy to p-hack: a researcher can tune the prompts, decoding parameters, or output format until a desired result is reached. We propose a protocol to mitigate p-hacking in LLM-based research: preregistering the experiment and eligible models, and then running it on the first eligible LLM that is released after the preregistration. The researcher finalizes the procedure on current models, preregisters the analysis plan together with a set of eligible future models, and runs the confirmatory analysis on the first eligible model released afterward. Because this model does not exist at commitment time, it cannot be hacked against; furthermore, configurations that hack one model frequently do not transfer to the next. We evaluate the protocol on two tasks whose true values are known. Across 20 models from four providers and 11 LLM-analysis configurations, the protocol would have blocked successful transfer of the p-hack in 73.9% and 72.7% of cases in the two tasks. Additional analyses reveal that mitigation remains substantial under several stress tests. Finally, putting money where our mouth is, we followed our own protocol and preregistered our experiment. The preregistered experiment confirmed the protocol's effectiveness: out of the 7 configurations that hacked the prior model, the hacking failed to carry over in 6 configurations on the first eligible model released afterward.
33. 【2606.27681】xtual Belief States for World Models: Identifiable Representation Learning Under Strict Mediation
链接:https://arxiv.org/abs/2606.27681
作者:Xiang Gao,Kaiwen Dong,Yuguang Yao,Padmaja Jonnalagedda,Kamalika Das
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:summarize interaction history, partially observed environments, observed environments rely, latent state unidentifiable, World models
备注:
点击查看摘要
Abstract:World models in partially observed environments rely on latent representations that summarize interaction history, but in many modern LLM-based architectures predictive performance fails to reflect representation quality due to history bypass, rendering the latent state unidentifiable. Strict latent state mediation, requiring predictions to depend only on the latent state and action, is a classical principle that resolves this, but enforcing it in text-based settings is an open challenge: textual latent states are discrete and non-differentiable, precluding variational training, and expressive LLM decoders readily ignore the bottleneck. We show how to make strict mediation work in the text domain. We formalize why it is necessary, showing that strict mediation makes representation quality empirically testable while history-leaky architectures break this connection. We then introduce textual latent states, which are discrete, interpretable, and variable-length, and factorized GRPO (fGRPO), a tree-structured reinforcement learning method that enforces strict mediation during training. Experiments on TextWorld and ScienceWorld show preserved one-step prediction accuracy alongside up to 57\% gains in representation quality and 98\% improvements in rollout performance, increasing with task complexity and horizon.
34. 【2606.27679】From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models
链接:https://arxiv.org/abs/2606.27679
作者:Ponhvoan Srey,Xiaobao Wu,Cong-Duy Nguyen,Quang Minh Nguyen,Duc Anh Vu,Anh Tuan Luu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, internal model signals, Large Language, Language Models, hallucinations in Large
备注:
点击查看摘要
Abstract:Probe-based uncertainty estimation (UE) has emerged as a prominent approach to detect hallucinations in Large Language Models (LLMs) by learning uncertainty from internal model signals. Yet, recent methods vary simultaneously across feature design, training data construction, and evaluation setting, obscuring what actually drives performance. To address this issue, we propose a factorised study of probe-based UE under matched conditions. Our results show that raw hidden states and attention features are difficult to outperform in-domain. However, under distribution shift, structured and compressed features are more robust, suggesting that in-domain performance alone is insufficient to measure progress. Furthermore, prompting and label construction significantly affect probe behaviour. Building on these best-practice findings, we train benchmark-based pretrained probes that transfer reasonably well to open-ended factual generation, providing a stable off-the-shelf baseline. Our work encourages more deployment-oriented evaluation of probe-based uncertainty estimators. The code repository is available at this https URL.
35. 【2606.27669】When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search
链接:https://arxiv.org/abs/2606.27669
作者:Yiling Tao,Shihan Deng,Meiling Tao,Pengzhi Wei,Zhichao Hu,Zhihao Zhu
类目:Computation and Language (cs.CL)
关键词:solve complex information-seeking, fulfill user goals, large language models, complex information-seeking tasks, Search agents powered
备注: 26 pages, 7 figures, 12 tables
点击查看摘要
Abstract:Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume that user queries are complete and explicit, overlooking the fact that real-world search requests are frequently vague, underspecified, or even factually incorrect. In deep search scenarios, such ambiguity can propagate along multi-step reasoning chains and lead agents toward incorrect search trajectories. To address this gap, we introduce DiscoBench, a benchmark for clarification-aware deep search, designed to evaluate whether search agents can proactively identify ambiguity, ask effective clarification questions, and recover correct reasoning paths through user interaction. DiscoBench contains 211 samples and 463 ambiguity instances across 11 real-world domains, covering four ambiguity types. We further design a user simulator for multi-turn interaction and evaluate model performance from four perspectives: task utility, ambiguity detection, interaction strategy, and cost efficiency. Experiments on representative LLMs show that ambiguity detection and effective clarification are distinct capabilities, and that repeatedly searching instead of asking for clarification often performs worse than direct guessing, highlighting a critical gap between retrieval ability and interactive problem-solving in current search agents.
36. 【2606.27632】Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety
链接:https://arxiv.org/abs/2606.27632
作者:Ting Ma,Xiufeng Huang,Benlei Cui,Xiaowen Xu,Shikai Qiu,Ruijie Jian,Hongxing Li,Guanghui Wang,Longtao Huang,Haiwen Hong,Haolei Xu,Wenjing Jiang,Ziwen Xu,Zhaoyu Fan,Shaoxuan He,Chuxi Xiao,Yujian Li,Xinyue Chen,Chunyang Chai,Wenxuan Liu,Ziheng Wang,Dongjie Zhang,Yangfan Zhou,Libin Dong,Yupeng Cao,Xiaoqian Xia,Jing Wang,Zhe Jiang,Zhenan Ye,Guang Yang,Bin Liu,Wei Peng,Ziqiang Zhu,Meihui Lian,Kaiwen Lv Kacuila,Haidong Ding,Bingyu Zhu,Yan Wang,Hai Zhao,Xuan Jin,Wei Zhao,Pengfei Sun,Wei Wang,Huiming Zhang,Bin Li,Hui Xue
类目:Computation and Language (cs.CL)
关键词:Yuvion LLM, dangerous misuse, increasingly deployed, lead to harmful, harmful outputs
备注:
点击查看摘要
Abstract:As large language models are increasingly deployed in real-world systems, safety failures can still lead to harmful outputs and dangerous misuse. We argue that the essence of safety is adversarial: many failures arise not from natural inputs alone, but from strategic attempts to evade model policies and safeguards. However, existing general-purpose model development largely overlook this adversarial nature, and often remain insufficient for realistic safety scenarios involving planning, tool use, and multi-step reasoning, causing measured safety performance to overestimate real deployment robustness. To address this gap, we present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety. Yuvion LLM treats adversarial robustness and agentic capability as first-class objectives. Its pipeline combines adversarially aware data construction, knowledge-enhanced continued pretraining, and policy-grounded multi-task safety post-training, including risk-aware supervised fine-tuning and reinforcement learning-based policy optimization, together with safety-aware agentic reinforcement learning for tool use and multi-step reasoning in complex safety scenarios. We further introduce the Yuvion LLM RiskEval (YLRE), a collection of 93 benchmarks across four evaluation categories, covering diverse open and internal evaluations with a focus on safety, adversarial robustness, and real-world capability requirements. Across these evaluations, Yuvion LLM demonstrates clear advantages on safety-focused benchmarks and particularly strong robustness under adversarial conditions, while maintaining solid overall capability. Notably, Yuvion-8B outperforms most state-of-the-art baselines, including substantially larger models such as GPT-5.4 and Qwen3-MAX, on several safety tasks.
37. 【2606.27629】Cross-Platform Chinese Offensive Comment Detection via Dual-Threshold Hard Example Mining
链接:https://arxiv.org/abs/2606.27629
作者:Ruixing Ren,Junhui Zhao,Fangfang Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
关键词:Chinese social media, offensive comment detection, social media suffers, detection for Chinese, Chinese social
备注: 10 pages, 7 figures
点击查看摘要
Abstract:Cross-platform deployment of offensive comment detection for Chinese social media suffers performance degradation. The paper proposes a dual-threshold hard mining method to address this. First, the clean-Chinese-base RoBERTa is finetuned on COLD to establish a binary baseline for fair comparison. Second, a three-class fine-labeled test set covering Weibo, Xiaohongshu, Tieba, and Zhihu is constructed, domain distances from the source are quantified using Jaccard and Proxy-A Distance, as well as the degradation bottleneck of the baseline under domain shift is systematically revealed. Herein, a dual threshold hard example mining strategy is proposed. High- and low-confidence error-prone samples are filtered from unlabeled corpora by prediction confidence. The model is secondarily finetuned under implicit contexts with merely a small set of manually labeled hard examples, realizing low-cost cross-platform domain adaptation. Experiments reveal significant performance gains of the optimized model across four platforms.
38. 【2606.27619】DysLexLens: A Low-Resource LLM Framework for Analysing Dyslexic Learners Insights from Online Forums
链接:https://arxiv.org/abs/2606.27619
作者:Dana Rezazadegan,Atie Kia,Phongpadid Nandavong,Dominique Carlon,Jeremy Nguyen,Abhik Banerjee,James Marshall,Anthony McCosker,Yong-Bin Kang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Dyslexic learners increasingly, artificial intelligence, study-related tasks, dyslexic learners experience, increasingly use artificial
备注:
点击查看摘要
Abstract:Dyslexic learners increasingly use artificial intelligence (AI) tools to support reading, writing, organisation, and study-related tasks. However, their lived experiences with these tools remain largely underexamined. This paper proposes DysLexLens, a low-resource LLM framework, designed to analyse dyslexic learners experience with AI through online forum discussions. DysLexLens is designed as an end-to-end, evidence-traceable architecture which transforms noisy social media posts into a dictionary-driven corpora, provides knowledge-graph (KG)-based question reasoning, generates verifiable query responses, and enables response evaluation through quantitative and human-grounded assessment. DysLexLens has four key features. First, it employs a dictionary-driven filtering method to construct a more focused Reddit corpus on dyslexia and AI, filtering out noisy and weakly related posts to improve the relevance of data collected from low-resource forum contexts. Second, it integrates LLM-assisted semantic analysis with KG-based query reasoning to uncover meaningful patterns. Third, it has quantitative evaluation metrics (RAGAS and Query Robustness) to measure LLM-generated response performance. Fourth, it provides structured qualitative validation guidelines for assessing response quality, with a specific focus on hallucination and evidence alignment. We demonstrate the effectiveness of DysLexLens using dyslexia-related Reddit forum data and 30 questions. The results show its potential generalisability to other low-resource forum data contexts. DysLexLens, sample data, questions and evaluation results are available at Github to support reproducibility.
39. 【2606.27617】Masked Language Flow Models
链接:https://arxiv.org/abs/2606.27617
作者:Iskander Azangulov,Kianoosh Ashouritaklimi,Leo Zhang,Simon Vary,Patrick Rebeschini
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Masked Diffusion Models, greatest efficiency gains, reverse transition factorises, few-step sampling regime, Diffusion Models
备注: Preprint
点击查看摘要
Abstract:Masked Diffusion Models (MDMs) promise fast, parallel language generation, but their reverse transition factorises across token positions -- an approximation that breaks down in the few-step sampling regime where parallel generation ought to provide the greatest efficiency gains. Flow Language Models (FLMs) sidestep this limitation by learning a continuous flow that transports noise toward clean sequences represented in Euclidean space, inducing a flow map that can be distilled for single-step generation. However, this makes complex tasks requiring multi-step reasoning problematic for FLMs, as FLMs are forced to decode every token during generation. To address this, we introduce Masked Language Flow Models (MLFMs), which incorporate masking into FLMs using a continuous stochastic interpolant to bridge partially masked and clean sequences. This design enables conditional generation via continuous flows and allows pretrained MDMs to be converted into MLFMs through a simple, lightweight adaptation. Leveraging this flexibility, we propose a novel sampler that alternates continuous denoising with the discrete unmasking of confident tokens to better support multi-step reasoning. We evaluate our approach on GSM8K and MT-Bench and find, for the first time, that flow-based language models can be scaled to solve downstream reasoning and instruction-following tasks.
40. 【2606.27598】Narrative-UFET: Narrative Generation for Ultra-Fine Entity Typing
链接:https://arxiv.org/abs/2606.27598
作者:Mreedul Gupta,Advait Deshmukh,Ashwin Umadi,Matt Pauk,Maria Leonor Pacheco
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Ultra-fine entity typing, current approaches struggle, assigns highly specific, Ultra-fine entity, assigns highly
备注:
点击查看摘要
Abstract:Ultra-fine entity typing (UFET) assigns highly specific types to entity mentions, but current approaches struggle with types in the long tail. We hypothesize that a key limitation is the reliance on sentence-level context, since disambiguating evidence is often spread across multiple sentences. Testing this has been difficult because all existing UFET resources are sentence-level. We present Narrative-UFET, a controlled extension of UFET in which each entity mention is paired with an automatically generated short, coherent narrative. Synthesizing narratives lets us isolate the effect of specific discourse properties. We experiment with two paired variants: one in which the entity's type is held constant across the narrative (Maintain) and one in which it shifts (Change). We show that narrative context yields consistent improvements on long-tail types over sentence-level baselines, with the Change variant providing the stronger signal. A comparison against naturally occurring contexts shows that synthetic narratives yield stronger gains, indicating that controlled discourse construction can surface signals that real text leaves implicit. Substantial room for improvement remains, suggesting open directions in both discourse modeling and narrative construction.
41. 【2606.27595】Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents
链接:https://arxiv.org/abs/2606.27595
作者:Minbyul Jeong
类目:Computation and Language (cs.CL)
关键词:Web-agent benchmarks overwhelmingly, overwhelmingly measure depth, benchmarks overwhelmingly measure, Web-agent benchmarks, measure depth
备注:
点击查看摘要
Abstract:Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated, especially outside English. Breadth is also hard to build: certifying that a gold set is complete and every cell correct is far costlier than checking a single answer. I introduce \textsc{Ko-WideSearch}, a Korean breadth-search benchmark built by an automated synthesize-and-verify pipeline. Each task names a set-parent entity -- a TV season, a dynasty, a league, an administrative region, an election -- and asks for its full membership plus a per-item attribute table, graded by Item-, Column-, and Row-F1. It spans 228 tables over 190 entities and sixteen categories across three difficulty tiers, set by two structural knobs I dial independently -- table width and a 2-D composite key -- so cross-product membership climbs from 0\% to 100\% across the tiers. A single normalization-aware comparator is shared between gold construction and grading, so stable date and count columns are not over-dropped on formatting alone. Across twenty web agents, the failure is consistent: agents recover the set but not the rows (e.g.\ Item-F1 92.8 against Row-F1 53.7), accuracy falls steadily as the knobs harden, and neither more search nor more spend closes the gap. Broken down by cell, the hard part is finding the right value, not formatting it: open-ended free-text cells fail most, while cells with a standard answer such as a date or a name usually come out right.
42. 【2606.27550】EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction
链接:https://arxiv.org/abs/2606.27550
作者:Carrie Chen
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:improve downstream text-generation, increase data density, downstream text-generation quality, density during training, improve downstream
备注: 7 pages, 5 figures
点击查看摘要
Abstract:Multi-token prediction has been shown to increase data density during training, improve downstream text-generation quality, and serves as the defacto approach for self-speculative decoding. Existing foundation and open source models that use MTP heads commit to a static tree-based attention topology throughout the entire generation sequence, meaning the speculation depth, and thus the compute required during verification, stays constant regardless of the context. This is fundamentally misaligned with the entropy patterns of natural language where low-entropy regions often support reliable multi-step drafting, while high-entropy regions require more conservative speculation. To address this, we propose Entropy-guided Multi-Token Prediction (EntMTP), a training-free scheduler that toggles between tree-based attention topologies from a set of task-specific pareto-optimal trees conditioned on a running estimate of local generation entropy. By matching speculation depth to context predictability, EntMTP maximizes expected accepted-token throughput across the full distribution of generated text without sacrificing generation quality. When evaluated across Humaneval, ShareGPT, GSM8k, and Litbench benchmarks, EntMTP consistently achieves a 1.15x speedup against Hydra and peak speedup of 1.36x against Medusa baselines respectively.
43. 【2606.27538】he Context-Ready Transformer
链接:https://arxiv.org/abs/2606.27538
作者:Mahesh Godavarti
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:D-layer transformer block, recurrent neural network, network architecture built, D-layer transformer, neural network architecture
备注: NeurIPS, 22 pages
点击查看摘要
Abstract:We introduce the context-ready transformer, a new recurrent neural network architecture built from a D-layer transformer block that pre-contextualizes each token before it enters the block. During left-to-right generation, a correction network combines the previous position's block output -- a cached summary of past context -- with the current token embedding, so the tokenenters the block already contextualized rather than as a raw embedding. At sequential inference, the correction chain makes the architecture a recurrent neural network. For training, we unroll the correction process K times over the full sequence, processing all positions in parallel at each step. A pretrained transformer can also be converted to a context-ready model by adding a zero-initialized correction FFN and fine-tuning. We evaluate across widths, depths, block sizes, and two datasets, with all comparisons against standard transformers, variants, and ablations. A D=5 model beats a 12-layer transformer while generating 1.7x faster on an A100. With K=10, a single-layermodel (D=1) beats a 6-layer transformer with a 2.6x inference speedup, and sequential inference matches parallel K=10 to within 0.01 PPL. The architecture benefits most from wide representations and long contexts. On a pointer-chasing task, D=1 trained with BPTT solves all 10 composition levels, while standard transformers exhibit staircase-like depth dependence.
44. 【2606.27510】he Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching
链接:https://arxiv.org/abs/2606.27510
作者:Sankaran Vaidyanathan,David Arbour,Aaron Mueller,Scott Niekum,David Jensen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:primary tool, tool in mechanistic, Activation patching, INT, causal
备注:
点击查看摘要
Abstract:Activation patching is the primary tool in mechanistic interpretability. It attributes causal responsibility for a model behavior to each of its individual components by estimating its natural indirect effect (NIE). Re-deriving the activation patching estimand from causal mediation analysis, we find that the NIE does not solely capture the causal effect through the specific component. It also contains interaction effects (INT) that measure how much the component's causal effect itself depends on the state of other components in the model. A natural response may be to try to eliminate INT by adjusting the estimator or unit of analysis, but each of these potential remedies has predictable failure modes. We demonstrate these failure modes in the GPT-2 IOI circuit; components whose causal importance is conditional on the state of other components are either invisible or artificially inflated, and INT variance explains the previously documented instability of faithfulness scores. We prove that INT scales with the distance between clean and patched component activations, is negligible when the model is locally affine, and decomposes combinatorially into pairwise and higher-order group interactions. Despite its inevitability, INT is not a nuisance to be eliminated, but rather a diagnostic for interpretability studies. Its individual and group-level magnitude and sign signal when causal conclusions are prompt-dependent, and when greedy NIE-based component ranking will miss mechanisms only discoverable through combinatorial search.
45. 【2606.27500】Aloe-Vision: Robust Vision-Language Models for Healthcare
链接:https://arxiv.org/abs/2606.27500
作者:Jaume Guasch-Martí,Enrique Lopez-Cuena,Martín Suárez-Fernández,Jordi Bayarri-Planas,Anna Arias-Duart,Dario Garcia-Gasulla
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, promising research direction, research direction due, Large Vision-Language, specialized in healthcare
备注: MIDL 2026
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) specialized in healthcare are emerging as a promising research direction due to their potential impact in clinical and biomedical applications. However, progress is constrained by the scarcity of high-quality medical multimodal data, concerns about robustness in safety-critical settings, and the narrow and potentially contaminated evaluation benchmarks that limit reliable assessment. To address these issues, the field requires state-of-the-art solutions to be fully open and reproducible systems in which all components can be inspected, evaluated, and improved. This work introduces Aloe-Vision-Data, a large-scale, quality-filtered mixture which integrates both medical and general domains across multimodal and text-only sources, designed for direct use in model fine-tuning. Building on this dataset, we train the Aloe-Vision family of medical LVLMs, openly released with full weights, training recipes and data, in two scales (7B and 72B). Through comprehensive benchmarking, we demonstrate that high quality training mixtures produce balanced LVLMs which yield significant gains over the baseline models without compromising general capabilities, achieving competitive performance with respect to state-of-the-art alternatives. To support reliable evaluation, we introduce CareQA-Vision, a carefully curated vision benchmark derived from MIR and EIR exams, the residency entrance exams for medical and nursing specialists in Spain, offering novel vision questions with low likelihood of contamination. Finally, we show that current LVLMs remain vulnerable to adversarial and misleading inputs, underscoring reliability challenges in clinical contexts.
46. 【2606.27499】DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection
链接:https://arxiv.org/abs/2606.27499
作者:Yujin Tang,Chenming Shang,Ruize Xu,Nikhil Singh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:matured rapidly, text side, interactive environment, existing benchmarks, agent genuinely
备注: 16 pages
点击查看摘要
Abstract:Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write down. We introduce DMV-Bench (Code: this https URL), the first interactive benchmark for multimodal-agent visual memory. DMV-Bench is built on a controlled home-furnishing e-commerce catalogue of 1,000 product variants in which a text-leakage contract keeps the discriminative signal of each task in the pixels alone. Across a chain of autonomous shopping sessions, every visited product image carries a unique, pre-rendered incidental cue, and the agent is later asked to recall a particular cued product and navigate to its URL. Inspired by dual-coding theory, we propose DualMem, a memory architecture that maintains a visual and a verbal code in parallel. On DMV-Bench, DualMem outperforms a caption baseline and three recent multimodal agent-memory systems at every chain length J in {5, 10, 15, 50} on both Gemini 2.5 Flash and Qwen2.5-VL-7B, with the lead surviving controls for memory-bank size and encoding-position bias, and an asymmetric dual-coding regime in which vision carries the cue end-to-end while the verbal channel plays a smaller query-grounding role.
47. 【2606.27472】Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents
链接:https://arxiv.org/abs/2606.27472
作者:Vedant Patel
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language model, Large language, operate over long, multi-session interactions, user moves
备注: 11 pages, 4 figures, 3 tables. Code, environment, model, and dataset: [this https URL](https://github.com/Vrin-cloud/supersede)
点击查看摘要
Abstract:Large language model (LLM) agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding values that have been superseded. We isolate this ability on real conversational data and show that it is a distinct, unsolved failure. On the knowledge-update subset of LongMemEval, replacing an agent's full context with a bounded, self-maintained memory drops accuracy from 92% to 77% even on a frontier model (gpt-5.4), a gap that is statistically significant (paired McNemar p0.005) and persists across model scale while full-context accuracy saturates near 92%. The bottleneck is therefore memory maintenance, not comprehension, and is not closed by a stronger model. We then ask whether this is merely an undersized memory, and find it is not: as the conversation grows 24x, accuracy falls further (from 68% to 28%), and granting the agent proportionally more memory yields no detectable recovery (28% to 28%, n=25). The failure scales with the length of the conversation, not the compression ratio. We release Supersede, an open reinforcement-learning environment (on the verifiers / prime-rl stack) that turns this measurement into a training signal: agents are rewarded for answering from the current value and penalized for stale ones. Finally, we close the loop and show the gap is trainable: GRPO fine-tuning a small open model (Qwen2.5-3B) on this environment nearly doubles its held-out supersession accuracy on real, unseen conversations (9.0% to 16.7%, a single run), along a monotonic checkpoint curve indicating the learned policy, not the harness, carries the gain. To our knowledge this is the first trainable environment whose reward targets temporal fact-currency, and the first evidence the supersession gap can be trained down, not only measured.
48. 【2606.27460】Developmental approach reveals the statistical learning of Neural Language Models: Transformers generalize from the most abstract statistical patterns
链接:https://arxiv.org/abs/2606.27460
作者:Wang Bojun,Holly Jenkins,Elizabeth Wonnacott
类目:Computation and Language (cs.CL)
关键词:Generative Transformer models, approach to investigate, neural language models, Generative Transformer, learning
备注: 10 pages, 7 figures, oral presentation at Interdisciplinary Advances in Statistical Learning
点击查看摘要
Abstract:In this study, we use a developmental approach to investigate the statistical learning and mental representation of neural language models (NLM). A series of Generative Transformer models are trained on a synthetic grammar. The model states are saved at multiple stages in the course of training. Through analyzing how the internal representations of these models change in the developmental path, we found that NLMs acquire the most abstract global statistical knowledge at the beginning of learning and later acquire the relatively local statistical dependencies. This learning path contains many over-generalizations from the very beginning and these over-generalizations are gradually constrained in the later stage of learning. Based on this observation, we propose a new framework to explain the statistical learning and language cognition of NLMs.
49. 【2606.27457】Cluster, Route, Escalate: Cascaded Framework for Cost-Aware LLM Serving
链接:https://arxiv.org/abs/2606.27457
作者:Yasmin Moslem,Magdalena Kacmajor,Vasudevan Nedumpozhimana,Ammar Abbas,Solmaz Panahi,David Lynch,Zhuangzhuang Nie,Alexandros Agapitos,Aleksandar Milenovic,Hongmeng Song,Yucheng Shi,Yue Pan,Patricia Buffini,John D. Kelleher
类目:Performance (cs.PF); Computation and Language (cs.CL)
关键词:Efficient deployment, large language models, deployment of large, large language, production forces
备注:
点击查看摘要
Abstract:Efficient deployment of large language models (LLMs) in production forces a trade-off between accuracy and cost. Operators often default to a single model that is either expensive for easy queries or insufficient for hard ones. To address this challenge, we propose a two-stage cascaded solution. Stage 1 clusters incoming queries and assigns each cluster to its most cost-effective model. The cost budget for this routing process is set by an interpretable hyperparameter, tuned offline. Stage 2 adds a quality estimation (QE) cascade; when an output from Stage 1 is judged low-quality, the query is escalated to a stronger model. This ensures only hard or low-confidence cases reach the expensive models. On the test datasets, the cascaded system retains 97-99% of the strongest model's accuracy while reducing Time Per Output Token (TPOT). It requires only task-correctness labels and adapts to changes in the model pool without manual reconfiguration.
50. 【2606.27446】Causal Connections: Leveraging Multilingual Fine-Tuning for Financial QA@FinCausal 2026
链接:https://arxiv.org/abs/2606.27446
作者:Akash Kumar Gautam,Serhii Hamotskyi,Christian Hänig
类目:Computation and Language (cs.CL)
关键词:describes team HSA, paper describes team, extracting cause-effect relations, extractive question answering, team HSA
备注:
点击查看摘要
Abstract:This paper describes team HSA_CORAL's submission to the FinCausal 2026 shared task on extracting cause-effect relations from financial narratives via extractive question answering in English and Spanish. We compare three modeling families: (i) encoder-only token tagging with multilingual BERT, (ii) encoder-decoder generation with multilingual BART, and (iii) decoder-only LLMs (Llama 3.1 and GPT variants) using prompt refinement, few-shot demonstrations, and supervised fine-tuning. Across settings, prompting and few-shot examples yield competitive performance, while supervised fine-tuning provides the largest gains. Our best system, GPT-4.1 Mini fine-tuned on combined English and Spanish training data, achieves a tied highest score on the English subtask (score 4.8140) and ranks third on Spanish (score 4.7753) under the shared task's LLM-as-a-judge metric. Overall, the results highlight the value of task-specific adaptation and multilingual fine-tuning for cross-lingual transfer in financial causality QA.
51. 【2606.27409】Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement
链接:https://arxiv.org/abs/2606.27409
作者:Igor Itkin
类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG); Systems and Control (eess.SY)
关键词:Multi-agent large language, Multi-agent large, large language model, systems often rely, suppress hallucinations
备注: 20 pages, 5 figures, 1 table. Code and data: [this https URL](https://github.com/YehudaItkin/delayed-verification-llm)
点击查看摘要
Abstract:Multi-agent large language model (LLM) systems often rely on verifier and critic agents to suppress hallucinations, but verification is delayed. During this delay, false claims can propagate through the agent network. We model this process as delayed consensus on a graph with grounded corrector nodes. Spectral decomposition by the grounded Laplacian yields a closed-form stability threshold for the verification dose: correction that is too strong or too delayed can turn consensus into oscillation. The most unstable regime occurs when the communication and verification delays coincide; for delay two, the threshold is the inverse golden ratio. The same framework gives a supermodular placement objective and a greedy (1-1/e)-approximation rule for assigning a limited corrector budget to influential nodes. Experiments across five open models confirm the predicted dose-delay oscillations. By contrast, grounded factual answering makes truth an absorbing boundary and eliminates the effect, suggesting that the instability is specific to signed-belief tasks while grounded verification remains stabilizing
52. 【2606.27401】Recall Before Rerank: Benchmarking Deep Learning Models for Large-Scale Code-to-Code Retrieval
链接:https://arxiv.org/abs/2606.27401
作者:Leonardo Venuta,Francesco Tosoni,Paolo Ferragina
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Semantic code search, software development, Semantic code, clone detection, detection are essential
备注: 15 pages, 4 figures
点击查看摘要
Abstract:Semantic code search and clone detection are essential for software development, maintenance, and reuse. This paper evaluates the effectiveness, efficiency, and scalability of contemporary deep learning models for first-stage recall in large-scale code-to-code search engines. Benchmarking across multiple programming languages and datasets reveals critical limits in the precision and scalability of these models on Terabyte-scale source-code collections. We present LLM-based code normalisation and query-rewriting schemes that yield significant gains in precision for lower-performing models. Our results question the sustainability of resource-constrained deployment and the assumed robustness of current code-specialised LLMs across datasets. We conclude with actionable insights for building scalable, efficient code-retrieval systems.
53. 【2606.27383】CalBrief: A Pilot Diagnostic Benchmark for Evidence-Calibrated Scientific Briefing with Large Language Models
链接:https://arxiv.org/abs/2606.27383
作者:Yu Fu,Yongqi Kang,Yong Zhao
类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, Large language, language models, calibrate research takeaways, remains unclear
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used as research assistants, yet it remains unclear whether they can calibrate research takeaways to the strength and scope of the supporting evidence. We study evidence-calibrated scientific briefing: given a bounded package of related papers, a system should generate package-level takeaways with evidence strength, scope boundaries, and missing-evidence caveats. We contribute a verified pilot benchmark of 16 heterogeneous scientific evidence packages and 96 human-verified takeaways, and we use CalBrief, an auditable role/gap/strength framework, as a diagnostic probe to locate where briefing breaks down. Under a fair-schema evaluation, structured organization improves role and gap reasoning, but an explicit strength-calibration policy is systematically over-conservative and falls below majority and direct-LLM baselines. To explain why, we run a controlled diagnostic across three closed-model backbones (GPT-4o, Claude Sonnet, Gemini Flash) that separates three potential causes of conservatism. Approximately 63% of the conservatism gap is attributable to expanding the label space from binary {moderate, weak} to four-way {moderate, weak, uncertain, insufficient_evidence} (p 0.001 across all backbones); only 1% is attributable to gap/scope signal injection (not significant); the remaining 36% arises from the pipeline policy itself. We also find that 4-way predictions can be post-hoc collapsed back to binary and then match or exceed direct binary prompting, so the extra labels carry information that strict matching hides. Label-level strength judgment and auditable evidence organization are distinct abilities currently in tension, and should be evaluated separately for LLM research assistants.
54. 【2606.27380】A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges
链接:https://arxiv.org/abs/2606.27380
作者:Wen Liang,Li Siyan,Zackary Rackauckas,Julia Hirschberg
类目:Computation and Language (cs.CL)
关键词:computer-assisted pronunciation training, oral presentations sits, compared existing systems, automated presentation coaching, categorizes automated presentation
备注: accepted into the BEA 2026 workshop at ACL
点击查看摘要
Abstract:Automated coaching for oral presentations sits at the intersection of computer-assisted pronunciation training (CAPT), prosody modeling, and speech synthesis, yet no prior work has systematically surveyed and compared existing systems along these dimensions. This survey reviews and categorizes automated presentation coaching systems, spanning pronunciation tutors, fluency and prosody coaches, multimodal trainers, and conference QA practice tools. We introduce a five-dimensional task taxonomy - covering segmental pronunciation, lexical stress, suprasegmental prosody, pacing, and content faithfulness - and explicitly map surveyed systems onto it to reveal coverage gaps. We further review the core technical methods these systems employ: TTS-based exemplar generation and diagnostic methods for pronunciation, prosody, and fluency assessment. Key open challenges include the scarcity of annotated presentation corpora, achieving accent-fair feedback across diverse L1 backgrounds, and delivering low-latency diagnostics for real-time rehearsal.
55. 【2606.27379】Position: The Term "Machine Unlearning" Is Overused in LLMs
链接:https://arxiv.org/abs/2606.27379
作者:Sangyeon Yoon,Yeachan Jun,Albert No
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, increasingly face demands, regulatory deletion obligations, language models increasingly, models increasingly face
备注: 13 pages; ICML 2026 Position Paper Track. Sangyeon Yoon and Yeachan Jun contributed equally
点击查看摘要
Abstract:Large language models increasingly face demands to "forget" training data, knowledge, or behaviors due to regulatory deletion obligations, copyright/licensing disputes, and safety or product-policy requirements. This position paper argues that machine unlearning is overused as a term in LLM research and should be reserved for dataset-defined deletion: removing the training influence of a precisely specified forget set such that the resulting model is approximately indistinguishable from retraining without that data. We contend that many tasks currently labeled "unlearning" (e.g., refusal for harmful requests, entity/knowledge removal, or targeted suppression) pursue different, often policy-dependent objectives and therefore require different terminology and baselines (e.g., alignment, suppression, editing, obfuscation). We further argue that this confusion is not cosmetic: because papers make different implicit guarantees under the same label, metrics and benchmarks are frequently reused outside their intended scope, rewarding surface-level non-disclosure (e.g., low ROUGE/forget accuracy) even when retraining-equivalence is not tested and derived capabilities remain. We conclude by calling for stricter terminology tied to explicit guarantees and reference models, and for evaluations that match the claimed objective.
56. 【2606.27378】Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs
链接:https://arxiv.org/abs/2606.27378
作者:Fahd Seddik,Fatemeh Fard
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:axiomatic evaluation framework, benchmark accuracy masks, downstream benchmark scores, latent thought representations, reveal representational failures
备注: 44 pages, 27 tables, 14 figures
点击查看摘要
Abstract:We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks. Existing evaluations conflate representation quality with model capacity. Therefore, failures cannot be attributed to the representation rather than to the model that processes it. We formalize four functional axioms (Causality, Minimality, Separability, and Stability) and define a quantitative measure for each, computed directly on the representation independently of downstream accuracy. We audit open-weight LLMs across 23 reasoning tasks (e.g., Spatial Reasoning, Factual QA). We find that no candidate satisfies all four axioms simultaneously, that the representations distinguish task type reliably but cannot distinguish between two questions within the same task, and that the representations encode little information beyond what is already present in the input embedding. The failure is consistent across dense, reasoning-distilled, and RL-trained model families, indicating that the gap is structural rather than a property of model size or training procedure.
57. 【2606.28249】HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech
链接:https://arxiv.org/abs/2606.28249
作者:Sihang Nie,Xiaofen Xing,Rui Xing,Haoming Li,Ruitong Xiao,Jingyuan Xing,Baiji Liu,Xiangmin Xu
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
关键词:Large Language Model, Large Language, achieved remarkable naturalness, Language Model, remarkable naturalness
备注: 7 pages, 3 figures, 3 tables; Preprint
点击查看摘要
Abstract:Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimization offers a promising alternative, existing approaches suffer from two structural mismatches: information conflict, where content and emotion in a shared latent space produce conflicting gradients, leading to reward hacking and semantic degradation; and scale gap, where sparse sentence-level rewards struggle to guide dense frame-level generation. To overcome these challenges, we propose HPRO, a hierarchical progressive reward optimization framework. Within HPRO, we introduce the HD-Emo codec as a novel differentiable reward model to resolve the information conflict. It extracts speech into distinct content and style preference tokens, structurally isolating emotional optimization from semantic content. Building upon this structured preference space, HPRO bridges the scale gap by progressively aligning frame-, word- and sentence-level objectives. Experiments demonstrate that HPRO significantly enhances emotional expressiveness, while effectively preserving linguistic intelligibility. The code and audio samples are publicly available at this https URL.
58. 【2606.28105】Scaling limit of the Random Language Model
链接:https://arxiv.org/abs/2606.28105
作者:Eric De Giuli
类目:Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL)
关键词:Random Energy Models, Random Language Model, stochastic context-free grammars, hidden symbols, develop a quantitative
备注: 17 pages + 14 pages SI
点击查看摘要
Abstract:We develop a quantitative theory of the Random Language Model (RLM), an ensemble of stochastic context-free grammars, in a scaling limit where the number of hidden symbols $N \to \infty$ while the grammar temperature $\tilde{\epsilon}_d \to 0$ at fixed $x = {\tilde\epsilon}_d \log N$. In this limit, the model admits a controlled description based on a large-deviation principle over rule-usage patterns. A semi-annealed approximation maps the problem to a class of Random Energy Models with nontrivial combinatorics. We show that the RLM exhibits a condensation transition at a critical value $x_c=1/8$, below which rule usage concentrates and language statistics acquire a nontrivial dependence on corpus length. A second characteristic scale at $x=1/2$ marks the onset of entropy reduction from its maximal value. Across these regimes, we derive explicit scaling laws for the number of distinct rules, entropy, and related observables, identifying distinct scaling, saturation, and critical regimes controlled by the interplay of grammar size, corpus length, and temperature. The theory resolves previous ambiguities regarding the existence of a thermodynamic transition and explains the slow approach to the large-$N$ limit as a consequence of the dependence on $\log N$. It further provides a unified framework in which universal statistical properties of language emerge from typical realizations of generative grammars, with implications for both natural language statistics and the behavior of large language models.
Comments:
17 pages + 14 pages SI
Subjects:
Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL)
Cite as:
arXiv:2606.28105 [cond-mat.dis-nn]
(or
arXiv:2606.28105v1 [cond-mat.dis-nn] for this version)
https://doi.org/10.48550/arXiv.2606.28105
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
信息检索
1. 【2606.28081】Context-Aware Explanations for Spatialized Document Layouts
链接:https://arxiv.org/abs/2606.28081
作者:Wei Liu,John Wenskovitch,Chris North,Rebecca Faust
类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:regions remains challenging, Spatialized document layouts, Spatialized document, text corpora, remains challenging
备注: 10 pages, 4 figures, accepted to Graphics Interface 2026 (GI 2026)
点击查看摘要
Abstract:Spatialized document layouts are widely used for exploratory analysis of text corpora, but interpreting the spatial organization of documents and the relationships between regions remains challenging. Existing approaches primarily summarize document content or explain how layouts are generated, providing limited support for understanding spatial relationships within the layout itself. We present CAPE, a context-aware explanation framework that generates natural-language explanations grounded in both document semantics and layout-derived spatial context. CAPE identifies salient spatial patterns (e.g., clusters, subgroups, outliers, and bridging documents) and constructs multi-level contextual representations to guide LLM-based explanation generation. It supports both AI-guided overview and user-driven exploration, with explanations available at multiple levels of detail. We demonstrate CAPE on news and scholarly document layouts and evaluate it in a controlled user study against keyword-based and content-only LLM baselines. Our results suggest that spatially grounded explanations are perceived as more helpful than content-only baselines for interpreting the spatial organization of document layouts.
2. 【2606.28062】Single and Multi Truth Data Fusion using Large Language Models
链接:https://arxiv.org/abs/2606.28062
作者:Hira Beril Kucuk,Norman W Paton,Jiaoyan Chen,Zhenyu Wu
类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:data integration problem, Data fusion tasks, Large Language Models, Data fusion, integration problem
备注:
点击查看摘要
Abstract:Data fusion, also known as truth discovery, is a data integration problem that aims to determine the correct value or set of values for each attribute of an object when presented with potentially conflicting values from multiple sources. Data fusion tasks belong to two main categories: single-truth scenarios, where each attribute has only one correct value, and multi-truth scenarios, where multiple values can be valid simultaneously. This paper investigates the use of Large Language Models (LLMs) in data fusion tasks for tabular data. Various prompting strategies, encompassing both single-truth and multi-truth scenarios, are investigated empirically. Domain-dependent, domain-independent, zero-shot and one-shot prompts are evaluated on three different benchmark datasets. Experimental results demonstrate that LLM-based approaches outperform traditional unsupervised truth discovery methods, such as DART and LTM, across all datasets. The codebase of this study has been made publicly available on GitHub.
3. 【2606.28059】Fast and Feasible: Permutation-based Constrained Reranking for Revenue Maximization
链接:https://arxiv.org/abs/2606.28059
作者:Svetlana Shirokovskikh,Anastasiia Soboleva,Ekaterina Solodneva,Aleksandr Katrutsa,Roman Loginov,Egor Samosvat
类目:Information Retrieval (cs.IR); Optimization and Control (math.OC)
关键词:produced highly relevant, highly relevant search, relevant search results, produced highly, highly relevant
备注:
点击查看摘要
Abstract:Search and recommender systems have produced highly relevant search results. A natural next step in the development of such systems in e-commerce is to rerank these results to increase the platform's revenue from paid promotion products. However, maximizing revenue alone may degrade the user experience by reducing relevance or increasing fraud risk. To avoid this, we state the reranking problem as an integer linear program ($ILP$) that maximizes revenue subject to per-query constraints on other metrics, e.g., relevance. Since solving $ILP$ exactly for every query is slow for deployment to the online service, we propose a lightweight permutation-based reranking approximation algorithm PermR. At each step, the algorithm selects a pair of neighboring items and swaps them to either improve the objective or repair a violated constraint. We evaluate PermR across multiple categories of a large classified platform in offline and online settings. PermR achieves about 63\% of the ILP revenue improvement, within production latency limits, preserving all constraints. In a 14-day online A/B test over 56 million search queries, PermR increased revenue by $2$\%.
4. 【2606.27980】Listwise Explanation of Embedding-Based Rankings via Semantic Chunk Grouping
链接:https://arxiv.org/abs/2606.27980
作者:Hyunkyu Kim,Yeeun Yoo,Youngjun Kwak
类目:Information Retrieval (cs.IR)
关键词:contextual sentence, embedding rankers score, Dense, Dense embedding, Abstract
备注: 17 pages, 5 figures, 4 tables
点击查看摘要
Abstract:Dense embedding rankers score documents through contextual sentence- and passage-level representations. Yet many listwise explanation methods still attribute rankings to isolated words. This feature-unit mismatch leaves word-level features too fragmented for dense semantic ranking. We introduce ChunkGroupSHAP, a listwise Shapley method that clusters semantically related chunks into shared cross-document features. Masking a group perturbs all documents with related evidence, attributing rankings at a granularity closer to dense representations while preserving the listwise setup. Our findings across MS MARCO, FinanceBench, AILACaseDocs, and FinQA with E5 rankers and BM25 show that the best explanation unit is setting-dependent: word features for lexical BM25, corpus-level groups for dense rankers, and query-local grouping for heterogeneous web retrieval. Feature units should thus follow both the ranker's representational granularity and the structure of the retrieved corpus.
5. 【2606.27976】SHARD: cell-keyed residual splitting for alignment-resistant private dense retrieval
链接:https://arxiv.org/abs/2606.27976
作者:Sergey Kurilenko
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Dense embeddings underpin, underpin semantic search, vector store hands, leaked vector store, search and RAG
备注: arXiv admin note: text overlap with [arXiv:2606.26373](https://arxiv.org/abs/2606.26373)
点击查看摘要
Abstract:Dense embeddings underpin semantic search and RAG, yet a leaked vector store hands much of the underlying text back to whoever holds it. The attacks that make this possible (few-shot alignment, zero-shot inversion, unsupervised cross-space translation) share one weakness: the protected store is a single global geometry that can be aligned to a known one. A secret global rotation, the usual lightweight defence, is no exception: orthogonal Procrustes recovers it once the attacker has about the subspace dimension in known pairs. We introduce Shard, a retrieval-preserving embedding transform that removes this weak axis. The centred embedding is split into a short public prefix (for stage-1 retrieval) and a private residual sharded into C cells under separate secret keys; the residual is reranked under CKKS, where the keys cancel and leave the inner product exact. A single parameter C runs the design from the global-linear baseline it replaces (C=1) to per-document micro-keys (C=N). Because the rerank is full-dimensional, Shard returns the raw-space nDCG@10 that half-SVD truncation gives up; and because the residual is keyed cell-locally, mapping it back to a common frame under a diffuse known-plaintext leak costs roughly C times more anchors (median 200 to 102,400 at C=256), for a few encrypted queries. The short public prefix leaks far less neighbour structure, and a micro-key limit drives the residual graph to zero with an unlinkable, renewable template. The barrier holds against learned, non-linear and unsupervised aligners, and where a matched-utility noise defence de-anonymises almost every probe, Shard de-anonymises none. We are plain about the limits: within a cell the keys cancel, a targeted attacker needs only about d_priv anchors, and an overlapping reference corpus still leaks through the prefix. Shard is an attack-aware geometric defence, not a cryptographic guarantee.
Comments:
arXiv admin note: text overlap with arXiv:2606.26373
Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:
arXiv:2606.27976 [cs.CR]
(or
arXiv:2606.27976v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2606.27976
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2606.27930】An LLM-Powered Semantic Alignment Framework for Journal Recommendation
链接:https://arxiv.org/abs/2606.27930
作者:Yanglin Yan,Zicheng Xie,Tianchen Gao,Rui Pan,Hansheng Wang
类目:Information Retrieval (cs.IR); Applications (stat.AP)
关键词:important task, scholarly information systems, Journal, Journal recommendation, information systems
备注:
点击查看摘要
Abstract:Journal recommendation is an important task in scholarly information systems. Existing approaches typically rely on supervised learning models, manually engineered features, or historical interaction data, which may limit their generalizability and interpretability. We propose an LLM-powered semantic alignment framework that formulates journal recommendation as a semantic matching problem between manuscript content and journal scope descriptions. The framework enables large language models (LLMs) to infer journal suitability directly from article titles, abstracts, keywords, and candidate journal information without task-specific training. Experiments are conducted using DeepSeek-V3 on a dataset of 23,609 articles from 49 journals in statistics and related fields. The proposed framework achieves Top-3, Top-5, and Top-10 accuracies of 40.23\%, 53.67\%, and 70.05\%, respectively. Additional analyses show that incorporating reference information generally improves recommendation performance and that recommendations remain highly stable across repeated runs, with an average Top-5 Jaccard similarity of 84\%. The framework also generates interpretable reasoning outputs that provide insights into the recommendation process. These findings demonstrate the potential of LLMs as a training-free and scalable paradigm for journal recommendation and scholarly decision support.
7. 【2606.27865】From Bootstrapping to Sequence Modeling: A Unified Generative Framework for Personalized Landing-Page Modeling
链接:https://arxiv.org/abs/2606.27865
作者:Fan Li,Chang Meng,Jiaqi Fu,Shuchang Liu,Tianke Zhang,Xueliang Wang,Xiaoqiang Feng,Yongqi Liu,Kaiqiao Zhan
类目:Information Retrieval (cs.IR)
关键词:increasingly adopt multi-page, adopt multi-page architectures, platforms increasingly adopt, Modern online platforms, accommodate diverse user
备注: arXiv admin note: text overlap with [arXiv:2507.23459](https://arxiv.org/abs/2507.23459)
点击查看摘要
Abstract:Modern online platforms increasingly adopt multi-page architectures to accommodate diverse user needs. On these platforms, page navigation (the process of directing users to specific functional pages upon app entry) serves as a critical gateway that shapes user's first impression and significantly influences subsequent engagement. To optimize this process, Kuaishou formulated the task of Personalized Landing Page Modeling (PLPM) and proposed KLAN, a reinforcement learning framework built upon Conservative Q-Learning (CQL). However, CQL-based approaches suffer from two fundamental limitations: (1) the Markov assumption fails to capture the strong non-Markovian temporal dependencies inherent in real-world user behaviors, and (2) TD learning with bootstrapping incurs severe cumulative errors and credit assignment difficulties under delayed rewards, particularly in long-horizon settings where users enter the app multiple times daily. To address these limitations, we propose GLAN (Generative Landing-page Adaptive Navigator), a sequence modeling framework built on Decision Transformer to tackle PLPM from a unified global-local perspective. Specifically, GLAN incorporates two key modules. First, we design the L-RTG module that captures users' inter-day consumption dynamics to provide accurate global guidance for all page assignments within a day. Furthermore, we propose the HRM module that decomposes session-level feedback into fine-grained signals, enabling precise local supervision for each page assignment. Extensive online experiments conducted on the Kuaishou platform demonstrate the effectiveness of GLAN, achieving +0.158\% and +0.108\% improvements on Daily Active Users (DAU) and user Lifetime (LT) respectively.
8. 【2606.27743】End-to-End Dynamic Sparsity for Resource-Adaptive LLM Inference
链接:https://arxiv.org/abs/2606.27743
作者:Yuhang Chen,Jinhao Duan,Ruichen Zhang,Mingfu Liang,Xiaohan Wei,Yunchen Pu,Fei Tian,Chonglin Sun,Parish Aggarwal,Frank Shyu,Luke Simon,Sandeep Pandey,Tianlong Chen,Xi Liu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, fixed computational graph, Language Models, typically deployed
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) inference is typically deployed under a static resource assumption, where models execute a fixed computational graph regardless of the runtime environment. However, real-world cloud infrastructure is inherently dynamic, characterized by fluctuating availability (e.g., spot instance preemption) and tiered Quality-of-Service requirements. In such volatile settings, static models are inflexible: they either crash under resource constraints or waste compute on redundant operations. To bridge this gap, we propose Learning to Allocate (L2A), an end-to-end framework for resource-adaptive inference. Unlike prior methods that condition only on input difficulty, we formulate inference as a constrained allocation problem conditioned on both the input and the runtime resource budget itself. We introduce lightweight, budget-conditioned and input-aware gating networks integrated into the LLM. These gates are trained via a unified objective that jointly optimizes task performance, logical consistency, and resource costs along three axes matching how real-world dynamics manifest: layer skipping for memory and depth pressure, head pruning for throughput contention, and reasoning-token reduction for latency tightening. This lets the model learn a budget-aware policy beyond input difficulty alone: it adaptively configures its computational footprint with respect to real-time resource dynamics, maximizing reasoning depth when resources permit while enforcing strict frugality when budgets tighten. A single L2A model traces the entire compute-accuracy Pareto frontier on Llama-3-8B and Qwen-3-4B: at up to 34% realized layer sparsity, it stays within 0.6% of the dense baseline on GSM8K, with the same gap holding zero-shot on out-of-distribution tasks, while every static or heuristic baseline requires a separately tuned model and still drops by 5-10% at comparable inference time.
9. 【2606.27732】Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation
链接:https://arxiv.org/abs/2606.27732
作者:Yuhang Chen,Xianfeng Wu,Jinhao Duan,Mingfu Liang,Xiaohan Wei,Yunchen Pu,Fei Tian,Chonglin Sun,Parish Aggarwal,Frank Shyu,Luke Simon,Sandeep Pandey,Xi Liu,Tianlong Chen
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Discrete diffusion language, diffusion language models, Discrete diffusion, offering significant speedups, recover masked tokens
备注:
点击查看摘要
Abstract:Discrete diffusion language models (dLLMs) recover masked tokens in parallel, offering significant speedups over autoregressive (AR) generation. However, such promising frameworks face a fundamental architectural design dilemma: \ding{182} Adopting bidirectional attention achieves strong generation quality by allowing each position to access the full context, but is inherently incompatible with KV caching, limiting inference throughput in batch-serving scenarios; \ding{183} Conversely, causal attention enables efficient cached inference but loses all right-side context, substantially degrading generation quality. This paper introduces Bifocal dLLMs, a new paradigm that resolves this dilemma through \emph{asymmetric bidirectional context}. Analogous to bifocal lenses, we instantiate the paradigm as \textbf{R2LM} (Right-to-Left Mamba), which combines two complementary mechanisms: $a$) standard causal attention providing precise left-context with full KV cache compatibility, while $b$) a lightweight reverse Mamba SSM sidecar supplying compressed right-side context without breaking cacheability. Comprehensive experiments on continued pretraining of Qwen3-1.7B with 60B tokens demonstrate that R2LM achieves $2.4\times$ to $12.9\times$ higher throughput than bidirectional dLLMs and $1.9\times$ to $2.9\times$ speedup over AR baselines in batch serving through parallel decoding with KV caching, while exceeding the causal baseline on most benchmarks and surpassing the bidirectional dLLM on average.
10. 【2606.27684】Intuition-Guided Latent Reasoning for LLM-Based Recommendation
链接:https://arxiv.org/abs/2606.27684
作者:Chang Liu,Yimeng Bai,Xiaoyan Zhao,Yang Zhang,Qifan Wang,Fuli Feng,Wenge Rong
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Language Models, complex problem-solving tasks, demonstrated impressive reasoning
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks, motivating their use for preference reasoning in recommender systems. Latent reasoning, which operates in continuous hidden spaces rather than discrete tokens, has recently emerged as a promising paradigm for LLM-based recommendation. However, existing methods often start from unconstrained reasoning points, where hidden representations are misaligned with target item embeddings, leading to suboptimal reasoning trajectories. Inspired by cognitive neuroscience, which suggests that human multi-step reasoning is guided by intuition as a latent prior, we propose \emph{IntuRec}, a two-stage framework that anchors latent reasoning with \emph{recommendation intuition}. In the extraction stage, the LLM-based recommender generates a top-$K$ candidate set based on users' histories as the source of intuition. In the injection stage, the candidate set is transformed into a preference-aligned intuition embedding using self- and cross-attention mechanisms, which initializes the reasoning start point and guides subsequent latent reasoning. By providing a semantically grounded starting point, IntuRec efficiently explores the preference space along more accurate reasoning trajectories. Extensive experiments on multiple real-world datasets demonstrate that IntuRec consistently outperforms state-of-the-art baselines. We release our code at this https URL.
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2606.27684 [cs.IR]
(or
arXiv:2606.27684v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2606.27684
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
11. 【2606.27619】DysLexLens: A Low-Resource LLM Framework for Analysing Dyslexic Learners Insights from Online Forums
链接:https://arxiv.org/abs/2606.27619
作者:Dana Rezazadegan,Atie Kia,Phongpadid Nandavong,Dominique Carlon,Jeremy Nguyen,Abhik Banerjee,James Marshall,Anthony McCosker,Yong-Bin Kang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Dyslexic learners increasingly, artificial intelligence, study-related tasks, dyslexic learners experience, increasingly use artificial
备注:
点击查看摘要
Abstract:Dyslexic learners increasingly use artificial intelligence (AI) tools to support reading, writing, organisation, and study-related tasks. However, their lived experiences with these tools remain largely underexamined. This paper proposes DysLexLens, a low-resource LLM framework, designed to analyse dyslexic learners experience with AI through online forum discussions. DysLexLens is designed as an end-to-end, evidence-traceable architecture which transforms noisy social media posts into a dictionary-driven corpora, provides knowledge-graph (KG)-based question reasoning, generates verifiable query responses, and enables response evaluation through quantitative and human-grounded assessment. DysLexLens has four key features. First, it employs a dictionary-driven filtering method to construct a more focused Reddit corpus on dyslexia and AI, filtering out noisy and weakly related posts to improve the relevance of data collected from low-resource forum contexts. Second, it integrates LLM-assisted semantic analysis with KG-based query reasoning to uncover meaningful patterns. Third, it has quantitative evaluation metrics (RAGAS and Query Robustness) to measure LLM-generated response performance. Fourth, it provides structured qualitative validation guidelines for assessing response quality, with a specific focus on hallucination and evidence alignment. We demonstrate the effectiveness of DysLexLens using dyslexia-related Reddit forum data and 30 questions. The results show its potential generalisability to other low-resource forum data contexts. DysLexLens, sample data, questions and evaluation results are available at Github to support reproducibility.
12. 【2606.27559】A Sensitivity-Aware Test Collection for Search Among Personal Information
链接:https://arxiv.org/abs/2606.27559
作者:Jack McKechnie,Graham McDonald,Craig Macdonald
类目:Information Retrieval (cs.IR)
关键词:Traditional search tasks, search tasks aim, Traditional search, tasks aim, aim to satisfy
备注: SIGIR 2026 Resource Paper
点击查看摘要
Abstract:Traditional search tasks aim to satisfy user information needs by returning a subset of a collection of documents, ranked by the documents' relevance to a user query. However, some collections that contain useful information also contain sensitive personal information. Recently, there has been increasing interest in the development of Sensitivity-Aware Search (SAS) retrieval models to provide users with effective retrieval results without revealing such sensitive information. To develop such systems, test collections containing both sensitive and non-sensitive information, a set of queries, and query-document relevance assessments are required. The Enron email corpus contains real business-related emails, where some emails also contain sensitive personal information. However, the original Enron collection does not contain queries or query-relevance assessments. To this end, we crowdsource 150 query formulations for 50 different topics and 11,471 query-relevance assessments for a subset of the Enron documents that have been manually labelled for sensitivity. We follow best practices for using large language models (LLMs) in Information Retrieval evaluation to extend the collection further with additional LLM judged query-relevance assessments and sensitivity labels. We present baseline performances for relevance, sensitivity classification, and sensitivity-aware search on the collection. We make the collection available, including through the popular ir_datasets package, and provide pre-built sparse and dense indices on Huggingface to facilitate easy experimentation.
13. 【2606.27401】Recall Before Rerank: Benchmarking Deep Learning Models for Large-Scale Code-to-Code Retrieval
链接:https://arxiv.org/abs/2606.27401
作者:Leonardo Venuta,Francesco Tosoni,Paolo Ferragina
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Semantic code search, software development, Semantic code, clone detection, detection are essential
备注: 15 pages, 4 figures
点击查看摘要
Abstract:Semantic code search and clone detection are essential for software development, maintenance, and reuse. This paper evaluates the effectiveness, efficiency, and scalability of contemporary deep learning models for first-stage recall in large-scale code-to-code search engines. Benchmarking across multiple programming languages and datasets reveals critical limits in the precision and scalability of these models on Terabyte-scale source-code collections. We present LLM-based code normalisation and query-rewriting schemes that yield significant gains in precision for lower-performing models. Our results question the sustainability of resource-constrained deployment and the assumed robustness of current code-specialised LLMs across datasets. We conclude with actionable insights for building scalable, efficient code-retrieval systems.
计算机视觉
1. 【2606.28323】DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand
链接:https://arxiv.org/abs/2606.28323
作者:Dihong Huang,Zhenyu Wei,Zhuxiu Xu,Yunchao Yao,Sikai Li,Mingyu Ding
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:hand remains challenging, single hand remains, solve individual skills, remains challenging, solve individual
备注: Project page: [this https URL](https://devon018.github.io/DexCompose-Webpage/)
点击查看摘要
Abstract:Dexterous manipulation policies can solve individual skills, but composing them to perform multiple tasks with a single hand remains challenging. Adding a new task on top of an existing manipulation skill often imposes conflicting demands on overlapping fingers and contact modes, causing destructive interference between preserving an existing manipulation outcome and executing a new one. We propose DexCompose, a role-aware residual composition framework that reuses pretrained dexterous policies for multi-task manipulation through explicit finger-level action ownership. Given two pretrained full-hand policies, DexCompose first collects successful post-task states from the first skill and performs release tests over candidate finger masks to identify which fingers are necessary for maintaining the established skill state. It then trains two asymmetric residual modules: a bounded residual stabilizer for task preservation, and a context-aware residual that adapts the frozen downstream policy only within the action subspace assigned to the new task. We evaluate the framework on 16 composite dexterous manipulation tasks spanning four object-retention skills and four downstream interactions. DexCompose achieves a 77.4% average composite success rate, demonstrating that structural action ownership with dual residuals offers a promising direction for composing dexterous skills beyond conventional policy chaining.
2. 【2606.28322】PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception
链接:https://arxiv.org/abs/2606.28322
作者:Yana Wei,Hongbo Peng,Yanlin Lai,Liang Zhao,Kangheng Lin,En Yu,Keyu Lv,Han Zhou,Yin Tang,Haodong Li,Mitt Huang,Hangyu Guo,Jianjian Sun,Zheng Ge,Xiangyu Zhang,Daxin Jiang,Vishal M. Patel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rubric-based evaluation framework, saturated benchmark scores, framework that addresses, scores and real-world, introduce PerceptionRubrics
备注: ICML 2026. Project page: [this https URL](https://weiyana.github.io/PerceptionRubrics)
点击查看摘要
Abstract:We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.
3. 【2606.28321】StructSplat: Generalizable 3D Gaussian Splatting from Uncalibrated Sparse Views
链接:https://arxiv.org/abs/2606.28321
作者:Jia-Chen Zhao,Beiqi Chen,Xinyang Chen,Guangcong Wang,Liqiang Nie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian reconstruction framework, requiring camera parameters, Gaussian reconstruction, present StructSplat, feed-forward and generalizable
备注: Project page: [this https URL](https://structsplat.github.io) Code: [this https URL](https://github.com/J-C-Zhao/StructSplat)
点击查看摘要
Abstract:We present StructSplat, a feed-forward and generalizable 3D Gaussian reconstruction framework that operates directly on uncalibrated images without requiring camera parameters. Existing methods either rely on per-scene optimization or assume known camera poses, and often entangle geometry and appearance within a unified backbone, limiting reconstruction fidelity and generalization. Our key idea is to adopt a structured representation that organizes geometry, semantic, and texture cues with explicit roles in the reconstruction process. Specifically, we introduce a pixel-aligned feature injection mechanism to enable accurate texture modeling from 2D observations, incorporate semantic-aware priors to improve global consistency, and design a camera alignment strategy to prevent information leakage and improve generalization. Experiments show that our method significantly outperforms prior approaches on challenging benchmarks. On DL3DV, our method achieves 28.045 PSNR, surpassing AnySplat (22.377) by +5.67 dB. In cross-dataset evaluation, our method achieves +1.94 dB over AnySplat on ACID and +1.72 dB on RealEstate10K. Project page: this https URL Code: this https URL
4. 【2606.28268】Learning Topology-Aware Representations via Test-Time Adaptation for Anomaly Segmentation
链接:https://arxiv.org/abs/2606.28268
作者:Ali Zia,Usman Ali,Abdul Rehman,Umer Ramzan,Kang Han,Muhammad Faheem,Shahnawaz Qureshi,Wei Xiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:mitigating distribution shifts, promising paradigm, paradigm for mitigating, mitigating distribution, distribution shifts
备注:
点击查看摘要
Abstract:Test-time adaptation (TTA) has emerged as a promising paradigm for mitigating distribution shifts in deep models. However, existing TTA approaches for anomaly segmentation remain limited by their reliance on pixel-level heuristics, such as confidence thresholding or entropy minimisation, which fail to preserve structural consistency under noise and texture variation. Moreover, they typically treat anomaly maps as flat intensity fields, ignoring the higher-order spatial relationships that characterise complex defect geometries. We introduce TopoTTA (Topological Test-Time Adaptation), a novel framework that integrates persistent homology, a tool from topological data analysis, into the TTA pipeline to enforce geometric and structural coherence during adaptation. By applying multi-level cubical complex filtration to anomaly score maps, TopoTTA derives robust topological pseudo-labels that guide a lightweight test-time classifier, enhancing segmentation quality without retraining the backbone model. The approach avoids reliance on method-specific raw-score thresholding for mask binarisation, preserves connectivity, and generalises across both 2D and 3D modalities. Extensive experiments across six standard benchmarks (MVTec AD, VisA, Real-IAD, MVTec 3D-AD, AnomalyShapeNet, and MVTec LOCO) demonstrate an average 15% F1 improvement over state-of-the-art unsupervised anomaly detection and segmentation methods, with the largest gains on anomalies exhibiting complex geometric or structural variations. These findings suggest that integrating topological reasoning into test-time adaptation provides a principled route to structure-aware generalisation, bridging the gap between geometric learning and robust adaptation.
5. 【2606.28266】RSICCLLM: A Multimodal Large Language Model for Remote Sensing Image Change Captioning
链接:https://arxiv.org/abs/2606.28266
作者:Yelin Wang,Zijia Song,Shuo Ye,Chuanguang Yang,Miaoyu Wang,Yong Xu,Zhulin An,Yongjun Xu,Zitong Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Remote Sensing Image, remote sensing images, bi-temporal remote sensing, Image Change Captioning, Sensing Image Change
备注: Accepted by ECCV 2026
点击查看摘要
Abstract:Remote Sensing Image Change Captioning (RSICC) aims to describe changes between bi-temporal remote sensing images and holds significant research and application value. However, most existing methods rely on conventional deep learning architectures, and the limited model capacity constrains performance. Although large-model post-training techniques have achieved great success in general domains, their direct transfer to RSICC remains challenging due to data scarcity and the need for fine-grained change understanding. To address this, we propose RSICCLLM, the first post-training framework for large vision-language models in RSICC. Specifically, we design a data generation paradigm, release the instruction dataset RSICI, and establish a task-specific RSICC benchmark. We further introduce Difference-aware Supervised Fine-tuning to explicitly extract change representations and guide the model in perceiving and understanding temporal differences. In addition, we propose Dual-Negative Preference Optimization (DNPO), which employs two complementary negative-sample construction strategies to construct the preference dataset RSICP and further refine model performance. Extensive experiments validate the superior capability of RSICCLLM, which achieves outstanding results with only 7B parameters, surpassing models of substantially larger scales. The code and dataset will be made publicly available at this https URL.
6. 【2606.28226】Exposure Bias Can Alleviate Itself via Directional and Frequency Rectification in Flow Matching
链接:https://arxiv.org/abs/2606.28226
作者:Guanbo Huang,Jingjia Mao,Fanding Huang,Fengkai Liu,Xiangyang Luo,Yaoyuan Liang,Jiasheng Lu,Xiaoe Wang,Pei Liu,Ruiliu Fu,Ruqi Huang,Shao-Lun Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Flow Matching, remarkable generative performance, achieved remarkable generative, generative performance, achieved remarkable
备注: arXiv admin note: text overlap with [arXiv:2512.04904](https://arxiv.org/abs/2512.04904)
点击查看摘要
Abstract:Flow Matching (FM) has achieved remarkable generative performance, yet it suffers from exposure bias due to discrepancies between training and inference. Existing mitigation strategies typically rely on static constraints or external heuristics. In this work, we propose that exposure bias itself inherently contains dynamic signals that can guide its own rectification. To leverage this, we introduce DEFAR (DirEctional-Frequency Adaptive Rectification). This framework simulates the single-step inference process during training to identify exposure bias. It utilizes directional and frequency-adaptive feedback signals from the bias itself to enhance the model's bias tolerance. It consists of two key components: (1) Anti-Drift Rectification (ADR). ADR treats inference-time drift as a signal to learn the direction to steer deviated states back toward the target. ADR endows the model with intrinsic active self-rectification capabilities; (2) Frequency Compensation (FC). Empirically, we observe that accumulated bias often stems from a lack of low-frequency components in high-noise stages, and exposure bias carries the missing frequency. FC leverages the bias itself as a self-feedback weighting factor to reinforce the missing frequency components. Experiments on CIFAR-10, CelebA-64, and ImageNet-256/512 show that DEFAR outperforms prior baselines and further demonstrates favorable scalability, compatibility, and inference robustness.
7. 【2606.28215】HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration
链接:https://arxiv.org/abs/2606.28215
作者:Jiaxin Li,Yuxiang Wu,Zhenkai Zhang,Xinrui Shi,Haoyuan Wang,Yichen Zhao,Su Linxiang,Chenyang Yu,Mingyu Zhang,Yifan Ding,Boran Wen,Li Zhang,Ruiyang Liu,Yong-Lu Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:scaling Embodied, Extracting dynamic, highly efficient data, efficient data collection, data collection pathway
备注: Accepted to ECCV 2026. 15 pages of main text and 39 pages of appendices. Project page: [this https URL](https://lijiaxin0111.github.io/HAT4D/)
点击查看摘要
Abstract:Extracting dynamic 4D object interactions from massive, in-the-wild monocular videos offers a highly efficient data collection pathway for scaling Embodied AI and training VLAs. However, existing monocular 4D reconstruction methods primarily focus on isolated objects, often failing under the severe occlusions and complex dynamics inherent in multi-object interactions. To bridge this gap, we propose HAT-4D, the first agentic framework designed to reconstruct the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single video. By integrating VLMs with a multi-level human-in-the-loop feedback mechanism, HAT-4D efficiently resolves depth ambiguities and interaction-induced occlusions during 3D generation and 4D propagation, yielding physically plausible assets without relying on expensive multicamera rigs. As a scalable data engine, HAT-4D facilitates the creation of MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction, accompanied by a novel multi-dimensional evaluation protocol focused on physical plausibility and temporal consistency. Extensive experiments demonstrate that HAT-4D achieves SOTA performance on most evaluation metrics, while maintaining competitive semantic alignment. Ablation studies show that introducing a small amount of human feedback improves interaction reconstruction. Moreover, the data produced by HAT-4D effectively improves baseline performance when used for fine-tuning. Our data and code are available at this https URL
8. 【2606.28182】LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior
链接:https://arxiv.org/abs/2606.28182
作者:Qinhong Zhou,Chuang Gan,Anoop Cherian
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:attracted growing attention, partially observable environments, Embodied agents operating, recent years, operating in decentralized
备注: Accepted to ICML 2026
点击查看摘要
Abstract:Embodied agents operating in decentralized and partially observable environments have attracted growing attention in recent years. However, existing large language model (LLM)-based agents often exhibit behaviors that are misaligned with their partners or inconsistent with the environment state, leading to inefficient cooperation and poor task success. To address this challenge, we propose a novel framework, Learning Laws of Cooperation (LLawCo), that enables embodied agents to autonomously align with both their partners and task objectives. Our framework allows agents to reflect on past failures to extract misaligned behavioral patterns, which are used to derive high-level behavioral laws, such as "Talk when necessary" and "Wait for partner." These laws are explicitly incorporated into the agents' chains of thought via supervised fine-tuning, aligning their reasoning with task requirements and the behavior of other agents. To evaluate our approach, we introduce PARTNR-Dialog, a large-scale multi-agent communicative and cooperative planning benchmark built on the PARTNR environment. Experiments on existing tasks and our new benchmark demonstrate significant improvements in cooperative efficiency and task success rates. Across four backbone LLMs, our method achieves average success rate improvements of 4.5% on the PARTNR-Dialog benchmark and 6.8% on the TDW-MAT benchmark over state-of-the-art open-source communicative agent frameworks. See the LLawCo project page for details: this https URL
9. 【2606.28164】EchoSonar-R: A Multi-View Reasoning-Enabled Model for Disease Classification and Report Generation in Echocardiography
链接:https://arxiv.org/abs/2606.28164
作者:Darya Taratynova,Ahmed Aly,Numan Saeed,Mohammad Yaqub
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:providing essential information, cardiac imaging modality, non-invasive cardiac imaging, imaging modality, providing essential
备注:
点击查看摘要
Abstract:Echocardiography is the most widely used non-invasive cardiac imaging modality, providing essential information for cardiovascular diagnosis. Interpreting an echocardiogram requires synthesizing complementary evidence across multiple heart views to identify abnormalities and produce structured clinical reports. While recent efforts focus on improving classification performance, most models lack explicit diagnostic reasoning and spatially grounded anatomical evidence, limiting clinician trust. We present EchoSonar-R, a multi-view reasoning-enabled vision-language model that jointly performs multi-label disease classification and report generation from echocardiography studies. EchoSonar-R combines a spatiotemporal video encoder with a structure-aware cardiac detector that provides spatially grounded anatomical cues to improve interpretability and clinician trust during cross-view reasoning. EchoSonar-R is trained in two stages: supervised fine-tuning (SFT) on reasoning-annotated targets, followed by Group Relative Policy Optimization (GRPO) with task-specific rewards that jointly align classification and report generation within a unified reinforcement-learning framework. Across a private multi-view dataset and two public benchmarks, EchoSonar-R improves macro balanced accuracy by 17.1% on the private set and 6.1% on MIMICEchoQA over the strongest baseline, achieves a GREEN clinical faithfulness score of 0.800, and produces interpretable reasoning traces grounded in multi-view visual evidence.
10. 【2606.28149】oward Robust In-Context Segmentation via Concept Guidance
链接:https://arxiv.org/abs/2606.28149
作者:Zhigang Chen,Xiawu Zheng,Rongrong Ji
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:segment target regions, updating any parameters, In-context segmentation, segment target, target regions
备注: ECCV 2026
点击查看摘要
Abstract:In-context segmentation (ICS) requires a model to segment target regions in a query image using only a few reference images and their corresponding masks, without updating any parameters. Despite recent progress, prior ICS studies have largely overlooked a critical aspect: system robustness, ie, whether the model can produce stable segmentation results for the same query under different references. In this work, we revisit ICS from the robustness perspective and introduce a novel paradigm, Concept-Guided In-Context Segmentation (CG-ICS), which performs segmentation by extracting high-level semantic concepts from references rather than relying solely on low-level visual matching. Specifically, CG-ICS introduces a concept reasoning module that uses an MLLM to propose candidates and a SAM3-driven scoring function with tree-search refinement to select reliable textual concepts, together with a parallel visual exemplar route that provides query-side spatial grounding via a simple context construction. Both the textual concept and the visual exemplar are then used to activate the segmentation capability of a frozen SAM3 backbone. Extensive experiments on standard ICS benchmarks demonstrate that CG-ICS not only achieves state-of-the-art accuracy but also substantially improves robustness, yielding a more reliable ICS system with significantly reduced variance across diverse reference choices.
11. 【2606.28144】Monocular Avatar Reconstruction via Cascaded Diffusion Priors and UV-Space Differentiable Shading
链接:https://arxiv.org/abs/2606.28144
作者:Hong Li,Minqi Meng,Yanjun Liang,Chongjie Ye,Houyuan Chen,Weiqing Xiao,Xianda Guo,Guojun Lei,Xuhui Liu,Chaojie Yang,Yanlun Peng,Hao Zhao,Baochang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:challenging ill-posed problem, high-quality PBR data, Reconstructing high-fidelity, ill-posed problem, primarily hindered
备注: Accepted by ECCV 2026. Project page: [this https URL](https://luh1124.github.io/MARCUS-Avatar-Projectpage/)
点击查看摘要
Abstract:Reconstructing high-fidelity, relightable 3D avatars from a single in-the-wild image is a challenging ill-posed problem, primarily hindered by the scarcity of high-quality PBR data and the complexity of disentangling illumination from intrinsic materials. In this paper, we present a data-efficient framework that leverages the robust priors of a unified pre-trained diffusion backbone to sequentially address texture completion, delighting, and material decomposition. Unlike existing methods that rely on fragmented pipelines or extensive proprietary datasets, we utilize cascaded Low-Rank Adaptations (LoRAs) to adapt the strong generative prior of the diffusion model for each sub-task in UV space. Specifically, we first employ an Inpainting LoRA to complete missing UV textures caused by occlusion, leveraging the model's semantic understanding to generate semantically and photometrically coherent details. Subsequently, a Light-Homogenization LoRA and a novel Cross-Intrinsic Attention mechanism are introduced to remove baked-in lighting and collaboratively synthesize pixel-aligned PBR maps (Albedo, Normal, Roughness, Specular, and Displacement). To ensure physical plausibility, we impose a UV-space differentiable BRDF shading loss during the decomposition stage, forcing the generative process to adhere to the rendering equation without the artifacts typical of rasterization-based supervision. Extensive experiments demonstrate that our method, trained on fewer than 100 real 3D scans, generates comprehensive, 4K-resolution PBR assets with superior realism and generalization compared to state-of-the-art methods, and all training code and model weights will be released upon acceptance.
12. 【2606.28133】ranslation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots
链接:https://arxiv.org/abs/2606.28133
作者:Sijin Chen,Kaixuan Jiang,Haixin Shi,Yanhui Wang,Weiheng Zhong,Haosheng Li,Bo Jiang,Yuxiao Liu,Xihui Liu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:human, action, human data, parallel grippers, parallel gripper
备注: Project Page: [this https URL](https://translation-as-a-bridging-action.github.io/)
点击查看摘要
Abstract:We study whether we can learn novel manipulation skills from human actions to a bi-manual robot with parallel grippers. Human action data is cheap, abundant, and diverse, making it one of the most promising resources for scaling up robot learning. Yet transferring skills from humans to robots remains hard: most prior work treats humans as just another bi-manual 6DoF embodiment, where hand-pose estimates are noisy and the contact patterns of human fingers differ fundamentally from those of a parallel gripper. We argue that learning rotation-inclusive action signals from human data is therefore sub-optimal, and instead propose a bridging action representation: the relative wrist translation within the initial head-camera frame, an action space shared by humans and robots. To handle the potential absence of certain action components in different embodiments, we build a $\pi_0$-like vision-language-action model with interleaved action tokens and attention masking. On a suite of novel bi-manual manipulation tasks, our bridging action transfers human manipulation knowledge to robots far more effectively than noisy 6DoF human actions and scales with the amount of human data.
13. 【2606.28128】PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation
链接:https://arxiv.org/abs/2606.28128
作者:Peiwen Zhang,Yufan Deng,Shangkun Sun,Juncheng Ma,Duomin Wang,Jonas Du,Zilin Pan,Ye Huang,Hao Liang,Songyan Huang,Ruihua Zhang,Enze Xie,Ming-Yu Liu,Daquan Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:embodied world simulation, promising paradigm, world simulation, Video, world
备注: Github: [this https URL](https://github.com/DAGroup-PKU/PhysisForcing) Project website: [this https URL](https://dagroup-pku.github.io/PhysisForcing.github.io/#)
点击查看摘要
Abstract:Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific data fine-tuned models can still produce physically implausible manipulations, including discontinuous motion trajectories and inconsistent robot-object interactions, which limits their reliability as world simulators. Through extensive experiments, we find that such physical instability mainly arises from two factors: deformation of moving objects and implausible spatio-temporal correlations among interacting entities, particularly during contact. Building on this observation, we propose PhysisForcing, a scalable training framework that strengthens physical consistency by focusing supervision on physics-informative regions through joint optimization of pixel-level and semantic-level features. The framework consists of a pixel-level trajectory alignment loss, which supervises DiT features using reference point trajectories, and a semantic-level relational alignment loss, which aligns DiT features with inter-region relations extracted from a frozen video understanding encoder. Extensive experiments on R-Bench, PAI-Bench, and EZS-Bench show that PhysisForcing consistently improves embodied video generation over strong baselines, improving the Wan2.2-I2V-A14B and Cosmos3-Nano base models on R-Bench by 22.3\% and 9.2\% (7.1\% and 3.7\% over vanilla finetuning), with the Cosmos3-Nano variant attaining the best overall score. Beyond generation, as a world model under the WorldArena action-planner protocol it raises the closed-loop success rate from 16.0\% to 24.0\% and further improves downstream policy success, indicating that physically aligned video models yield stronger representations for robotic manipulation.
14. 【2606.28122】Higher-Order Fourier Neural Operator: Explicit Mode Mixer for Nonlinear PDEs
链接:https://arxiv.org/abs/2606.28122
作者:Alex Colagrande,Paul Caillon,Eva Feillet,Alexandre Allauzen
类目:Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:provide deep neural, deep neural networks, operators provide deep, Neural operators provide, Fourier Neural Operator
备注: 46 pages
点击查看摘要
Abstract:Neural operators provide deep neural networks for learning mappings between function spaces. Among them, the Fourier Neural Operator (FNO) is particularly effective: its spectral convolution relies on low-dimensional Fourier-domain representations and can handle inputs at different resolutions. This design aligns well with settings where the Fourier basis diagonalizes the underlying operator, such as linear, constant-coefficient PDEs on periodic domains, in which Fourier modes evolve independently. However, nonlinear PDEs may benefit from an additional inductive bias, as they exhibit structured interactions between modes, governed by polynomial nonlinearities. To capture this inductive bias, we introduce the Higher-Order Spectral Convolution, a spectral mixer that extends FNO from diagonal modulation to explicit n-linear mode mixing, aligned with the dynamics of nonlinear PDEs. Our experiments on standard benchmarks show that the proposed Higher-Order FNO (HO-FNO) retains the efficiency of FNO-based architectures and consistently improves over other spectral neural operators. HO-FNO also performs on par with or better than state-of-the-art transformers and state-space models on several datasets, with stronger gains in highly nonlinear regimes, such as the Poisson equation with polynomial forcing, where a single HO-FNO layer outperforms FNO models with up to 16 layers. We open-source our code for reproducibility at: this https URL.
15. 【2606.28112】BiDeMem: Bidirectional Degradation Memory for Explainable Image Restoration
链接:https://arxiv.org/abs/2606.28112
作者:Xinrui Wu,Lichen Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:obtains higher PSNR, Degradation-aware prompts, restored image obtains, image obtains higher, higher PSNR
备注:
点击查看摘要
Abstract:Degradation-aware prompts, conditions, and latent priors are increasingly used in image restoration, yet they are usually judged by a single endpoint: whether the restored image obtains higher PSNR. This is a weak test of semantics. A condition can help by adding capacity, acting as a global correction bias, or exploiting dataset shortcuts, without becoming an interpretable degradation prior. We propose BiDeMem, a bidirectional degradation memory for explainable image restoration. A query built from restoration features and input statistics retrieves a compact top-k subset of memory slots. The same selected slot identity supports the restoration path at inference time and a training-only forward-degradation explanation path. The study centers on verifiability in a controlled multi-degradation NAFNet setting. New controls separate the gain from a correction head alone, a dense query prior, and a static global prior: these variants are 0.2588, 0.2586, and 0.2839 dB below BiRank, respectively. Strong residual supervision and a wider degradation head also remain below the full bidirectional memory model. Intervention probes show that BiRank preserves restoration quality while increasing wrong-prior and native-prior sensitivity, framing degradation memory as both a restoration module and a falsifiable explanation mechanism.
16. 【2606.28104】Cross-view Multimodal Vision-Based Assessment Framework for Traditional Chinese Medicine Rehabilitation Training
链接:https://arxiv.org/abs/2606.28104
作者:Francis Xiatian Zhang,Hao Yao,Shengxuan Chen,Hong Zhu,Hongxiao Jia,Sisi Zheng,Hubert P. H. Shum
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Traditional Chinese Medicine, Chinese Medicine, Traditional Chinese, computer vision offers, action quality assessment
备注: Published in IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2026
点击查看摘要
Abstract:Vision-based assessment can provide convenient and cost-effective evaluation in Traditional Chinese Medicine (TCM) rehabilitation training, where action quality assessment (AQA) from computer vision offers a promising solution. Existing automatic AQA frameworks for physical therapy typically rely on skeletal data captured from a single viewpoint, which is inefficient for TCM techniques such as acupuncture or Tuina that involve dense hand self-occlusion and complex hand-object interactions. To address these challenges, we propose CME-AQA, a cross-view, multimodal vision-based assessment framework that integrates visual-pose fusion to enhance understanding of environmental context and leverages both first-person and third-person videos during training to improve inference robustness. We collected two dual-view datasets, TCM-AQA61-A (Acupuncture) and TCM-AQA61-T (Tuina), each containing synchronized first-person and third-person recordings of 61 subjects with expert annotations. Experimental results show that our approach achieves superior or comparable mean performance against competitive baselines, achieving over 10% relative improvement in weighted F1 over the best competing method on key rating tasks such as Needle Depth and Quick Needle Insertion, while also reducing mean absolute error in quantitative measures such as insertion time and manipulation frequency. Testing on a CPR dataset further demonstrates comparable performance on several posture-based criteria, suggesting applicability to related structured simulated clinical skill assessments where participant motion is central to evaluation. Overall, CME-AQA enhances assessment accuracy for structured TCM rehabilitation training and facilitates more convenient and effective training-oriented skill evaluation.
17. 【2606.28094】OSOR: One-Step Diffusion Inpainting for Effect-Aware Object Removal
链接:https://arxiv.org/abs/2606.28094
作者:Qinming Zhou,Chenxi Sun,Deyang Kong,Junhao He,Xiangheng Tang,Peike Yu,Haotian Wu,Leilei Cao,Linfeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Real-world object removal, object non-local effects, target object non-local, Real-world object, key difficulties
备注: Code and resources are available at [this https URL](https://github.com/Zhouqm-Git/osor)
点击查看摘要
Abstract:Real-world object removal is challenging due to two key difficulties: the target object's non-local effects, such as shadows and reflections, which are difficult to model, and the fact that user-provided masks are often inaccurate or incomplete. With billions of parameters and tens of denoising steps, diffusion-based models achieve strong removal performance at the expense of substantial computational cost, limiting their use in interactive applications and on edge devices. To address these challenges, we present OSOR (One-Step Object Removal), which simultaneously achieves efficient, effect-aware, and mask-robust object removal. Concretely, OSOR introduces: (1) an occupancy-guided discriminator for precise boundary supervision, enabling stable single-step diffusion training; (2) an alpha head that leverages knowledge from pretrained diffusion models to predict appropriate removal regions with minimal overhead, thereby handling imperfect masks; and (3) a semantic-anchored verification pipeline (SAVP) that filters noisy instruction-based triplets to produce effect-aware supervision at scale. Using SAVP, we curate CORNE, which contains 280K verified removal pairs, and further annotate AnimeEraseBench and TextEraseBench to evaluate performance on more complex removal tasks. Experiments show that OSOR surpasses strong multi-step diffusion baselines in perceptual quality while achieving $4\times$ to $30\times$ faster inference.
18. 【2606.28092】Diffusion Model Attribution via Spectral Coupling of Denoiser Responses
链接:https://arxiv.org/abs/2606.28092
作者:Pragati Shuddhodhan Meshram,Varun Chandrasekaran
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:intellectual property protection, property protection, Attributing a generated, fundamental challenge, challenge in provenance
备注:
点击查看摘要
Abstract:Attributing a generated image to its source diffusion model is a fundamental challenge in provenance verification and intellectual property protection. This problem is particularly difficult because diffusion models trained on different datasets can converge to similar score functions and thus similar output distributions, making the generated images themselves unreliable as attribution evidence. Existing non-invasive methods either fail on architecturally similar variants or rely on signals that vanish when models share the same autoencoder. We propose Spectral Denoising Signatures (SDS), a non-invasive attribution method that identifies the source model by fingerprinting each candidate model's denoising behavior. Our key insight is that a model's denoising score function exhibits a distinctive spectral geometry, reflected in how it redistributes energy across spatial frequency bands during denoising. By probing this behavior with frequency-controlled perturbations, SDS extracts a stable signature that is intrinsic to the model, requiring only standard forward passes with no inversion, optimization, or generation-time enrollment. Our results demonstrate that SDS achieves approximately 99.9% accuracy across eight diverse diffusion models and 96.2% under cross-domain prompt shift, outperforming non-invasive baselines across variations in training data, architecture, and training procedure, establishing spectral geometry as a principled and practical basis for diffusion model attribution. Code is available at: this https URL
19. 【2606.28089】RPM-Distill: Physiology-guided Adaptive Cross-modal Distillation for Robust Remote Physiological Measurement
链接:https://arxiv.org/abs/2606.28089
作者:Jiyao Wang,Qingyong Hu,Duoxun Tang,Xiao Yang,Kaishun Wu,Jiangbo Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video-based remote physiological, remote physiological measurement, Video-based remote, skin tones, physiological measurement
备注: Accepted by ECCV2026
点击查看摘要
Abstract:Video-based remote physiological measurement (RPM) is highly accessible but remains fragile under varying illumination, skin tones, and motion. Radio frequency (RF) radar is largely invariant to illumination and appearance, providing complementary cardio-respiratory micro-motion cues; however, requiring radar at inference is often impractical due to its limited ubiquity and deployment overhead. We propose RPM-Distill, a physiology-guided cross-modal distillation framework that leverages synchronized radar only during training while retaining video-only inference. Our key observation is that although RGB and RF waveforms differ in sensing physics and time-domain morphology, they share similar latent periodic rhythm in the frequency domain. We thus distill physiology-structured spectral evidence to improve robustness, via losses that (i) anchor the fundamental peak, (ii) match the off-peak background distribution, and (iii) preserve spectral morphology and sharpness. To avoid negative transfer under sample-level teacher quality and alignment uncertainty, a spectral policy network predicts sample-level distillation gates and component weights from the student--teacher spectral relation map, learned with a meta bilevel objective on a small labeled validation split. Through extensive experiments in challenging conditions and cross-dataset settings, RPM-Distill brings 81\% MAE and 21\% correlation improvement over unimodal baselines. Code is at this https URL.
20. 【2606.28083】STAG: Spatio-temporal Evolving Structural Representation of Action Units for Micro-expression Recognition
链接:https://arxiv.org/abs/2606.28083
作者:Nandani Sharma,Varun Sharma,Dinesh Singh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
关键词:challenging due, facial muscle movements, short-lived facial muscle, Micro-expression recognition, short-lived facial
备注:
点击查看摘要
Abstract:Micro-expression recognition is challenging due to subtle and short-lived facial muscle movements. Existing methods rely heavily on apex-onset frames, overlook fine-grained inter-frame dynamics, and separately model spatial and temporal information, limiting generalization across datasets. To address these challenges, we propose STAG, a dynamic ROI-AU-coupled spatial-temporal network that jointly models motion flow and adaptive facial connectivity. The framework extracts optical flow from discriminative frames using magnitude-based selection and temporal attention. A dual-branch architecture combines an enhanced graph attention network for structured spatial reasoning with a transformer encoder for temporal modeling. A bidirectional cross-attention module enables mutual refinement of spatial and temporal features, while AU-guided dynamic connectivity adapts facial region interactions according to muscle activation patterns. The transformer captures subtle temporal dynamics beyond apex-based approaches, improving semantic consistency and interpretability for explainable micro-expression recognition. The fused representation is optimized using focal loss and evaluated on CASME II, 4DME, DFME, NaME, SAMM, and SMIC-HS. Extensive experiments demonstrate improved robustness, generalization, interpretability, and computational efficiency, confirming the effectiveness of adaptive relational reasoning, AU-guided dynamic connectivity, and deep spatial-temporal feature fusion for accurate cross-dataset micro-expression recognition.
21. 【2606.28077】xtDS: Parameter-Efficient Representation Alignment for Scene Text Detection under Distribution Shifts
链接:https://arxiv.org/abs/2606.28077
作者:Boyuan Chen,Zichen Dang,Chuang Yang,Lap-pui Chau,Yi Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:detectors inevitably face, inevitably face distribution, text detectors inevitably, large-scale scene-text pretraining, face distribution shifts
备注: Accepted by ECCV 2026. Project page: [this https URL](https://github.com/ZChenDang/TextDS)
点击查看摘要
Abstract:In real-world deployments, scene text detectors inevitably face distribution shifts beyond the training distribution. Prior work often depends on large-scale scene-text pretraining, yet evaluation under cross-domain changes and real-world imaging degradations remains limited. We propose TextDS, an efficient framework for scene text detection under distribution shifts. First, we propose a data-efficient dual-encoder design with visual foundation models, eliminating the reliance on large-scale scene-text pretraining. Second, we introduce Step-wise LoRA adaptation (SWLoRA), which performs progressive low-rank refinement with a dynamic early-exit mechanism for effective feature adaptation. Third, we propose Common Subspace Fusion (CSF) to align and fuse the two branches in a shared subspace while retaining complementary, shift-robust information. Finally, we construct adverse-condition scene text detection datasets to address the gap in evaluating under imaging degradation. Experiments show that TextDS achieves competitive performance in scene text detection, demonstrating robustness across domains and adverse imaging conditions with only 4.9M trainable parameters.
22. 【2606.28060】ReScene: Structured Indoor Scene Reconstruction from Multi-View Captures
链接:https://arxiv.org/abs/2606.28060
作者:Haoran Xu,Lechao Zhang,Daoguo Dong,Yan Gao,Xin Tan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Embodied Artificial Intelligence, Artificial Intelligence, require object-level structure, Constructing simulation-ready, explicit inter-object relations
备注:
点击查看摘要
Abstract:Constructing simulation-ready 3D scenes from multi-view captures is a key bottleneck for Embodied Artificial Intelligence, as downstream tasks require object-level structure, explicit inter-object relations, and physical plausibility. Existing approaches either rely on specialized capture hardware, suffer from single-view bias in object reconstruction, or yield layouts that are geometrically reasonable but physically inconsistent. We identify that the problem is not single-object reconstruction but cross-view relation fusion and physically plausible scene assembly. To address this challenge, we present ReScene, a framework that threads multi-view geometry throughout the pipeline as a unifying prior. Our method consists of two main components: HierView prioritizes reconstruction views based on semantic consistency and 3D coverage completeness, replacing the largest-mask heuristic that conflates image occupancy with object coverage; and Relation-Aware Assembly fuses multi-frame relation predictions from a vision-language model with geometric and room-shell priors into a confidence-weighted scene graph, enabling physically consistent scene assembly. ReScene sets a new state of the art across geometry, rendering, and perceptual quality on a set of ScanNet scenes, achieving a 17% reduction in Chamfer Distance and 26% in LPIPS over the strongest prior baseline, while running up to 10x faster than prior multi-view methods. Based on the reconstructed scenes, we also generate an embodied visual question answering dataset, on which fine-tuned Qwen-VL approaches the performance of strong closed-source models on several spatial reasoning tasks.
23. 【2606.28049】AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration
链接:https://arxiv.org/abs/2606.28049
作者:Haotian Li,Yida Wang,Leyuan Wang,Jinshan Lai,Keyang Wang,Zonghao Guo,Qiang Ma,Liuyu Xiang,Jianwei Hu,Zhaofeng He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, shown strong potential, views remains under-evaluated, large language models, maintain geometrically consistent
备注:
点击查看摘要
Abstract:In recent years, multimodal large language models (MLLMs) have shown strong potential for embodied intelligence, yet their ability to maintain geometrically consistent spatial understanding across heterogeneous views remains under-evaluated. Existing benchmarks largely focus on single-agent, single-view perception, leaving a gap in the systematic assessment of collaborative air-ground settings, where multi-scale observations are complementary but introduce scale mismatch, asymmetric occlusion, and reference-frame inconsistencies. We present AirGroundBench, a diagnostic benchmark for evaluating multi-view spatial intelligence in heterogeneous UAV-UGV collaboration. AirGroundBench is built from 11 high-fidelity simulated environments with 1,021 synchronized air-ground observation pairs, yielding approximately 62,000 dual-view, four-option single-choice visual question answering instances and 115 closed-loop vision-language navigation episodes. It covers 10 task types organized into four progressively demanding capability dimensions: spatial perception, cross-view alignment, spatial transformation and reasoning, and embodied decision-making. To support geometry-grounded evaluation and analysis, we provide structured spatial annotations, including cross-view object identities and metric 2D and 3D bounding boxes. Evaluations of 13 representative MLLMs under UAV-only, UGV-only, and dual-view input settings reveal consistent bottlenecks: models perform relatively well on spatial perception but struggle with cross-view alignment and transformation-intensive reasoning, and these deficits propagate to sequential decision-making in vision-language navigation. Although dual-view inputs provide measurable gains over single-view variants, a persistent gap from human performance remains, highlighting geometric consistency as a key limitation of current embodied MLLMs.
24. 【2606.28039】Mind the Gap: Quantifying the Domain Gap in Cross-Sensor Diffusion Super-Resolution
链接:https://arxiv.org/abs/2606.28039
作者:Dawid Kopeć,Katarzyna Jabłońska,Wojciech Kozłowski,Maciej Zięba
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:spatial resolution gap, high-resolution satellite imagery, increased interest, bridge the spatial, spatial resolution
备注: 26th International Conference on Computational Science
点击查看摘要
Abstract:Demand for high-resolution satellite imagery has increased interest in super-resolution (SR) to bridge the spatial resolution gap between freely available missions such as Sentinel-2 and commercial systems like PlanetScope. Because no sensor provides true paired low- and high-resolution observations, SR models are usually trained on synthetically degraded data, creating a domain gap on real cross-sensor imagery. In this work, we provide the first systematic study of how this synthetic-to-real mismatch affects the performance of modern diffusion-based SR models. Using a large, geometrically and temporally aligned dataset of Sentinel-2 and PlanetScope imagery, we evaluate five state-of-the-art diffusion architectures under controlled experimental settings. We also introduce LPIPS-Sat, a domain-adapted perceptual metric based on Sentinel-2 self-supervised features. Our results show two persistent challenges: synthetically trained models degrade sharply on real pairs, while models trained on real cross-sensor data exhibit optimisation difficulties and struggle to adapt to the physical and radiometric diversity. These findings highlight a key limitation of current SR and motivate methods that disentangle super-resolution from domain adaptation.
25. 【2606.28026】EMOSH: Expressive Motion and Shape Disentanglement for Human Animation
链接:https://arxiv.org/abs/2606.28026
作者:Dongbin Zhang,Hao Liu,Binquan Dai,Kangjie Chen,Chuming Wang,Chen Li,Jing Lyu,Haoqian Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:digital avatar applications, avatar applications, essential for content, content creation, creation and digital
备注: Accepted to ECCV 2026, Project Page: [this https URL](https://eastbeanzhang.github.io/EMOSH/)
点击查看摘要
Abstract:High-fidelity and expressive controllable human animation is essential for content creation and digital avatar applications. However, existing methods face a dilemma between expressiveness and disentanglement. Mainstream 2D pose-conditioned approaches suffer from "motion-shape entanglement", leading to the leakage of the driving subject's body shape. Conversely, methods relying on 3D priors (e.g., SMPL) achieve geometric disentanglement but struggle to capture facial expressions and complex gestures, resulting in rigid animations. To this end, we propose EMOSH, a novel framework for high-fidelity controllable human video generation. First, an Expressive Human Model (EHM) is introduced as the core control representation. By explicitly disentangling shape and pose parameters, we fundamentally resolve the body shape leakage issue. Alongside this, a robust motion tracker is designed to accurately estimate EHM parameters from video. Second, we propose a Coarse-to-Fine Hybrid Motion Injection strategy, enabling more fine-grained control over expressions and gestures. Furthermore, we introduce a Spatially-Aligned Conditioning mechanism to bridge the domain gap between training and inference, improving identity consistency. Extensive experiments demonstrate that EMOSH outperforms previous methods in both self-driven and cross-driven scenarios, producing high-fidelity videos with vivid expressions while maintaining shape disentanglement.
26. 【2606.28016】mpAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL
链接:https://arxiv.org/abs/2606.28016
作者:Jing Wang,Xiangxin Zhou,Jiajun Liang,Kaiqi Liu,Wanyun Pang,Zhenyu Xie,Tianyu Pang,Xiaodan Liang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enable low-latency streaming, chunk-wise formulation makes, low-latency streaming generation, models enable low-latency, formulation makes temporal
备注:
点击查看摘要
Abstract:Autoregressive (AR) video diffusion models enable low-latency streaming generation by synthesizing videos chunk by chunk with cached visual context, but this chunk-wise formulation makes temporal instruction following ambiguous. A single global prompt does not specify which sub-event should be realized in each chunk, while naively switching to step-wise prompts often leads to delayed reactions, blended step semantics, and error propagation across prompt transitions. These failures are difficult to address with supervised fine-tuning or distillation alone: SFT suffers from exposure bias, while rollout-based distillation still optimizes low-level denoising or teacher-distribution matching rather than directly enforcing action ordering and prompt-transition correctness. We address these challenges with TempAct, a planner--executor reinforcement learning framework that jointly optimizes temporal decomposition and step-conditioned execution for temporally plausible AR video generation. TempAct uses an LLM planner to explore span-aware step prompts that are executable by the video model, and trains an AR diffusion executor to follow these prompts under its own generated histories. Its key mechanism is hierarchical group exploration: candidate plans form planning groups, and each plan induces an execution group of multiple continuations from a shared visual context, enabling plan-level credit assignment for long-horizon temporal outcomes and executor-level credit assignment for prompt-switch behavior. We further design hierarchical rewards that combine plan-quality and full-video temporal feedback for the planner with local transition-level step-following rewards, aesthetic regularization, and KL constraints for the executor. Experiments on Self-Forcing and LongLive show that TempAct improves temporal consistency while preserving overall visual quality.
27. 【2606.28012】Curriculum-guided Change Detection Training: Toward Accurate Serac Fall Monitoring
链接:https://arxiv.org/abs/2606.28012
作者:Arthur Dérédel,Carlos Crispim-Junior,Pierre Lemaire,Johan Berthet,Laure Tougne Rodet
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:registered multi-temporal images, aims to identify, Change Detection, identify semantic, registered multi-temporal
备注: Preprint, 11 pages, 5 figures
点击查看摘要
Abstract:Change Detection (CD) aims to identify semantic or structural changes from nearly registered multi-temporal images. While recent advances in training methodologies have largely focused on semi-supervised learning and consistency regularization, alternative training paradigms remain underexplored. In particular, most deep CD methods rely on uniform sampling during training, implicitly assuming that all training samples contribute equally to the optimization process. However, such naive sampling can introduce noisy gradients and hinder robust representation learning. To address this limitation, we propose a curriculum learning framework tailored for change detection. Our approach investigates two complementary difficulty measures: the Solar Angular Gap (SAG), a physically grounded proxy for acquisition-condition variability, and the Structural Similarity Index Measure (SSIM), which evaluates appearance similarity between image pairs. Based on these criteria, the framework progressively introduces challenging samples during training, enabling models to learn robust representations in a coarse-to-fine manner. We evaluate our method on the challenging SeracFallDet benchmark, where results demonstrate consistent improvements of the proposed approach over standard uniform-sampling strategies for both pixel-based and object-based approaches. These results highlight the potential of curriculum learning to improve robustness in deep change detection. Importantly, our training framework is orthogonal to existing CD architectures, making it readily applicable to a broad range of methods.
28. 【2606.27999】HumanMoveVQA: Can Video MLLMs reason about human movement in videos?
链接:https://arxiv.org/abs/2606.27999
作者:Pulkit Gera,Faegheh Sardari,Asmar Nadeem,Valentina Bono,Padraig Boulton,Adrian Hilton,Armin Mustafa
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, fundamental bottleneck remains, coarse semantic labels, Multimodal Large
备注:
点击查看摘要
Abstract:Despite the rapid advance of Multimodal Large Language Models (MLLMs) in high-level video understanding, a fundamental bottleneck remains: these models collapse complex human motion into coarse semantic labels. Existing benchmarks mostly focus on scene-centric events or local joint articulations, failing to probe global human motion in space over time (trajectory and orientation changes). We introduce HumanMoveVQA, the first comprehensive benchmark designed to evaluate global trajectory and orientation reasoning from an exocentric perspective. Our benchmark utilizes a first-frame anchored world coordinate system, preserving translation and rotation relative to a fixed starting point. We propose a scalable, multi-stage pipeline that lifts 2D video observations into world-consistent 3D motion tracks to generate over 10K structured question-answer pairs across seven reasoning categories, including motion aggregation, sequential ordering, and trajectory-level inference. Our extensive evaluation reveals a critical capability gap in state-of-the-art proprietary models on deep human motion understanding. However, we demonstrate that this is a learnable problem; by fine-tuning an open-source baseline with our targeted, world-consistent supervision, we achieve a significant this http URL establishes a rigorous geometric foundation for developing next-generation, movement-aware video understanding models.
29. 【2606.27988】Latent Visual Diffusion Reasoning with Monte Carlo Tree Search
链接:https://arxiv.org/abs/2606.27988
作者:Xirui Teng,Nan Xi,Junsong Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Analyzing fine-grained skill, Analyzing fine-grained, recognizing visual patterns, fine-grained skill activities, Carlo Tree Search
备注: Accepted to ECCV 2026. Project page: [this https URL](https://github.com/XiruiTeng/LVDR_Official.git)
点击查看摘要
Abstract:Analyzing fine-grained skill activities (e.g., sports, surgery) requires not only recognizing visual patterns but also performing step-by-step visual reasoning that leads to the final judgment. While recent advances in action quality assessment have achieved remarkable progress in evaluating performance, existing models remain black boxes, where they lack the ability to explicitly reveal the reasoning processes underlying their judgments. To address this limitation, we propose Latent Visual Diffusion Reasoning (LVDR), a novel framework that integrates keypoint-guided Monte Carlo Tree Search (MCTS) to model and visualize the latent visual reasoning process. LVDR not only produces more accurate skill assessments but also uncovers the critical visual reasoning sequences that contribute to the final evaluation. Extensive experiments across four datasets spanning diverse sports and surgical domains demonstrate that LVDR achieves competitive quantitative performance while providing interpretable visual reasoning trajectories leading to the final predictions. Source codes and models can be found through the following link: this https URL.
30. 【2606.27978】Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation
链接:https://arxiv.org/abs/2606.27978
作者:Jiayi Xu,Di He,Guolin Ke
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:avoiding discrete tokenization, separately pretrained tokenizer, Pixel-space continuous-token autoregressive, continuous-token autoregressive, avoiding discrete
备注:
点击查看摘要
Abstract:Pixel-space continuous-token autoregressive (AR) generation directly models images as sequences of raw pixel patches, avoiding discrete tokenization or a separately pretrained tokenizer. However, it faces coupled challenges: high-dimensional patch generation causes large single-step errors, and teacher-forced training creates a train--inference gap that makes these errors accumulate across AR steps. Existing fixes such as $x$-prediction and input noise injection only partially mitigate these issues. Exact rollout training better matches inference-time conditions, but is impractical due to prohibitively slow sequential sampling. We propose \emph{Parallel Rollout Approximation} (PRA), a scalable framework that addresses both challenges jointly. PRA generates low-dimensional intermediate states instead of high-dimensional pixel patches, then maps them back to pixel-space tokens with a pixel decoder, preserving a pixel-in, pixel-out AR interface. It also constructs inference-like pixel inputs through the same intermediate-state-to-pixel path used at inference, independently across positions, approximating the pixel-feedback interface encountered during inference-time rollout while retaining parallel teacher-forced training. On class-conditional ImageNet-1K generation at $256\times256$ resolution, PRA-S with 135M parameters achieves an FID of 2.58, surpassing the previous billion-scale pixel-space AR result of 3.60. Scaling to PRA-L with 511M parameters further improves FID to 1.94, establishing a new state of the art among pixel-space AR models. Beyond generation, PRA achieves higher ImageNet classification probing accuracy than other AR and diffusion baselines, suggesting its potential for unified pixel-space image generation and understanding.
31. 【2606.27974】ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering
链接:https://arxiv.org/abs/2606.27974
作者:ZhengXian Wu,Hangrui Xu,Kai Shi,Zhuohong Chen,Yunyao Yu,Chuanrui Zhang,Zirui Liao,Jun Yang,Zhenyu Yang,Haonan Lu,Haoqian Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Knowledge-based Visual Question, Visual Question Answering, Knowledge-based Visual, Question Answering, Visual Question
备注:
点击查看摘要
Abstract:Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static top-k setting, which is not adaptive during reasoning. We propose ProMSA, a progressive multimodal search agent for KB-VQA. Given an image-question pair, the agent iteratively chooses image search, text search, or stop, under explicit tool-call budgets and with deduplication to avoid redundant retrieval. For training, we first use rejection-sampling SFT to learn valid tool-use formats, then optimize the agent with TN-GSPO, a sequence-level RL objective that normalizes updates by both generation length and tool-interaction depth. Experiments on E-VQA and InfoSeek show consistent gains over strong RAG and agent baselines, and improved retrieval and end-to-end accuracy. The code is available at this https URL.
32. 【2606.27964】Directing the World: Fast Autoregressive Video Generation with Compositional Human-Camera Control
链接:https://arxiv.org/abs/2606.27964
作者:Haoyuan Wang,Yabo Chen,Haibin Huang,Chi Zhang,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Building interactive world, models requires generating, requires generating realistic, Building interactive, generating realistic videos
备注:
点击查看摘要
Abstract:Building interactive world models requires generating realistic videos while maintaining controllable dynamics over long horizons. Autoregressive video generation offers a scalable foundation, but suffers from error accumulation and temporal degradation during extended rollouts. This issue is further amplified under heterogeneous controls such as human motion and camera trajectories, which may interfere and destabilize a pretrained video prior, while existing methods often trade off controllability and visual quality. We propose "Directing the World", a fast autoregressive framework for controllable world-model video generation with compositional human-motion and camera-trajectory control. Our key idea is to decouple control learning while preserving a unified autoregressive video prior. We introduce a Fast-Slow Memory training strategy to stabilize long-horizon rollout learning and improve convergence. For human motion control, we design a t-guided Dynamic Projection mechanism and a refined Motion-CFG strategy, enabling temporally smooth and accurate motion alignment without degrading visual fidelity, and supporting multi-person this http URL learning a robust motion prior, we introduce a second-stage camera-trajectory control module to compose human dynamics with viewpoint changes for coherent world exploration. We further construct a large-scale dataset with synchronized video, text, human-motion, and camera-trajectory annotations, organized into motion-centric and camera-centric subsets for decoupled training. Extensive experiments show stable long-horizon generation with precise controllability and high visual quality. See more at this https URL.
33. 【2606.27947】Understanding How MLLMs Describe Artworks Using Token Activation Maps
链接:https://arxiv.org/abs/2606.27947
作者:Nicola Fanelli,Pasquale De Marinis,Raffaele Scaringi,Eva Cetinic,Gennaro Vessio,Giovanna Castellano
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
备注: Accepted at PRESTIGE workshop at ICPR 2026
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) describe artworks with remarkable fluency, yet the visual reasoning behind their outputs remains opaque. When an MLLM names a style, identifies a subject, or recognizes an iconographic symbol, does it ground each claim in the relevant region of the canvas, draw on an undifferentiated visual signal, or rely primarily on textual priors? We study this using the Token Activation Map (TAM), which produces, for each generated token, a heatmap isolating the visual evidence specific to that token from prior-context interference. Applying TAM to a curated set of paintings spanning multiple periods and genres, we analyze grounding patterns across five semantically distinct token categories: common visual objects, style descriptors, metadata, iconographic tokens, and affective expressions. We find that visual grounding varies substantially with token semantics. We further show that MLLMs attempt to identify artworks and artists, achieving higher accuracy in artist attribution than in title prediction, where hallucinations are more frequent. Finally, we compare TAM with SAM~3 open-vocabulary segmentation. To ensure reproducibility, we release our code, experimental configurations, prompts, and qualitative results on the project page at this https URL.
34. 【2606.27935】Controllable Histopathology Image Synthesis with Training-free Structural Initialization and Textural Modulation
链接:https://arxiv.org/abs/2606.27935
作者:Yuheng Qiu,Jingyi Luo,Chenfei Ye,Ting Ma,Jianfeng Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable success, Deep learning, histopathology image analysis, learning has demonstrated, demonstrated remarkable
备注:
点击查看摘要
Abstract:Deep learning has demonstrated remarkable success in high-throughput histopathology image analysis. However, the performance of learning-based models critically depends on the quality and size of annotations by expert pathologists, which is a resource-intensive and time-consuming process. To address the limitations of data scarcity and annotation burden, several methods have been proposed to synthesize paired histopathology data. Nevertheless, these frameworks typically still require annotation data, albeit in reduced quantities, to impose structural constraints during training. In this work, we present CHIS, a plug-in framework that guides the sampling trajectory of a pretrained diffusion model through two key stages: structural initialization at the start and textural modulation during generation. The initial noise state is refined by fusing the phase information from a prior mask with the amplitude of Gaussian noise in the frequency domain, yielding a structurally informed starting point. During the reverse diffusion process, we adaptively modulate both coarse-grained and fine-grained textures at different wavelet decomposition levels. This enables a diffusion model pretrained solely on unlabeled images to generate outputs that align with prior structural masks while preserving the reference tissue style. We conducted extensive experiments demonstrating the superiority of CHIS in generation fidelity and its substantial benefits for downstream segmentation tasks. Code is available at this https URL.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2606.27935 [cs.CV]
(or
arXiv:2606.27935v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2606.27935
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
35. 【2606.27926】Verifiable Geometry Problem Solving: Solver-Driven Autoformalization and Theorem Proposing
链接:https://arxiv.org/abs/2606.27926
作者:Can Li,Ting Zhang,Junbo Zhao,Hua Huang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Geometry Problem Solving, Geometry Problem, Problem Solving, Solving have increasingly, combining neural intuition
备注:
点击查看摘要
Abstract:Geometry Problem Solving have increasingly adopt the neuro-symbolic paradigm, combining neural intuition with symbolic rigor. However, current frameworks suffer from severe bottlenecks in two core stages: autoformalization, which treats multimodal translation as a static task decoupled from downstream solver compatibility, and theorem prediction, where solvers frequently hit a deductive impasse due to fixed rule libraries. To address these, we propose SD-GPS, a solver-driven framework that treats the symbolic solver as an execution oracle throughout both formalization and deduction. First, Solver-Driven Autoformalization unifies supervised formal-language adaptation and solvability-guided reinforcement learning into a single module built on QwenVL3-2B, making executability the central training signal. Second, Verified Theorem Proposing introduces an impasse-aware agent that proposes local auxiliary lemmas from current proof states, ensuring soundness by filtering all proposals through symbolic verification. Empirical evaluations on Geometry3K and PGPS9K demonstrate that SD-GPS consistently outperforms existing MLLM, neural, and neuro-symbolic methods across standard completion, multiple-choice, and cross-modal reference regimes, proving that closing the loop between multimodal perception and symbolic execution significantly improves geometric reasoning, offering profound insights into how neural agents can be grounded by formal systems to achieve verifiable problem-solving capabilities.
36. 【2606.27923】Home3D 1.0: A High-Fidelity Image-to-3D Asset Generation System for Interior Design
链接:https://arxiv.org/abs/2606.27923
作者:Yiyun Fei,Guoqiu Li,Jin Song,Chuqiao Wu,Delong Wu,Hong Wu,Ziru Zeng,Haohui Chen,YinDong Kong,Jing Li,Qi Wu,Feng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:single reference image, targeting interior design, produces high-quality, reference image, targeting interior
备注: 18 pages, 10 figures, 2 tables; technical report
点击查看摘要
Abstract:We present Home3D 1.0, a modular image-to-3D generation system that produces high-quality 3D assets from a single reference image, targeting interior design and e-commerce applications. Given a photograph of a furniture or decor item, the system outputs a mesh with physically-based rendering (PBR) materials, and the mesh can be decomposed into material-specific components. The pipeline is organized into four tightly coupled modules: Geometry reconstructs a watertight mesh through latent SDF modelling with a geometry VAE and a coarse-to-fine flow-matching DiT; Texture predicts multiview albedo observations, reprojects them onto the mesh, and completes unseen surface regions with a 3D texture field; Material uses MatWeaver to obtain component masks through video-based segmentation and UV-space voting, then retrieves and bakes PBR maps from a curated material library through hierarchical multi-modal matching; and Parts generates material-editable semantic part meshes with a PartVAE and PartDiT, decoding multi-head part-specific SDF fields in one pass. Each module is evaluated independently with dedicated metrics, highlighting both the current system capability and the remaining gaps toward broader deployment.
37. 【2606.27922】Reflect-R1: Evidence-Driven Reflection for Self-Correction in Long Video Understanding
链接:https://arxiv.org/abs/2606.27922
作者:Shuimu Chen,Yuteng Chen,Yuanshen Guan,Zebang Cheng,Zeyu Zhang,Shengqian Qin,Bin Xia,Jiaran Li,Wenming Yang,Fei Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Current multimodal reflection, Current multimodal, multimodal reflection mechanisms, understanding predominantly rely, long video understanding
备注: 18 pages, 6 figures, ECCV
点击查看摘要
Abstract:Current multimodal reflection mechanisms for long video understanding predominantly rely on closed-loop self-reflection within internal parameters. Lacking objective external evidence, models are frequently trapped in blind confidence and often fail to correct errors. Furthermore, applying reinforcement learning to multi-stage reflection pipelines introduces severe policy coupling, which is exacerbated by a critical scarcity of dedicated training data. To address these limitations, this work proposes Reflect-R1, the first Evidence-Driven self-correction framework for long video understanding. The framework constructs a three-stage pipeline consisting of intuition, verification, and arbitration. By dynamically retrieving objective visual evidence to verify initial intuitions and autonomously executing multiple temporal searches to resolve conflicts, it completely breaks the hallucination loop. To overcome policy coupling, we design a stage-decoupled reinforcement learning algorithm named SD-GRPO that independently computes advantage functions across different reasoning stages. Concurrently, we construct a dataset of 120K samples to bridge the training data gap. Extensive experiments on benchmarks such as VideoMME and LongVideoBench demonstrate that Reflect-R1 achieves state-of-the-art performance. Our method significantly improves the genuine rectification rate and enables authentic self-correction strictly grounded in objective evidence.
38. 【2606.27918】Every Step of the Way: Video-based Parkinsonian Turning Step Counting
链接:https://arxiv.org/abs/2606.27918
作者:Qiushuo Cheng,Jingjing Liu,Catherine Morgan,Alan Whone,Majid Mirmehdi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:reflects motor dysfunction, directly reflects motor, Parkinson disease, symptom of Parkinson, complete a turn
备注:
点击查看摘要
Abstract:As a prominent symptom of Parkinson's disease (PD), turning impairment is evaluated through parameters such as turning angle, duration, and particularly, the number of steps required to complete a turn, which directly reflects motor dysfunction. Accurate step counting is challenging due to variability in real-world turning movements and atypical shuffling patterns in parkinsonian gait. Existing methods are predominantly wearable-based, requiring users to wear and manage dedicated devices, which can be inconvenient for continuous daily use. To address this, we propose a passive, video-based framework that estimates step count in a coarse-to-fine manner using diverse motion representations. Specifically, an initial step count is estimated from foot movement signals derived from 3D human mesh recovery, providing high-level motion structures. To incorporate fine-grained motion details, a motion encoder learns complementary gait dynamics from mesh and optical flow to refine the initial estimate. In this process, coarse foot movement signals query the pixel-level motion cues via cross attention to capture subtle parkinsonian gait dynamics. To handle varying video lengths, we partition each video into clips and integrate clip-wise motion embeddings via multiple instance learning (MIL) for step count residual prediction. Extensive experiments show our method consistently outperforms existing step counting methods on real-world PD turning datasets.
39. 【2606.27905】here and Back Again: A Flexible-Frame Transformer for Multi-Exposure Fusion
链接:https://arxiv.org/abs/2606.27905
作者:Lishen Qu,Yao Liu,Shihao Zhou,Jie Liang,Hui Zeng,Lei Zhang,Jufeng Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rich scene content, Multi-exposure fusion, conventional cameras closer, brings the dynamic, human vision
备注: Accepted by ECCV 2026
点击查看摘要
Abstract:Multi-exposure fusion (MEF) brings the dynamic range of conventional cameras closer to that of human vision, producing images with rich scene content. Given the large variability in scene luminance, exposure strategies often require different numbers of frames to capture the full radiance range faithfully. However, conventional MEF techniques are typically designed for a fixed number of inputs, forcing deployment systems to maintain separate models for different frame-count requirements, which undermines deployment efficiency. To address this limitation, we propose FreeMEF, the first flexible-frame transformer for MEF that seamlessly accommodates varying numbers of input exposures without retraining or architectural changes. The proposed approach consists of two key modules. First, we introduce a recurrent state space module (RSSM) that sequentially fuses features from arbitrary sequences via adaptive alignment and state-space recurrent modeling, thereby providing global information guidance for the subsequent restoration. Second, we devise a global feature guided block (GFGB) incorporating an extremity-aware hybrid attention (EAHA) and an affine-injection feed-forward network (AFFN), which effectively resolves the similarity paradox while simultaneously optimizing contrast and brightness regulation. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method, which performs favorably against state-of-the-art methods both quantitatively and qualitatively.
40. 【2606.27900】Long-Term Prediction of Local and Global Human Motion with Occlusion Recovery
链接:https://arxiv.org/abs/2606.27900
作者:Qiaoyue Yang,Sven Heutger,Christopher Niemann,Magnus Jung,Ayoub Al-Hamadi,Sven Wachsmuth
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:three-dimensional full-body movement, Human motion describes, describes the three-dimensional, three-dimensional full-body, full-body movement
备注: Advances in Visual Computing (ISVC 2025)
点击查看摘要
Abstract:Human motion describes the three-dimensional full-body movement of a person. Anticipating such motion holds significant relevance across a wide range of application domains such as human-robot interaction, autonomous driving, animation, and healthcare. In recent research, spatial and temporal dependencies are modeled by bidirectional attention mechanisms. These typically anticipate human motion in an autoregressive manner which could cause an accumulation of errors over time. As a consequence, they solely focus on local pose forecasting. To address these limitations, we propose a non-autoregressive transformer based on spatio-temporal attention, and train it not only for local pose anticipation, but also for global motion prediction in space. Furthermore, to enhance its applicability in real-world scenarios, our model is also trained to recover missing joints due to occlusions, and is capable of processing varying lengths of history observations. Our code is publicly available at this https URL.
41. 【2606.27897】A Multi-Attribute Latent Space for Visual Analysis of Watches
链接:https://arxiv.org/abs/2606.27897
作者:Kai Lawonn,Tobias Günther,Monique Meuschke
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:exploring large wristwatch, large wristwatch collections, interactive visual-analysis system, embedding model, exploring large
备注:
点击查看摘要
Abstract:We present a design rationale, embedding model, and interactive visual-analysis system for exploring large wristwatch collections through heterogeneous visual and semantic attributes. The system addresses a common limitation of catalog and e-commerce interfaces: users can filter by metadata, but they receive little support for open-ended exploration of visual similarity, stylistic alternatives, and mixed aesthetic-functional criteria. We therefore represent watches with separate attribute graphs for dial color and dial design, while using watch type as an explicit semantic organizer. Dials are segmented with a U-Net, watch types are predicted with a Vision Transformer, colors are represented through a shared CIELAB reference palette, and dial structure is described with a gradient-based image descriptor. We extend UMAP by combining attribute-specific neighborhood graphs in a unified probabilistic objective and by adding a class-aware layout term that separates global type structure from local visual neighborhoods. The resulting map is exposed in an interactive interface with spatial navigation, metadata filtering, detail inspection, and search-by-example insertion. We evaluate the approach through parameter analysis, runtime measurements, and a qualitative pilot study with watch experts and novices. The results suggest that the system supports discovery and comparison, while also revealing limitations in scalability assessment, search-by-example validation, and the need for broader domain studies. We explicitly discuss these limitations and derive design implications for multi-attribute latent-space visualization across heterogeneous visual collections.
42. 【2606.27880】OrthoTryOn: Geometric Orthogonalization for Conflict-Free Unified Fashion Generation
链接:https://arxiv.org/abs/2606.27880
作者:Zhaotong Yang,Ying Tai,Jiahui Zhan,Yu Zheng,Jianjun Qian,Jian Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fashion generation integrates, task-specific adaptation costs, virtual try-on, try-on and garment, garment reconstruction
备注: Accepted by ECCV2026
点击查看摘要
Abstract:Unified fashion generation integrates tasks like virtual try-on and garment reconstruction into a single model to reduce task-specific adaptation costs. However, naive parameter sharing across semantically distinct tasks induces negative transfer through severe inter-task gradient conflict. We propose OrthoTryOn, a unified framework mitigating this interference within a shared Low-Rank Adaptation (LoRA) module. Its Orthogonal Subspace Projection (OSP) applies task-specific orthogonal rotations to bottleneck features, mapping them into decorrelated coordinate frames. To address residual semantic coupling at inference time, we further propose Fisher-guided Negative Guidance (FNG), a parameter-free strategy that utilizes diagonal Fisher information to quantify inter-task sensitivity overlap and explicitly repels generation trajectories from the most confusable task via Classifier-Free Guidance. Extensive experiments demonstrate that OrthoTryOn avoids the severe performance degradation typical of naive unified training and even surpasses independently trained task-specific models, achieving state-of-the-art results across multiple benchmarks while generalizing robustly across diverse diffusion backbones. Code is available at this https URL.
43. 【2606.27876】SpatialUAV: Benchmarking Spatial Intelligence for Low-Altitude UAV Perception, Collaboration, and Motion
链接:https://arxiv.org/abs/2606.27876
作者:Haoyu Zhang,Meng Liu,Qianlong Xiang,Kun Wang,Yaowei Wang,Liqiang Nie
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:unmanned aerial vehicle, low-altitude UAV spatial, UAV spatial intelligence, low-altitude unmanned aerial, low-altitude UAV benchmark
备注: 10 pages, 7 figures
点击查看摘要
Abstract:Spatial intelligence is essential for low-altitude unmanned aerial vehicle (UAV) perception, collaboration, and navigation. However, existing UAV benchmarks often emphasize image-level recognition, single-view understanding, or narrow answer formats, leaving 3D spatial inference, multi-view collaboration, scene dynamics, and diverse task formulations insufficiently evaluated. To address these gaps, we introduce SpatialUAV, a real low-altitude UAV benchmark comprising 4,331 curated instances across 14 fine-grained task types, covering semantic discrimination, spatial relation, aerial--aerial collaboration, aerial--ground collaboration, and motion understanding. SpatialUAV organizes all samples into a unified visual-input--question--answer schema, while supporting seven input configurations and nine answer formats, including option labels, region identifiers, geometric values, cross-view correspondences, and free-form motion descriptions. To ensure reliable and grounded evaluation, our data construction pipeline integrates detector-assisted regions, depth supervision, metadata-derived rules, extensive manual annotation, blind filtering, and multi-turn human validation, together with task-specific metrics for heterogeneous outputs. Evaluating representative vision-language models across three categories, we show that current models remain far from human-level performance, with pronounced bottlenecks in cross-view association, structured grounding, geometric reasoning, and temporal viewpoint understanding. These results offer empirical guidance for advancing low-altitude UAV spatial intelligence. Code and data are available at this https URL.
44. 【2606.27864】A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$
链接:https://arxiv.org/abs/2606.27864
作者:T\=ıkun Ông,Georg Bökman
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Vision transformers, equivariant, vision transformers equivariant, Vision, dominant architecture
备注:
点击查看摘要
Abstract:Vision transformers have become a dominant architecture for visual recognition. However, standard models do not explicitly encode the planar symmetries that arise in many vision domains. We introduce a family of vision transformers equivariant to arbitrary discrete subgroups of $\mathrm{O}(2)$, providing a unified framework that generalizes prior flipping- and $D_4$-equivariant transformer architectures. Our construction yields equivariant analogues of the core transformer components, together with expressivity guarantees for the resulting layers. In particular, we show that whenever $H \le G$, the class of $G$-equivariant ViTs embeds naturally into the class of $H$-equivariant ViTs. We also prove that, in the single-head setting, the corresponding equivariant self-attention layer realizes every $G$-equivariant self-attention map representable by ordinary self-attention. We further construct a $D_6$-equivariant model based on hexagonal patches, making the architecture compatible with six-fold rotational symmetries. We evaluate the resulting models on the PatternNet aerial image dataset in artificially data-scarce regimes across subgroups of $D_4$ and $D_6$. Our experiments compare two equivariant attention mechanisms and analyze how the choice of homogeneous-space configurations used in the nonlinearities affects performance. Preliminary results under matched parameter budgets indicate that equivariance can improve recognition accuracy, motivating further study of how discrete symmetry groups shape transformer-based visual recognition models.
45. 【2606.27862】ScaLe-INR: Scale and Learn Implicit Neural Representations
链接:https://arxiv.org/abs/2606.27862
作者:Buwaneka Epakanda,Athulya Ratnayake,Pandula Thennakoon,Mario De Silva,Avishka Ranasinghe,Roshan Godaliyadda,Parakrama Ekanayake
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Implicit Neural Representations, Implicit Neural, Neural Representations, multilayer perceptrons excel, modeling continuous signals
备注: Submitted as a conference paper to NeurIPS 2026
点击查看摘要
Abstract:Implicit Neural Representations (INRs) parameterized by multilayer perceptrons excel at modeling continuous signals. However, a key challenge persists as INRs fundamentally suffer from spectral bias and information cross-talk. When a single network attempts to capture multi-scale phenomena, high-frequency weight updates destructively interfere with the underlying low-frequency structural approximation. We introduce Scale and Learn INR (ScaLe-INR), a novel multi-branch architecture that resolves these limitations by explicitly matching the signal's frequency spectrum with the optimal operating region of the INR. Drawing upon the Fourier inverse scaling theorem we demonstrate that applying directional coordinate scaling expands a network's representational bandwidth along specific spatial axes. To mathematically enforce functional disentanglement and minimize task-specific information leakage between branches, we propose a Directional Edge Guidance Loss, a spatially-conditioned sparsity prior derived from ground-truth gradients. By constraining the high-frequency branches to act as strict, localized edge-filters, ScaLe-INR eliminates spectral cross-talk, accelerates convergence, and achieves high-fidelity signal reconstruction on complex multi-scale topologies. We evaluate ScaLe-INR across diverse reconstruction and inverse tasks, demonstrating substantial performance gains over existing state-of-the-art (SOTA) methods. The proposed architecture improves upon the nearest baselines by +5.16 dB in image reconstruction and +0.65 dB in image denoising. Furthermore, it achieve an impressive figure of 50.02 dB on audio reconstruction and 0.999 IOU(Intersection Over Union) on 3D reconstruction which beats the all SOTA models.
46. 【2606.27831】Hippocampus-DETR: An Explicit Memory Object Detection Framework Based on Hippocampus Modeling
链接:https://arxiv.org/abs/2606.27831
作者:Zhaoning Shi,Bo Ma,Hao Xu,Zepeng Yang,Bo Liang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:hippocampal memory modeling, biological hippocampal memory, paper addresses, addresses the lack, lack of explicit
备注:
点击查看摘要
Abstract:This paper addresses the lack of explicit memory mechanisms in current object detection models and proposes Hippocampus-DETR, a novel detection framework based on biological hippocampal memory modeling. This framework integrates a hippocampal memory network module, HipNet, into the DETR architecture and systematically simulates the anatomical structure and functional organization of hippocampal subregions, including the entorhinal cortex, dentate gyrus, CA3, CA1, and subiculum. Through this design, Hippocampus-DETR realizes pattern separation, pattern completion, importance filtering, and information integration of visual encoding features. During training, different memory submodules are optimized using a layer-wise training strategy, ultimately forming a memory system with memory retrieval and completion capabilities. Experimental results demonstrate that Hippocampus-DETR achieves higher detection accuracy than current mainstream models. More importantly, models equipped with this framework also exhibit excellent generalization ability and data efficiency in tasks such as few-shot image classification, multimodal feature construction, and image restoration. Subsequent experiments further validate the functional necessity and internal interpretability of each memory submodule. This study not only provides a novel object detection framework, but also offers a feasible technical pathway for integrating neurocognitive mechanisms with deep learning models, highlighting its significant value in improving model learning efficiency and task robustness. The project is available at this https URL.
47. 【2606.27829】CSD: Content-aware Speculative Decoding for Efficient Image Generation
链接:https://arxiv.org/abs/2606.27829
作者:Mingcheng Wang,Junbo Qiao,Yunchen Li,Lingfu Jiang,Wei Li,Jie Hu,Jiao Xie,Zhou Yu,Xinghao Chen,Guixu Zhang,Shaohui Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:key solution, autoregressive image generation, Speculative decoding, speculative decoding algorithm, content-aware speculative decoding
备注:
点击查看摘要
Abstract:Speculative decoding (SD) has emerged as a key solution to accelerate the inference of autoregressive models. However, in the field of image generation, it faces the challenge of low acceptance rates, and directly relaxing its criteria leads to degradation in image quality. In this paper, we propose a novel content-aware speculative decoding algorithm, termed CSD, which integrates an entropy-based probability relaxation mechanism with an optimal resampling strategy to enhance the inference efficiency for autoregressive image generation. By leveraging the informational uncertainty inherent in different regions of an image, CSD dynamically adjusts the acceptance probability of candidate tokens, increasing the acceptance rate in low-detail areas to accelerate generation. Moreover, a distribution alignment filter is introduced to ensure the output distribution to be aligned with the target model, which significantly improves the generative quality. Experiments conducted on Lumina-mGPT and Janus-Pro demonstrate that the superiority of the proposed CSD. Our source code is available at this https URL.
48. 【2606.27828】Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning
链接:https://arxiv.org/abs/2606.27828
作者:Hohin Kwan,Hongyu Li,Ray Zhang,Manyuan Zhang,Xianghao Kong,Anyi Rao,Jiahao Xie,Si Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, Recent interest, large language models, raises a central, central question
备注:
点击查看摘要
Abstract:Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events in individual frames? This ability, which we refer to as video temporal-logical reasoning, requires models to maintain, update, and compose evidence as visual states evolve across frames. Existing video benchmarks often conflate this capability with scene complexity, static recognition, or uncontrolled temporal variation. To isolate this capability, we introduce Video-MME-Logical, a controlled benchmark organized around five temporal-logical operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition. The benchmark contains 25 fine-grained task categories generated with controlled object states, transitions, temporal dependencies, and logical compositions. It enables difficulty-controlled final-answer evaluation by varying temporal horizon and reasoning complexity, and supports intermediate-state diagnostics by verifying whether models recover the required logical reasoning trace before producing the final answer. Experiments with state-of-the-art MLLMs reveal a substantial human-model gap, especially as temporal-logical complexity increases. Supervised fine-tuning on up to 500K generated samples improves performance but remains insufficient to close the reasoning gap, positioning Video-MME-Logical as a scalable testbed for analyzing and improving temporal-logical reasoning in MLLMs.
49. 【2606.27818】Scalable and Differentiable Point-Cloud Registration Using Maximum Mean Discrepancy
链接:https://arxiv.org/abs/2606.27818
作者:Rixon Crane,Fahira Afzal Maken,Nicholas Lawrance,Stanislav Funiak,Kasra Khosoussi,Ming Xu,Russell Tsuchida
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:linear computational complexity, number of points, correspondence-free approach, approach to point-cloud, linear computational
备注: Accepted at ICML 2026
点击查看摘要
Abstract:We present MMD-Reg, a novel correspondence-free approach to point-cloud registration that is differentiable and has linear computational complexity in the number of points. We model registration as a nonlinear least-squares problem based on the Maximum Mean Discrepancy, approximated using random Fourier features. The resulting objective can be solved efficiently with standard methods such as Levenberg-Marquardt, and the solution is differentiable via the implicit function theorem. This allows MMD-Reg to be used as a differentiable optimization layer within end-to-end trainable models, supporting registration under challenging conditions such as poor initial alignment and partial overlap. We demonstrate this Neural MMD-Reg formulation by integrating the layer with a set transformer, training the resulting model in supervised and unsupervised settings, and comparing its performance against recent learning-based methods. We also evaluate standalone MMD-Reg, comparing its accuracy and scalability against widely used non-learning-based registration methods.
50. 【2606.27794】xt as Illumination: Spatial Contrastive Retinex Learning for Language-guided Medical Image Segmentation
链接:https://arxiv.org/abs/2606.27794
作者:Jian Shi,Cheng Zhen,Pingping Zhang,Rui Xu,Yanan Lv,Yili Ma,Huan Bi,Haojie Li,Huchuan Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Language-guided Medical Image, Medical Image Segmentation, Language-guided Medical, Medical Image, shown great potential
备注: Aceepted by MICCAI2026. More modifications may be performed
点击查看摘要
Abstract:Language-guided Medical Image Segmentation (LMIS) has shown great potential to improve the delineation of anatomical structures and lesions by integrating clinical textual information. Existing methods generally rely on either implicit interaction between textual and visual features or auxiliary coarse-grained supervision for cross-modal alignment. However, these methods lack explicit and fine-grained constraints to ensure semantic consistency, causing a mismatch between language and the segmentation outputs. To address this issue, we propose Text-as-Illumination Retinex Network (TIRNet), a novel Retinex-inspired framework that treats text embeddings as semantic illumination for feature modulation, thereby improving semantic consistency in LMIS. TIRNet introduces two key blocks integrated at each decoder stage: (1) the Retinex-inspired Text Modulation Block (RTMB), which employs positive and negative illumination maps to enhance text-relevant foreground features and suppress background interference; and (2) the Consistent Detail Compensation Block (CDCB), which selectively recovers high-frequency details via a consistency-gated mechanism conditioned on illumination reliability. Furthermore, we propose a Multi-Scale Illumination Supervision Loss (MSIS-Loss), comprising a Region-Grounded Contrastive Loss (RGC-Loss) that enforces cross-modal similarity to be concentrated in text-relevant foreground regions and suppressed in background regions, and a Background Suppression Loss (BS-Loss) that provides pixel-level supervision for negative illumination maps, jointly ensuring a precise cross-modal alignment at each decoder stage. Extensive experiments on the MosMedData+ and QaTa-COV19 datasets demonstrate that TIRNet achieves state-of-the-art performance in LMIS. The code is available at: this https URL.
51. 【2606.27784】Improving Adversarial Robustness via Activation Amplification and Attenuation
链接:https://arxiv.org/abs/2606.27784
作者:Taïga Gonçalves,Yongsong Huang,Tomo Miyazaki,Shinichiro Omachi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:presence of non-robust, non-robust features, adversarial robustness, adversarial attacks, improves adversarial robustness
备注: Accepted to ECCV 2026
点击查看摘要
Abstract:The existence of adversarial attacks is often attributed to the presence of non-robust features in neural networks. While prior defenses reduce their impact via pruning, masking, or feature recalibration, we instead propose to jointly learn to amplify and attenuate these signals through a simple activation scaling mechanism. To this end, we introduce Activation Amplification and Attenuation (A3), a lightweight plug-in module that enhances adversarial robustness with minimal modifications of the activations. A3 dynamically rescales the activations using a learnable mask and a scaling factor derived from the original activation magnitudes. The influence of adversarial perturbations can be amplified or attenuated using the same learnable parameters by simply flipping the sign of the scaling operation. The amplified signals serve as negative references to construct novel contrastive and ranking loss functions. Experimental analysis shows that learning to degrade the predictions in amplification mode simultaneously improves adversarial robustness in attenuation mode. Moreover, A3 relies on only a small number of learnable parameters, with most of its behavior being determined by the scaling mechanism rather than additional network capacity. Extensive experiments demonstrate that integrating A3 into different backbones, datasets, and training methods consistently improves adversarial robustness while introducing negligible computational and memory overhead compared to existing plug-in modules. Code is available at: this https URL.
52. 【2606.27779】MindFlow: Harmonizing Cognitive Semantics and Acoustic Dynamics for Facial Animation Generation in Dyadic Conversations
链接:https://arxiv.org/abs/2606.27779
作者:Hejia Chen,Haoxian Zhang,Xu He,Xiaoqiang Liu,Pengfei Wan,Shoulong Zhang,Shuai Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low-level motor reflexes, Generating lifelike facial, precise low-level motor, dyadic conversations requires, conversations requires reconciling
备注: Accepted by ECCV 2026
点击查看摘要
Abstract:Generating lifelike facial animation for dyadic conversations requires reconciling high-level cognitive intent with precise low-level motor reflexes, yet existing methods fall short in the semantic understanding of dialogue context and in precise dynamic control. In this paper, we propose MindFlow, a dual-pathway generative framework inspired by the Ventral-Dorsal pathway model in neuroscience, which decouples generation into two collaborative streams, thereby harmonizing deep semantic reasoning with fine-grained control. In the Ventral module, we transform the conventional Sentence-Action approach into a novel Chunk-State approach that models raw acoustic streams as a context-aware, evolving emotional state chain, capturing subtle paralinguistic nuances and mid-utterance emotional shifts missed by sentence-level modeling. The Dorsal module features a conditional autoregressive flow matching network for high-fidelity facial motion, driven by high-frequency acoustic cues and modulated by emotion states, plus a Selective Acoustic Injector for adaptive audio gating to ensure robustness in talking-and-listening dynamics without interference. Extensive experiments demonstrate that MindFlow achieves superior semantic appropriateness and motion naturalness compared to state-of-the-art baselines.
53. 【2606.27777】RUST: Efficient Abdominal Trauma Recognition via Image-to-Ultrasound-Video Transfer Learning
链接:https://arxiv.org/abs/2606.27777
作者:Enguang Wang,Hao Zhou,Shuo Gao,Tuo Liu,Guangquan Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:noninvasive trauma triage, indispensable for rapid, Abdominal ultrasound, Abdominal, Transfer Learning
备注: Accepted to MICCAI 2026, 11 pages, 5 figures
点击查看摘要
Abstract:Abdominal ultrasound is indispensable for rapid, noninvasive trauma triage. However, interpreting the subtle dynamic cues embedded in continuous scanning is time-intensive and operator-dependent. Parameter-Efficient Image-to-Video Transfer Learning (PEIVTL), which efficiently adapts pre-trained image models to the video domain, notably through visual-textual alignment, offers a promising paradigm for ultrasound video analysis. Nevertheless, substantial spatiotemporal and semantic variations arising from physician-dependent scanning practices continue to limit the effectiveness and generalizability of this framework. We propose TRUST, a scan-aware PEIVTL framework that explicitly models fine-grained spatiotemporal variations to enable reliable ultrasound video understanding. First, we introduce a Cross-Frequency Collaborative Adapter (CFCA) that establishes mutual constraints between low- and high-frequency components, enhancing discriminative spatial feature extraction under heavy speckle corruption. Second, we design a Multi-Granularity Motion-Aware (MGMA) module that integrates local temporal convolutions with motion-prior-guided global self-attention, jointly capturing stable intra-view patterns and abrupt inter-view transitions to characterize complex scanning dynamics. Third, a Visual Query Semantic Aggregation (VQSA) module dynamically generates text prototypes conditioned on visual features, enabling adaptive visual-textual alignment robust to intra-class variability under diverse scanning conditions. Experiments on in-house ultrasound trauma datasets demonstrate that TRUST outperforms state-of-the-art methods by 9.63% with superior computational efficiency.
54. 【2606.27773】ModaFlow: Modality-Aware Flow Matching for High-Fidelity Virtual Try-On
链接:https://arxiv.org/abs/2606.27773
作者:Xiangyu Sai,Meysam Madadi,Sergio Escalera,Yong Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Image-based virtual try-on, large clothing-body deformations, simultaneously preserve fine, person body geometries, preserve fine garment
备注: Preprint
点击查看摘要
Abstract:Image-based virtual try-on has emerged as a compelling task in e-commerce and augmented reality, yet existing methods struggle to simultaneously preserve fine garment semantics and adapt to diverse person body geometries under large clothing-body deformations. We present ModaFlow, a modality-aware flow-matching based framework for high-fidelity virtual try-on that achieves precise alignment between textual descriptions and garment appearance. Unlike prior methods that treat multimodal conditions uniformly, ModaFlow introduces a modality-aware guidance scheme: visual garment embeddings extracted by a pretrained image prompt adapter provide deterministic, persistent structural guidance, while textual embeddings generated from garment descriptions are controlled via classifier-free guidance (CFG) with adaptive scaling and zero-initialized velocity. To further enhance flow field accuracy, we propose two regularization losses, cosine similarity and perceptual flow discrimination, that jointly improve directional consistency and perceptual realism of the velocity field. Additionally, a mask manipulation strategy stochastically samples among box, transparent, and relaxed masks during training, simulating diverse occlusion scenarios and enabling robust inference under unpaired settings where only a box mask is available. Experiments show that ModaFlow achieves state-of-the-art results in both qualitative and quantitative evaluations, reducing FID by approximately 30% on paired and 20% on unpaired benchmarks.
55. 【2606.27772】An Embedded Real-Time License Plate Recognition System for Complex Traffic Scenes
链接:https://arxiv.org/abs/2606.27772
作者:Anuki Pasqual,Dulan Lokugeegana,Manimohan Thiriloganathan,Nuthya Rathnayake,Kithsiri Samarasinghe,Udaya S. K. P. Miriya Thanthrige
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
关键词:license plate recognition, license plate, intelligent transportation systems, license plate detection, real-time license plate
备注: Accepted at IEEE Intelligent Transportation Systems Conference (ITSC) 2026
点击查看摘要
Abstract:Vehicle license plate recognition is an integral component of intelligent transportation systems. In this work, we present an embedded real-time license plate recognition system customized for developing countries. We address the challenge of handling complex, unstructured traffic scenes with diverse vehicle types while implementing the system on an embedded platform for low-cost deployment. Our method consists of license plate detection on a multi-vehicle image, followed by character recognition on the detected license plates. Both steps use lightweight convolutional neural networks to balance accuracy and efficiency. We also introduce the SL-LPR dataset of Sri Lankan road images, which contains a variety of vehicle types and traffic conditions typically seen in developing countries. On this dataset, the license plate detection and character recognition models achieved 93.6% mAP and 87.88% accuracy, respectively, and were competitive against larger models on several public datasets. To achieve real-time performance in a resource-constrained embedded environment, we applied low-bitwidth quantization using the Brevitas library and implemented FPGA acceleration for the models using the FINN framework. The end-to-end system can operate at 11.5~FPS when implemented on the Xilinx Kria KV260 platform. These results demonstrate that our system is effective for real-time license plate recognition on an embedded device, even in complex traffic scenarios. The SL-LPR dataset is available for research use at: this https URL.
56. 【2606.27771】NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning
链接:https://arxiv.org/abs/2606.27771
作者:Tianlin Pan,Lianyu Pang,Cheng Da,Huan Yang,Changqian Yu,Kun Gai,Wenhan Luo
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Reinforcement learning, degrades perceptual quality, flow-based generators, alignment of flow-based, degrades perceptual
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) post-training improves the reward alignment of flow-based generators, but often degrades perceptual quality in ways that are not captured by the reward proxy. We identify a simple structural signature of this drift: across three post-training methods (NFT, AWM, DPO), RL fine-tuning inflates the per-step velocity norm $\|v_\theta\|$ by $5\%$ to $15\%$ relative to the reference. A form of norm inflation has been studied in classifier-free guidance (CFG), where rescaling the velocity back to a reference norm at inference time can mitigate the resulting artifacts. However, this inference-time correction does not transfer cleanly to RL: rescaling $v_\theta$ to match $\|v_{\text{ref}}\|$ at inference time neither improves reward nor fixes the quality degradation, because the inflation is co-adapted into the model weights. Furthermore, an adjoint sensitivity analysis shows that velocity magnitude rescaling carries no coherent first-order reward signal at the batch level, indicating that suppressing norm inflation is unlikely to remove a consistently reward-carrying component. Since inference-time renormalization fails while norm suppression carries no reward cost, training-time intervention is the appropriate strategy. Together, these findings motivate \methodname, a hinge penalty that activates only when $\|v_\theta\|$ exceeds $\|v_{\text{ref}}\|$ and composes additively with any velocity-local base loss. Across two base models, three post-training methods, and two reward proxies, \methodname consistently improves MLLM-judged image quality and forensic realism while preserving reward, with gains that amplify under few-step inference and are not explained by early stopping.
57. 【2606.27760】PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion
链接:https://arxiv.org/abs/2606.27760
作者:Zipeng Guo,Lichen Ma,Yu He,Xiaolong Fu,Jingling Fu,Junshi Huang,Yan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:compression of Latent, Latent Diffusion Models, Latent Diffusion, bypass the lossy, lossy compression
备注:
点击查看摘要
Abstract:End-to-end pixel-space diffusion models bypass the lossy compression of Latent Diffusion Models (LDMs) but struggle to jointly model low-frequency semantics and high-frequency signals in high-dimensional space. Existing works heavily rely on complex pixel decoders to alleviate this issue. In this paper, we challenge this trend by revealing that these decoders primarily compensate for the optimization difficulties inherent to velocity prediction ($v$-prediction). Under the clean data paradigm ($x$-prediction), they are redundant. Motivated by this insight, we advocate for simplicity over complexity and introduce PixelU, a minimalist, single-stage U-shaped Diffusion Transformer tailored for pixel space. PixelU abandons auxiliary decoders in favor of zero-cost skip connections, which provide an "information highway" that directly routes uncorrupted high-frequency spatial details from shallow to deep layers. To further enable the backbone to focus exclusively on modeling low-frequency semantics, we introduce a constant-channel spatial down-sampling mechanism as a natural low-pass filter, which compresses deep features into a compact, low-frequency semantic manifold. Extensive experiments demonstrate that this decoupling of frequencies could outperform the strong baseline (JiT-G) with only about 1/3 of its computation cost. On ImageNet 256$\times$256 and 512$\times$512, PixelU achieves FID of 1.63 and 1.92 respectively, surpassing recent pixel-space methods and establishing a simple yet powerful new paradigm for end-to-end diffusion models.
58. 【2606.27745】Panoramic Scene Analysis: A Survey from Distortion-Aware Engineering to Sphere-Native Foundation Modeling
链接:https://arxiv.org/abs/2606.27745
作者:Qinfeng Zhu,Lei Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:providing spatial context, complete visual sphere, spatial context unattainable, Panoramic images capture, single frame
备注:
点击查看摘要
Abstract:Panoramic images capture the complete visual sphere in a single frame, providing spatial context unattainable by conventional cameras. Yet this completeness comes at a geometric cost: the 2-sphere cannot be faithfully mapped to the plane, and every planar representation introduces distortions that violate the assumptions underlying standard vision architectures. This survey traces the evolution of panoramic scene analysis along a methodological trajectory, from projection-based adaptation, through distortion-aware engineering, to sphere-native modeling and geometry-aware tokenization for foundation models, and argues that this evolution reflects a progressive deepening of geometric commitment rather than a simple accumulation of techniques. We organize the literature along two orthogonal dimensions: architectural design (how operators interact with spherical geometry) and training paradigm (how knowledge is transferred across domains). Covering dense prediction (semantic segmentation, depth estimation, and room layout estimation), unified multi-task understanding, open-world perception, vision-language reasoning, and dynamic video analysis, we identify a central unresolved tension: among the methods surveyed, none simultaneously delivers strict spherical equivariance and full reuse of perspective-pretrained foundation-model weights, and we argue that this is a structural rather than incidental gap. We further expose five systematic gaps in current evaluation protocols, namely the absence of spherical-area-weighted metrics, seam-consistency testing, polar-robustness stratification, cross-projection generalization, and open-world protocol standardization, and propose a six-point research roadmap toward general-purpose panoramic intelligence. The corresponding repository is publicly available at: this https URL.
59. 【2606.27741】SIFT: Self-Imagination Fine-Tuning for Physically Plausible Motion in Video Diffusion Models
链接:https://arxiv.org/abs/2606.27741
作者:Ruoyu Wang,Jialun Liu,Huayang Huang,Haibin Huang,Jiepeng Wang,Chi Zhang,Xuelong Li,Yu Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:improved visual fidelity, greatly improved visual, Recent advances, violate physical plausibility, visual fidelity
备注: ECCV 2026
点击查看摘要
Abstract:Recent advances in video diffusion models have greatly improved visual fidelity, yet their generated motions often violate physical plausibility. We observe a common kinematic failure, "motion entanglement", the unintended coupling of independent motion sources, such as camera movement and object motion. We identify that this issue stems from data bias and the reconstruction-based training design of diffusion models. Training on noisy videos that still retain coarse motion cues inadvertently encourages the model to replicate existing motion without an incentive to learn how to model kinematically-grounded motions. To address this, we propose a Self-Imagination Fine-Tuning (SIFT) paradigm, which enables the model to learn from its own generated videos rather than directly reconstructing real ones, breaking the reconstruction shortcut. We further employ motion-aware discriminative supervision and a progressive hard-case replay strategy to stabilize and accelerate learning. By leveraging freely-generated text prompts, our method can densely cover a broad motion space, including rare or finely-disentangled scenarios that would be costly to collect as video data. Extensive experiments demonstrate that our approach substantially improves the physical realism, motion disentanglement, and controllability of generated videos.
60. 【2606.27729】Learning 1-Bit LiDAR-based Localization with Auxiliary Objective
链接:https://arxiv.org/abs/2606.27729
作者:Kaijie Yin,Zhiyuan Zhang,Tian Gao,Wentao Zhu,Cheng-zhong Xu,Hui Kong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous systems operating, fundamental capability, capability for autonomous, autonomous systems, systems operating
备注: European Conference on Computer Vision(ECCV)
点击查看摘要
Abstract:6-DoF LiDAR-based localization is a fundamental capability for autonomous systems operating in large-scale outdoor environments. Many deep-learning-based localization methods have achieved promising performance so far. However, as one of the always-on modules competing for limited on-board computational resources, the localization module is expected to consume only a small portion of the overall compute budget. Most existing learning-based methods are still too heavy for this purpose. In contrast, binary neural networks (BNNs) offer an appealing solution, but the 1-bit compression causes severe information loss and performance drop. In this paper, we address this challenge by proposing Binarized LiDAR-based Localization (BiLoc), the first binary neural network framework for 6-DoF LiDAR localization. Specifically, we reinterpret the training of BNNs from the perspective of the information-bottleneck principle, aiming at retaining minimal yet sufficient representations for pose estimation while suppressing redundant variations. And we introduce an auxiliary objective that adaptively regulates information retention in the binary encoder, effectively mitigating the information loss caused by binarization. This auxiliary objective provides additional optimization signals that compensate for the limited representational capacity and the gradient mismatch inherent in BNNs. Extensive experiments on large-scale outdoor LiDAR datasets demonstrate that BiLoc establishes a new state of the art for LiDAR localization with BNNs.
61. 【2606.27720】Scene and Human in One World: Reconstruction in a Feedforward Pass
链接:https://arxiv.org/abs/2606.27720
作者:Boao Shi,Qiao Feng,Yiming Huang,Lingjie Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human mesh recovery, mesh recovery, Reconstructing humans, human mesh, remains challenging due
备注:
点击查看摘要
Abstract:Reconstructing humans in dynamic scenes from moving monocular cameras remains challenging due to scale ambiguity, human-scene misalignment, and occlusion interference. Rather than treating human mesh recovery and scene reconstruction as separate tasks, we believe that accurate human-scene reconstruction requires the two tasks to mutually inform each other: parametric human models offer semantic structure and metric-scale priors, while scene geometry provides spatial context for human localization and alignment. Built on this insight, we introduce SHOW, a mask-promptable human mesh recovery framework that couples feed-forward 3D scene reconstruction with Human Mesh Recovery in a unified metric space. SHOW injects human semantics and scale priors from parametric human models into normalized point-map prediction, enabling metric-scale scene reconstruction from inherently scale-ambiguous monocular input. In turn, the recovered scene geometry constrains human mesh estimation, encouraging spatially consistent human placement and improved human-scene alignment. To handle complex multi-person and cluttered scenes, SHOW further incorporates a promptable masking mechanism that enables flexible target-human selection while suppressing background distractions and occlusion interference. Through joint training, the model learns both human-aware geometric features and geometry-constrained human features, producing aligned metric-scale reconstructions from monocular human-centric videos. Extensive experiments demonstrate that SHOW improves metric-scale consistency, human-scene alignment, and reconstruction accuracy under challenging camera motion, occlusion, and cluttered backgrounds.
62. 【2606.27718】MASS: Motion-Aligned Selective Scan for Refinement in Flow-Based Video Frame Interpolation
链接:https://arxiv.org/abs/2606.27718
作者:Jun-Sang Yoo,Seung-Won Jung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video frame interpolation, Video frame, frame interpolation, State Space Models, Video
备注: Accepted in ECCV 2026
点击查看摘要
Abstract:Video frame interpolation (VFI) remains a challenging task, particularly when dealing with large, non-linear motions and complex occlusions. While flow-based methods are prevalent, they often struggle with ambiguous correspondences. Recent VFI methods based on selective State Space Models (SSMs) are still limited by static grid-based scanning that misaligns with physical motion. In this paper, we propose Motion-Aligned Selective Scan (MASS), a novel framework that reformulates feature scanning from static spatial grids to dynamic motion trajectories. MASS builds a feature sequence along each pixel's flow-guided trajectory and aggregates it with an SSM. Specifically, we introduce a learnable non-linear path integration to approximate complex curved trajectories via residual velocity updates, and a velocity-aware SSM that dynamically adjusts the sampling budget and step size based on motion magnitude. This adaptive strategy allocates denser sampling to fast-motion regions while keeping static regions efficient. Furthermore, the aggregated states guide a refinement module to rectify intermediate flows and masks in an end-to-end manner. Extensive experiments indicate that MASS achieves highly competitive overall performance on standard benchmarks, establishing state-of-the-art results particularly in challenging scenarios with large displacements and complex dynamics.
63. 【2606.27708】ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval
链接:https://arxiv.org/abs/2606.27708
作者:Siqiao Xue,Chunxue Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:foundation vision-language encoder, model broad generalization, foundation model broad, specialized retrieval task, retrieval task creates
备注: ZooClaw Team
点击查看摘要
Abstract:Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \wiseft~\citep{wortsman2022wiseft} weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.
64. 【2606.27700】Joint Transcription and Decryption of Images of Encrypted Handwritten Documents: A Comparison with the Traditional Pipeline
链接:https://arxiv.org/abs/2606.27700
作者:Marino Oliveros-Blanco,Lei Kang,Alicia Fornés,Beáta Megyesi
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Historical encrypted manuscripts, Historical encrypted, intersection of cryptology, computer vision, challenging problem
备注: Published at HistoCrypt 2026 (9th International Conference on Historical Cryptology). NEALT Proceedings Series Number 61. Tartu University Library. 10 pages
点击查看摘要
Abstract:Historical encrypted manuscripts present a challenging problem at the intersection of cryptology, linguistics, paleography, and computer vision. Current automatic decipherment approaches usually rely on a two-stage pipeline: transcription of cipher symbols from manuscript images, followed by decryption into plaintext. However, this design is sensitive to transcription errors, which propagate to the final output. We present Direct Image Decryption, an end-to-end approach that directly maps encrypted manuscript images to plaintext, bypassing the intermediate transcription stage. Using the Copiale cipher as a case study, we build a synthetic data generation pipeline to create large-scale cipher-like training data and compare the traditional pipeline with the proposed joint architecture. Results show that joint image-to-plaintext modeling is a promising alternative to traditional transcription-based pipelines.
65. 【2606.27696】Class-frequency Guided Noise Schedule for Diffusion Models
链接:https://arxiv.org/abs/2606.27696
作者:Jiequan Cui,Beier Zhu,Qingshan Xu,Xiaojuan Qi,Bei Yu,Hanwang Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:multi-scale noise schedule, noise schedule, examine the correlations, correlations between class, CFRG noise schedule
备注: technical report
点击查看摘要
Abstract:In this paper, we are the first to examine the correlations between class frequency and the multi-scale noise schedule within diffusion models. For score-based generative models, low-density regions often lead to inaccurately estimated scores, thereby compromising the generation quality. Although the multi-scale noise schedule can alleviate this issue during the diffusion process, low-frequency classes still face the challenge of large low-density regions, resulting in more inaccurate estimated scores than high-frequency classes. Furthermore, high-frequency classes tend to dominate the score space, causing a convergence of most data points towards generating samples from these classes. Consequently, samples generated within low-frequency classes exhibit suboptimal quality and limited diversity. To address this challenge, we propose the \textit{Class-frequency Guided (CFRG)} noise schedule, leveraging the insight that low-frequency classes should be endowed with larger-scale noises. To illustrate the effectiveness of our method, we conduct experiments on various tasks, including image generation, image classification, and text-to-image generation, using imbalanced datasets, \textit{i.e.}, CIFAR-100-LT, and ImageNet-LT. By employing the CFRG noise schedule, we achieve substantial improvements over baselines, manifesting the crucial role of frequency statistics in noise schedule design.
66. 【2606.27678】wo-Stage Cross-Domain Cervical Abnormality Screening with Cytopathological Image Synthesis and Knowledge Distillation
链接:https://arxiv.org/abs/2606.27678
作者:Jincheng Li,Yuzhi He,Yihui Zhan,Xinmei Zhang,Yifei Sun,Zelin Liu,Lichi Zhang,Minye Shao,Lili Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:subtle visual differences, cell pathology due, Cross-domain diagnosis remains, cervical cell pathology, impair model generalization
备注:
点击查看摘要
Abstract:Cross-domain diagnosis remains a major challenge in cervical cell pathology due to pronounced domain shifts across institutions and the subtle visual differences among disease stages, which jointly impair model generalization. To address these issues, this paper proposes a two-stage framework for cross-domain cervical cell detection. In the first stage, we propose the Spatially-Continuous Unpaired Neural Schrödinger Bridge (SC-UNSB), which constructs a synthetic intermediate domain to mitigate cross-domain distribution shifts by modeling image translation as an entropy-regularized optimal transport process. In the second stage, we propose a dual-level feature alignment strategy within a knowledge distillation, which progressively aligns shallow structural features and deep semantic representations to facilitate the transfer of domain-invariant knowledge from the source to the target model. Experimental results demonstrate that the proposed method effectively mitigates domain shift and category ambiguity, improving the cross-domain detection performance.
67. 【2606.27677】DIM-WAM: World-Action Modeling with Diverse Historical Event Memory
链接:https://arxiv.org/abs/2606.27677
作者:Kai Wang,Zhaopeng Gu,Yixiang Chen,Yuan Xu,Qisen Ma,Peng Su,Zhaowen Li,Yan Huang,Liang Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:shown promising robot-manipulation, promising robot-manipulation performance, jointly predicting future, global task progress, task progress
备注:
点击查看摘要
Abstract:World-action models have shown promising robot-manipulation performance by jointly predicting future visual states and actions. However, existing methods mainly rely on short-term history and short-horizon future prediction, which is insufficient for long-horizon tasks whose correct execution depends on earlier observations and task progress. Such temporally dependent tasks require effective use of complementary temporal information, including recent local context, cross-stage historical events, immediate future dynamics, and global task progress. To address long-term forgetting and poor awareness of the global task state, we introduce DiM-WAM, a memory-augmented world-action model that integrates multi-scale historical context, local future dynamics, and global task progress. The memory extracts compact visual event information from real observations, updates multiple memory banks through independent similarity-based merging, and then reads the bank-identity- and time-embedded long-term context to condition video and action denoising. A progress-supervision objective further encourages memory tokens to encode not only completed historical events but also the current task stage and its implications for the remaining task. On RMBench, DiM-WAM raises average success from 28.4% with LingBot-VA to 69.8%, exceeding the explicit-memory Mem-0 baseline at 42.0%. On four real-world Franka tasks, it improves average stage success from 70.7% to 91.5% and full-task success from 52.5% to 80.0%. Project page: this https URL{\texttt{this https URL}}.
68. 【2606.27671】Multi-Modal Conditioned High-Resolution Transformer for Urban Electromagnetic Field Map Prediction Download PDF
链接:https://arxiv.org/abs/2606.27671
作者:Do-Eon Kim,Dongryul Park,Seungyoung Ahn,Namwoo Kang,Seong-heum Kim,Seongsin Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Predicting electromagnetic field, cellular network planning, Predicting electromagnetic, Feature-wise Linear Modulation, electromagnetic field
备注:
点击查看摘要
Abstract:Predicting electromagnetic field (EMF) strength in urban environments is essential for cellular network planning but computationally expensive with physics-based simulators. We propose a multi-conditioned dense prediction framework that generates 500 500 EMF maps from building layout images and antenna configurations. Our architecture uses a High-Resolution Transformer (HRFormer) backbone with two complementary conditioning mechanisms: Feature-wise Linear Modulation (FiLM) injects scalar antenna parameters into all backbone stages, while cross-attention fuses 1-D radiation pattern tokens with spatial features at the deepest stage. We further introduce transmitter-relative spatial channels encoding distance, proximity, and bearing from the antenna, enabling coordinate-consistent test-time augmentation (TTA) that reduces test MAE by 6.3%. To address the prediction difficulty imbalance across EMF maps, we design a composite loss combining masked L1, multi-scale structural similarity (MS-SSIM), and a focal L1 term that upweights high-signal pixels, outperforming individual loss components in all metrics. Our best model achieves a test MAE of 0.0461, a 25.2% improvement over a plain UNet baseline and 31.8% over an HRFormer-only this http URL-
69. 【2606.27667】Explainable AI for Biodiversity Monitoring and Ecological Image Analysis
链接:https://arxiv.org/abs/2606.27667
作者:Brinnae Bent,Holly R. Houliston,Jiayi Zhou,Günel Aghakishiyeva,David W. Johnston
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
关键词:enabling automated analysis, transforming biodiversity monitoring, underwater platforms, camera traps, sensing systems
备注:
点击查看摘要
Abstract:Artificial intelligence is transforming biodiversity monitoring by enabling automated analysis of ecological imagery collected from camera traps, drones, satellites, underwater platforms, and other sensing systems. These tools can expand the scale and speed of conservation assessments, yet many computer vision models remain difficult to inspect, making it challenging to determine whether predictions are based on ecologically meaningful signals or on spurious correlations, sampling biases, and other artifacts that may undermine conservation decisions. We argue that explainable artificial intelligence (XAI) should become a standard component of ecological model validation because conservation practitioners increasingly depend on understanding not only whether a model is accurate, but why it is accurate. We provide practical guidance for applying XAI to three common ecological computer vision tasks: image classification, object detection, and image segmentation. To illustrate how XAI can support ecological model auditing, refinement, and deployment, we present two case studies using aerial imagery: harbor seal detection and cetacean anatomical segmentation. These examples demonstrate how explanation methods can identify biologically meaningful cues, reveal false positives driven by background and shape confounds, uncover edge and occlusion effects, and guide data collection, augmentation, and retraining strategies. More broadly, they show how explainability can help assess whether model reasoning aligns with ecological understanding. We conclude by identifying key challenges and opportunities. By making model behavior more transparent and scientifically interrogable, XAI can help ensure that AI-supported ecological evidence is more reliable, understandable, and actionable for biodiversity conservation.
70. 【2606.27660】MVPruner: Dynamic Token Pruning for Accelerating Multi-view Vision-Language Models in Autonomous Driving
链接:https://arxiv.org/abs/2606.27660
作者:Nan Yang,Zhanwen Liu,Linfeng Zhang,Shangyu Xie,Yang Wang,Wenzhuo Zhou,Xiangmo Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:efficiency issues due, standard multi-view settings, visual token sequences, long visual token, improve generalization
备注: accepted by ECCV26
点击查看摘要
Abstract:Vision-Language Models (VLMs) improve generalization and interpretability in autonomous driving but suffer from efficiency issues due to long visual token sequences, particularly in standard multi-view settings. Existing token pruning methods employ fixed pruning rate allocation and static importance metrics, ignoring dynamic inter-view importance differences and the evolving information importance during inference. Our analysis reveals that multi-view VLMs inherently encode task-related view priors in deeper layers and exhibit dynamic information requirements. Motivated by these findings, we propose MVPruner, a two-stage adaptive token pruning method that aligns pruning behavior with the model's dynamic information requirements. The first stage allocates pruning budgets based on the information diversity of each view, and retains tokens with consistent contribution across stages, ensuring semantic representational capacity. The second stage allocates budgets and selects tokens guided by instruction text to guarantee task alignment. Experimental results on four benchmarks demonstrate the superior performance of our method. For example, DriveMM equipped with MVPruner achieves 87.3% reduction in FLOPs, 4.97* speedup in prefilling phase while retaining 98.5% accuracy on DriveLM benchmark.
71. 【2606.27659】GeoFace: Consistent Multi-View Face Generation with Geometry-Constrained Diffusion
链接:https://arxiv.org/abs/2606.27659
作者:Yeji Choi,Jinhyeok Choi,Jaewon Min,Minkyung Kwon,Jin Hyeon Kim,Seungryong Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:geometry-constrained multi-view diffusion, single input, consistent face generation, multi-view diffusion framework, multi-view diffusion
备注:
点击查看摘要
Abstract:We present GeoFace, a geometry-constrained multi-view diffusion framework for consistent face generation from a single input. % While recent multi-view diffusion models achieve photorealistic synthesis at the per-view level, they lack an explicit mechanism to enforce a shared 3D structure across views, often leading to inconsistent geometry across viewpoints. To address this, GeoFace proposes a unified dual-stream framework for joint generation of multi-view RGB images and 3D face geometry, where the appearance and geometry streams interact through shared attention layers. To encourage the two streams to mutually constrain each other, we introduce a geometry-guided attention alignment loss that supervises the cross-attention between appearance and geometry tokens with 3D-consistent correspondences, enabling the appearance stream to correctly reference pose-invariant geometric cues for robust alignment across viewpoints. Geometry is represented as a canonical UV position map, derived from a FLAME mesh fitted to multi-view observations, serving as a view-invariant shared constraint across all generated views. Experiments on RenderMe-360 and NeRSemble demonstrate that GeoFace consistently outperforms existing methods in both visual quality and cross-view geometric consistency, facilitating more efficient 3D reconstruction.
72. 【2606.27655】mporal-Emerged Prompting for Segment Anything in Multiframe Infrared Small Target Detection
链接:https://arxiv.org/abs/2606.27655
作者:Yinghui Xing,Donghao Chu,Shizhou Zhang,Di Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurately localizing, infrared sequences remains, localizing and segmenting, sequences remains, SNR
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
点击查看摘要
Abstract:Accurately localizing and segmenting small targets in low signal-to-noise ratio (SNR) infrared sequences remains a challenging task. Since targets are often indistinguishable from the background in individual frames, existing methods, even when equipped with advanced foundation model and powerful inter-frame association mechanisms, still fail to detect them. Motivated by the observation that targets tend to emerge gradually from the background over time and become distinguishable, we propose Temporal-Emerged Prompting for Segment Anything Model (TEP-SAM), a principled framework designed to explicitly exploit such temporal-emerged cues to modulate and prompt SAM. TEP-SAM operates by jointly modeling global motion patterns and local motion deviations to locate potential targets. It further enhances target region features by leveraging motion discrepancy, thereby generating temporal-emerged cues for SAM and enabling non-interactive segmentation. By bridging large-scale semantic pretraining with task-specific temporal modeling, TEP-SAM effectively adapts SAM to the challenging multiframe infrared small target detection task. Extensive experiments demonstrate the effectiveness of our approach, particularly under severely low-SNR conditions and in complex dynamic background.
73. 【2606.27646】VLM-Aware Meta-Optic Front-End Design for Frozen Vision-Language Models
链接:https://arxiv.org/abs/2606.27646
作者:Chanik Kang,Raphaël Pestourie,Haejun Chung
类目:Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
关键词:machine-vision pipelines typically, pipelines typically rely, Conventional machine-vision pipelines, aberration correction, produce clean
备注: 18 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Conventional machine-vision pipelines typically rely on high-quality optics that produce clean, human-interpretable images, and optical design has therefore been driven by image-level criteria such as resolution, aberration correction, and pixel fidelity. However, such optics are often impractical for size-, cost-, or form-factor-constrained applications, where compact meta-optics offer an attractive alternative but operate under strict physical efficiency limits. We propose CODA, a co-design framework that optimizes a continuous-density meta-optic front-end for frozen-model recognition using differentiable image formation and adjoint-gradient updates of Maxwell-based simulations. CODA directly optimizes the cross-entropy loss of a fixed zero-shot CLIP classifier without learned reconstruction, image signal processing, or image-fidelity auxiliary objectives. In a two-dimensional simulated imaging benchmark on ImageNet-100, CODA improves CLIP ViT-L/14 zero-shot accuracy from 53.75 $\pm$ 3.57$\%$ with a focal-concentration baseline to 65.41 $\pm$ 3.99$\%$. The optimized optics further transfer without re-optimization across CLIP, SigLIP, and DINOv2 on ImageNet-100, CIFAR-100, and Food-101. These results demonstrate that, under constrained meta-optic imaging, downstream recognition can be improved by aligning optical design with frozen vision-model objectives rather than conventional image-formation criteria.
74. 【2606.27644】CascadeOcc: Rethinking 3D Occupancy World Models with Cascaded VQ Representations
链接:https://arxiv.org/abs/2606.27644
作者:Kyumin Hwang,Wonhyeok Choi,Jaeyeul Kim,Jihun Park,Daehee Park,Sunghoon Im
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:prioritizes intrinsic structural, intrinsic structural hierarchy, extrinsic auxiliary modalities, letter proposes CascadeOcc, occupancy world model
备注: Accepted to IEEE Signal Processing Letters (SPL), 2026
点击查看摘要
Abstract:This letter proposes CascadeOcc, a novel occupancy world model that prioritizes intrinsic structural hierarchy over extrinsic auxiliary modalities for autonomous driving. Occupancy world models -- forecasting the future driving environment and planning the driving trajectory -- effectively bridge perception and planning, but current approaches often heavily rely on external modalities or large language models, failing to fully exploit the inherent structural potential of occupancy representations themselves. To enhance representational capacity for complex 3D scenes, we integrate a cascaded Vector Quantized (VQ) mechanism into an autoregressive framework. Following a coarse-to-fine principle, CascadeOcc progressively refines fine-grained details from global structures through a multi-scale architecture. Additionally, we incorporate a TimeMixer to capture multi-scale temporal dependencies, establishing a dual-hierarchy mechanism in both space and time. Experimental results on 4D occupancy forecasting and motion planning benchmarks demonstrate that CascadeOcc achieves superior performance among vision-centric approaches, validating that optimizing inherent representations is a powerful alternative to relying on external foundation models.
75. 【2606.27637】AI-Generated Image Recognition via Fusion of CNNs and Vision Transformers
链接:https://arxiv.org/abs/2606.27637
作者:Xuan-Bach Mai,Hoang-Minh Nguyen-Huu,Quoc-Nghia Nguyen,Hoang-Tung Vu,Minh-Triet Tran,Trung-Nghia Le
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Artificial Intelligence, produced by Artificial, synthetic data technology, quality are generated, blurring the lines
备注: SOICT 2024
点击查看摘要
Abstract:Recent advancements in synthetic data technology have opened a new era where images of remarkable quality are generated, blurring the lines between real-life images and those produced by Artificial Intelligence (AI). This evolution poses a significant challenge to ensuring the reliability and authenticity of data, underscoring the need for robust detection methods. In this paper, we present a robust approach aimed at addressing these pressing concerns. Our methodology revolves around leveraging fusion strategies, combining the strengths of multiple detection methods for identifying AI-generated images. Through extensive experimentation on the CIFAKE dataset, our model showcases remarkable performance, achieving an impressive accuracy rate of 97.32%. This accomplishment underscores the efficacy of our approach in accurately distinguishing between AI-generated images and real-life images, thus contributing to the advancement of data authentication techniques amidst the proliferation of synthetic data.
76. 【2606.27635】Denoising ICF Images with Multiplicative Uniform Noise: A Self-Supervised Study Based on the Log-Domain Noisier2Inverse Framework
链接:https://arxiv.org/abs/2606.27635
作者:Gyeongha Hwang,Bradley Thomas Wolfe,Naima Naheed
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Inertial Confinement Fusion, Confinement Fusion, Inertial Confinement, Multiplicative Uniform noise, corrupted by Multiplicative
备注:
点击查看摘要
Abstract:This paper documents the implementation and evaluation of a self-supervised denoising framework on Inertial Confinement Fusion (ICF) images corrupted by Multiplicative Uniform noise: the \emph{Log-Domain Noisier2Inverse} framework. This framework is developed and analysed in this work; the key theoretical result -- that minimising the log-domain self-supervised loss is equivalent to supervised learning in the transformed domain -- is presented with full proof. We document significant implementation challenges arising from the unique characteristics of ICF imagery, describe the fixes applied at each stage, and report final quantitative results. The log-domain approach with per-image JSON Uniform noise loading (Variant~B) achieves the best result: a mean PSNR of $21.41\db$ and SSIM of $0.8358$, a $+19.46\db$ improvement over the noisy input baseline of $1.95\db$, substantially outperforming BM3D log-domain ($4.47\db$, SSIM $0.5181$) and Noise2Self ($4.75\db$, SSIM $0.0177$). Variant~A, using fixed Gaussian noise loading, achieves $21.39\db$ PSNR and SSIM $0.8436$. Of the three evaluated methods, Log-Domain Noisier2Inverse and Noise2Self are entirely self-supervised during training, requiring no clean ground truth data; BM3D is a classical filter-based method requiring no training at all. The clean reference images are used solely for quantitative evaluation of all three methods.
77. 【2606.27608】Qwen-Image-2.0-RL Technical Report
链接:https://arxiv.org/abs/2606.27608
作者:Yixian Xu,Kaiyuan Gao,Yuxiang Chen,Yilei Chen,Zecheng Tang,Zihao Liu,Zikai Zhou,Deqing Li,Hao Meng,Kuan Cao,Jiahao Li,Jie Zhang,Liang Peng,Lihan Jiang,Ningyuan Tang,Shengming Yin,Tianhe Wu,Xiaoyue Chen,Yan Shu,Yanran Zhang,Yi Wang,Yu Wu,Yujia Wu,Zekai Zhang,Zhendong Wang,Xiao Xu,Kun Yan,Chenfei Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:applies reinforcement learning, human feedback, post-training pipeline, pipeline that applies, applies reinforcement
备注: 16 pages, 6 figures, 1 table
点击查看摘要
Abstract:We present Qwen-Image-2.0-RL, a post-training pipeline that applies reinforcement learning from human feedback (RLHF) and on-policy distillation (OPD) to improve both the visual quality and instruction-following capability of the Qwen-Image-2.0 diffusion model. To provide reliable reward signals, we construct task-specific composite reward models by fine-tuning vision-language models with a pointwise scoring paradigm and chain-of-thought reasoning. For text-to-image generation, the reward models cover alignment, aesthetics, and portrait fidelity dimensions. For image editing tasks, the reward system addresses instruction-following accuracy and face identity preservation. Building on this reward system, we develop a scalable GRPO-based RL training framework, incorporating a hybrid classifier-free guidance (CFG) strategy to preserve pre-trained knowledge, prompt curation via intra-group reward range filtering, and per-category reward weight calibration. To merge the task-specialized RL policies for T2I and editing, we propose on-policy distillation as the final training stage, which consolidates multiple teachers into a single student model through trajectory-level velocity matching. Extensive evaluation shows that Qwen-Image-2.0-RL achieves 57.84 overall score on Qwen-Image-Bench (+2.61 over the base model), Elo ratings of 1193 in text-to-image arena (+78) and 1349 in image edit arena (+93), demonstrating consistent gains in aesthetic quality, prompt adherence, and editing accuracy.
78. 【2606.27605】On the stability of scale-space metrics
链接:https://arxiv.org/abs/2606.27605
作者:William Leeb
类目:Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian scale-space representations, functions' Gaussian scale-space, functions' Gaussian, Gaussian scale-space, scale-space representations
备注: 36 pages, 7 figures
点击查看摘要
Abstract:We study the stability of a classical family of metrics defined over functions' Gaussian scale-space representations, focusing on the comparison of images (functions of two variables). These metrics have precedents both in harmonic analysis, specifically the theory of Besov spaces, and in classical methods of image processing; special cases are also known to be metrically equivalent to certain Wasserstein distances. We quantify these metrics' robustness to geometric deformations, and introduce rotationally-invariant versions that are stable to changes in angle when comparing tomographic projections. We also describe computationally efficient algorithms for evaluating the metrics from finite samples, and prove their robustness to additive noise. The results are illustrated through numerical experiments.
79. 【2606.27604】Spectral Subsurface Scattering from RGB via Biophysical Skin Inversion
链接:https://arxiv.org/abs/2606.27604
作者:Carlos Aliaga,Adrian Jarabo
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:path tracing-based rendering, paper we present, present a spectral, tracing-based rendering, rendering of subsurface
备注: 14 pages, 9 figures
点击查看摘要
Abstract:In this paper we present a spectral optical inversion for skin for path tracing-based rendering of subsurface scattering. Skin is a complex multilayered medium, with appearance determined by the mixture of biophysical chromophores. However, current methods rely on medium homogeneization, with optical parameters obtained via albedo inversion from a reflectance texture and hand-tuned scattering distance and anisotropy. This results into significant art-skilled manual labor for authoring, and an inaccurate scattering profile for skin. To solve these problems, we generalize existing albedo inversion techniques, and propose a framework that predicts full-spectral skin scattering parameters from a single RGB diffuse albedo. Our method builds upon a new mixture-of-media representation, that approximates the aggregated multilayered appearance of skin by mixing the aggregated scattering of three uncorrelated media. We train a chained neural decoder that maps RGB diffuse albedo to the optical properties of the mixture of media, including anisotropy, scattering radius and scattering albedo. Then, we show this mixture can be used in a random-walk-based path tracer with minimal modifications, by simply randomly selecting the medium to traverse.
80. 【2606.27596】Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding
链接:https://arxiv.org/abs/2606.27596
作者:Liu Yu,Can Chen,Ping Kuang,Zhikun Feng,Fan Zhou,Gillian Dobbie
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, exhibit sophisticated reasoning, Large Vision-Language, Vision-Language Models, exhibit sophisticated
备注: 29 pages, 25 figures. Accepted by ICML 2026
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) exhibit sophisticated reasoning but remain susceptible to object hallucination. Deviating from the prevailing attention intensity assumption, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors. This establishes a pathological shortcut that bypasses visual grounding. To dismantle this, we propose Fox (Faithfulness and Observational-flow via eXpression-rectification), a training-free inference-time framework. Fox diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly. We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path. Finally, a conflict-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency. Extensive experiments demonstrate that Fox achieves SOTA performance, outperforming SID by 29.1% while preserving linguistic richness. Code is available at this https URL.
81. 【2606.27584】CoIn: Comprehensive 2D-3D Inpainting with Gaussian Splatting Guidance
链接:https://arxiv.org/abs/2606.27584
作者:Hana Kim,Minje Kim,Tae-Kyun Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:reconstructing areas corrupted, leverage Gaussian Splatting, limited viewpoints, essential for reconstructing, reconstructing areas
备注:
点击查看摘要
Abstract:3D scene inpainting is essential for reconstructing areas corrupted by occlusions or limited viewpoints. While recent methods leverage Gaussian Splatting (GS) for efficient 3D editing, they often depend on precise multi-view segmentation masks and are inherently constrained to object removal tasks. We propose CoIn, a novel framework that bridges 2D inpainting models and 3DGS through a multi-stage consistency pipeline. Our approach first generates initial inpainted images using a diffusion model, enabling the use of arbitrary-shaped masks and diverse tasks like object insertion. We then introduce Reference Adaptive GS with Feature Attention to reconstruct a coarse 3D scene by adaptively weighing towards a reference view (2D - 3D). This 3D representation provides geometric guidance to the diffusion process via GS-based Reference Feature Warping, ensuring multi-view consistency (3D - 2D). Finally, a Texture-Enhancing Discriminator refines the 3D scene to achieve high photometric realism (2D - 3D). Experiments show that CoIn, effectively leveraging bidirectional information flow, achieves state-of-the-art performance and effectively handles both object removal and object insertion with flexible mask input.
82. 【2606.27582】Beyond Points: Spherical Distributional Part Prototypes for Interpretable Classification
链接:https://arxiv.org/abs/2606.27582
作者:Duarte Leão,Diogo Pereira Araújo,Catarina Barata,Carlos Santiago
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Prototype-based neural networks, neural networks aim, provide intrinsic interpretability, Prototype-based neural, neural networks
备注:
点击查看摘要
Abstract:Prototype-based neural networks aim to provide intrinsic interpretability by grounding predictions in a small set of part prototypes. However, modern vision backbones typically operate in normalized, directional embedding spaces where each semantic part exhibits substantial intra-class variability. As a result, point prototypes often become redundant or unstable, hurting both explanation quality and robustness. We propose vMFProto, a distributional part-prototype framework that models each class as a mixture of von Mises-Fisher components on the hypersphere. Each prototype learns its own concentration, capturing part-specific variability, and we use entropic optimal transport (OT) to obtain structured patch-to-prototype assignments. A two-stage training schedule performs OT-driven prototype discovery followed by end-to-end refinement with patch-level distillation and distribution-aware diversity regularization. Experiments on CUB-200-2011, Stanford Dogs, and Stanford Cars with frozen DINO backbones show that vMFProto achieves state-of-the-art explanation quality (consistency, stability, and distinctiveness) with competitive accuracy. Qualitative results confirm that vMFProto yields localized, non-redundant part evidence.
83. 【2606.27579】Distribution-based deep multiple instance learning for tumor proportion scoring in NSCLC
链接:https://arxiv.org/abs/2606.27579
作者:Krzysztof Pysz,Artur Bartczak,Jarosław Kwiecień,Piotr Krajewski,Witold Dyrka
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
关键词:cell lung cancer, non-small cell lung, Accurate assessment, tumor proportion score, lung cancer
备注:
点击查看摘要
Abstract:Accurate assessment of tumor proportion score (TPS) in non-small cell lung cancer (NSCLC) is critical for treatment planning and prognosis. Key challenges include the tedious manual work required to annotate each slide, combined with the limited number of experts certified for this task. Multiple instance learning (MIL) has proven to be an effective approach for predicting TPS scores at the slide level; however, existing methods struggle with non-expressive (zero class) images. Our approach involves two models: (1) an embedding-extraction and multiclass-classification network that captures the histopathological features of individual patches, and (2) a MIL model that aggregates these embeddings to predict zero-inflated beta (ZIBeta) parameters representing the overall TPS probability distribution for the entire slide. Using only slide-level TPS scores as labels, we demonstrate how this end-to-end framework can leverage a novel distribution-based architecture to improve prediction accuracy and explainability. ZIBeta modeling significantly outperforms baseline linear and ridge regression while capturing expected accuracy through distribution concentration.
84. 【2606.27576】DeLux: Cross-Modal Local Artifact Restoration in Video Using Neuromorphic Data
链接:https://arxiv.org/abs/2606.27576
作者:Bartosz Stachowiak,Dariusz Brzezinski
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Conventional RGB cameras, irrecoverable information loss, RGB cameras suffer, Conventional RGB, necessitates computational restoration
备注:
点击查看摘要
Abstract:Conventional RGB cameras suffer from lighting artifacts such as flare, glare, flicker, and overexposure, leading to irrecoverable information loss that necessitates computational restoration. However, existing approaches treat these problems in isolation, failing to recover structural details completely obscured by complex spatially discrete image degradations. In this paper, we propose a novel cross-modal restoration paradigm and present DeLux, a modular proof-of-concept pipeline that leverages neuromorphic event streams as a structural prior to guide the targeted detection and inpainting of lighting artifacts in RGB video. Validation on synthetic benchmarks and real-world automotive footage demonstrates that DeLux effectively suppresses local artifacts and restores affected regions. The proposed approach outperforms existing RGB-only baselines and event-guided HDR models, achieving an average MS-SSIM of over 0.99 across all artifact types and demonstrating up to an 88% reduction in artifact severity in real-world automotive footage. The synthetic artifact generation tools and curated real-world evaluation datasets are made publicly available to foster future research on cross-modal restoration.
85. 【2606.27575】Perceptual 3D Simulation With Physical World Modeling
链接:https://arxiv.org/abs/2606.27575
作者:Wanhee Lee,Klemen Kotar,Rahul Mysore Venkatesh,Jared Watrous,Daniel L. K. Yamins
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:goal in vision, central goal, scene, Predicting, world model
备注: Published as a conference paper at CVPR 2026
点击查看摘要
Abstract:Predicting how a scene will evolve after a desired 3D transformation from images is a central goal in vision, graphics, and robotics. Yet unlike ideal simulators with full access to 3D geometry and dynamics, real world systems must rely on perceptual inputs and local actions that are inherently partial and incomplete. In this work, we present P3Sim, a physical world modeling system that simulates future scene states under both partial observations and incomplete 3D transformation signals. P3Sim is composed of three interacting components: a learned physical world model, a geometric conditioning module, and a persistent scene memory. The world model interprets perception as probabilistic inference over multimodal scene variables, providing predictions of the distributions of any scene variable conditioned on any combination of others. The geometric conditioning module provides a partial 3D transform signal for conditioning the world model at inference time. The persistent scene memory integrates predictions over time, enabling online updates and consistency under uncertainty. By combining learned inference with explicit geometric structure, P3Sim balances data-driven flexibility with built-in inductive bias. This design yields a flexible perceptual simulator that generalizes across diverse 3D transformation tasks, such as novel view synthesis, object manipulation, and dynamic scene prediction, advancing toward general purpose 3D scene understanding and transformation.
86. 【2606.27556】Radar Guided Camera Verification for Automatic Emergency Braking Rethinking Object Detection in Radar Camera Fusion
链接:https://arxiv.org/abs/2606.27556
作者:Ram Charan Akula,Sivanathan Kandhasamy,Manikandan Ganesan
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Automatic Emergency Braking, Emergency Braking AEB, Automatic Emergency, Emergency Braking, proper visual confirmation
备注: 8 pages, 8 figures
点击查看摘要
Abstract:Radar camera fusion is widely used in Automatic Emergency Braking AEB systems because radar provides reliable range and velocity measurements while cameras provide a proper visual confirmation of the objects . Most of the deployed systems perform this confirmation using computationally intensive object detectors. However, if the radar has already localized a target, the camera may only need to verify the obstacles presence rather than solving a full problem by identifying the object. Our work proposes a radar scoped edge density gate that performs obstacle verification within radar guided image regions of interest. This method requires no training data, model weights, or GPU acceleration and was integrated into a complete radar camera fusion AEB system with brake by wire actuation. Evaluated on a real instrumented vehicle across 72 driving sessions and 131,603 camera frames, the proposed approach reduced the camera search space by up to 98.7 percentage, achieved a mean processing latency of 0.121 ms per ROI, an AUC of 0.898, and a recall of 0.994. Across 33 staged threat scenarios, the complete AEB system recorded zero missed brake events.
87. 【2606.27554】Understanding Cross-Rig Generalization in Automotive Perception: a Multi-Rig Benchmark and Rig Variation Metrics
链接:https://arxiv.org/abs/2606.27554
作者:Tim Alexander Bader,Tim Dieter Eberhardt,Maximilian Dillitzer,Wilhelm Stork
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Camera-based perception systems, real-world vehicle fleets, vehicle fleets exhibit, fleets exhibit substantial, Rig Contrastive Distance
备注: Accepted at ECCV 2026; Project Page: [this https URL](https://badertim.github.io/plentiful-carla-camera-rigs)
点击查看摘要
Abstract:Camera-based perception systems for autonomous driving are typically developed and evaluated using fixed sensor rigs, while real-world vehicle fleets exhibit substantial variation in camera placement, orientation, field of view, and camera count. This mismatch introduces a cross-rig domain gap in which only the geometric observation process changes. To study this effect under controlled conditions, we introduce Plentiful CARLA Camera Rigs, a benchmark that renders identical driving scenes under 14 systematically designed camera rigs. This setup enables direct analysis of cross-rig generalization without confounding changes in scene content or appearance. Using the benchmark, we analyze cross-rig transfer behavior of representative multi-view perception architectures and observe substantial performance shifts induced by geometric rig variation. To facilitate structured analysis, we further introduce two calibration-based descriptors derived from rig metadata: Rig Variance, capturing internal rig diversity, and Rig Contrastive Distance, measuring geometric discrepancy between rigs. Our experiments show that geometric rig differences strongly correlate with relative cross-rig performance shifts and that Rig Contrastive Distance provides a reliable proxy for ranking transfer difficulty between sensor rigs.
88. 【2606.27547】Beyond MoCap: Scaling Motion Tokenizers with Synthetic Human Motion for Generative Modeling
链接:https://arxiv.org/abs/2606.27547
作者:Yiwen Yan,Wanning He,Yu-Wing Tai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex human movements, highly dynamic motions, motion, Human motion generation, Human motion
备注:
点击查看摘要
Abstract:Human motion generation models are fundamentally constrained by the limited diversity of motion capture datasets, which predominantly contain common, repetitive actions and fail to cover the long tail of complex human movements, resulting in a restricted motion vocabulary in learned latent representations and poor generalization to rare, compositional, and highly dynamic motions. In this work, we propose a framework for expanding the motion representation space by leveraging large-scale synthetic human motion, introducing a data generation pipeline that produces diverse, physically plausible motion sequences beyond the distribution of existing datasets and integrating it with a redesigned VQ-VAE tokenizer that adapts to this expanded motion space. Unlike conventional tokenizers trained on narrow data distributions, our approach jointly scales both the training distribution and the discrete codebook, enabling the model to capture a significantly richer set of motion primitives. We demonstrate that training with synthetic motion substantially improves the coverage and compositionality of the learned motion vocabulary, leading to consistent gains across motion generation tasks such as text-to-motion and motion continuation, while remaining fully compatible with existing frameworks including MotionGPT. Our results suggest that the primary bottleneck lies in the limited support of the learned motion representation, rather than model architecture alone. Scaling synthetic motion in tandem with representation learning offers a principled path toward more expressive, controllable, and generalizable human motion synthesis.
89. 【2606.27537】MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
链接:https://arxiv.org/abs/2606.27537
作者:Haoyu Chen,Kaichen Zhou,Hang Hua,Kaile Zhang,Jingwen Qian,Wufei Ma,Haonan Chen,Chunjiang Liu,Yizhou Zhao,Xiaoyuan Wang,Weiyue Li,Alan Yuille,Paul Pu Liang,Yilun Du
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video generation models, simulate dynamic environments, Video generation, generation models aspire, aspire to simulate
备注:
点击查看摘要
Abstract:Video generation models aspire to simulate dynamic environments, and several benchmarks now evaluate memory consistency across frames. However, most assess consistency only while the target remains in view, and the few that force objects out of view evaluate static scenes where nothing changes during occlusion. To bridge this gap, we introduce MemoBench, a diagnostic benchmark built around the disappear-and-reappear paradigm in dynamically changing environments: a target object undergoes a physical process, disappears from view, and must be correctly recovered in its updated state upon reappearance. We curate 360 ground-truth clips spanning synthetic and real-world scenes, and design an evaluation suite combining automated metrics with VQA-based assessment across four diagnostic pillars. Evaluation of eight state-of-the-art models reveals key insights and open challenges regarding memory consistency under the disappear-and-reappear paradigm.
90. 【2606.27527】Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge
链接:https://arxiv.org/abs/2606.27527
作者:Thomas Shih-Chao Liang,Zhuoran Yu,Yong Jae Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, large-scale text pretraining, modalities remains underexplored, possess broad conceptual
备注: Accepted by ICML 2026
点击查看摘要
Abstract:Large Language Models (LLMs) possess broad conceptual knowledge acquired through large-scale text pretraining, yet their potential to supervise models in other modalities remains underexplored. In this work, we propose LaViD--Language-to-Visual Knowledge Distillation--a simple and effective framework for transferring high-level semantic knowledge from a language-only teacher to a vision-only student model. Instead of relying on paired multimodal data, LaViD elicits conceptual signals from an LLM by prompting it to generate multiple-choice questions (MCQs) that probe semantic distinctions between visual classes. Each class is mapped to a soft label distribution over these MCQs, forming a rich conceptual signature that guides the student through an auxiliary distillation loss. Notably, despite using a language-only teacher without access to image data, LaViD consistently outperforms recent methods like MaKD that distill from vision-language models across multiple fine-grained benchmarks. It also achieves competitive or superior performance compared to state-of-the-art visual distillation methods such as DKD and MLKD, with further gains when combined with logit standardization. On the Waterbirds dataset, LaViD substantially improves worst-group accuracy, demonstrating enhanced robustness to spurious correlations with distillation. Code is available at this https URL.
91. 【2606.27514】ssellating The Earth
链接:https://arxiv.org/abs/2606.27514
作者:Daniel Cher,Hamza Iqbal,Eric Xing,Brian Wei,Nathan Jacobs
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Geolocation encoders, map geographic coordinates, learned representations, capturing visual, visual and non-visual
备注: European Conference on Computer Vision -- ECCV 2026
点击查看摘要
Abstract:Geolocation encoders, which map geographic coordinates to learned representations, are emerging as an effective means of capturing visual and non-visual characteristics from a latitude-longitude pair alone. However, existing approaches project coordinates onto fixed bases (e.g., spherical harmonics), allocating representational capacity uniformly and devoting equal resources to the open ocean and to a developing city. We introduce Tessellating the Earth (TTE), a location encoder built from learnable Spherical Voronoi partitions that concentrates representational capacity where it is needed in a fully differentiable, end-to-end manner. Each Voronoi site carries its own embedding and migrates during training toward discriminative areas. To bridge the gap between local spatial structure and global semantic understanding, we introduce \emph{global semantic tokens}: a set of shared learnable concept tokens that distill semantic knowledge from the satellite imagery into a compact vocabulary the location encoder can reference at inference, enabling geographically distant sites covering similar environments to share semantics. TTE sets a new state of the art for location encoders across a suite of geospatial classification and regression tasks, and achieves the strongest results when used as a geographic prior for fine-grained species classification on iNaturalist-2018. Code, and weights are available at this https URL.
92. 【2606.27509】Structured-Li-GS: Structured 3D Gaussians Splatting with LiDAR Incorporation and Spatial Constraints
链接:https://arxiv.org/abs/2606.27509
作者:Huaiyuan Weng,Huibin Li,Chul Min Yeum
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Structured framework, Gaussian Splatting, lightweight Gaussian Splatting, Gaussian Splatting pipeline, develop a Structured
备注: 9 pages, ISPRS Congress 2026
点击查看摘要
Abstract:In this study, we develop a Structured framework for Gaussian Splatting (3DGS) with LiDAR integration (Structured-Li-GS). It is a lightweight Gaussian Splatting pipeline that leverages LiDAR-inertial-visual SLAM. Structured-Li-GS achieves high-quality 3D reconstructions with fewer Gaussians by training on accurate, dense, colorized point clouds. Gaussian primitives are anchored using sub-sampled point clouds, and their ellipsoidal parameters are initialized from local surface geometry. Our training strategy integrates a comprehensive set of loss terms, including photometric, flattening, offset, depth, and normal losses, guided by the dense point cloud, enabling accurate reconstruction without Gaussian densification. This approach produces up-to-scale, high-fidelity results with a moderate model size. For experimental validation, we develop a custom hardware-synchronized LiDAR-camera handheld scanner. Experiments on both benchmark datasets and our real-world in-house dataset demonstrate that Structured-Li-GS surpasses state-of-the-art methods while using fewer Gaussians.
93. 【2606.27505】ruEye: Fine-Grained Detection of AI-Generated Human Subjects in Images
链接:https://arxiv.org/abs/2606.27505
作者:Jay Barot,Dan Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, costly Large Language, Internet, Large Language, social media users
备注: 18 Pages, 3 figures
点击查看摘要
Abstract:AI generated images are proliferating across the Internet. While some are used for entertainment, others are weaponized for fraud and social engineering attacks on social media users. Existing detectors overfit to generators seen during training, treat detection as opaque binary classification, or rely on costly Large Language Models (LLMs) to explain their outputs. In this paper, we present TruEye, a novel model for fine grained detection and localization of AI manipulated or AI generated humans and scenes. Unlike conventional detectors that assign a single authenticity label, TruEye is the first to distinguish among five compositional categories of synthetic content, including the most challenging case in which a real human is composited into a real scene where they were never physically present. At its core is a mask conditioned dual stream transformer that separates human and scene tokens while preserving patch level spatial correspondence. Specialized reasoning within each stream and region gated cross attention enforce semantic coherence between subject and background, while token level supervision and global compositional classification yield robust, interpretable predictions without invoking an LLM. By restricting intra stream attention to semantically coherent tokens, TruEye also runs over $100\times$ faster than LLM based competitors. Experiments on 6 datasets and our newly curated FineSyn dataset, show that TruEye surpasses state of the art detectors with higher accuracy, faster inference, and stronger generalization to unseen AI generated or manipulated images.
94. 【2606.27504】ReWorld: Learning Better Representations for World Action Models
链接:https://arxiv.org/abs/2606.27504
作者:Tianze Xia,Lijun Zhou,Kaixin Xiong,Jingfeng Yao,Yu Zhu,Zhenxin Zhu,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Haiyang Sun,Xinggang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:future environment evolution, model future environment, World Action Models, offering a scalable, autonomous driving
备注: 19 pages,3 figures
点击查看摘要
Abstract:World Action Models (WAMs) model future environment evolution under action conditioning, offering a scalable paradigm for autonomous driving. However, existing approaches focus largely on model architecture design, and how a WAM can efficiently learn better world representations for planning remains underexplored. To address this gap, we propose ReWorld, the first representation learning framework specifically designed for autonomous-driving world action models. In WAMs, standard training supervises only the output ends of the generation and planning modules, leaving the intermediate representations that carry world knowledge to be shaped only indirectly, as byproducts of fitting these outputs. The core idea of ReWorld is to treat intermediate representations as direct targets of optimization, shaping them along three complementary dimensions. On the Video DiT responsible for generation, we impose future-predictive supervision on its intermediate representations. On the Action DiT responsible for planning, we first align its intermediate representations cross-modally with the video world representation, then further shape them to be discriminative around safety-critical boundaries via hard-negative supervision. In addition, we systematically analyze the effectiveness of existing representation learning methods in video generation world models, and discuss why their performance is limited on this task. Experiments on nuScenes and NAVSIM show that ReWorld improves fine-tuned video generation by 23.9% in FVD (81.3 to 61.9), raises closed-loop PDMS from 89.1 to 90.4 without any post-training such as RL or post-processing, and accelerates from-scratch convergence by approximately 2x.
95. 【2606.27500】Aloe-Vision: Robust Vision-Language Models for Healthcare
链接:https://arxiv.org/abs/2606.27500
作者:Jaume Guasch-Martí,Enrique Lopez-Cuena,Martín Suárez-Fernández,Jordi Bayarri-Planas,Anna Arias-Duart,Dario Garcia-Gasulla
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, promising research direction, research direction due, Large Vision-Language, specialized in healthcare
备注: MIDL 2026
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) specialized in healthcare are emerging as a promising research direction due to their potential impact in clinical and biomedical applications. However, progress is constrained by the scarcity of high-quality medical multimodal data, concerns about robustness in safety-critical settings, and the narrow and potentially contaminated evaluation benchmarks that limit reliable assessment. To address these issues, the field requires state-of-the-art solutions to be fully open and reproducible systems in which all components can be inspected, evaluated, and improved. This work introduces Aloe-Vision-Data, a large-scale, quality-filtered mixture which integrates both medical and general domains across multimodal and text-only sources, designed for direct use in model fine-tuning. Building on this dataset, we train the Aloe-Vision family of medical LVLMs, openly released with full weights, training recipes and data, in two scales (7B and 72B). Through comprehensive benchmarking, we demonstrate that high quality training mixtures produce balanced LVLMs which yield significant gains over the baseline models without compromising general capabilities, achieving competitive performance with respect to state-of-the-art alternatives. To support reliable evaluation, we introduce CareQA-Vision, a carefully curated vision benchmark derived from MIR and EIR exams, the residency entrance exams for medical and nursing specialists in Spain, offering novel vision questions with low likelihood of contamination. Finally, we show that current LVLMs remain vulnerable to adversarial and misleading inputs, underscoring reliability challenges in clinical contexts.
96. 【2606.27499】DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection
链接:https://arxiv.org/abs/2606.27499
作者:Yujin Tang,Chenming Shang,Ruize Xu,Nikhil Singh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:matured rapidly, text side, interactive environment, existing benchmarks, agent genuinely
备注: 16 pages
点击查看摘要
Abstract:Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write down. We introduce DMV-Bench (Code: this https URL), the first interactive benchmark for multimodal-agent visual memory. DMV-Bench is built on a controlled home-furnishing e-commerce catalogue of 1,000 product variants in which a text-leakage contract keeps the discriminative signal of each task in the pixels alone. Across a chain of autonomous shopping sessions, every visited product image carries a unique, pre-rendered incidental cue, and the agent is later asked to recall a particular cued product and navigate to its URL. Inspired by dual-coding theory, we propose DualMem, a memory architecture that maintains a visual and a verbal code in parallel. On DMV-Bench, DualMem outperforms a caption baseline and three recent multimodal agent-memory systems at every chain length J in {5, 10, 15, 50} on both Gemini 2.5 Flash and Qwen2.5-VL-7B, with the lead surviving controls for memory-bank size and encoding-position bias, and an asymmetric dual-coding regime in which vision carries the cue end-to-end while the verbal channel plays a smaller query-grounding role.
97. 【2606.27491】SelectAnyTree: A Promptable Instance Segmentation Model for 3D Forest LiDAR Point Clouds
链接:https://arxiv.org/abs/2606.27491
作者:Trung Thanh Nguyen,Daniel Lusk,Kilian Gerberding,Janusch Vajna-Jehle,Tuan-Anh Vu,Duc Viet Le,Tu Vo,Phi Le Nguyen,Yasutomo Kawanishi,Takahiro Komamizu,Ichiro Ide,Julian Frey,Teja Kattenborn
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Automated instance segmentation, Automated instance, forest monitoring moves, moves toward scalable, LiDAR point clouds
备注:
点击查看摘要
Abstract:Automated instance segmentation of forest LiDAR point clouds is increasingly critical as forest monitoring moves toward scalable, detailed, 3D measurement. Yet, progress is constrained by label scarcity for tree instances; a single hectare can hold millions of points and hundreds of overlapping, complex crowns, making manual annotation from scratch with raw data laborious and error-prone. Annotations are often corrected from automatic pre-segmentations, but remain costly as these provide no interactive or AI-assisted refinement. Inspired by the promptable paradigm of foundation segmentation models, we propose SelectAnyTree, a promptable instance segmentation model that delineates any individual tree in a 3D forest point cloud from a few clicks. It introduces two key components: Click-to-query prompt encoder and Canopy Height Model (CHM)-guided first prompt. The former turns each click into a single content query, encoding its 3D position and positive/negative polarity together with a pooled local backbone feature. The latter provides treetops as a geometry- and ecologically guided first prompt without any user input. The resulting prompt query is then decoded into one tree mask by a state-space query decoder to efficiently capture long-range context in large-scale forest scenes with linear-time complexity. We evaluate SelectAnyTree in interactive and instance-level settings across seven diverse forest regions and an independent held-out test dataset, demonstrating strong generalization beyond the training domains. It segments a target tree to 78.2 Intersection over Union (IoU) from a single click, 24.8 points above the strongest promptable baseline, and reaches every accuracy target with the fewest clicks, while using far fewer parameters and less inference time than prior promptable models. The source code is available at this https URL.
98. 【2606.27484】Fine-tuning a multimodal large language model for clinician-grade autism behavioral scoring from short home videos
链接:https://arxiv.org/abs/2606.27484
作者:Mohammadmahdi Honarmand,Parnian Azizian,Aaron Kline,Kae Nurge,Zerin Nasrin Tumpa,Saimourya Surabhi,Kaitlyn Dunlap,Yang Qian,Ali Kargarandehkordi,Sameer Neupane,Peter Washington,Dennis P. Wall
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autism spectrum disorder, spectrum disorder, exceeds four years, median age, Autism spectrum
备注:
点击查看摘要
Abstract:Autism spectrum disorder (ASD) affects 1 in 31 US children, yet median age at diagnosis exceeds four years. Artificial intelligence pipelines that provide quantified diagnosis using easy to access observational data (e.g., home videos) could help with earlier diagnosis, and timely delivery of early treatments. We fine-tuned Gemini 2.5 Pro on 400 clinician-rated home videos with low-rank adaptation, training only on 30 behavioral features previously validated to produce reliable predictions when passed to various ML models. On 99 held-out children (49 ASD, 50 neurotypical), inter-rater reliability with clinicians (per-feature weighted Cohen's kappa) improved by 40% (p0.001), with 27 of 28 evaluable features improving. As an emergent zero-shot capability, direct ASD diagnosis F1 improved by 53% (p0.001), matching or exceeding clinician outcomes. Classifier-assisted pipelines using fine-tuned LLM-derived behavioral features matched clinician-scored inputs across all tested pathways and achieved 77% accuracy (95% CI: 68-85%) and an AUC of 86% (95% CI: 78-92%). Fine-tuned multimodal LLMs can serve as scalable behavioral feature extractors for use in autism assessment and diagnosis.
99. 【2606.27444】SemCityLoc: Aerial 6DoF Localization Using Semantic 3D City Models
链接:https://arxiv.org/abs/2606.27444
作者:Jingfeng Mao,Xuyang Chen,Qilin Zhang,Oussema Dhaouadi,Guangming Wang,Brian Sheil,Daniel Cremers,Yan Xia,Olaf Wysocki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:precise GNSS signals, precise GNSS, GNSS signals, localization typically relies, radiometrically rich
备注: accepted by ECCV 2026
点击查看摘要
Abstract:Aerial 6DoF localization typically relies on precise GNSS signals or radiometrically rich 3D reconstructions, limiting scalability and on-board deployment. We propose SemCityLoc, a semantic-geometric alignment system that reframes aerial pose estimation as structured surface registration between foundation-model-derived visual priors and standardized LoD-compliant 3D city models. Instead of matching sparse contours or dense texture, our method aligns semantic surfaces and monocular depth with lightweight semantic 3D building models, increasing pose discriminability in repetitive and occluded urban environments. To enable accurate evaluation, we introduce SemCityLockeD, the first real-world benchmark combining centimeter-accurate UAV poses with standardized LoD1--LoD3 semantic city models and challenging low-altitude imagery. Experiments demonstrate substantial improvements over existing map-based approaches, improving recall by up to 36% and reducing mean positional error from 9.89m to 2.62m in challenging urban canyons. Our results indicate that semantically structured geometry provides sufficient and scalable constraints for high-precision aerial localization without radiometric scene reconstructions. The code and data are available at this https URL.
100. 【2606.27412】Not All Relations Rotate Alike: Transformation-Aware Decoupling for Viewpoint-Robust 3D Scene Graph Generation
链接:https://arxiv.org/abs/2606.27412
作者:Jingjun Sun,Chaowei Wang,Zhirui Liu,Jiaxu Tian,Ming Yang,Yaoxing Wang,Shan Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Image and Video Processing (eess.IV)
关键词:Scene Graph Generation, Graph Generation, compact relational abstraction, Scene Graph, providing a compact
备注:
点击查看摘要
Abstract:3D Scene Graph Generation (3DSGG) represents 3D scenes as structured object-relation-object graphs, providing a compact relational abstraction for spatial understanding. In embodied intelligence settings, the same 3D scene may be observed by agents from viewpoints that differ by yaw rotations. However, current 3DSGG models often fail to produce relation predictions that follow the expected transformation behavior under such viewpoint shifts. This behavior reveals an empirical mismatch related to predicate-level transformation heterogeneity: directional predicates such as left, front, right, and behind should transform with the observation frame, whereas most contact, support, and semantic predicates such as standing on and attached to should remain stable. To reduce this mismatch, we propose Transformation-Aware Decoupling (TAD), a viewpoint-robust 3DSGG framework that decouples relation reasoning according to predicate transformation behavior and is supported by viewpoint-stable object representations. TAD decomposes relation reasoning into two parts: one learns cues that should stay stable across viewpoints, while the other learns directional cues that should change with the observation frame. The two parts are merged for standard multi-label predicate prediction. Transformation-specific descriptors and group-aware auxiliary supervision encourage the two branches to capture complementary relation cues. Extensive experiments on 3DSSG show that TAD achieves state-of-the-art robustness under yaw viewpoint changes without training-time rotation augmentation, while maintaining competitive performance under the standard benchmark. The project page is available at this https URL.
101. 【2606.27385】RANSAC Scoring Done Right
链接:https://arxiv.org/abs/2606.27385
作者:James Pritts,Felix Seegräber,Kevin Köser
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:variants score candidate, score candidate models, summing per-point scores, RANSAC variants score, inlier scale
备注: pre-print
点击查看摘要
Abstract:The most widely used RANSAC variants score candidate models by counting inliers or summing per-point scores that saturate beyond a residual threshold. Every such score requires a user-supplied parameter that is a function of the inlier scale, which must itself be estimated from contaminated data. We remove this dependence by reversing the usual order of inference: rather than estimating the scale and then scoring against it, we marginalize the inlier scale analytically in closed form under a conjugate Inverse-Gamma prior for a fixed inlier partition, then optimize over partitions. A single closed-form expression spans the non-informative Jeffreys limit and informative empirical-Bayes priors, so the same score adapts across data-rich and data-scarce regimes without any change to the algorithm. The proposed RANSAC score is the first in which the inlier scale is genuinely absent from the formula. The score admits O(N log N ) computation via sort-and-sweep. On a benchmark of nearly 70 000 image pairs spanning different two-view estimation problems and both engineered and learned feature pipelines, the proposed score exceeds the state of the art (RANSAC, MSAC, GaU, MAGSAC): it stays nearly flat under threshold miscalibration where baselines degrade, reaches near-optimal accuracy from as few as two validation pairs where baselines need ont he order of 100 times more,. and tightens its prior regularization as validation data grows scarce.
102. 【2411.19594】ortho-Gaussian: Splatting True Digital Orthophoto Maps
链接:https://arxiv.org/abs/2411.19594
作者:Xin Wang,Wendi Zhang,Hong Xie,Haibin Ai,Qiangqiang Yuan,Zongqian Zhan
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Geographic Information Systems, True Digital Orthophoto, Digital Orthophoto Maps, Information Systems, Geographic Information
备注: This work has been submitted to the IEEE Transactions on Geoscience and Remote Sensing for possible publication
点击查看摘要
Abstract:True Digital Orthophoto Maps (TDOMs) are essential products for digital twins and Geographic Information Systems (GIS). Traditionally, TDOM generation involves a complex set of traditional photogrammetric process, which may deteriorate due to various challenges, including inaccurate Digital Surface Model (DSM), degenerated occlusion detections, and visual artifacts in weak texture regions and reflective surfaces, etc. To address these challenges, we introduce TOrtho-Gaussian, a novel method inspired by 3D Gaussian Splatting (3DGS) that generates TDOMs through orthogonal splatting of optimized anisotropic Gaussian kernel. More specifically, we first simplify the orthophoto generation by orthographically splatting the Gaussian kernels onto 2D image planes, formulating a geometrically elegant solution that avoids the need for explicit DSM and occlusion detection. Second, to produce TDOM of large-scale area, a divide-and-conquer strategy is adopted to optimize memory usage and time efficiency of training and rendering for 3DGS. Lastly, we design a fully anisotropic Gaussian kernel that adapts to the varying characteristics of different regions, particularly improving the rendering quality of reflective surfaces and slender structures. Extensive experimental evaluations demonstrate that our method outperforms existing commercial software in several aspects, including the accuracy of building boundaries, the visual quality of low-texture regions and building facades. These results underscore the potential of our approach for large-scale urban scene reconstruction, offering a robust alternative for enhancing TDOM quality and scalability.
103. 【2606.28163】Enhanced Neural Video Representation Compression across Extreme Complexity and Quality Scales
链接:https://arxiv.org/abs/2606.28163
作者:Ho Man Kwan,Tianhao Peng,Fan Zhang,Mike Nilsson,Andrew Gower,David Bull
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Implicit neural representations, Implicit neural, competitive rate-distortion performance, rate-distortion performance alongside, performance alongside rapid
备注:
点击查看摘要
Abstract:Implicit neural representations (INRs) have recently emerged as a promising approach to video compression, delivering competitive rate-distortion performance alongside rapid decoding. However, existing neural video codecs struggle to balance complexity and scalability. Lightweight models often suffer from degraded compression performance when scaled to different bitrate/quality levels, whereas high-performance models exhibit limited scalability, as their model complexity typically increases with quality. This lack of a unified architecture capable of maintaining consistent complexity across a wide range of bitrates severely limits their diverse real-world deployment. To address these challenges, we introduce NVRC++, a novel INR-based video codec that utilizes a lightweight INR with multiple high-resolution feature grids, providing high scalability at any given complexity level. This is paired with an optimization framework that enables efficient overfitting on high-resolution grids for long video sequences, thereby exploiting spatio-temporal redundancies without prohibitive computational or memory overhead. Additionally, an advanced entropy model is designed for efficiently compressing the high-dimensional grid parameters. As a result, NVRC++ provides four complexity levels (from 7kMACs/pixel to 360kMACs/pixel), each spanning wide bitrate and quality ranges while supporting real-time decoding. The experimental results show that NVRC++ offers a much faster decoding speed (up to 7.6x) compared to the SOTA INR-based video codec, NVRC, while delivering comparable performance.
104. 【2606.28136】Differentiable design of the PIAA-ZWFS: a flexible wavefront sensor that approaches the fundamental limit
链接:https://arxiv.org/abs/2606.28136
作者:A. K. Taras,S. Y. Haffert,L. Desdoigts
类目:Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
关键词:Extreme adaptive optics, high contrast astronomy, Zernike wavefront sensor, Extreme adaptive, adaptive optics systems
备注: Submitted to Astronomy Astrophysics (AA)
点击查看摘要
Abstract:Extreme adaptive optics (AO) is necessary for high contrast astronomy at scales of the habitable zone of nearby systems. We seek to evaluate wavefront sensors that approach fundamental limits of wavefront sensing, enabling adaptive optics systems to run faster or on fainter targets. We present the phase-induced amplitude apodisation Zernike wavefront sensor (PIAA-ZWFS): an adaptation of the conventional Zernike wavefront sensor (ZWFS) that leverages lossless apodisation of the pupil to concentrate the starlight in the focal plane. We optimise and evaluate the sensor with a differentiable modelling framework, drawing on concepts from Bayesian experimental design to minimise the variance of a maximum likelihood estimator that uses the system in the high Strehl regime. Our architecture shows state-of-the-art performance in simulation for different apertures, bandwidths, photon fluxes and source sizes, closing the gap to the fundamental limit by a factor 10 (2.5) compared to the conventional ZWFS (optimised ZWFS) in a typical photon-limited case. For extended sources, we show that even an ideal point source sensor rapidly becomes sub-optimal, and our system outperforms it for stellar diameters larger than 0.8{\lambda}/D. We verify that these gains do not come at the cost of dynamic range with either linear or non-linear reconstructors. Finally, we present a proof that there must be a trade-off between the information gained about amplitude and phase errors for any wavefront sensor. The PIAA-ZWFS is a viable wavefront sensor operating near the fundamental sensitivity limits.
105. 【2606.28027】MLVC: Multi-platform Learned Video Codec for Real-World Deployment
链接:https://arxiv.org/abs/2606.28027
作者:Tanel Pärnamaa,Martin Lumiste,Ardi Loot,Evgenii Indenbom,Andrei Znobishchev,Ando Saabas
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:high computational cost, surpassed classical codecs, neural video codec, computational cost, surpassed classical
备注: Accepted to ECCV 2026
点击查看摘要
Abstract:Neural video codecs have surpassed classical codecs in coding efficiency but remain impractical for deployment due to cross-platform incompatibility and high computational cost. Existing quantization-based solutions fail to produce deterministic results across diverse hardware platforms, leading to catastrophic decoding failures. We introduce MLVC, a hardware-robust neural video codec designed for practical cross-platform inference. The key idea is to explicitly transmit scale parameters through the hyperprior, which guarantees entropy coding consistency across devices without requiring bit-exact arithmetic. While this increases bitrate overhead, we recover most of the coding efficiency through architectural improvements (gated memory, ReGLU activation), a long-term reference recovery mechanism, and domain-specific perceptual training. On the VCD video conferencing benchmark, MLVC achieves 70% BD-rate (MOS) improvement over hardware HEVC, the strongest deployable baseline, while reaching subjective quality competitive with DCVC-RT, which cannot operate across diverse platforms. Both the encoder and decoder run at 100 FPS on average on commodity NPUs from Apple, Intel, and Qualcomm. MLVC is the first neural video codec to combine competitive compression performance, real-time speed, and cross-platform robustness across diverse consumer devices, making it suitable for widespread deployment. Code will be released.
106. 【2606.27612】Enhancing Co-packaging Optics Enabled Silicon Photonics Security Assurance Hardware Fingerprinting
链接:https://arxiv.org/abs/2606.27612
作者:Liton Kumar Biswas,M Shafkat M Khan,Himanandhan Reddy Kottur,Hao Wang,Hamed Dalir,Navid Asadizanjani
类目:Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:greatly improving data, improving data communication, data communication bandwidth, standard semiconductor processes, photonics enables integration
备注: Author manuscript version of paper published in IMAPSource Proceedings 2025. Final published version available through IMAPS. 6 pages
点击查看摘要
Abstract:Silicon photonics enables integration of optical components using standard semiconductor processes, greatly improving data communication bandwidth and energy efficiency. However, photonics integrated circuits (PICs) face unique security challenges, such as counterfeit or tampering threats, that conventional electronic security methods do not address. We propose a novel hardware fingerprinting technique that embeds two dimensional photonic crystal patterns into the density control filler regions of a PIC. Each PhC pattern is designed to resonate a specific visible to near infrared wavelengths, producing a distinctive optical signature (based on wavelength, polarization, and incident angle) for each device. Finite difference time domain (FDTD) simulation using ANSYS Lumerical is employed to optimize nanostructure dimensions and spacing so that each device's reflection/absorption spectrum contains unique narrowband peaks. No extra fabrication steps or materials are required beyond standard lithography, keeping costs low. The embedded nanostructures have sub-50nm precision, making forgery extremely difficult. Our method yields a high resolution, scalable fingerprint for silicon photonic chips, enabling cost-effective device authentication and improved supply chain security.

