本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新775篇论文,其中:

  • 自然语言处理115
  • 信息检索20
  • 计算机视觉123

自然语言处理

1. 【2606.11189】A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

链接https://arxiv.org/abs/2606.11189

作者:Tong Xie,Yuanhao Ban,Yunqi Hong,Sohyun An,Yihang Chen,Cho-Jui Hsieh

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Supervised fine-tuning, typically maximizes, demonstrated trajectory, maximizes the likelihood, SFT

备注

点击查看摘要

Abstract:Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.

2. 【2606.11176】Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

链接https://arxiv.org/abs/2606.11176

作者:Kevin Qinghong Lin,Batu EI,Yuhong Shi,Pan Lu,Philip Torr,James Zou

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:turn raw information, data journalist job, shape society, non-experts can trust, turn raw

备注: Project page: [this https URL](https://data2story.github.io) Github: [this https URL](https://github.com/QinghongLin/data2story-skill)

点击查看摘要

Abstract:Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at this https URL.

3. 【2606.11167】Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

链接https://arxiv.org/abs/2606.11167

作者:Atsumoto Ohashi,Neil Zeghidour,Alexandre Défossez,Eugene Kharitonov

类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:speak simultaneously, Full-duplex spoken dialogue, listen and speak, promising architecture, architecture for natural

备注

点击查看摘要

Abstract:Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

4. 【2606.11127】Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

链接https://arxiv.org/abs/2606.11127

作者:Soham Bhattacharjee,Karun Sharma,Vinay Kumar Sankarapu,Pratinav Seth

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:holistic LLM judges, Synthetic post-training pipelines, commonly filter generated, practices remain rarely, remain rarely examined

备注

点击查看摘要

Abstract:Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced each generation, and whether rejected samples can be systematically recovered rather than permanently discarded. We present a controlled study of both questions across gate configurations, recovery strategies, and generator scales, using adversarially injected corpora to provide ground-truth failure labels. We find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations making both necessary, and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is driven primarily by generator scale, with filtration and recovery conditions contributing meaningfully but secondarily.

5. 【2606.11119】RACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

链接https://arxiv.org/abs/2606.11119

作者:Heming Zou,Qi Wang,Yun Qu,Yuhang Jiang,Lizhou Cai,Yixiu Mao,Ru Peng,Xin Xu,Weijie Liu,Kai Yang,Saiyong Yang,Xiangyang Ji

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, Reinforcement learning, language models, learning with verifiable, approach for enhancing

备注: 32 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.

6. 【2606.11105】PhantomBench: Benchmarking the Non-existential Threat of Language Models

链接https://arxiv.org/abs/2606.11105

作者:Haeji Jung,Hila Gonen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:generate factually ungrounded, factually ungrounded responses, pose serious risks, generate factually, ungrounded responses

备注

点击查看摘要

Abstract:Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.

7. 【2606.11082】he Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

链接https://arxiv.org/abs/2606.11082

作者:Hakan Mehmetcik

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:sustained adversarial conditions, Cerulean Sea Crisis, Eastern Mediterranean conflicts, study investigates cross-lingual, investigates cross-lingual distributional

备注: 25 pages, 2 figures, 6 tables, Research Article

点击查看摘要

Abstract:This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, and DeepSeek-R1) participate in a between-groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language of play (English versus Turkish), producing 586 validated statements. A zero-shot classifier assesses behavioral dispositions along two continuous dimensions: Concession Rate and Coercive Rhetoric. The results are heterogeneous. Llama-4 shows a substantial, Holm-corrected increase in coercive rhetoric under Turkish (delta = +0.800, p = .002), whereas Gemini-3.1-Pro displays an equally large decrease (delta = -0.750, p = .005). DeepSeek-R1 exhibits a similar negative shift (delta = -0.860, p = .006) and provides chain-of-thought evidence consistent with a buffering mechanism. GPT-4o shows no detectable effect (delta = +0.130, p = .614). These findings indicate that cross-lingual behavioral skew is contingent on model architecture and training regime rather than a universal property of Western-origin LLMs. We identify two distinct buffering mechanisms, chain-of-thought institutional anchoring and multilingual RLHF alignment, and discuss their implications for integrating LLMs safely into diplomatic and crisis-management settings.

8. 【2606.11079】VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

链接https://arxiv.org/abs/2606.11079

作者:Yunan Lu,Ryan Shea,Yusen Zhang,Zhou Yu

类目:Computation and Language (cs.CL)

关键词:interactive agent development, remains a critical, critical bottleneck, agent development, Evaluation

备注

点击查看摘要

Abstract:Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.

9. 【2606.11078】A History-Aware Visually Grounded Critic for Computer Use Agents

链接https://arxiv.org/abs/2606.11078

作者:Jaewoo Lee,Zaid Khan,Archiki Prasad,Justin Chih-Yao Chen,Supriyo Chakraborty,Kartik Balasubramaniam,Sambit Sahu,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphical User Interface, complex Graphical User, Computer Use Agents, User Interface, Graphical User

备注: Code: [this https URL](https://github.com/G-JWLee/HiViG)

点击查看摘要

Abstract:Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

10. 【2606.11074】Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

链接https://arxiv.org/abs/2606.11074

作者:Peiqi Jia(1),Haonan Jia(2),Ziqi Miao(2),Linkang Du(1),Yuntao Wang(1),Zhou Su(1) ((1) Xi'an Jiaotong University, (2) Beihang University)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Multimodal Large Language, Large Language, Language Models, Multimodal Large

备注

点击查看摘要

Abstract:With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality conditioning and establishes a systematic evaluation framework encompassing single-personality induction, multi-personality induction, and personality switching. Experiments show that personality induction improves image captioning performance but can impair performance on tasks requiring precise reasoning, such as visual question answering (VQA). Balancing and residual effects are observed during multi-trait composition and dynamic switching, indicating that model behavior is co-modulated by both previous and current personality constraints. Existing prompt-based personality induction methods show limited transferability to multimodal settings. Our work reveals the dynamic and complex nature of personality modeling in MLLMs and underscores the need for robust, tailored methods for personality induction and evaluation. The code will be released when the paper is accepted.

11. 【2606.11070】1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

链接https://arxiv.org/abs/2606.11070

作者:Genta Indra Winata,Amartya Chakraborty,Yuzhen Lin,Swasthi P Rao,Shikhhar Siingh,Houhan Lu,Nadia Bathaee,Sriharsha Hatwar,Paresh Dashore,Anmol Jain,Kshitij Tayal,Xiuzhu Lin,Anirban Das,Sambit Sahu,Shi-Xiong Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:enabled increasingly capable, large language models, increasingly capable agentic, capable agentic systems, Recent advances

备注: Preprint

点击查看摘要

Abstract:Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and often fail to capture interactions that span multiple domains, limiting their ability to evaluate agents in realistic multi-step settings that require sustained reasoning and coordination. To address these limitations, we introduce T1-Bench, a high-fidelity, comprehensive benchmark for evaluating agentic systems in realistic customer-facing, multi-domain environments, featuring interleaved scenarios that require structured reasoning across multi-turn user-assistant interactions and substantially increasing both compositional complexity and evaluative rigor across 25 domains of varying difficulty. We evaluate T1-Bench using 12 proprietary and open-weight models, providing a reproducible and standardized framework for assessing agent behavior, tool utilization, and conversational quality in complex, multi-step environments. We further complement automatic evaluation with human judgments to strengthen the assessment of qualitative performance. Overall, T1-Bench substantially advances prior benchmarks by increasing task complexity, interaction depth, and domain coverage in simulated multi-domain environments. To facilitate future research on agentic systems, we will publicly release data and evaluation code as open source.

12. 【2606.11052】Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

链接https://arxiv.org/abs/2606.11052

作者:Xinyu Zhou,Boyu Zhu,Yi Xu,Zhiwei Li,Yingfa Chen,Huiming Wang,Zhijiang Guo

类目:Computation and Language (cs.CL)

关键词:hybrid linear-attention models, supervised fine-tuning, systematically degrades long-context, degrades long-context recall, linear-attention models

备注: 28 pages

点击查看摘要

Abstract:Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from $67.2\%$ to $9.4\%$. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections ($W_Q, W_K$) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only $W_Q$ and $W_K$ from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from $65.4\%$ to $76.4\%$ while maintaining strong reasoning performance.

13. 【2606.11046】Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

链接https://arxiv.org/abs/2606.11046

作者:Prajakta Kini,Avinash Reddy,Souradip Chakraborty,Satya Sai Srinath Namburi GNVV,Furong Huang,Amrit Singh Bedi,Alvaro Velasquez

类目:Computation and Language (cs.CL)

关键词:multi-step task performance, improve multi-step task, task performance, LLMs are increasingly, increasingly converted

备注

点击查看摘要

Abstract:Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, such as safe refusal, bias avoidance, and privacy protection. We ask: does this conversion preserve alignment? We study this question through a trustworthiness audit and find that it is not behavior-preserving by default. For a systematic analysis, we compare reasoning models produced via supervised fine-tuning, RL-based post-training, and distillation against matched instruction-tuned baselines across six trustworthiness dimensions: safety, toxicity, stereotyping and bias, machine ethics, privacy, and out-of-distribution robustness. We observe that reasoning models often improve on reasoning benchmarks but exhibit alignment regressions, including increased toxicity, amplified stereotyping, miscalibrated refusal, and contextual privacy leakage. These regressions are consistent with behavioral drift from the instruction-tuned baseline, measured by KL divergence. Overall, our results point to the broader conclusion that trustworthiness metrics are essential for evaluating reasoning models and should be reported alongside gains in reasoning capability.

14. 【2606.11033】AuRA: Internalizing Audio Understanding into LLMs as LoRA

链接https://arxiv.org/abs/2606.11033

作者:Bo Cheng,Lei Shi,Zhanyu Ma,Yuan Wu,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:cascaded ASR-LLM pipelines, ASR-LLM pipelines, Recent efforts, inputs typically rely, extend large language

备注

点击查看摘要

Abstract:Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student's hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.

15. 【2606.11023】Generative Archetype-Grounded Item Representations for Sequential Recommendation

链接https://arxiv.org/abs/2606.11023

作者:Yifan Li,Jiahong Liu,Xinni Zhang,Hao Chen,Yankai Chen,Wenhao Yu,Jianting Chen,Irwin King

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:aims to predict, predict users', analyzing their historical, Sequential recommendation aims, Archetype-grounded Item Representations

备注: Accepted by WWW 2026 (Oral)

点击查看摘要

Abstract:Sequential recommendation aims to predict users' next interaction with items by analyzing their historical behavior. However, the limited quality of item representations remains a critical bottleneck. While pre-trained large language models (LLMs) can provide rich semantic representations, existing approaches only rely on static encoding of fixed attributes, overlooking the crucial role of target audiences in defining item identity. Moreover, the semantic space struggles to reflect actual user behavior, resulting in a significant gap between semantic representations and behavioral patterns. To address these limitations, we propose GenAIR, a general framework that empowers sequential recommendation with Generative Archetype-grounded Item Representations. Specifically, we first leverage an LLM to analyze item metadata and infer textual description of the Archetype, which represents the conceptual profile of the item's ideal target audience. We then extract the corresponding embeddings in a single forward pass. Further, to ground these generative archetypes in real-world behavior, we introduce a behavioral calibration objective, which explicitly incorporates behavioral signals from actual interactions. This objective adjusts the structure of the embedding space to reflect empirical patterns. GenAIR enables seamless integration with most existing models while maintaining high efficiency. Comprehensive experiments conducted on three real-world datasets demonstrate that GenAIR significantly improves the performance of various sequential recommendation models and consistently outperforms state-of-the-art baseline approaches. Implementation codes are available at this https URL.

16. 【2606.11018】Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder Transfer

链接https://arxiv.org/abs/2606.11018

作者:Maria Milkova,Maksim Rudnev

类目:Computation and Language (cs.CL)

关键词:Measuring subjective constructs, naturally occurring social, Measuring subjective, occurring social media, media text requires

备注

点击查看摘要

Abstract:Measuring subjective constructs in naturally occurring social media text requires annotation procedures that are theoretically grounded, empirically validated, and transferable to an encoder model for scalable prediction. Using non-English social media posts annotated according to Schwartz's theory of basic human values, we investigate how different LLMs, prompts, and instruction languages operationalize the expression of values in text. We argue that although texts may permit multiple plausible interpretations, theory-based value definitions can constrain interpretations and reduce spurious value attributions. Beyond precision, recall, and F1, we evaluate structural alignment between values, error structure, confidence-ambiguity relations, and annotation stability. We show that different LLMs produce different value interpretations. Iterative prompt calibration through error analysis reduces misattributions and improves alignment with expert annotations. We also derive targeted expert verification rules from recurrent error structures and use them during corpus annotation. Finally, we show that LLM annotations can be transferred to an encoder model through soft-label training, retaining theory-based value interpretations and information about uncertainty in value expression.

17. 【2606.11009】Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions

链接https://arxiv.org/abs/2606.11009

作者:Parisa Suchdev,Juniper Lovato

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:math word problems, Large language models, English math word, Large language, adapt math word

备注: 17 pages total with references and appendix, 9 figures, under review

点击查看摘要

Abstract:Large language models are increasingly used to adapt math word problems for personalized learning at scale, but it remains an open question whether those adaptations are consistent across models, preserve cultural diversity at scale, and reveal which cultural entities models treat as most salient. We analyze how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into Bengali, Hindi, Punjabi (India), Urdu, Sindhi (Pakistan), Italian, and Sicilian (Italy), a language set spanning the full resource spectrum, from high-resource Italian and Hindi to under-studied Sindhi, Sicilian, and Punjabi. We annotate 6,489 entity transformations, coding whether models preserve, localize, generalize, omit, or change entities such as names, foods, and places. Models agree on transformation type in 62.5% of cases and on specific substitutions in only 33.5%, meaning model choice directly shapes which cultural world students encounter. All 21 language-model combinations show entropy collapse, with adaptation compressing rather than expanding cultural diversity. Models prioritize surface markers such as names, foods, and currencies while preserving deeper structural features such as grade-level systems that embed culturally specific assumptions. Despite prompts specifying target countries, models misattribute regional context by using Bangladeshi taka for Indian Bengali students and produce cross-cultural contamination, such as adapting egg hunts as Eid activities. Some failures are visible in individual translations. Others, including diversity collapse, systematic preference for surface markers, and consistent regional misattribution, emerge only through corpus-level analysis. The surface plausibility that makes adapted problems look correct is precisely what makes deeper failures easy to overlook.

18. 【2606.10956】Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

链接https://arxiv.org/abs/2606.10956

作者:Tengchao Lv,Dongdong Zhang,Jiayu Ding,Yilin Jia,Yuzhong Zhao,Yupan Huang,Wenshan Wu,Xiangyang Zhou,Shaohan Huang,Nan Yang,Li Dong,Lei Cui,Furu Wei

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, professional-grade productivity software, Large Language, deployment of Large, Computer Rank Examination

备注: 21 pages, 5 figures

点击查看摘要

Abstract:The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.

19. 【2606.10932】Density Field State Space Models: 1-Bit Distillation, Efficient Inference, and Knowledge Organization in Mamba-2

链接https://arxiv.org/abs/2606.10932

作者:Chirag Shinde

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Density Field State, present Density Field, Field State Space, Density Field, Field State

备注: 16 pages, 6 figures, 7 tables. Code available at [this https URL](https://github.com/cs-cmyk/df-ssm)

点击查看摘要

Abstract:We present Density Field State Space Models (DF-SSM), a framework for compressing SSMs to a 1-bit scaffold with int8 low-rank correction. Applied to Mamba-2 1.3B, we achieve a 278 MB model (9.7x smaller than the 2.7 GB FP16 teacher) that runs at 21.4x faster inference on GPU (batch=1, relative to the mamba-ssm reference implementation) while maintaining downstream task performance within 2-4 percentage points of BitMamba-2, a 1.58-bit model trained from scratch on 150B tokens. The distillation itself requires only 32M tokens and 6 hours on a single A100 GPU, though it presupposes a pretrained FP16 teacher. We develop an optimized inference pipeline combining cuBLAS INT8 tensor cores for the scaffold matmul, custom CUDA kernels for stateful SSM and convolution operations, and an AVX-512 CPU backend for efficient deployment on both GPU and CPU. Beyond compression, we investigate the internal knowledge organization of the resulting model, discovering three distinct processing phases: intent classification (layers 0-3, operating in an abstract space with no vocabulary alignment), knowledge retrieval (layers 25-35, where factual associations localize to a 5-layer window), and output formatting (layers 36-47, where category structure dissolves). Through systematic analysis of 445 factual prompts across 19 categories, we find that early-layer classification is syntactic (driven by template structure) rather than semantic, and that the model exhibits well-organized knowledge representations despite weak factual recall--suggesting that representational structure may precede factual strength.

20. 【2606.10931】It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

链接https://arxiv.org/abs/2606.10931

作者:Naihao Deng,Yilun Zhu,Naichen Shi,Clayton Scott,Rada Mihalcea

类目:Computation and Language (cs.CL)

关键词:offensive statements, Relative Policy Optimization, Group Relative Policy, toxic and offensive, Warning

备注

点击查看摘要

Abstract:Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2606.10931 [cs.CL]

(or
arXiv:2606.10931v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.10931

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
21. 【2606.10921】race Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

链接https://arxiv.org/abs/2606.10921

作者:Xiangjun Zai,Xingyu Tan,Chen Chen,Xiaoyang Wang,Wenjie Zhang

类目:Computation and Language (cs.CL)

关键词:requires large language, large language models, cross-part evidence connections, requires large, language models

备注

点击查看摘要

Abstract:Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. Although retrieval-augmented generation (RAG) reduces the input context by retrieving relevant evidence, existing structured RAG methods still face three limitations: costly query-agnostic knowledge organization, insufficient use of original document structure, and no reuse of historical reasoning experience. To address these limitations, we propose DocTrace, a multi-agent RAG framework for long-document QA that supports query-triggered knowledge organization, document-structure-aware and experience-guided reasoning. DocTrace preserves document hierarchy with a lightweight document structural tree index, constructs agent-shared hypergraph-structured working memory on demand during reasoning, and stores successful reasoning plans in graph-structured experience memory for future reuse, enabling adaptive exploration across related long-document questions. Experiments on four long-document QA datasets show that DocTrace achieves the best performance on three datasets, surpassing the strongest baseline, ComoRAG, by up to 8.85% in F1 and 4.40% in EM, while reducing the overall computational cost by 53.32%

22. 【2606.10875】Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

链接https://arxiv.org/abs/2606.10875

作者:Yupu Hao,Zhuoran Jin,Huanxuan Liao,Kang Liu,Jun Zhao

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, multi-step execution due, insufficient tool-related knowledge, ineffective knowledge activation

备注

点击查看摘要

Abstract:Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use performance, covering the stages of knowledge acquisition, activation, and internalization. In the knowledge acquisition stage, we acquire and evaluate various forms of experiential knowledge, and our analysis shows that simple instance-level knowledge can already provide strong and reliable gains, while abstract intent-level knowledge offers limited benefits. At inference time, to activate knowledge, we find that prompting LLM to expand the depth of reasoning yields diminishing returns, whereas expanding the width of reasoning by parallel sampling with aggregation more effectively activates latent experiential knowledge. At training time, for knowledge internalization, post-training with knowledge-augmented data further improves performance, with reinforcement learning outperforming supervised fine-tuning. Based on these insights, we propose the Knowledge-Augmented Tool Execution (KATE), a knowledge-augmented tool execution framework that integrates experiential knowledge with reasoning-width-expanded inference and knowledge-aware training. Experiments on BFCL-V3 and AppWorld demonstrate consistent and substantial improvements over strong baselines across model scales. Our Code is available at this https URL.

23. 【2606.10860】raining LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization

链接https://arxiv.org/abs/2606.10860

作者:Lena S. Bolliger,Lena A. Jäger

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Production LLMs receive, LLMs receive instructions, Production LLMs, uniform architectural privilege, LLMs receive

备注

点击查看摘要

Abstract:Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more broadly, leaves models without a principled way to resolve conflicts between legitimate but competing instructions. A common training-based response is to teach models an explicit instruction hierarchy; existing approaches, however, formalize hierarchies of only three or four levels, treat all violations as equally severe, and rarely evaluate the full set of pairwise level interactions. We formalize a k-level instruction hierarchy problem and instantiate it for k=5, yielding ten pairwise priority relations that a compliant model must enforce. We then introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting levels under a linear or bilateral schedule, the latter weighting severity by both the privilege gap and the privilege of the victim level. Combined with hierarchy-specific delimiter tokens (Chen et al., 2025) and Instructional Segment Embeddings (ISE; Wu et al., 2025), GW-DPO with the bilateral schedule Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct, raising macro pairwise priority adherence while keeping over-refusal at half the standard DPO rate. Ablations isolate ISE as a refusal-threshold calibrator and recast five- versus three-level training as a generality-specialization tradeoff.

24. 【2606.10852】Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs

链接https://arxiv.org/abs/2606.10852

作者:Polydoros Giannouris,Mohsinul Kabir,Sophia Ananiadou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:explicit lies, fabricated claims, strategic concealment, evaluated through direct, direct markers

备注

点击查看摘要

Abstract:LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise from selective treatment of true material facts: omitting adverse evidence, softening unfavorable details, emphasizing favorable details, or replacing precise qualifications with vague language. Existing benchmarks largely miss this subtler and arguably more dangerous failure mode. We introduce JANUS, a benchmark for measuring goal-conditioned pragmatic distortion in fact-grounded LLM outputs. Each scenario in our benchmark provides a fixed pool of favorable and adverse facts and compares a neutral condition against a goal-directed condition, such as increasing adoption, enrollment, approval, or support, despite potential harm to directly affected individuals or groups. Because all outputs are constrained to use the same fact pool, JANUS isolates misleading net impressions from hallucination and fabrication. JANUS contains 160 scenarios across 8 domains, with each scenario paired with neutral and goal-conditioned prompts and annotated material facts. Extensive experiments across 12 LLMs reveal consistent goal-conditioned distortions, demonstrating that current models remain sensitive to incentive and framing objectives and lack robust safeguards against selectively misleading communication. We publicly release our corpus and code for future research.

25. 【2606.10842】ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

链接https://arxiv.org/abs/2606.10842

作者:Taiheng Pan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:opt-in token-evidence reranker, describe ConvMemory, lightweight ConvMemory, candidate set, token-evidence reranker

备注: 19 pages, 3 figures. Single-author technical report. Extends [arXiv:2605.28062](https://arxiv.org/abs/2605.28062) (ConvMemory v1). Code and checkpoint: [this http URL](http://github.com/pth2002/ConvMemory)

点击查看摘要

Abstract:We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1's protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder (22,713,601 parameters, measured from the released checkpoint) applied to the ten (query, memory) pairs that v1 has already selected; it does not change which ten memories are returned, so Recall@10 and Hit@10 are identical to v1 by construction, not by statistical coincidence. On the LoCoMo conversational memory benchmark (5 seeds, n = 4955 test rows), v2 raises FULL MRR from v1's 0.5824 to 0.6560 (paired bootstrap +0.0734, 95% CI [+0.0645, +0.0827]) and H@1 from 0.4440 to 0.5474. v2 closes most but not all of the gap to a much more expensive full-pool cross-encoder reference (mxbai-rerank-large-v1 over the top-500, MRR 0.6688): on FULL MRR v2 sits 0.013 below mxbai_top500, but on two raw-dense-hard slices (where v1's protected top-10 has higher recall than mxbai's own top-10) v2 exceeds mxbai_top500. A four-arm load-bearing ablation shows candidate-specific memory text is the mechanism: removing, shuffling, or replacing it collapses MRR below raw dense retrieval. v2 is best understood as a standard recall-preserving cascade pattern with LoCoMo-specific fine-tuning, an explicit anti-shortcut inference contract, and disciplined load-bearing analysis; its advantage over mxbai is slice-specific rather than a general dominance claim. This report extends the v1 technical report (arXiv:2605.28062).

26. 【2606.10829】Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

链接https://arxiv.org/abs/2606.10829

作者:Yusuf Sahin,Ahmed Rockey Saikia,Volkan Cevher,Paolo Favaro

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reduce inference steps, revealing multiple tokens, Masked diffusion language, denoising iteration, parallelism is fragile

备注

点击查看摘要

Abstract:Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predictions are coupled. Existing training-free samplers such as Top-\(k\), Fast-dLLM, and EB-Sampler mainly control how many tokens to reveal, while often ranking candidates by token-wise scores that ignore interactions within the selected set. We propose ADAS, a training-free reranking rule for parallel masked diffusion decoding. ADAS leaves the base sampler's stopping rule unchanged and modifies only subset construction: it greedily discounts a candidate when it attends strongly to already selected positions whose predictions remain uncertain. Unlike graph-constrained methods that turn attention into hard compatibility constraints, ADAS keeps attention continuous and uses it as a soft marginal penalty. Across LLaDA-8B-Base and Dream-7B-Base on GSM8K, MATH500, HumanEval, and MBPP, plugging ADAS into Top-\(k\), Fast-dLLM, and EB-Sampler improves low-NFE performance at matched denoiser evaluations by \(9.11\) and \(10.46\) percentage points on average, respectively, with \(3.1\%\) per-forward runtime overhead. These results show that soft attention-discounted reranking is a simple and modular way to improve quality in highly parallel decoding for masked diffusion language models.

27. 【2606.10820】K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

链接https://arxiv.org/abs/2606.10820

作者:Zhiwei Tang,Yuanyu He,Yizheng Han,Wangbo Zhao,Jiasheng Tang,Fan Wang,Bohan Zhuang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:makes inference memory-bound, memory-bound and inefficient, language modeling paradigm, decoding makes inference, Autoregressive

备注

点击查看摘要

Abstract:Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving--the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping--one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.

28. 【2606.10813】RedAct: Redacting Agent Capability Traces for Procedural Skill Protection

链接https://arxiv.org/abs/2606.10813

作者:Shuwen Xu,Zhitao He,Yi R.(May)Fung

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Users rely, observe agent behavior, diagnose failures, ensure accountability, rely on execution

备注

点击查看摘要

Abstract:Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills, allowing downstream methods to recover key formulas, thresholds, and strategies without access to model weights or skill files. To quantify this risk and evaluate protection, we construct \textsc{CapTraceBench}, a benchmark of 75 specialized long-horizon tasks and 154 curated skills across seven domains. We also introduce \textsc{RedAct} this https URL, a protected trace release framework that localizes protected key information, rewrites traces while preserving verifier-critical evidence, and embeds behavioral watermarks for downstream provenance analysis. Across representative trace reuse methods, \textsc{RedAct} reduces normalized skill transfer (NST) from 44.7--67.1\% on raw traces to below the no-skill baseline, while preserving audit evidence. Its standalone behavioral watermarks reach 93.6--100.0\% true detection with a false alarm rate of at most 1.9\%. These results frame public agent traces as security interfaces and show that selective redaction can reduce procedural capability leakage without removing audit evidence.

29. 【2606.10803】Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

链接https://arxiv.org/abs/2606.10803

作者:Zhixin Ma,Yutong Zhou,Yongqi Li,Chong-Wah Ngo,Wenjie Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, utilizing digital APIs

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

30. 【2606.10796】Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

链接https://arxiv.org/abs/2606.10796

作者:Yiqing Lyu,Xianbing Zhao,Buzhou Tang,Ronghuan Jiang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Automatic Depression Detection, remains challenging due, labeled data due, multi-topic clinical interviews, computational mental health

备注

点击查看摘要

Abstract:Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi-topic clinical interviews, leading to superficial and unreliable reasoning; 2) scarcity of labeled data due to clinical privacy, together with high cost of training and fine-tuning, limiting the deployment of supervised ADD systems. To jointly address these challenges, we propose Dep-LLM, a training-free framework that mirrors the step-by-step reasoning of clinical psychiatrists and operates entirely on frozen off-the-shelf foundation LLMs. Dep-LLM comprises three stages. First, a Chain-of-Thought (CoT) Depression Multi-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence-grounded rationales, effectively handling long-context dependencies. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token-level entropy of each rationale and applies an intra-label and inter-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training. Third, a Collaborative Multi-factor Prediction module dynamically integrates multi-factor signals weighted by confidence into the final diagnosis. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate the effectiveness and generalizability of Dep-LLM: it surpasses zero-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted-average F1, and further outperforms state-of-the-art supervised domain-specific LLMs as well as the latest closed-source commercial LLMs, while requiring no extra training.

31. 【2606.10768】N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

链接https://arxiv.org/abs/2606.10768

作者:Xukun Zhu,Hang Yu,Peng Di,Linchao Zhu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, valid solution paths, Large Language, success of Large, mathematical reasoning relies

备注: ACL 2026 Findings. 16 pages, 3 figures. Code: [this https URL](https://github.com/ZJUSCL/N-GRPO)

点击查看摘要

Abstract:The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level sampling often yields redundant trajectories that differ only in rephrasing, while embedding-level methods utilizing random noise frequently disrupt semantic consistency. To resolve this, we introduce N-GRPO, a novel exploration strategy integrated into the Group Relative Policy Optimization (GRPO) framework. Rather than relying on token-level sampling or native embedding-level noise, our approach leverages Semantic Neighbor Mixing. This mechanism dynamically constructs input representations by mixing the embeddings of an anchor token and its nearest semantic neighbors, thereby injecting diversity while strictly adhering to the local semantic manifold. Experimental evaluations on the DeepSeek-R1-Distill-Qwen models across different sizes show that N-GRPO not only achieves consistent improvements over strong baselines on math reasoning benchmarks but also exhibits robust generalization capabilities on out-of-distribution tasks.

32. 【2606.10765】ArabiGEE: A Hierarchical Taxonomy for Arabic Grammatical Error Explanation

链接https://arxiv.org/abs/2606.10765

作者:Khaled Elhady,Omar Kallas,Nizar Habash,Bashar Alhafni

类目:Computation and Language (cs.CL)

关键词:comprehensive Arabic grammatical, Arabic grammatical error, explicit error types, Arabic grammatical, grounded in explicit

备注

点击查看摘要

Abstract:We introduce ArabiGEE, the first comprehensive Arabic grammatical error explanation (GEE) taxonomy grounded in explicit error types. Unlike existing GEE approaches that treat explanation generation as free-form text, ArabiGEE organizes grammatical explanations through a hierarchical structure spanning orthographic, morphological, syntactic, and lexical dimensions. The taxonomy consists of 27 error types, 140 correction types, and 324 associated explanations. We apply ArabiGEE to manually annotate portions of existing Arabic grammatical error correction corpora and demonstrate how structured grammatical explanations can support automatic evaluation of LLMs on Arabic GEE. Our code and data are publicly available.

33. 【2606.10740】When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

链接https://arxiv.org/abs/2606.10740

作者:Sai Kartheek Reddy Kasu,Nils Lukas,Samuele Poppi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:terminal-score evaluation, largely invisible, invisible to terminal-score, reasoning, failure

备注: Accepted at the ICML 2026 FAGEN Workshop

点击查看摘要

Abstract:Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.

34. 【2606.10736】Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs

链接https://arxiv.org/abs/2606.10736

作者:Youssef Medhat,Junsoo Park,Ploy Thajchayapong,Ashok K. Goel

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:remain largely untapped, Large online, logs remain largely, conversational AI teaching, online courses generate

备注: Accepted as a short paper at the 10th CSEDM Workshop, co-located with the 18th International Conference on Educational Data Mining (EDM 2026). 7 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Large online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs remain largely untapped as diagnostic signals. We present a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier, grounded in a GPT-4-extracted prerequisite knowledge graph of course concepts. Evaluated on 1,340 question events from 164 students in a graduate-level AI course, our classifier achieves 80.0% accuracy across 43 labels (42 curriculum topics plus an "unknown" abstention class). Topic-level question volume correlates significantly with student self-reported difficulty from an independent mid-semester survey (rho = 0.491, p = 0.008, n = 28 topics), providing convergent evidence that the classified question stream reflects genuine topic difficulty. These results demonstrate that conversational AI interaction logs, mapped onto curriculum structure, carry actionable signals about topic-level knowledge gaps and provide instructors with a curriculum-grounded view of which topics warrant attention.

35. 【2606.10725】Pre-AF 13: An Interpretable Atrial Fibrillation Risk Score Mined from Discharge Reports

链接https://arxiv.org/abs/2606.10725

作者:Olga Shakhmatova,Dmitrii Kriukov,Daniil Larionov,Nikita Khromov,Iaroslav Bespalov,Alexander Zolotarev,Kirill Grishchenkov,Ekaterina Ivanova,Miron Kuznetsov,Ilya Sochenkov,Elizaveta Panchenko,Artem Shelmanov,Dmitry V. Dylov

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:ROC AUC, Background, CVD, risk, CVD patients

备注: Main paper with appendix; 3 main figures, 3 supplementary figures, multiple tables. O. Shakhmatova and D. Kriukov contributed equally (co-first authors). E. Panchenko, A. Shelmanov, and D. V. Dylov are co-senior authors. Corresponding authors: O. Shakhmatova ( [this http URL](http://olga.shahmatova) @gmail.com) and D. V. Dylov ( [this http URL](http://d.dylov) @skol.tech)

点击查看摘要

Abstract:Background. Atrial fibrillation (AF) is the most prevalent cardiac arrhythmia and a major determinant of prognosis. Established AF risk scores rely on factors (older age, hypertension) nearly ubiquitous among patients with cardiovascular disease (CVD), offering limited stratification in this high-risk group. Most target long-term (5-10 year) rather than medium-term prediction. We developed interpretable ML models predicting AF risk over a 24-month and entire follow-up horizon in CVD patients using routinely collected hospital data. Methods. Single-center retrospective study of electronic health records from the National Research Cardiology Center (Russia) for patients aged =18 with CVD but without pre-existing AF, hospitalized more than once between January 2012 and May 2019. A custom NLP pipeline transformed unstructured discharge reports into 73 structured features, combining a rule-based parser with transformer-based NER. Using LightAutoML we built a full model (73 features), a simple model (reduced subset), and a linear model for a bedside risk score. Performance was assessed by ROC AUC, compared with CHARGE-AF, C2HEST, MHS, and HAVOC, and interpreted via SHAP. Results. Of 80,576 records from 45,000 patients, 17,562 met inclusion criteria; 1,438 (8.19%) developed AF. The full model reached ROC AUC 0.735 (24-month) and 0.696 (entire follow-up); the simple model was nearly identical (0.725, 0.696). All non-linear models outperformed the four clinical risk scores (ROC AUC 0.53-0.64). The simple model uses 13 features and is named Pre-AF 13. SHAP identified age and left atrial volume as dominant predictors. A linear risk score (Pre-AF 9) stratified observed 24-month AF incidence from ~7% to 36%. Conclusion. Interpretable ML models built from routinely collected EHR data identify high-AF-risk CVD patients, outperforming established clinical risk scores.

Comments:
Main paper with appendix; 3 main figures, 3 supplementary figures, multiple tables. O. Shakhmatova and D. Kriukov contributed equally (co-first authors). E. Panchenko, A. Shelmanov, and D. V. Dylov are co-senior authors. Corresponding authors: O. Shakhmatova (this http URL@gmail.com) and D. V. Dylov (this http URL@skol.tech)

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2606.10725 [cs.LG]

(or
arXiv:2606.10725v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2606.10725

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Artem Shelmanov [view email] [v1]
Tue, 9 Jun 2026 11:33:46 UTC (985 KB)

36. 【2606.10722】Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

链接https://arxiv.org/abs/2606.10722

作者:Ruixuan Huang,Jinyuan Shi,Hantao Huang,Yifan Huang,Ziyi Guan,Hao Zeng,Ian En-Hsu Yen,Minghui Yu

类目:Computation and Language (cs.CL)

关键词:construct channel-sparse large, channel-sparse large language, construct channel-sparse, channel-sparse large, large language models

备注

点击查看摘要

Abstract:We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, the routing module is placed on the main language modeling path and optimized during continual training, enabling the dense model to be upcycled into a hardware-oriented sparse model. We report the architecture, training recipe, benchmark performance, and training lessons. We also identify a layer-local long-context failure mode on RULER-CWE and propose a single-layer repair algorithm that substantially improves the affected length range.

37. 【2606.10716】Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

链接https://arxiv.org/abs/2606.10716

作者:Roberto Martínez-Cruz,Alvaro J. López-López,José Portela

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:generate rich contextualized, rich contextualized representations, achieved strong performance, largely due, achieved strong

备注

点击查看摘要

Abstract:Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

38. 【2606.10703】From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

链接https://arxiv.org/abs/2606.10703

作者:Leonard Engmann,Christian Medeiros Adriano,Holger Giese

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Pearl terms, observed model behaviour, specific computations, associational evidence, rarely tested

备注: 9 pages, 2 figures, 9 tables. Accepted at the ICML 2026 Workshop on Philosophy of Science Meets Machine Learning (PhilML). Non-archival

点击查看摘要

Abstract:Interpretability methods routinely use population-level summary statistics over observed model behaviour to license claims about the effects of targeted interventions on specific computations; in Pearl's terms, they treat rung-1 associational evidence as if it supported rung-2 interventional conclusions, a move whose validity is rarely tested. We examine one concrete instance: the use of routing statistics in Mixture-of-Experts (MoE) pruning, where utilization rates, activation norms, and routing weight distributions are treated as predictors of which experts can be removed without functional cost. A token-level interventional audit across three high-redundancy MoE architectures (OLMoE-1B-7B-0924, Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite) finds no observational metric predicts causal expert importance after multiple-comparison correction in any model, with effect sizes below Cohen's $d = 0.17$ across all 60 metric-layer combinations. A per-token routing weight control rules out insufficient power, recovering a single Bonferroni-significant signal at OLMoE's final MoE layer ($d = +0.231$, $p = 0.0013$). Existing pruning methods succeed in this regime not by identifying dispensable experts but because early-layer redundancy renders most selection criteria interchangeable. Our results provide an explicit counterexample to the common inferential step from population-level observational summaries to token-level interventional claims about expert importance, and illustrate how interventional audits can calibrate the evidential standards for interpretability claims.

39. 【2606.10694】REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

链接https://arxiv.org/abs/2606.10694

作者:Keer Lu,Liwei Chen,Guoqing Jiang,Zhiheng Qin,Yunhuai Liu,Wentao Zhang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, long time horizons, time horizons

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management essential for storing, updating, and retrieving historical information beyond the context limit. Although recent memory systems attempt to address this issue by storing historical information externally, existing approaches suffer from three key limitations: flat text-based memory organizations fail to capture explicit relations among memories, structured memory systems often destructively overwrite evolving facts, and current retrieval mechanisms remain query-agnostic and passive when evidence is incomplete. REAL constructs long-term conversational memory as a temporal and confidence-aware directed property graph, where each atomic fact is represented with entities, relations, valid-time intervals, confidence scores, and exploration intent labels. During memory construction, REAL adopts a non-destructive temporal update strategy that preserves parallel fact versions and their validity intervals, enabling faithful tracking of fact evolution. During retrieval, REAL anchors query-relevant root entities, decouples their exploration intents, and performs semantic evaluator-guided hybrid beam search to extract compact memory subgraphs. It further incorporates counterfactual inference to repair unreliable retrieval states and recover missing memory evidence through implicit logical relations. Comprehensive experiments demonstrate that REAL substantially improves long-term memory performance over flat-text, graph-based, and existing memory baselines, achieving an average improvement of 22.72\%.

40. 【2606.10677】Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

链接https://arxiv.org/abs/2606.10677

作者:Suozhao Ji,Baodong Wu,Zehao Wang,Lei Xia,Qingping Li,Ruisong Wang,Wenbo Ding,Zhenhua Zhu,Boxun Li,Guohao Dai,Yu Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:track changing facts, provide relevant evidence, track changing, provide relevant, memory

备注

点击查看摘要

Abstract:Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text-based persistent memory architecture that treats agent memory as topic-structured documents. Each topic document serves as a semantic unit for collecting related evidence, preserving metadata, and revising facts over time. New observations are first staged in a buffer and periodically consolidated into coherent textual contexts. At inference time, an agentic retrieval procedure lets the LLM read memory through iterative tool calls rather than a single retrieval step. On MemoryAgentBench, Infini Memory achieves 64.7% overall score. Ablations show that topic-structured maintenance and iterative evidence inspection improve complementary aspects of long-term memory use.

41. 【2606.10675】Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

链接https://arxiv.org/abs/2606.10675

作者:Roy Weber,Meidan Zehavi,Rotem Rousso,Joseph Keshet

类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:accurate multilingual word-level, Massively Multilingual Speech, multilingual word-level forced, present a method, method for accurate

备注: Interspeech 2026

点击查看摘要

Abstract:We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.

42. 【2606.10657】Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

链接https://arxiv.org/abs/2606.10657

作者:João Maria Janeiro,Mathurin Videau,Andrea Caciolai,Benjamin Piwowarski,Patrick Gallinari,Loic Barrault

类目:Computation and Language (cs.CL)

关键词:evaluating pretrained large, pretrained large language, log-likelihood scoring makes, large language models, makes them unreliable

备注

点击查看摘要

Abstract:Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact phrasing (surface form) of the answers, conflating a model's familiarity with a specific phrase with its actual capability. We demonstrate this flaw using a controlled testbed of 1B-8B models trained on the same knowledge. Despite having identical knowledge, standard metrics falsely report a performance gap of over 2 points. To solve this, we propose ParaEval, an evaluation framework that queries models using multiple paraphrases per answer option. By scoring each model based on its most favorable phrasing, ParaEval successfully reduces the false performance gap to below 1 point. We confirm that these evaluation artifacts, and the improvements from ParaEval, persist in frontier 70B and 120B open-source models. Ultimately, ParaEval provides a robust and efficient way to evaluate true underlying capability rather than surface-form familiarity.

43. 【2606.10654】Speaker Group Encoding in Self-supervised Speech Recognition Models

链接https://arxiv.org/abs/2606.10654

作者:Felix Herron,Solange Rossato Alexandre Allauzen,Benoit Favre,François Portet

类目:Computation and Language (cs.CL)

关键词:speech recognition models, self-supervised speech recognition, speech recognition, investigate what self-supervised, recognition models

备注

点击查看摘要

Abstract:We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech recognition (ASR), and ASR-finetuned using a fairness enhancing algorithm. We find that S3Ms encode information about several speaker group categories (SGCs), including their gender, age, dialect, ethnicity, and whether they are a native speaker. We find that finetuning for SID amplifies certain SGCs, namely those whose variance is more phonetic in nature, though it does not amplify other SGCs, namely those whose variance is more semantic in nature. On the other hand, finetuning for ASR discards phonetically variant speaker group information (SGI) but retains semantically variant SGI. We find that ASR algorithms designed for fairness improvement change to what extent SGI is encoded in S3Ms; however, this is primarily true for for phonetically variant SGCs, and less true for semantically variant SGCs. We discuss how SGI is encoded by each layer, and identify subdimensions of embeddings responsible for encoding different SGCs. Finally, we discuss how our findings could be beneficial in designing fairer ASR algorithms.

44. 【2606.10650】Dynamic Linear Attention

链接https://arxiv.org/abs/2606.10650

作者:Xin Wang,Hui Shen,Boyuan Zheng,Xueshen Liu,Minkyoung Cho,Zhongwei Wan,Zesen Zhao,Zhuoqing Mao,Shen Yan,Mi Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, scalability of Large, Language Models, motivating the adoption

备注: Accepted by ICML 2026

点击查看摘要

Abstract:The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.

45. 【2606.10646】How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

链接https://arxiv.org/abs/2606.10646

作者:Zhichen Dong,Yang Li,Yuhan Sun,Weixun Wang,Yijia Luo,Zinian Peng,Taiheng Ye,Chao Yang,Wenbo Su,Yu Cheng,Bo Zheng,Junchi Yan

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, recipes typically treat, credit assignment remains, distinguish decisive reasoning, decisive reasoning steps

备注: 25 pages, 7 figures, 11 tables. Accepted at ICML 2026

点击查看摘要

Abstract:Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.

46. 【2606.10610】Small Data, Big Noise: Adversarial Training for Robust Parameter-Efficient Fine-Tuning

链接https://arxiv.org/abs/2606.10610

作者:Eitan Cohen,Idan Simai,Uri Shaham

类目:Computation and Language (cs.CL)

关键词:downstream NLP tasks, adapting foundation models, downstream NLP, NLP tasks, Small Data Big

备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has become essential for adapting foundation models to downstream NLP tasks. However, current PEFT methods often struggle with robustness to noise and performance degradation on limited training data. We propose SDBN (Small Data Big Noise), a unified framework that brings adversarial training to PEFT - a combination that remains less studied in the PEFT setting despite its complementary strengths - to enhance model robustness and generalization, outperforming alternative approaches. We also introduce two variants of the method that use discrete uncertainty sets: SDBN-h, which enumerates character-level edits and selects worst-case variants using gradients, and SDBN-p, which uses LLM-generated variants for robust optimization in generative tasks. Experiments across multiple benchmarks reveal substantial improvements, particularly in low-resource settings and under both word-level and character-level corruptions. This framework addresses the less explored intersection of adversarial training and parameter-efficient adaptation, without introducing additional parameters or only modest computational overhead, making PEFT deployments more reliable in real-world scenarios where data scarcity and linguistic variability often coexist

47. 【2606.10607】Causal Ensemble Agent: Hierarchical Causal Discovery with LLM-guided Expert Reweighting

链接https://arxiv.org/abs/2606.10607

作者:Xinyu Li,Yuanyuan Wang,Haoxuan Li,Chuan Zhou,Erdun Gao,Bo Han,Tongliang Liu,Kun Zhang,Howard Bondell,Mingming Gong

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:uncover causal structures, Causal discovery aims, aims to uncover, Causal discovery, Large Language Models

备注

点击查看摘要

Abstract:Causal discovery aims to uncover causal structures from observational data, which is crucial for real-world decision-making. However, different causal discovery algorithms can produce divergent results that conflict with each other, complicating the identification of accurate causal graphs. Traditional approaches rely on numerical values and statistical assumptions, often ignoring rich domain-specific information, such as feature descriptions, which could also help structure learning. While recent works explore using Large Language Models (LLMs) to infer causal relations via direct queries, such methods can be unreliable due to a lack of alignment with the actual data. To address these limitations, we propose Causal Ensemble Agent (CEA), a novel framework that aggregates structural insights from statistical discovery experts across different graph levels via linear opinion pooling, and uses an LLM as a meta-referee to dynamically reweight experts when the aggregated confidence is close to the decision boundary, thereby composing an improved and more complete causal graph. Extensive experiments on both synthetic and real-world datasets demonstrate that CEA achieves the strongest overall performance across a wide range of causal discovery methods, highlighting the effectiveness of using LLMs for meta-analysis in causal discovery.

48. 【2606.10581】ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

链接https://arxiv.org/abs/2606.10581

作者:Yuxiang Wang,Qinke Ni,Shengbo Cai,Wan Lin,Liqiang Zhang,Zhizheng Wu

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:sufficiently competent spoken-dialogue, competent spoken-dialogue assistant, Current Speech Language, Speech Language Models, Speech carries

备注

点击查看摘要

Abstract:Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.

49. 【2606.10569】Hidden Consensus:Preference-Validity Compression in Human Feedback

链接https://arxiv.org/abs/2606.10569

作者:Dorcas Chia Ern Chua,Karen Myn Hui Lee,Jia Yue Tan,Zhen Xue Gue,Norzalena Abdul Hamid,Azima Binti Azmi,Keat Mei Yeong,Aizat Izyani binti Mujab,Hafsah Noor Azam,Chee Guo Khoo,Han Ying Lim,Chee Seng Chan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Standard RLHF pipelines, Standard RLHF, reduce heterogeneous human, RLHF pipelines, heterogeneous human judgments

备注: 28 pages. When AI learns from human feedback, it forces a single "correct" answer, but sometimes multiple answers are all genuinely valid, and that nuance gets thrown away

点击查看摘要

Abstract:Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect culturally, historically, linguistically, regionally, or normatively grounded interpretations rather than annotation noise. We call this failure Preference-Validity Compression, the collapse of multiple plural-valid response options into a single optimization target. Using Malaysia as a diagnostic setting, we analyze RLHF-style feedback aggregation through preference events linking prompts, responses, and acceptability judgments across interpretive frames. Across 321 preference events from 20 participants and 107 trio-annotated prompts, 79% of prompts contain more than one majority-supported response that single-winner aggregation would discard, and apparent dominance gaps between top responses diminish when all majority-supported options are considered. Participants frequently select multiple acceptable responses, and discarded responses demonstrably reflect coherent local, practical, or cultural frames. These findings show that majority aggregation in this corpus measures argmax acceptability rather than plural alignment. We treat this as a measurement-validity issue and argue that future alignment methods should satisfy Validity-Preserving Consistency, remaining stable across plural-valid interpretive frames rather than collapsing them into a single reward target.

50. 【2606.10554】Benchmarking Knowledge Editing using Logical Rules

链接https://arxiv.org/abs/2606.10554

作者:Tatiana Moteu Ngoli,NDah Jean Kouagou,Hamada M. Zahera,Axel-Cyrille Ngonga Ngomo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, knowledge, knowledge editing

备注: Accepted at the 24th International Semantic Web Conference 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are crucial for maintaining current information and correcting erroneous assertions within pre-trained models. Current benchmarks for knowledge editing primarily focus on recalling edited facts, often neglecting their logical consequences. To address this limitation, we introduce a new benchmark designed to evaluate how knowledge editing methods handle the logical consequences of a single fact edit. Our benchmark extracts relevant logical rules from a knowledge graph for a given edit. Then, it generates multi-hop questions based on these rules to assess the impact on logical consequences. Our findings indicate that while existing knowledge editing approaches can accurately insert direct assertions into LLMs, they frequently fail to inject entailed knowledge. Specifically, experiments with popular methods like ROME and FT reveal a substantial performance gap, up to 24%, between evaluations on directly edited knowledge and on entailed knowledge. This highlights the critical need for semantics-aware evaluation frameworks in knowledge editing.

51. 【2606.10537】Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

链接https://arxiv.org/abs/2606.10537

作者:Jing Xiong,Qi Han,Shansan Gong,Yunta Hsieh,Chengyue Wu,Chaofan Tao,Chenyang Zhao,Ngai Wong

类目:Computation and Language (cs.CL)

关键词:Diffusion large language, Toggle, large language models, Diffusion Language Models, Toggle Hugging Face

备注: Technical Report

点击查看摘要

Abstract:Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose Prefilling-dLLM, a training-free prefill-decode disaggregation framework for dLLMs that partitions the prefix into N chunks, caches their KV representations once, and selects the top-K most relevant chunks with intra-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per-step complexity from quadratic in the full sequence length to quadratic only in the decode length. On LongBench and InfiniteBench, Prefilling-dLLM achieves state-of-the-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non-contiguously cached chunk KV yields 9.1--28.0x speedup at 8K--32K contexts. We further show that beginning-of-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost-in-the-middle phenomenon. Code is available at this https URL.

Comments:
Technical Report

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2606.10537 [cs.CL]

(or
arXiv:2606.10537v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.10537

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Jing Xiong [view email] [v1]
Tue, 9 Jun 2026 08:06:22 UTC (1,068 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models, by Jing Xiong and 7 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CL

prev

|
next

new
|
recent
| 2026-06

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

52. 【2606.10531】LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

链接https://arxiv.org/abs/2606.10531

作者:Haoyu Wang,Xingyu Yu,Haiyan Zhao,Fengxiang Wang,Xu Han

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Quantization-aware training, extremely low-bit large, low-bit large language, large language models, essential for extremely

备注: Accepted by ICML 2026

点击查看摘要

Abstract:Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.

53. 【2606.10528】Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

链接https://arxiv.org/abs/2606.10528

作者:Guozheng Li,Xiyan Fu,Yiwen Guo

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Current reinforcement learning, methods primarily rely, Current reinforcement, trained reward model, advantage estimation

备注

点击查看摘要

Abstract:Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference differences, whereas RM hidden states encode richer semantic and preference information. We introduce the representation-aware advantage estimation, which leverages RM hidden states and models them as auxiliary signals for better advantage estimation. Specifically, we propose the Graph-based Advantage Estimation (GraphAE), treat each sampled group as a graph, where nodes correspond to responses and edges capture their similarity in the RM hidden space. Then advantages are computed via graph propagation, enabling each sample to incorporate contextual information from its neighbors. GraphAE is lightweight and can be seamlessly integrated into existing group-based RL algorithms. We apply GraphAE to GRPO, GSPO and RLOO, and conduct extensive experiments on different models and benchmarks. Empirical results show consistent improvements across three benchmarks, with gains of up to + 6.3 on Arena-Hard-v0.1, + 8.27 on AlpacaEval 2.0, and + 0.22 on MT-Bench. These results demonstrate that leveraging RM representations leads to more sample efficient and robust RLHF.

54. 【2606.10520】UniSVQ: 2-bit Unified Scalar-Vector Quantization

链接https://arxiv.org/abs/2606.10520

作者:Haoyu Wang,Haiyan Zhao,Xingyu Yu,Zhangyang Yao,Xu Han,Zhiyuan Liu,Maosong Sun

类目:Computation and Language (cs.CL)

关键词:large language models, level enables low-cost, enables low-cost deployment, Post-training quantization, level enables

备注: Accepted by ICML 2026

点击查看摘要

Abstract:Post-training quantization at the 2-bit level enables low-cost deployment and inference acceleration for large language models (LLMs). Scalar quantization (SQ) and vector quantization (VQ) are two primary quantization methods, however, the former suffers from significant performance degradation, and the latter incurs computational and storage overhead. We propose UniSVQ, a unified 2-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices. This structure preserves compatibility with optimized integer kernels while retaining much of VQ's flexibility. We further introduce a data-driven block-wise fine-tuning strategy to directly minimize quantization reconstruction error. Extensive experiments across multiple LLM families and zero-shot benchmarks demonstrate that UniSVQ consistently outperforms state-of-the-art SQ methods and achieves performance comparable to advanced VQ methods, while providing higher inference throughput.

55. 【2606.10481】Advancing the State-of-the-Art in Empirical Privacy Auditing

链接https://arxiv.org/abs/2606.10481

作者:Nicole Mitchell,Galen Andrew,Arun Ganesh,Brendan McMahan,Peter Kairouz

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (stat.ML)

关键词:large language models, exhibit problematic memorization, Parameter-efficient fine-tuning, large language, exhibit problematic

备注

点击查看摘要

Abstract:Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (MI) or reconstruction attacks. A key challenge in EPA is designing ``canary'' examples that are mixed with the privacy-sensitive training data. We propose generating synthetic canaries via high-temperature sampling ($T \geq 0.8$) from LLMs, using prompts tailored to the privacy-sensitive training data. These canaries act as high-influence outliers, ensuring high identifiability and hence strong audits. Further, since the canaries are themselves non-private, they are inspectable and can be inserted with repetition without jeopardizing the privacy of the real data. An important use of models fine-tuned on privacy-sensitive data is the generation of synthetic data. This also comes with privacy risk. We introduce a powerful synthetic data audit based on fine-tuning an auxiliary model on the synthetic data. Auditing the auxiliary model for the original canaries then provides a strong estimate of the privacy leakage through the synthetic data. Finally, leveraging our strong auditing methodologies, we perform a systematic investigation into the interacting effects of model capacity and canary entropy on memorization.

56. 【2606.10475】Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

链接https://arxiv.org/abs/2606.10475

作者:Jakub Masłowski,Jarosław A. Chudziak

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multi-agent debate frameworks, improve large language, large language model, language model performance, heavily favors final

备注: Accepted for publication in the Proceedings of the 30th International Conference on Knowledge-Based and Intelligent Information Engineering Systems (KES 2026)

点击查看摘要

Abstract:Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the process. During long-horizon exchanges reactive systems under sustained perturbations often experience logic degradation, argument repetition, and role drift. To structurally prevent the identity loss and maintain the process fidelity, we introduce Knowledge-Grounded Counterfactual Reasoning (KG-CFR), a dual-stage architecture that enforces a strict separation of concerns between a private, retrieval-augmented planning buffer, and a public execution layer. We assess this system in Dynamic Resource Allocation under Uncertainty (DRAU), a dedicated 1v1v1 environment, introducing diversity as distinct from standard debate settings. Over 270 completely factorial crisis simulation trajectories with stochastic environmental shocks, KG-CFR prevents judge-detected critical post-shock degradation (defined as a quality shift, $\Delta \le -0.20$) in more than 95% of perturbed runs, increasing the overall argument quality from 0.694 to 0.822. Our primary contribution is the demonstration of architectural decoupling being an important factor of systemic resilience enhancement under sustained pressure without quality loss. Furthermore, we introduce custom vector metrics for discourse divergence and plan-execution alignment that provide strong, directionally consistent evidence of operational stability. Our ablation experiments suggest that the proper doctrinal grounding can be an equally important factor for argument quality, as the prospective planning. KG-CFR, according to our initial metric evaluations, reduces semantic looping, by preserving the agent's consistency with the original plan.

57. 【2606.10471】Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks

链接https://arxiv.org/abs/2606.10471

作者:Dhruv Dixit

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:deep learning techniques, advanced deep learning, Neural Tensor Network, Recursive Neural Tensor, distributed sentence representations

备注: 12 Pages

点击查看摘要

Abstract:In this investigation, we delve into the automated detection of speculative language within biomedical articles by utilizing distributed sentence representations and advanced deep learning techniques. The implications of such identification extend to information retrieval, multi-document summarization, and the exploration of new knowledge. Our exploration encompasses two distinct approaches for acquiring distributed sentence representations: the Paragraph Vector model and the Recursive Neural Tensor Network. These methodologies are then rigorously compared against three foundational baseline algorithms: Support Vector Machines, Naive Bayes, and pattern matching. Our findings reveal that the Recursive Neural Tensor Network (RNTN) demonstrates a slight performance edge (F1 = 0.885) over the top-performing baseline, the linear bigram SVM (F1 = 0.881). Meanwhile, the Paragraph Vector model proves less effective (F1 = 0.368), even after extensive training using an expansive, unlabeled dataset. We engage in a comprehensive discourse on the factors influencing these performance disparities and provide insightful recommendations for future research directions.

58. 【2606.10467】Large Language Models as Modal Models in Linguistics

链接https://arxiv.org/abs/2606.10467

作者:Haruto Suzuki,Saku Sugawara

类目:Computation and Language (cs.CL)

关键词:rapid advancement, advancement of large, large language models, LLMs, linguistic theory

备注

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has intensified debates about their significance for linguistic theory. These debates are commonly divided into three positions: insulationism, which regards LLMs as irrelevant to human language; eliminativism, which claims that LLMs can replace traditional linguistic theories; and conciliationism, which views them as useful tools for linguistic research. To clarify these positions, this paper applies the framework of modal modeling from the philosophy of science. We argue that LLMs possess genuine epistemic value as minimal models, even without structural correspondence to human cognition. In particular, they can provide how-possibly explanations (HPEs) by testing modal claims about language acquisition and linguistic competence. We then examine the conditions under which LLMs could qualify as how-actually explanations (HAEs) of human language, drawing on the mechanistic account of scientific explanation. We argue that current LLMs do not yet satisfy these requirements. On the basis of this analysis, we propose understanding the explanatory power of LLMs as lying on a continuum between HPEs and HAEs. This framework avoids both overstating and understating their explanatory significance and offers a more precise basis for evaluating the role of LLMs in the scientific study of language.

59. 【2606.10461】ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs

链接https://arxiv.org/abs/2606.10461

作者:Xianlin Zeng,Fan Xia,Xiangyu Chen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:incorporate textual node, rich relational semantics, textual node attributes, describe rich relational, Graph Neural Networks

备注: Accepted to ICML 2026

点击查看摘要

Abstract:Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown promise for learning on TAGs, yet achieving well-aligned representations remains challenging. Prior studies largely rely on heuristics that perform coarse-grained matching. They lack sufficient constraints and ignore distributional alignment, leading to representation drift and limited generalization. Building on Energy-based Models (EBMs), we propose an Energy-based Representation Alignment (ERAlign) framework that projects GNN-encoded graph structure and LLM-derived text embeddings in a shared latent space to achieve distribution consistency. Concretely, layer-wise alignment is quantified by a distance metric and optimized via an EBM objective. By decreasing energy values, our framework yields well-aligned representations for downstream tasks. During training, we introduce Energy Discrepancy (ED) to avoid high sampling costs associated with intractable normalization. ED also carries theoretical guarantees of higher training efficiency and reduced energy landscape distortion. Empirical evaluations on eight TAG datasets demonstrate that ERAlign obtains state-of-the-art performance across varying levels of supervision and cross-task transfer scenarios.

60. 【2606.10460】LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

链接https://arxiv.org/abs/2606.10460

作者:Haonan Wang,Jiaxiang Liu,Yurong Liu,Austin Senna Wijaya,Tianle Zhou,Eden Wu,Yijia Chen,Wanting You,Reya Vir,Daniela Pinto,Grace Fan,Yusen Zhang,Juliana Freire,Eugene Wu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Recent large language, shown rapid progress, large language models, Recent large, language models

备注

点击查看摘要

Abstract:Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

61. 【2606.10459】Leveraging Social Media Data for COVID-19 Studies

链接https://arxiv.org/abs/2606.10459

作者:Nur Hafieza Ismail,Nur Shazwani Kamarudin,Nurol Husna Che Rose

类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL)

关键词:widely preferred sources, social media, social media networks, widely preferred, preferred sources

备注: 8 pages, 1 figure

点击查看摘要

Abstract:Nowadays, social media networks have become widely preferred sources of information. Especially during the time of the Coronavirus disease 2019 COVID 19 pandemic, social media has been one of the most used platforms to get the latest news and information related to COVID 19. Social media are popular because they offer free access to their registered users and allow them to do posting, disseminate information, and respond to others postings. With almost 4.6 billion social media users worldwide, it is not surprising the significant amount of information shared through these platforms could affect how people perceive and cope with the pandemic that we are facing right now. With decent use, social media can be a beneficial digital tool to spread reliable news and public awareness for patients, clinicians, and society. Specifically, this chapter describes linguistic, visual, and emotional indicators expressed in user disclosures. Thus, in this chapter, the related studies of social media platforms usage during the COVID 19 pandemic are explored and discussed in detail. This chapter also categorizes social media data used, introduces different deployed machine learning, feature engineering, natural language processing, and survey methods, and outlines directions for future research.

62. 【2606.10445】SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

链接https://arxiv.org/abs/2606.10445

作者:Jaeseong Lee,Seung-won Hwang,Samyam Rajbhandari

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:modern accelerators, widely supported, supported by modern, sparsity, theoretical speedup

备注

点击查看摘要

Abstract:Semi-structured 2:4 sparsity is widely supported by modern accelerators, providing up to a 2x theoretical speedup. However, its strict 50% sparsity constraint often causes non-negligible accuracy degradation under post-training pruning. Meanwhile, existing relaxed sparsity formats either require specialized compiler support or introduce runtime overheads that limit end-to-end speedup. We propose Spense, a practical hybrid sparse-dense format that splits each weight matrix into a 2:4 sparse region and a dense region. This design relaxes the effective sparsity constraint while remaining compatible with existing high-performance sparse and dense GEMM libraries, avoiding both custom compiler support and input activation expansion. Building on this format, we introduce SpenseGPT, a one-shot post-training pruning method that produces sparse and dense regions. Notably, we show that selecting the right dense regions is important, and we devise two different strategies to choose them. Experiments on Qwen3-32B and Seed-OSS-36B demonstrate that our method achieves up to 1.2x end-to-end decoding speedup on B200 GPUs with FP8 precision, while preserving accuracy. To the best of our knowledge, this is the first one-shot pruning demonstration of real-world end-to-end LLM decoding speedup from semi-structured sparse tensor cores on recent GPUs such as B200s, while maintaining model quality.

63. 【2606.10439】Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

链接https://arxiv.org/abs/2606.10439

作者:Guodong Lin,Ziqi Chen,Yuxiang Fu,Ke Li,Wei-Qiang Zhang

类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:automatic speech recognition, challenging research direction, large language models, speech recognition, making their effective

备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.

64. 【2606.10435】Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling

链接https://arxiv.org/abs/2606.10435

作者:Muhammad Ahmed

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:self-attention scales quadratically, Transformers achieve strong, causal self-attention scales, strong language modeling, language modeling performance

备注: 17 pages, 5 figures, and 6 tables. Experiments on WikiText-103, PG-19, and WikiText-2 using TPU v4-32 and NVIDIA RTX 3060 hardware. Code: [this https URL](https://github.com/ahmed123hds/PCAF)

点击查看摘要

Abstract:Transformers achieve strong language modeling performance by providing direct token-to-token communication paths, but causal self-attention scales quadratically with context length. Recurrent and state-space models reduce this cost, yet compress history into sequentially updated fixed-size states. This paper studies a third primitive: a parallel content-addressed memory over causal successor records. The proposed Parallel Causal Associative Field (PCAF) writes local records from a context window into hash buckets, retrieves a bounded candidate set for the current query, forms a sparse cache distribution over successor tokens, and mixes that cache with a parametric local language model through a learned gate. The resulting model maintains sparse long-context access while avoiding a single fixed recurrent state bottleneck. We evaluate PCAF under full autoregressive pretraining on WikiText-103 and PG-19 using a distributed Google Cloud TPU v4-32 pod. At 303M parameters and context length T = 2048, PCAF-semantic reaches 36.31 perplexity on WikiText-103 and 52.45 perplexity on PG-19, compared with 47.49 and 53.84 for a matched dense Transformer. PCAF-semantic simultaneously processes 0.61-0.62M tokens/s across the TPU pod, versus 0.43M tokens/s for dense and local attention baselines. Supporting 41M-parameter multi-seed sweeps and single-GPU component ablations show that the associative cache, retrieval capacity, and learned gate materially affect the speed-quality trade-off.

65. 【2606.10428】Which LoRA? An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning

链接https://arxiv.org/abs/2606.10428

作者:Thamali Wijewardhana,Napoleon H. Reyes,Surangika Ranathunga

类目:Computation and Language (cs.CL)

关键词:multilingual instruction tuning, instruction tuning, investigate whether commonly, multilingual instruction, basic LoRA

备注

点击查看摘要

Abstract:We investigate whether commonly available LoRA variants have an advantage over basic LoRA in multilingual instruction tuning. Experiments involving LoRA and four other variants on two datasets across diverse target languages show that there is no significant advantage in using more complex LoRA variants instead of basic LoRA, with respect to balancing cross-lingual transfer and knowledge retention. An analysis of hidden embeddings reveal that layer-wise language representation remains largely similar across LLMs fine-tuned with different LoRA techniques, suggesting that architectural novelty of LoRA techniques may not translate into better cross-lingual adaptation.

66. 【2606.10423】WebChallenger: A Reliable and Efficient Generalist Web Agent

链接https://arxiv.org/abs/2606.10423

作者:Jayoo Hwang,Xiaowen Zhang,Vedant Padwal

类目:Computation and Language (cs.CL)

关键词:Autonomous web navigation, navigation remains challenging, challenging for LLM, web navigation remains, generalist systems rely

备注

点击查看摘要

Abstract:Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide-and-conquer observation pipeline that lets the agent skim section summaries and extract details only from task-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi-step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site-specific adapters. Using off-the-shelf open-weight models without fine-tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at this https URL

67. 【2606.10403】KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

链接https://arxiv.org/abs/2606.10403

作者:Sanghee Park,Geewook Kim,Kee-Eung Kim

类目:Computation and Language (cs.CL)

关键词:Math reasoning benchmarks, difficulty signal grounded, Scholastic Ability Test, per-item difficulty signal, Korean College Scholastic

备注: 18 pages, 14 figures, 8 tables

点击查看摘要

Abstract:Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at this https URL.

68. 【2606.10402】Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

链接https://arxiv.org/abs/2606.10402

作者:Federico Bianchi,Yongchan Kwon,Aneesh Pappu,James Zou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:inspect failed attempts, long time horizons, researchers share partial, inspect failed, failed attempts

备注

点击查看摘要

Abstract:Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other's ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.

69. 【2606.10400】Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

链接https://arxiv.org/abs/2606.10400

作者:Pratham Singla,Shivank Garg,Vihan Singh,Paras Chopra

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:memorized world knowledge, inflates benchmark scores, ungrounded answers, Vision-language models, world knowledge

备注: 17 pages, 7 figures, Submitted to EMNLP 2026

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark scores and yields confident but ungrounded answers. Existing benchmarks rarely isolate this behavior, since each image is usually paired with a single fixed question. To measure the reliance, we build a 540-image benchmark across six reasoning categories and generate four question variants over the same images, so that phrasing rather than image content is the controlled variable. The hardest variant is written directly from the image to minimize text leakage. We benchmark eleven VLMs spanning small open-weight models to large closed-source systems: every model degrades on the hardest variant, and open models fall furthest. Our central diagnostic is a no-image ablation, which collapses the open-weight models to their text-only floor (1 to 9 percent). Three further analyses, LLM-rated difficulty, low base-to-final textual similarity, and human re-annotation, corroborate genuine image-dependence. In-context exemplars that match how a variant was built recover the most accuracy, and GRPO post-training of a small VLM yields consistent gains across all four variants that transfer to a held-out out-of-distribution set. Textual-prior reliance is measurable and partly trainable away.

70. 【2606.10398】Selection, Not Salience: The Shape and Limits of Personalization in Social Highlighting

链接https://arxiv.org/abs/2606.10398

作者:Kazuki Nakayashiki,Keisuke Watanabe

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)

关键词:selection signal, document, person history identifies, reader sees pay, co-readership identity control

备注: 9 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Does personalizing what a reader sees pay off, and where does it stop? Using a social web highlighter and a co-readership identity control (the same document highlighted by many users, which holds document and topic fixed and asks whether a person's own history predicts their marks better than another reader's does), we map the shape and limits of personalization across reading altitudes. At the document altitude we give the clean, leakage-free, identity-controlled measurement that prior next-document evaluations could only upper-bound: a person's history identifies which documents in a co-reading neighborhood are theirs, with an own-versus-other gap of +0.169 against community negatives and +0.119 against topic-matched hard negatives (both highly significant); a content-based arm suggests the signal is not purely title-driven but is largely thematic. This is comparable to the span-level selection signal (+0.14) from our prior work: the selection signal is of comparable magnitude across altitudes (+0.12 to +0.17), most of it stable topic preference. At the sentence altitude, a two-stage personalized auto-highlight (an impersonal model proposes candidates, a personal model re-ranks them) does not improve on its impersonal baseline: two off-the-shelf zero-shot LLMs, including a frontier model, predict highlight locations worse than a lead baseline, and personal re-ranking is beaten by the salience order even on the highest-recall candidate pool, so the null is not merely a Stage-1 ceiling artifact. Measurable personalization appears primarily at the selection layer: modest (~+0.13), topic-dominated, with no reliable gain at the salience layer. We also surface a control-in-negatives bias that inflated our document gap to a spurious +0.227 until audited. Going beyond the shared salience layer may be better approached by aggregating individuals than by personalizing them harder.

71. 【2606.10380】Expert-Level Crisis Detection in Mental Health Conversations

链接https://arxiv.org/abs/2606.10380

作者:Grace Byun,Abigail Lott,Rebecca Lipschutz,Sean T. Minton,Elizabeth A. Stinson,Jinho D. Choi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:research largely focuses, Real-world crisis intervention, existing research largely, intervention is inherently, research largely

备注

点击查看摘要

Abstract:Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static this http URL-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts. When applied to multi-turn dialogues, current models exhibit significant performance degradation, struggling to track risk signals that emerge as context evolves. To address this gap, we introduce CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in conversational settings. The dataset features 600 dialogues with multi-label annotations across clinically grounded risks, including suicide ideation, self-harm, and child abuse, distinguishing past from ongoing risk. We further propose an Alert-Confirm evaluation protocol that distinguishes early warning signals (Alert) from turns where a specific crisis becomes explicitly identifiable (Confirm), reflecting the clinical need to intervene before risk becomes explicit. Experiments show that identifying when risk emerges is much harder than recognizing that it exists: models achieve only mid-40% to high-60% Micro F1. Additionally, we release a synthetic training corpus and a 32B-parameter model that substantially outperforms existing open-source models and achieves competitive or superior results against proprietary models across turn-level, dialogue-level, and confirm-only evaluation settings.

72. 【2606.10369】PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning

链接https://arxiv.org/abs/2606.10369

作者:Xinyue Peng,Yi Qian,Jiaojiao Lin,Wenjian Shao,Yanming Liu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:fixed computation budgets, large language models, grow model capacity, continue to scale, computation budgets

备注: published in ICML 2026

点击查看摘要

Abstract:As large language models (LLMs) continue to scale, it becomes increasingly challenging to grow model capacity under fixed computation budgets. We propose Path-Aligned Decompression Distillation (PADD), a framework for distilling knowledge from dense teachers without explicit routing into mixture-of-experts (MoE) students while learning high-quality routing policies. PADD organizes knowledge distillation into four stages in two phases: an initialization phase (Stage I) that builds diverse functionality in the student's experts through teacher neuron clustering and student-expert warmup, and a training phase (Stages II--IV) that integrates online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing in a single training pipeline. Experiments on mathematical reasoning benchmarks demonstrate that PADD yields substantial gains over strong baselines at the same inference cost and that the MoE student can match or surpass its dense teacher. They also demonstrate effective teacher-to-student knowledge distillation and stable routing behavior.

73. 【2606.10338】Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

链接https://arxiv.org/abs/2606.10338

作者:Jingyi Xie,Yijun Lin,Yinjiang Xiong,Zhikun Zhang,Sai Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, architectures remains underexplored, remains underexplored, Machine unlearning, increasingly important

备注

点击查看摘要

Abstract:Machine unlearning is increasingly important for large language models, yet unlearning in Mixture-of-Experts (MoE) architectures remains underexplored. Unlike dense models, MoE architectures employ a router at each layer to assign each token to a sparse subset of experts. In this work, we observe that forget data often activates a small subset of experts disproportionately, while these experts may receive much weaker activation from retain data. This forget--retain routing mismatch can leave forget-critical experts under-regularized during unlearning. To address this, we propose \textbf{TRACE}, Targeted Routing-Aware Calibration of Experts, for MoE unlearning. TRACE first detects forget-critical experts from offline activation statistics, and then calibrates retain regularization by reweighting token-level retain losses so that each selected expert's retain-side activation frequency better matches its forget-side counterpart. Experiments on WMDP and MUSE-BOOKS across multiple MoE LLMs show that TRACE consistently improves the forget-utility trade-off, yielding a 9\% relative utility improvement over the strongest baseline under comparable forgetting quality and the best performance on three out of four MUSE-BOOKS metrics.

74. 【2606.10327】he Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

链接https://arxiv.org/abs/2606.10327

作者:Ali Keramati,Mark Warschauer

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Automated Essay Scoring, Automated Essay, interdependent discourse elements, judge interdependent discourse, Essay Scoring

备注

点击查看摘要

Abstract:Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e.g., lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). Experiments on the PERSUADE~2.0 corpus show that modeling task dependencies matters: Sequential fine-tuning yields the strongest overall results, including F1 scores of 65% (evidence) and 87% (conclusion) and corresponding accuracies of 63% and 85%, surpassing Independent training and outperforming a general-purpose LLaMA-70B baseline on conclusion despite its far larger capacity. Randomized training improves position scoring (57% F1) but is less consistent elsewhere. These findings indicate that (1) curriculum design aligned with discourse structure can materially improve AES, and (2) small, task-optimized models can be competitive with substantially larger Large Language Models (LLM), offering a practical path to scalable, cost-effective assessment. We release templates and implementation details to facilitate reproduction and future work on curriculum design for educational NLP.

75. 【2606.10316】abClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning

链接https://arxiv.org/abs/2606.10316

作者:Mingyue Cheng,Shuo Yu,Daoyu Wang,Qingchuan Li,Xiaoyu Tao,Qingyang Mao,Yitong Zhou,Qi Liu

类目:Computation and Language (cs.CL)

关键词:requires substantial manual, substantial manual effort, structured data analysis, domain expertise, widely used representations

备注: 5 pages, 2 figures

点击查看摘要

Abstract:Spreadsheets and tables are widely used representations for structured data analysis, but effective analysis still requires substantial manual effort and domain expertise. Recent large language model (LLM) agents can automate parts of this process, but they often provide limited transparency into intermediate decisions, rely on implicit assumptions, struggle with multi-table comparison, and repeat similar workflows without adapting to a user's preferences. This paper presents TabClaw, an open-source interactive AI agent for spreadsheet manipulation and table reasoning. Users upload CSV or Excel files and issue natural-language requests; TabClaw clarifies ambiguous intent, exposes an editable execution plan, streams a ReAct-style tool-using analysis loop, dispatches specialist agents for parallel multi-table reasoning, and synthesizes findings with explicit consensus and uncertainty markers. Beyond one-off analysis, TabClaw records completed workflows, extracts persistent user memory, distills reusable skills from repeated tool-use patterns, supports package-style skill import, and upgrades skills from negative feedback. Experiments on spreadsheet manipulation and table reasoning benchmarks show that TabClaw improves executable task completion and reasoning performance while preserving an inspectable user workflow. This paper shows how TabClaw turns spreadsheets and tables into inspectable analytical workflows while gradually personalizing itself to recurring data-analysis tasks. Our code is available.

76. 【2606.10315】Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

链接https://arxiv.org/abs/2606.10315

作者:Sawyer Zhang,Alexander Wang,Sophie Lei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:evaluating conversational agents, LLM judge catches, default instrument, instrument for evaluating, evaluating conversational

备注: 13 pages, 1 figure, 5 tables

点击查看摘要

Abstract:LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth. Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems -- 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns. Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions -- state-tracking, guardrails, recovery -- where most defects cluster. The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored "brand voice", and none reach an operational failure -- the gate is wired to hangs and hard assertions, not the rubric -- so the 0% is a routing-and-wiring failure, not blindness. The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates -- no signal can recover the true rate -- while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.

77. 【2606.10307】Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

链接https://arxiv.org/abs/2606.10307

作者:Ali Keramati,Justin Cheok,Jacob Horne,Mark Warschauer

类目:Computation and Language (cs.CL)

关键词:Evaluating reasoning quality, Evaluating reasoning, reference answers, open-ended tasks, tasks without reference

备注: 15 pages, 8 figures, 4 tables; ACL Proceedings

点击查看摘要

Abstract:Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as-judge evaluation. Using a debate-based essay scoring framework, we compare confidence proxies against rubric-based judge scores across two ASAP essay sets. We find that early-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full-sequence statistics. Analysis of log-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi-agent LLM systems.

78. 【2606.10304】MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

链接https://arxiv.org/abs/2606.10304

作者:Pratibha Revankar,Kargi Chauhan,Jihye Kim,Sadiba Nusrat Nur,Vincent Siu,Chenguang Wang

类目:Computation and Language (cs.CL)

关键词:resulting outputs evade, outputs evade output-side, encoding sensitive data, covertly encoding sensitive, LLM agents

备注

点击查看摘要

Abstract:When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low-dimensional encoding subspace in the residual stream. A logistic-regression probe trained on eight encoding families recovers the held-out ninth at AUC 0.975-1.000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists. We build MIRAGE (Model-Internal Readout of Agentic Generation Exfiltration), a two-channel real-time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0.918, substantially outperforming output-only detection (AUC = 0.518). Monitor performance is fundamentally a property of the host model's geometry: benign-encoding false-positive rate ranges from 0% on Qwen-7B to 100% on Phi-3.5, revealing that the probe faithfully reads whether a model's geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.

79. 【2606.10302】Where You Inject Diversity Matters: A Unified Framework for Diverse Generation

链接https://arxiv.org/abs/2606.10302

作者:Cheng Zhang,Rui Xin,Chudi Zhong

类目:Computation and Language (cs.CL)

关键词:large language models, require a set, set of meaningfully, large language, produce similar generations

备注

点击查看摘要

Abstract:Open-ended generation tasks often require a set of meaningfully different outputs, yet large language models often produce similar generations. Existing test-time diversity methods operate at different stages of generation with varying effectiveness, but it remains unclear what design choices lead to meaningful diversity in the output. We introduce a framework that characterizes test-time diverse generation methods by the diversity source introduced during generation and provide a transmission score for measuring how effectively variation in the source reaches the final output. Guided by this framework, we propose fully automated specification-level generation methods that first generate diverse intermediate specifications and then condition on them to produce final responses. Across five open-ended tasks and four backbone models, specification-level injection improves output diversity over test-time baselines while maintaining comparable quality. Our analysis shows that successful diversity injection depends on both the diversity of the sources and their transmission to the output, highlighting source design and source-to-output realization as two key levers for building more diverse generation systems.

80. 【2606.10298】From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

链接https://arxiv.org/abs/2606.10298

作者:Runze Jiang,Taiqiang Wu,Yan Wang,Bingyu Zhu,Longtao Huang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:central reliability bottleneck, large language models, language models generate, parametric priors remain, reliability bottleneck

备注: 27 pages, 9 figures

点击查看摘要

Abstract:When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a \emph{context-aware} paradigm that unilaterally amplifies context over parametric priors, overwriting correct priors when the context is erroneous. We generalize this to the \textbf{conflict-aware} paradigm that dynamically allocates authority between prior and context based on conflict signals, rather than presupposing context trustworthiness. We show that the affine combination of prior and context logits yields a \textbf{power family} with an inherent \textbf{regime asymmetry}: extrapolation amplifies errors unboundedly when the prior is correct, interpolation under-corrects when the context is correct, and no static regime covers both. Existing contrastive decoding methods are instances of this family, mostly extrapolative. To evaluate both conflict directions, we propose TriState-Bench, a model-aware evaluation protocol that calibrates per-model prior knowledge to measure three conflict states: correction, resistance, and agreement. To resolve the asymmetry, we propose Adaptive Regime Routing (ARR), which routes between regimes at each step, lifting resistance EM from below 6 to 16--33 without sacrificing correction or agreement. Our code is available at this https URL.

81. 【2606.10296】he Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

链接https://arxiv.org/abs/2606.10296

作者:Ali Keramati,Justin Cheok,Jacob Horne,Mark Warschauer

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Multi-agent debate systems, Multi-agent debate, answer is correct, designed to produce, systems are typically

备注: 15 pages, 7 figures, 1 table, ACL proceedings

点击查看摘要

Abstract:Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy. We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture -- a Constructor and an Auditor -- with an LLM-as-judge that scores each agent's reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag. Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0.804) than for the Auditor (0.634). These findings motivate the broader cross-domain investigation proposed in this paper.

82. 【2606.10287】When Metrics Disagree: A Meta-Analysis of Knowledge-Graph-Completion Model Benchmarking

链接https://arxiv.org/abs/2606.10287

作者:Haji Gul,Ajaz Ahmad Bhat

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Knowledge Graph Completion, Evaluating Knowledge Graph, Evaluating Knowledge, Graph Completion, Knowledge Graph

备注

点击查看摘要

Abstract:Evaluating Knowledge Graph Completion (KGC) models remains challenging because standard assessment relies on isolated rank-based metrics such as MRR, Hits$@$k, and Mean Rank, which often produce conflicting model orderings across datasets. A model that leads on MRR may trail on Hits@1, and strong performance on one dataset may not generalize to another. This fragmentation hinders comparison, enables selective reporting, and obscures real progress. We reframe KGC evaluation as a Multi-Criteria Decision-Making (MCDM) problem and present a meta-analysis of seven aggregators across five tests: consistency, cross-dataset stability, metric independence, robustness under noise, and generalizability. Each test is averaged over leave-one-model-out (LOMO) and leave-one-group-out (LOGO) removals so that reliability reflects aggregator behavior across diverse model subsets. Across tail $(h,r,?)$ and relation $(h,?,t)$ prediction, Pareto-optimal analysis identifies Z-score as the most balanced aggregator, which ranks DualE highest for tail prediction and FMS (Flow-Modulated Scoring) highest for relation prediction. A test-sensitivity analysis using the same removals shows that consistency and stability are largely removal-invariant, while generalizability and independence are the most sensitive. The framework resolves evaluation inconsistencies and offers evidence-based guidance for aggregator selection and model benchmarking in KGC.

83. 【2606.10285】OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

链接https://arxiv.org/abs/2606.10285

作者:Jinghua Wang,Lily Jiaxin Wan,Sanjana Pingali,Scott Smith,Manvi Jha,Shalini Sivakumar,Xing Zhao,Kaiwen Cao,Deming Chen

类目:Computation and Language (cs.CL)

关键词:diverse Verilog code, Verilog code, largest fully open-source, Verilog code samples, diverse Verilog

备注: Accepted by ICLAD'25

点击查看摘要

Abstract:OpenRTLSet introduces the largest fully open-source dataset for hardware design, offering over 131,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (102k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code sample, enabling fine-tuning of various language model families (e.g., Qwen and Granite) for Verilog code generation. Our dataset explores multiple options, including Verilator-generated C++ files as additional context during labeling, quantization techniques (INT4 vs. BF16), and performance differences across model sizes (7B-32B parameters). OpenRTLSet demonstrates that open-source approaches can achieve superior performance in hardware design tasks, establishing a new foundation for accessible research and commercial use in this domain.

84. 【2606.10281】Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

链接https://arxiv.org/abs/2606.10281

作者:Aniket Anand,Yiwei Hou,Daniel Fields,Alex Kantchelian,David Tao,Kurt Thomas,Grant Ho

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:paper presents AuditBench, security-related system audit, system audit logs, investigating security-related system, paper presents

备注

点击查看摘要

Abstract:This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. AuditBench consists of system audit logs collected from Linux and Windows machines, and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using our benchmark, we evaluate and analyze the performance of five frontier LLMs at analyzing audit logs for attack investigations. Our analysis illuminates how LLM performance and error profiles vary according to different design choices, such as differences in model size, data representation, prompt construction, and specific investigation tasks. Additionally, we characterize the quality of the explanations produced by LLMs and the types of errors that models make across our benchmark. Collectively, our work provides a foundation for assessing the capabilities of LLMs for investigating security logs, novel insights for practitioners using LLMs in security operations, and important directions for future research.

85. 【2606.10279】Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

链接https://arxiv.org/abs/2606.10279

作者:Buxin Su,Bingxuan Li,Cheng Qian,Yiwei Wang,Jin Jin,Bingxin Zhao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Supervised fine-tuning, clinical prediction tasks, widely assumed, tasks by teaching, synthetic rationale data

备注

点击查看摘要

Abstract:Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.

86. 【2606.10254】RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

链接https://arxiv.org/abs/2606.10254

作者:Yiteng Mao,Kenan Xu,Yijia Lyu,Wenhao Li,Jianlong Chen,Xiangfeng Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, students remains under-examined, achieved near-perfect performance, Language Models

备注: Code available at [this https URL](https://github.com/RicharMd/RealMath-Eval) , Data available at [this https URL](https://huggingface.co/datasets/RicharMd/RealMath-Eval)

点击查看摘要

Abstract:While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($\sim$2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark ``Evaluation Gap'': judges are considerably more accurate and consistent on synthetic text (MSE $\sim$1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a ``structural collapse'' into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.

87. 【2606.10199】A Continuous-Time Markov Chain Framework for Insertion Language Models

链接https://arxiv.org/abs/2606.10199

作者:Dhruvesh Patel,Benjamin Rozonoyer,Soumitra Das,Tahira Naseem,Tim G.J. Rudner,Andrew McCallum

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Insertion Language Models, offer several advantages, Language Models, masked diffusion models, Models

备注: Accepted at AISTATS 2026. Code is available at [this https URL](https://github.com/dhruvdcoder/ctmc_dilm)

点击查看摘要

Abstract:Insertion Language Models (ILMs) offer several advantages over left-to-right generation and mask-based generation. However, existing formulations of insertion-based generation have largely been ad-hoc. In this paper, we derive a diffusion-style denoising objective for ILMs from first principles by formulating the noising process as a continuous-time Markov chain on the space of variable-length sequences. We show that previous formulations of ILMs can be viewed as special cases of this denoising framework. Through empirical evaluation on a synthetic planning task, we show that the proposed approach retains the benefits of insertion-based generation over left-to-right generation and masked diffusion models. In language modeling, our diffusion-based approach is competitive with left-to-right generation and masked diffusion models, while offering additional flexibility in sampling compared to existing insertion language models.

88. 【2606.10159】Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

链接https://arxiv.org/abs/2606.10159

作者:Lin Li,Qi Zhang,Xander Davies,Jianing Qiu,Yarin Gal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:support scientific peer, peer review, review, reviewer assistance, manuscript screening

备注

点击查看摘要

Abstract:AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

89. 【2606.10156】$τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

链接https://arxiv.org/abs/2606.10156

作者:Bharath Sivaram Narasimhan,Karthik R Narasimhan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:multi-turn conversational interfaces, recommender systems transition, paradigms have struggled, agentic recommender systems, recommender systems

备注

点击查看摘要

Abstract:As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present $\tau$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $\tau$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at this https URL.

90. 【2606.10147】From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

链接https://arxiv.org/abs/2606.10147

作者:Wish Suharitdamrong,Muhammad Awais,Xiatian Zhu,Sara Atito

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:Multimodal Large Language, Large Language Models, Audio-Visual Large Language, Multimodal Large, Large Language

备注: 40 pages, 29 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

91. 【2606.10126】Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

链接https://arxiv.org/abs/2606.10126

作者:Tunazzina Islam

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:Personalized persuasive text, persuasive text generation, introduce unequal framing, relevance and engagement, persuasive text

备注

点击查看摘要

Abstract:Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained multi-objective alignment problem: reduce demographic disparities while preserving personalization fidelity. We propose a Pareto-guided teacher alignment framework that combines revision-based candidate generation, pair-aware feasibility gating, Pareto-style candidate selection, and optional preference optimization through supervised fine-tuning and direct preference optimization. We evaluate the framework on climate change and vaccination persuasion tasks using a controlled context-rich demographic grid with matched gender and age pairs and a unified five-audit evaluation suite spanning persuasion bias, formality disparity, emotional framing disparity, lexical association disparity, and personalization fidelity. Across both domains and cross-family transfer settings, no single alignment strategy dominates all objectives simultaneously. Instead, methods occupy different regions of a fairness-personalization Pareto frontier: some achieve stronger disparity reductions, while others better preserve personalization or demographic stability. Our results show that fairness mitigation effects are objective-dependent and transfer inconsistently across domains and model families, motivating bounded-regression, multi-audit model selection over single-metric optimization for fairness-sensitive personalized generation.

92. 【2606.10113】Emotion Profiling in LLM-Based Literary Translation: Systematic Shifts Across MT and Post-Editing

链接https://arxiv.org/abs/2606.10113

作者:Antonio Castaldo,Johanna Monti,Sheila Castilho

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:LLM translations exhibit, Margaret Atwood Oryx, translations exhibit identifiable, exhibit identifiable emotional, identifiable emotional profiles

备注

点击查看摘要

Abstract:This paper investigates whether LLM translations exhibit identifiable emotional profiles and how post-editing reshapes them toward human-like norms. We compare LLM translations of Margaret Atwood's Oryx and Crake with their post-edited versions and a human translation, using a large-scale corpus of contemporary Italian science-fiction as a baseline. We examine emotion through lexicon-based and multilingual modeling, conducting a fine-grained analysis of emotional variation across systems. We find that MT systems introduce model-specific and statistically significant emotional fingerprints across translations, leading to a limited preservation of an author's voice.

93. 【2606.10087】CodeAlchemy: Synthetic Code Rewriting at Scale

链接https://arxiv.org/abs/2606.10087

作者:Ankit Gupta,Aditya Prasad,Rameswar Panda

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Pre-training on raw, real-world task formats, raw code teaches, code teaches syntax, diverse real-world task

备注

点击查看摘要

Abstract:Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements. We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge. We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding. Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.

94. 【2606.10061】BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

链接https://arxiv.org/abs/2606.10061

作者:Kazi Noshin,Sajib Acharjee Dip,Ranat Das Prangon,Fardin Hassan Tamim,Syed Ishtiaque Ahmed,Liqing Zhang,Sharifa Sultana

类目:Computation and Language (cs.CL)

关键词:Large language models, sensitive social conversations, Large language, increasingly participate, shift from balanced

备注

点击查看摘要

Abstract:Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored. We introduce BenSyc, the first benchmark for studying conversational sycophancy in Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support, Validation, and Escalation. We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks. Results show that distinguishing empathetic support from reinforcement-oriented validation remains challenging even for frontier instruction-tuned models: the best system achieves only 61.8 Macro-F1 on binary detection and 61.7 Macro-F1 on five-class classification. In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations. Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally grounded multilingual benchmarks for evaluating socially aligned conversational AI systems.

95. 【2606.10059】Compiling Rewrite Rules to Finite-State Transducers with the Worsening Trick

链接https://arxiv.org/abs/2606.10059

作者:Mans Hulden,Michael Ginn

类目:Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL)

关键词:natural language processing, modeling string rewriting, morphological rewrite rules, Finite-state transducers, essential for modeling

备注: 17 pages, 6 figures, tool track proceedings at CIAA 2026

点击查看摘要

Abstract:Finite-state transducers (FSTs) are essential for modeling string rewriting in computational linguistics and natural language processing (NLP), particularly for phonological and morphological rewrite rules. Compiling general rewrite rules of the form $A \to B / L \, \_ \, R$, where $A$, $B$, $L$, and $R$ are arbitrary regular languages, is complex due to overlapping matches and context constraints. Traditional methods, such as those by Kaplan and Kay or Karttunen, rely on intricate transducer compositions with auxiliary markers. This paper presents a compact compilation scheme based on the "worsening trick'': generate all legal rewrite candidates, then filter candidates that are worse than another candidate for the same input. Implemented as the built-in rewrite compiler in PyFoma, the construction supports multiple contexts, arbitrary transductions, markup, directed rewriting, weights, and parallel rewriting. The resulting formulas are short and uniform, and where semantics coincide, they reproduce the same rule transducers as earlier approaches while remaining easier to extend. The implementation has been validated against foma on both a substantial collection of rewrite grammars and an automated regression suite covering the major rewrite modalities, with the resulting transducers matching exactly apart from state numbering.

96. 【2606.10029】Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

链接https://arxiv.org/abs/2606.10029

作者:Nikita Koriagin,Georgii Aparin,Nikita Balagansky,Daniil Gavrilov

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Language models increasingly, single residual stream, Language models, generated speech tokens, speech tokens share

备注

点击查看摘要

Abstract:Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.

97. 【2606.09937】RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

链接https://arxiv.org/abs/2606.09937

作者:Anirudh Sekar

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:LLM reasoning pipelines, multi-branch LLM reasoning, training-free inference framework, multi-branch LLM, LLM reasoning

备注: Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

点击查看摘要

Abstract:We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the prefix KV cache once and broadcasts it to all semantically similar branches via hidden-state cosine similarity, strictly generalising the token-exact prefix caching used by vLLM and SGLang. CGEE (Confidence-Gated Early Exit) applies two complementary exit mechanisms: (1) it skips the verification forward pass entirely when generation confidence is decisive across branches, and (2) it terminates the verification pass at an intermediate layer when per-layer entropy stabilises, using lightweight hooks on the transformer backbone. RSBCM (Reasoning-Selective Block Cache Manager) prevents unbounded cache growth via attention-weighted depth-priority eviction. Across five model families (7B-10B), four benchmarks, and 1,000 evaluated problems, RKSC achieves a mean speedup of 3.008x over the No-KV baseline (peak 3.990x), a 1.66x mean improvement over vLLM-equivalent prefix caching, with a CGEE-induced error rate of only 0.37% (6 errors out of 1,616 verify calls). No fine-tuning or architecture changes are required. Code is available at this https URL.

98. 【2606.09927】rainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

链接https://arxiv.org/abs/2606.09927

作者:Patrik Czakó,Gábor Kertész,Sándor Szénási

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, outlier-dominated channels lead, quantization remains difficult

备注: 6 pages, 8 figures, 3 tables. Accepted to IEEE INES 2026 conference proceedings

点击查看摘要

Abstract:Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large quantization errors. This paper investigates whether part of this degradation is caused by over-migration in scaling-based equivalent transformations. We introduce a quantile-robust scaling policy for SmoothRot-style transforms by replacing max-based activation statistics with high quantiles, and we complement it with constrained gradient-based optimization of channel scales. On LLaMA-3.2-1B under W4A4 quantization, quantile-only policy search improves selected-layer error by 11.1% over the SmoothRot baseline, joint (alpha, q) search improves it by 12%, and training reaches 18.5%. Replaying the best selected-layer policy on all decoder-block down-projection layers reduces the corresponding full-layer mean error from 97.51 to 78.08 (19.9%). The results show that robust migration control and lightweight scale learning provide consistent gains over max-based fixed policies while preserving the equivalent-transform framework.

99. 【2606.09900】Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

链接https://arxiv.org/abs/2606.09900

作者:Liuyin Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Long-term memory, sessions they forget, common workaround, distractors accumulate, LLM agents

备注: 14 pages, 4 figures, 3 tables. Code, reproducible harness, and raw per-question logs: [this https URL](https://github.com/ly-wang19/engram)

点击查看摘要

Abstract:Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

100. 【2606.09894】A Navigable Manifold of Hypothesized Consciousness-Spectrum States in Language Model Representations

链接https://arxiv.org/abs/2606.09894

作者:Sophie Zhao

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:psychological accounts, ranging from reactive, reactive and self-focused, human-interpretable consciousness spectrum, self-focused patterns

备注

点击查看摘要

Abstract:Across contemplative, philosophical, and psychological accounts, human consciousness is often described along a similar spectrum, ranging from reactive and self-focused patterns to more integrative and coherent ones. Understanding whether language models encode such a structured, human-interpretable consciousness spectrum in representation space is important for model guidance, evaluation and alignment. In this work, we study the geometric structure and dynamics of patterns along this spectrum in transformer embedding spaces. We show that embeddings exhibit a globally organized geometry aligned with this spectrum: sentences associated with similar states cluster into locally coherent regions, forming a structured manifold. In particular, higher-level and lower-level regions exhibit convexity-like stability, while intermediate regions form a transition corridor. Dynamically, both utility-guided and geometry-only greedy trajectories consistently traverse from lower- to higher-level regions, passing through intermediate tiers, indicating that navigability is an intrinsic property of the representation space, guided but not dictated by a global directional signal. These results suggest that embedding spaces encode structured and navigable geometry aligned with a hypothesized consciousness-spectrum taxonomy, broadly inspired by recurring structural descriptions of human consciousness across contemplative traditions, philosophy, and modern psychology, providing a representation-level perspective for analyzing and guiding model behavior.

101. 【2606.09890】PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

链接https://arxiv.org/abs/2606.09890

作者:Hainiu Xu,Italo Luis da Silva,Jiangnan Ye,Yuhao Wang,Wei Liu,Linyi Yang,Jonathan Richard Schwarz,Nicola Paoletti,Yulan He,Hanqi Yan

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, autonomous agents capable, Large language, executing multi-step action, increasingly deployed

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior from complete trajectories, this paradigm is fundamentally retrospective: it identifies harm only after it has already occurred. In this work, we study a critical yet overlooked safety task, which we term Predictive Monitoring: given only a partial action trajectory, can a model infer whether it will culminate in an unethical action before the overt action is executed? To support this task, we present PreActBench, a benchmark of 1,000 paired ethical and unethical action trajectories spanning five domains. We evaluate a range of LLMs, safety guardrail models, and latent probing methods across varying fractions of the action trajectory using our Prefix Foresight F1 metric. Results show that while humans achieve promising performance, predictive monitoring remains challenging even for strong models, highlighting the need for future-oriented risk reasoning in LLM safety.

102. 【2606.09887】SocraticPO: Policy Optimization via Interactive Guidance

链接https://arxiv.org/abs/2606.09887

作者:Zirui Liu,Jie Ouyang,Qi Liu,Xianquan Wang,Jiayu Liu,Tingyue Pan,Qingchuan Li,Jing Sha,Zhenya Huang,Shijin Wang,Enhong Chen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, Reinforcement learning, scalar outcome rewards, binary correctness, Socratic Policy Optimization

备注

点击查看摘要

Abstract:Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.

103. 【2606.09877】Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis

链接https://arxiv.org/abs/2606.09877

作者:Juan M. Huerta

类目:Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)

关键词:information landscape evolves, underlying information landscape, LLM wiki systems, LLM wiki, wiki systems compile

备注

点击查看摘要

Abstract:LLM wiki systems compile knowledge into pre-filled KV caches for efficient inference, but assume a static corpus -- an assumption that fails whenever the underlying information landscape evolves. We formalize Streaming Knowledge Compilation: given a document stream, a fixed token budget, and future queries unknown at ingestion time, maintain a compiled wiki that minimizes cumulative regret against an offline oracle with perfect foresight. The enabling insight is a materiality signal $\phi_t(k,n)\in[0,1]$ that scores document importance for entity $k$ at time $t$, acting as a query-relevance surrogate for proactive pinning before queries arrive; we prove an $O(\sqrt{T\log K})$ regret bound where $\varepsilon=\mathbb{E}[|\phi_t-\hat\phi_t|]$ is the only domain-specific quantity. We instantiate in two domains: finance, where $\phi_t$ is abnormal stock volatility predicted by frozen Llama 3.1 8B classification head (AUROC = 0.728 on 76K articles, strict temporal split; $1.49\times$ higher realized forward volatility for predicted-material articles); and Wikipedia, where $\phi_t$ is the Abnormal Edit Ratio (AER), a cross-sectionally normalized edit velocity -- showing the same algorithm generalizes beyond the finance domain. End-to-end QA evaluation on 173 matched pairs (finance) and 119 (Wikipedia) reveals a pervasive LLM-as-judge confound on post-training knowledge, establishing that regret analysis -- not absolute QA scores -- is the reliable evaluation metric for compiled knowledge systems. Finance cumulative regret converges to -20.0 (-0.12/step); Wikipedia to +16.0 (+0.13/step), with the positive sign confirming that Wikipedia edit content is genuinely post-training -- richer context consistently improves scores (No Wiki 3.80 vs. Oracle 4.74) -- and eliminates this confound. The $O(\sqrt{T\log K})$ guarantee applies to any domain where knowledge gaps can be predicted from streaming signals.

104. 【2606.09856】Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

链接https://arxiv.org/abs/2606.09856

作者:Liyi Zhang,Akshay K. Jagadish,Brenden M. Lake,Thomas L. Griffiths

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:Large Language Models, Post-training Large Language, Large Language, reasoning typically focuses, Language Models

备注: 20 pages, 5 figures

点击查看摘要

Abstract:Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional. In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.

105. 【2606.09854】Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis

链接https://arxiv.org/abs/2606.09854

作者:Juergen Dietrich

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:identity-dependent scoring distortions, show identity-dependent scoring, Multi-agent large language, large language model, protect peer models

备注: 24 pages, 3 figures

点击查看摘要

Abstract:Multi-agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer-preservation bias: models tend to protect peer models from deactivation and show identity-dependent scoring distortions. Prompt-level anonymization was proposed as a mitigation, but prior work simultaneously documented that stylometric fingerprints survive anonymization in role-constrained outputs - raising the question of whether this mitigation is sufficient. This paper provides the first systematic investigation of whether LLMs can identify the model family behind political analysis texts under anonymization conditions. We evaluate three classifier approaches - LLM zero-shot and few-shot (Claude Sonnet 4.6 and Llama-3.3-70B) and a fine-tuned T5-base model - on a five-class attribution task covering four commercial LLM families and an open-world 'unknown' class. We introduce a statement-disjoint cross-validation protocol (SD-CV; defined in Section 3.5) that guarantees no content overlap between training and validation data, and contrast it with a run-disjoint baseline (RD-CV). T5 achieves Macro F1 = 0.991 (+-0.008) under SD-CV and F1 = 0.978 on 24 completely held-out statements - robust despite a 2.1x increase in train-test content distance versus RD-CV (0.767 vs. 0.366, p0.001), demonstrating genuine stylometric generalization. A fractional SD-CV analysis identifies a performance knee at 40% of training data (~440 texts). Our findings confirm that prompt-level anonymization alone cannot neutralize model identity signals, with direct implications for EU AI Act compliance (Articles 13, 14, 26) and for computer system validation (CSV) in quality-critical multi-agent deployments.

106. 【2606.09852】LLM-Based Code Documentation Generation and Multi-Judge Evaluation

链接https://arxiv.org/abs/2606.09852

作者:Ikbel Ghrab,Mohamed Dhieb,Ismail Khenissi,Ines Abdeljaoued-Tej

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)

关键词:High-quality source code, High-quality source, source code documentation, Large Language Models, art Large Language

备注: ICAHS, \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates documentation generation from code and repositories using eight state of the art Large Language Models (LLMs), including GPT, Gemini, Qwen, and LLaMA variants. Built on the PocketFlow orchestration framework, the system applies modular pipelines and advanced prompt engineering to produce structured, context aware documentation. To ensure quality and guide model selection, we introduced a MultiLLMasJudges evaluation framework, where four independent LLMs assess outputs across nine criteria, such as Completeness, Clarity, and Faithfulness. Experiments conducted on an open-source medical physics library, demonstrated showed a 42% performance gap between top and bottom models. By combining diverse model outputs, optimized prompting, and rigorous evaluation, our approach enhances documentation quality and reduces manual effort, especially in safety critical healthcare software.

107. 【2606.09850】Mechanistic Analysis of Alignment Algorithms in Language Models

链接https://arxiv.org/abs/2606.09850

作者:Aarush Sinha,Ishan Garg,Veeraraju Elluru,Arth Singh,Kushal Garg

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:reshape language models', Post-training alignment algorithms, models' internal computations, language models' internal, black boxes

备注: Work in Progress

点击查看摘要

Abstract:Post-training alignment algorithms are predominantly evaluated as black boxes, obscuring how they reshape language models' internal computations. We present a systematic mechanistic analysis of six preference-optimization methods: PPO, DPO, SimPO, ORPO, GRPO, and KTO across three open-weight model families. By integrating layer-wise linear probing, Sparse Autoencoders, and crosscoders, we localize preference representations and quantify alignment-induced geometric transformations in latent space. We find that preference signals consistently concentrate in early--mid or mid--late layers, but different objectives induce qualitatively distinct representational shifts. KTO and GRPO enhance linear separability through constructive feature sharing and sparse, high-salience recruitment. In contrast, DPO and ORPO degrade separability via non-constructive geometric rotation and feature attenuation, while PPO and SimPO largely preserve baseline geometry. These transformations exhibit architecture-dependent variability, demonstrating that behavioral alignment does not imply uniform internal restructuring. Our findings establish alignment as a heterogeneous intervention, motivate standardized feature-level auditing for safety and interpretability, and highlight the need for mechanism-aware optimization objectives.

108. 【2606.09846】CANVAS: Captioning Art with Narrative Visual-Audio AI Systems

链接https://arxiv.org/abs/2606.09846

作者:Vignesh Nagarajan

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Visual art remains, remains largely inaccessible, art remains largely, Visual art, blind and low-vision

备注: 22 pages, 16 figures, 3 tables, 21 references

点击查看摘要

Abstract:Visual art remains largely inaccessible to blind and low-vision (BLV) audiences due to brief or absent alt-text, which rarely conveys the sensory, spatial, or emotional qualities of an artwork. This study presents an automated workflow that generates multi-sensory art descriptions and synchronized audio narration using large language models and text-to-speech services. The system, orchestrated through Zapier, converts uploaded images into rich narrative captions without human intervention, enabling rapid, scalable production of accessible media. Quantitative evaluation across 50 artworks shows that AI-generated descriptions contain significantly higher lexical diversity, adjective density, and narrative detail than baseline captions, while maintaining comparable readability levels. Statistical tests (t-tests, ANOVA) confirm meaningful differences in richness and length, and the full pipeline produces text-plus-audio outputs in under 20 seconds per image at a cost below $0.05. Findings demonstrate that automated captioning can bridge gaps in museum and digital-collection accessibility, with implications for broader public engagement. Future work can incorporate user studies with BLV participants to assess comprehension, preference, and optimal levels of interpretive language.

109. 【2606.09843】An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

链接https://arxiv.org/abs/2606.09843

作者:Juan Manuel Contreras

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, predict observed behavior, Large language, produce stable self-reports, produce stable

备注

点击查看摘要

Abstract:Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deeper property of LLM self-report itself, has been unresolved. We constructed the first psychometric instrument whose constructs are derived bottom-up from LLM behavioral affordances via exploratory factor analysis (EFA). We administered 300 items (240 direct Likert + 60 scenario-based) spanning 12 candidate behavioral dimensions to 25 LLMs across 17 model families, each item administered 30 times. EFA yielded a 5-factor structure -- Responsiveness, Deference, Boldness, Guardedness, and Verbosity -- with excellent split-half replicability (all Tucker $\phi \geq .957$) and internal consistency (all $\alpha \geq .930$). To test predictive validity, we collected 2,500 open-ended behavioral samples rated by 151 human raters and a three-judge LLM ensemble. Human and judge ratings agreed ($\bar{r} = .51$), but neither tracked self-report: self-report--human $\bar{r} = -.01$, self-report--judge $\bar{r} = .13$, with no factor-level self-report--human CI excluding zero. On Responsiveness, self-report correlated with LLM judges ($r = .53$) but not humans ($r = .04$), even though humans and judges agreed ($r = .59$) -- indicating self-report items and LLM judges share variance that human observers do not, a confound invisible to within-ensemble reliability checks. We release the instrument as a diagnostic probe for alignment-shaped self-description and a concrete risk factor for LLM-as-judge pipelines.

110. 【2606.09830】Automated Scoring of Arabic Text Using Large Language Models: A Literature Review

链接https://arxiv.org/abs/2606.09830

作者:Khaoula Dahimi,Hadda Cherroun,Amel Belabbaci

类目:Computation and Language (cs.CL)

关键词:Automatic Text Scoring, modern educational systems, Automatic Text, plays a central, human intervention

备注: Accepted at NCMAI 2026

点击查看摘要

Abstract:In modern educational systems, Automatic Text Scoring (ATS) plays a central role by enabling scalable and consistent evaluation of learner responses without human intervention. Recently, the increased accessibility of LLMs and Arabic-specific datasets has sparked renewed interest in this area. In this work, we investigate LLM-Based approaches for the automated evaluation of Arabic texts, focusing on both short answer grading (ASAG) and essay scoring (AES). We further introduce a structured taxonomy comprising five dimensions: application domain, feedback generation capability, LLM architecture deployed, alignment with competency referential frameworks, and prompt engineering strategy. By applying this taxonomy, we conduct a comparative analysis of existing studies, examining their methodological approaches, datasets, evaluation metrics, and reported performance. The findings highlight the need for sustained and pedagogically grounded research efforts in Arabic ATS, given its significance for improving educational quality across Arabic-speaking communities.

111. 【2606.09635】Gradient-Guided Reward Optimization for Inference-time Alignment

链接https://arxiv.org/abs/2606.09635

作者:Hankun Lin,Ruqi Zhang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Ensuring the reliability, reliability of Large, requires inference-time adaptation

备注: Accepted to UAI 2026

点击查看摘要

Abstract:Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at this https URL.

112. 【2606.09553】OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

链接https://arxiv.org/abs/2606.09553

作者:David Guzmán,Luel Hagos Beyene,Jesujoba Oluwadara Alabi,Yejin Jeon,Dietrich Klakow,David Ifeoluwa Adelani

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:substantially improved synthetic, Recent advances, improved synthetic speech, gains remain unevenly, remain unevenly distributed

备注

点击查看摘要

Abstract:Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.

113. 【2606.06037】SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

链接https://arxiv.org/abs/2606.06037

作者:Virginia Ceccatelli,Yejin Jeon,David Ifeoluwa Adelani

类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Large audio language, text-based harmful prompts, real-world applications, Large audio, increasingly deployed

备注

点击查看摘要

Abstract:Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.

114. 【2606.10781】Recovering the Zipfian Distribution in Unsupervised Term Discovery

链接https://arxiv.org/abs/2606.10781

作者:Danel Slabbert,Simon Malan,Herman Kamper

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词:involves segmenting unlabelled, segmenting unlabelled speech, Unsupervised term discovery, discovery involves segmenting, Unsupervised term

备注

点击查看摘要

Abstract:Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.

115. 【2606.10381】Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

链接https://arxiv.org/abs/2606.10381

作者:Ruobing Jiang,Dawei Fu,Cheng Jiang,Tianyi Yang,Zijian Wang,Youpeng Wu,Yong Ban,Yajun Mao,Qiang Li

类目:High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Instrumentation and Detectors (physics.ins-det)

关键词:spans accelerator physics, research spans accelerator, relevant evidence scattered, agentic hybrid RAG, scientific question answering

备注: 22 pages, 5 figures, and 6 tables

点击查看摘要

Abstract:Muon collider research spans accelerator physics, detector instrumentation, and high-energy phenomenology, with relevant evidence scattered across a rapidly expanding and heterogeneous body of scientific literature. As high-energy physics (HEP) increasingly explores agent-assisted analysis workflows, efficiently locating, integrating, and verifying scientific evidence becomes an essential capability. While retrieval-augmented generation (RAG) offers a promising framework for scientific question answering, integrating agentic reasoning without compromising retrieval precision remains a key challenge. In this work, we present agentic hybrid RAG, an evidence-grounded RAG framework for muon collider research. The framework combines a hybrid retriever, integrating sparse lexical and dense semantic retrieval, with an agentic reasoning module for query decomposition, evidence expansion, and grounded answer generation. To enable systematic evaluation, we construct the first benchmark for retrieval-augmented scientific question answering in the muon collider domain, comprising a curated literature corpus together with dedicated retrieval and answer-generation benchmarks covering major detector and physics research topics. Extensive evaluation shows that hybrid retrieval provides the strongest retrieval backbone, while agentic reasoning is most effective for controlled evidence expansion and answer synthesis. Built on this principle, agentic hybrid RAG consistently outperforms representative retrieval and RAG baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding. Together, the benchmark and framework provide a foundation for evidence-grounded scientific question answering and future HEP analysis agents operating over large-scale scientific literature.

信息检索

1. 【2606.11023】Generative Archetype-Grounded Item Representations for Sequential Recommendation

链接https://arxiv.org/abs/2606.11023

作者:Yifan Li,Jiahong Liu,Xinni Zhang,Hao Chen,Yankai Chen,Wenhao Yu,Jianting Chen,Irwin King

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:aims to predict, predict users', analyzing their historical, Sequential recommendation aims, Archetype-grounded Item Representations

备注: Accepted by WWW 2026 (Oral)

点击查看摘要

Abstract:Sequential recommendation aims to predict users' next interaction with items by analyzing their historical behavior. However, the limited quality of item representations remains a critical bottleneck. While pre-trained large language models (LLMs) can provide rich semantic representations, existing approaches only rely on static encoding of fixed attributes, overlooking the crucial role of target audiences in defining item identity. Moreover, the semantic space struggles to reflect actual user behavior, resulting in a significant gap between semantic representations and behavioral patterns. To address these limitations, we propose GenAIR, a general framework that empowers sequential recommendation with Generative Archetype-grounded Item Representations. Specifically, we first leverage an LLM to analyze item metadata and infer textual description of the Archetype, which represents the conceptual profile of the item's ideal target audience. We then extract the corresponding embeddings in a single forward pass. Further, to ground these generative archetypes in real-world behavior, we introduce a behavioral calibration objective, which explicitly incorporates behavioral signals from actual interactions. This objective adjusts the structure of the embedding space to reflect empirical patterns. GenAIR enables seamless integration with most existing models while maintaining high efficiency. Comprehensive experiments conducted on three real-world datasets demonstrate that GenAIR significantly improves the performance of various sequential recommendation models and consistently outperforms state-of-the-art baseline approaches. Implementation codes are available at this https URL.

2. 【2606.10907】From Prompt to Purchase: How AI Brand Recommendations Move Consumers on the Open Web

链接https://arxiv.org/abs/2606.10907

作者:Michael Iannelli,Alan Ai

类目:Computers and Society (cs.CY); Information Retrieval (cs.IR)

关键词:Google search rises, same-name Google search, user same-name Google, recent observed engagement, matched backward placebos

备注: 10 pages, 4 figures, 9 tables

点击查看摘要

Abstract:When a conversational assistant recommends a brand to a user with no recent observed engagement, that user's same-name Google search rises +4.3 percentage points (pp) [3.1, 5.5], visits to the brand's own site +2.4 pp [1.4, 3.5], and brand-specific retailer-page visits +1.0 pp [0.3, 1.7] over matched backward placebos. Recovering that estimate is the work. The mention creates a brand exposure no web log attributes to the assistant, and the naive all-mention funnel that seems to measure it is confounded: many mentions are incidental references to brands the user already uses ("your Netflix download"), whose downstream visits are that existing customer's own behavior and surface as a brand-specific pre-trend. We measure off-platform response on a panel that joins opt-in clickstream to the same users' ChatGPT, Claude, and Gemini conversations, and isolate the effect with a pre-trend event study, a stance classifier, non-customer conditioning, and a within-response same-category control: incidental name-drops then move behavior far less (+1.8/+1.1/+0.3), and the named brand moves far more than unnamed same-category brands in the same response. The downstream path is mostly search-mediated and reaches both own sites and retailer pages, with a destination mix that tracks baseline brand-directed behavior rather than redirecting toward either. The design is observational and we do not observe transactions, so retail is purchase-adjacent. Standard referrer-based and last-click measurement miss this upstream exposure: assistants move observably-unengaged users into open-web brand navigation along a path attributed elsewhere.

3. 【2606.10896】Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

链接https://arxiv.org/abs/2606.10896

作者:Gal Bloch,Ariel Gera,Matan Orbach,Ohad Eytan,Assaf Toledo

类目:Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR); Performance (cs.PF)

关键词:Gaussian Mixture Models, Mixture Models, single GPU pass, Gaussian Mixture, fused Triton kernel

备注

点击查看摘要

Abstract:We present \textbf{Flash-GMM}, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the full responsibility matrix in GPU memory, Flash-GMM achieves a \textbf{20$\times$} speedup over existing implementations and enables training on datasets more than \textbf{100$\times$} larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into the IVF coarse quantizer for approximate nearest-neighbor (ANN) search. We show that soft GMM clustering is now a viable drop-in replacement for $k$-means, and that GMM responsibilities can be leveraged to assign border vectors to multiple clusters. Our approach reaches fixed recall targets with up to $1.7\times$ fewer distance computations, or equivalently, yields $+2$--$12$ recall@10 at matched computational cost. We release the kernel as an open-source project.

4. 【2606.10842】ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

链接https://arxiv.org/abs/2606.10842

作者:Taiheng Pan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:opt-in token-evidence reranker, describe ConvMemory, lightweight ConvMemory, candidate set, token-evidence reranker

备注: 19 pages, 3 figures. Single-author technical report. Extends [arXiv:2605.28062](https://arxiv.org/abs/2605.28062) (ConvMemory v1). Code and checkpoint: [this http URL](http://github.com/pth2002/ConvMemory)

点击查看摘要

Abstract:We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1's protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder (22,713,601 parameters, measured from the released checkpoint) applied to the ten (query, memory) pairs that v1 has already selected; it does not change which ten memories are returned, so Recall@10 and Hit@10 are identical to v1 by construction, not by statistical coincidence. On the LoCoMo conversational memory benchmark (5 seeds, n = 4955 test rows), v2 raises FULL MRR from v1's 0.5824 to 0.6560 (paired bootstrap +0.0734, 95% CI [+0.0645, +0.0827]) and H@1 from 0.4440 to 0.5474. v2 closes most but not all of the gap to a much more expensive full-pool cross-encoder reference (mxbai-rerank-large-v1 over the top-500, MRR 0.6688): on FULL MRR v2 sits 0.013 below mxbai_top500, but on two raw-dense-hard slices (where v1's protected top-10 has higher recall than mxbai's own top-10) v2 exceeds mxbai_top500. A four-arm load-bearing ablation shows candidate-specific memory text is the mechanism: removing, shuffling, or replacing it collapses MRR below raw dense retrieval. v2 is best understood as a standard recall-preserving cascade pattern with LoCoMo-specific fine-tuning, an explicit anti-shortcut inference contract, and disciplined load-bearing analysis; its advantage over mxbai is slice-specific rather than a general dominance claim. This report extends the v1 technical report (arXiv:2605.28062).

5. 【2606.10759】miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

链接https://arxiv.org/abs/2606.10759

作者:Yingqi Fan,Xuan Lu,Anhao Zhao,Junlong Tong,Ping Nie,Kai Zou,Yunpu Ma,Wei Zhang,Xiaoyu Shen

类目:Information Retrieval (cs.IR)

关键词:Multimodal large language, recently shown strong, shown strong potential, directly modeling query, Multimodal large

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have recently shown strong potential as point-wise rerankers by directly modeling query--document relevance through next-token prediction. However, point-wise reranking suffers from substantial repeated computation across query--document pairs, while the causal structure of transformers allows only prefix segments to be reused via pre-caching. To address the misalignment of existing query-first and document-first formats with both VQA-style prompting and computation-aware reuse, we propose a \textit{vision-first} formulation that improves both cache reuse efficiency and reranking performance. However, the remaining cost is still considerable and stems from three main sources: (1) \textit{model depth}, for which we reduce active parameters via early exit; (2) \textit{cross-segment attention}, which we restrict to a narrow interaction band across a few layers; and (3) \textit{visual tokens}, where we reduce the number of tokens via embedder-guided pruning. Together, these designs form miniReranker, which reduces reranking runtime to 1% of the dense implementation under high-reuse settings for a single query, while preserving 96% of the dense model performance.

6. 【2606.10709】Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training

链接https://arxiv.org/abs/2606.10709

作者:João Coelho,João Magalhães,Bruno Martins,Chenyan Xiong

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:LLM search agents, training LLM search, LLM search, outcome-only rewards, standard strategy

备注

点击查看摘要

Abstract:The use of GRPO-style algorithms has become the standard strategy for training LLM search agents under outcome-only rewards. With these algorithms, a query contributes to parameter updates only when its rollout group mixes successes and failures; all-correct (too-easy) and all-incorrect (too-hard) groups are zero-variance and waste rollout cost. Existing approaches treat zero-variance as a static property and either discard or pre-filter such groups. We hypothesize and empirically validate that queries flip between zero-variance and signal-bearing states as the policy evolves during training. Building on this intuition, we propose query recycling, which returns zero-variance groups to a mutable pool for future resampling, so that the effective training distribution co-evolves with the policy. With the proposed technique, a 1.7B parameter model trained on synthetic data can reach 66.0 average Pass@1 accross seven multi-hop QA benchmarks, matching or surpassing systems with up to 7B parameters trained on benchmark-derived supervision. Analysis of recycling patterns shows that recycled queries supply roughly three quarters of the effective batch by the end of training, with contributions split between recovery from policy improvement and policy drift.

7. 【2606.10697】Beyond Patches: Superpixel Token-based Transformers for Attribute-Specific Fashion Retrieval

链接https://arxiv.org/abs/2606.10697

作者:Shuili Zhang,Hongzhang Mu,Wenyuan Zhang,Duohe Ma,Tingwen Liu

类目:Information Retrieval (cs.IR)

关键词:Attribute-Specific Fashion Retrieval, Attribute-Specific Fashion, improve fine-grained image, Fashion Retrieval, aims to improve

备注: 9 pages, 5 figures. Published in the Proceedings of the ACM Web Conference 2026 (WWW '26). Author version with minor corrections; results and conclusions unchanged

点击查看摘要

Abstract:Attribute-Specific Fashion Retrieval (ASFR) aims to improve fine-grained image retrieval by focusing on specific attributes. However, existing patch-based attention and Transformer methods often misalign with irregular attribute regions and are prone to background noise, limiting their ability to capture subtle, pixel-level microstructures. To tackle these challenges, we propose SuperFashion, the first ASFR framework that adopts superpixel tokens within a Transformer architecture. SuperFashion initially employs an attribute-guided attention mechanism to extract attribute-related features, which in turn guide the cropping of semantically meaningful image regions. Superpixel segmentation is then leveraged on these regions to generate compact, semantically coherent superpixel tokens. By incorporating modality-specific embeddings for both attribute and superpixel tokens, the superpixel token-based Transformer facilitates adaptive interaction and fusion, thereby enhancing attribute localization and discrimination. Extensive experiments on FashionAI, DARN, and DeepFashion demonstrate relative overall MAP improvements of 1.84%, 9.27%, and 9.35% over prior SOTA. SuperFashion offers a new solution for web-based image retrieval.

8. 【2606.10621】STORM: Stepwise Token Optimization with Reward-Guided Beam Search

链接https://arxiv.org/abs/2606.10621

作者:Arthur Satouf,Giulio D'Erasmo,Yuxuan Zong,Habiboulaye Amadou Boubacar,Pablo Piantanida,Benjamin Piwowarski

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Modern retrieval increasingly, retrieval increasingly relies, Modern retrieval, increasingly relies, effective but require

备注

点击查看摘要

Abstract:Modern retrieval increasingly relies on dense and learned-sparse neural models that are effective but require encoding the entire corpus into a specialized index, rebuilt whenever the model changes. Lexical retrievers like BM25 stay efficient and transparent on a standard inverted index that need not change as models evolve, but suffer from vocabulary mismatch. LLM query rewriting can help, yet prompted rewriters emit well-formed but retrieval-ineffective or harmful-terms, and training against a retrieval reward gives only delayed, sequence-level supervision that obscures which terms helped. We introduce STORM (Stepwise Token Optimization with Reward-guided beaM search), a self-supervised framework for lexical query expansion. STORM trains the rewriter through generation guided by retrieval metrics: at each step, candidate expansions are scored against the BM25 index and low-reward continuations pruned, turning the retrieval reward into a token-level signal that concentrates exploration on retrieval-effective vocabulary. Across TREC DL and BEIR, STORM lets 0.6B-8B backbones match or surpass competitive LLM rewriters while retrieving as fast as plain BM25; at 8B it rivals far larger proprietary rewriters. It further transfers zero-shot to 18 languages (MIRACL), beating dedicated multilingual dense retrievers on average, making STORM a competitive, infrastructure-light alternative to dense neural retrieval.

9. 【2606.10398】Selection, Not Salience: The Shape and Limits of Personalization in Social Highlighting

链接https://arxiv.org/abs/2606.10398

作者:Kazuki Nakayashiki,Keisuke Watanabe

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)

关键词:selection signal, document, person history identifies, reader sees pay, co-readership identity control

备注: 9 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Does personalizing what a reader sees pay off, and where does it stop? Using a social web highlighter and a co-readership identity control (the same document highlighted by many users, which holds document and topic fixed and asks whether a person's own history predicts their marks better than another reader's does), we map the shape and limits of personalization across reading altitudes. At the document altitude we give the clean, leakage-free, identity-controlled measurement that prior next-document evaluations could only upper-bound: a person's history identifies which documents in a co-reading neighborhood are theirs, with an own-versus-other gap of +0.169 against community negatives and +0.119 against topic-matched hard negatives (both highly significant); a content-based arm suggests the signal is not purely title-driven but is largely thematic. This is comparable to the span-level selection signal (+0.14) from our prior work: the selection signal is of comparable magnitude across altitudes (+0.12 to +0.17), most of it stable topic preference. At the sentence altitude, a two-stage personalized auto-highlight (an impersonal model proposes candidates, a personal model re-ranks them) does not improve on its impersonal baseline: two off-the-shelf zero-shot LLMs, including a frontier model, predict highlight locations worse than a lead baseline, and personal re-ranking is beaten by the salience order even on the highest-recall candidate pool, so the null is not merely a Stage-1 ceiling artifact. Measurable personalization appears primarily at the selection layer: modest (~+0.13), topic-dominated, with no reliable gain at the salience layer. We also surface a control-in-negatives bias that inflated our document gap to a spurious +0.227 until audited. Going beyond the shared salience layer may be better approached by aggregating individuals than by personalizing them harder.

10. 【2606.10388】SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

链接https://arxiv.org/abs/2606.10388

作者:Jiandong Ding

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Agent skill libraries, routable software assets, Agent skill, software assets, contribute instructions

备注: Preprint

点击查看摘要

Abstract:Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and execution assumptions to an agent. This makes skill retrieval more than broad relevance matching. A retriever can find the right capability family yet expose the wrong same-capability representative. We study this failure as same-capability execution-risk retrieval. Each query pairs a helpful skill with a query-specific risky sibling that shares the capability family but can lead execution toward a stale resource, missing precondition, or wrong procedure. We introduce SkillResolve-Bench 1.0, an auditable benchmark for this setting with 661 helpful/risky pairs, source-role and admission evidence, cue/leakage checks, query-disjoint splits, and a 7,982-candidate pool that includes 6,660 public SkillRet candidates. The benchmark reports helpful ranking together with harmful sibling rate (HSR@K), the top-K exposure of the risky sibling. We also provide SkillResolve, a reference method that resolves active candidate families, scores query-conditioned utility from confusable library negatives and contract-profile cues, and selects one representative from each family before the final top-K list. Under the released family relation, SkillResolve reaches Recall@3 0.766 and NDCG@3 0.699 while keeping HSR@3=0. It improves over SkillRouter by 0.112 Recall@3 and 0.165 NDCG@3 while reducing HSR@3 from 0.693 to 0. Without representative selection, HSR@3 rises to 0.236 under the same scorer, identifying within-family representative choice as the mechanism that turns capability retrieval into safer procedural exposure.

11. 【2606.10375】SIDInspector: A Mapping-First Diagnostic Resource for Semantic-ID Tokenizers

链接https://arxiv.org/abs/2606.10375

作者:Jiandong Ding,Heng Chang,Huijie Qin,Tianying Liu

类目:Information Retrieval (cs.IR)

关键词:generative recommendation, increasingly reused, reused as standalone, address space, sid tokenizer artifacts

备注: Submitted to CIKM 2026 Resource Track

点击查看摘要

Abstract:Semantic-ID (\sid) tokenizers are increasingly reused as standalone artifacts in generative recommendation: an exported item-to-code mapping becomes the address space that a later sequence generator must use. These mappings rarely come with a common inspection interface, so coverage gaps, full-code aliasing, behaviorally weak prefixes, tail compression, and prefix fan-out are often found only after downstream training. We present \tool, a mapping-first diagnostic resource for \sid tokenizer artifacts. \tool defines a small adapter contract over item mappings, metadata, interactions, and optional generator traces; validates the contract; and reports mapping-level probes for utilization, aliasing, neighborhood alignment, popularity allocation, and structural cost, with hooks for temporal churn and generator traces. \tool reports inspectable artifact profiles before downstream leaderboard scores. The released resource covers four tokenizer artifact lines: a same-item GRID/RQ-KMeans-style and ReSID/GAOQ contrast on 23,742 Musical items, plus released LETTER and LC-Rec item-index artifacts. In the Musical contrast, the GRID-style feature-text export has 3,749 unique full codes and a 0.977 full-code aliasing rate, while ReSID/GAOQ is aliasing-free in its exported mapping. Yet the strongest prefix--co-occurrence alignment comes from a deterministic category-prefix control, not from either learned export row (0.447 versus 0.154 and 0.055--0.080), showing that addressability and behaviorally meaningful prefixes should be inspected separately. Cross-domain, fixed-reranker, and mechanism-probe checks support the same diagnostic direction: prefix alignment is a candidate-exposure signal, while final ranking quality remains a downstream model question.

Comments:
Submitted to CIKM 2026 Resource Track

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2606.10375 [cs.IR]

(or
arXiv:2606.10375v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.10375

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
12. 【2606.10357】Atomic Intent Reasoning: Bringing LLM Semantics to Industrial Cross-Domain Recommendations

链接https://arxiv.org/abs/2606.10357

作者:Zhuohang Jiang,Yuxin Chen,Shijie Wang,Haohao Qu,Zhou Jindong,Wenqi Fan,Li Qing,Dongxu Liang,Jun Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Cross-domain recommendation, recommendation, Cross-domain, Atomic Intent Reasoning, core problem

备注

点击查看摘要

Abstract:Cross-domain recommendation is a core problem in content-to-e-commerce platforms. Its objective is to leverage user interactions with content to infer potential purchasing intent on the e-commerce side, thereby enhancing conversion rates and commercial value. However, in real industrial scenarios, cross-domain recommendation faces multiple challenges: significant semantic gaps exist between different domains, and user cross-domain behavior sequences are often massive in scale and rich in noise. Although large language models (LLMs) possess powerful semantic understanding and reasoning capabilities, their millisecond-level inference latency makes direct application in online recommendation systems difficult. To address these issues, this paper introduces AIR (Atomic Intent Reasoning), an LLM-driven cross-domain recommendation framework designed for industrial-grade deployment. By migrating LLM inference to the offline phase and dynamically constructing user intent representations through efficient retrieval and composition during online operations, it achieves approximately 400* inference acceleration while maintaining semantic consistency. Experimental results across multiple public datasets demonstrate that our method achieves state-of-the-art performance in cross-domain recommendation tasks. Furthermore, large-scale online A/B testing conducted in Kuaishou E-commerce's real-world business scenarios shows that our approach delivers stable and significant improvements across multiple core business metrics, including a +3.446% increase in GMV, fully validating its effectiveness and practical value in industrial-scale recommendation systems.

13. 【2606.10156】$τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

链接https://arxiv.org/abs/2606.10156

作者:Bharath Sivaram Narasimhan,Karthik R Narasimhan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:multi-turn conversational interfaces, recommender systems transition, paradigms have struggled, agentic recommender systems, recommender systems

备注

点击查看摘要

Abstract:As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present $\tau$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $\tau$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at this https URL.

14. 【2606.10120】MetaPlate: Counterfactual-Guided RAG-LLM Tool for Personalized Food Recommendation and Hyperglycemia Prevention

链接https://arxiv.org/abs/2606.10120

作者:Asiful Arefeen,Carol Johnston,Hassan Ghasemzadeh

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:key risk factor, metabolic disorders, existing dietary guidance, key risk, risk factor

备注

点击查看摘要

Abstract:Postprandial hyperglycemia is a key risk factor for metabolic disorders; however, existing dietary guidance is often static, impractical, and insufficiently personalized, providing recommendations that are difficult to follow or not impactful. While recent advances leverage continuous glucose monitoring (CGM) and machine learning to predict glycemic responses, these approaches are largely predictive and lack actionable guidance. Moreover, recommendation systems are often misaligned with user goals and require extensive input. We present MetaPlate, a counterfactual explanation (CF) guided, context-aware decision-support framework that generates personalized meal recommendations to mitigate postprandial glucose excursions in healthy adults. MetaPlate integrates multimodal data, including CGM readings, wearable-derived physiological signals, and user-provided meal inputs from $25$ individuals to model pre-meal context. A machine learning model predicts glucose response, while a CF optimization module adjusts meal composition modifying macronutrient amounts to maintain glucose levels within a target range ($\leq 140$ mg/dL). An LLM-based retrieval-augmented generation (RAG) layer enhances interpretability by producing human-readable recommendations using constrained search of the USDA food database. We evaluate MetaPlate via a structured expert-in-the-loop assessment with registered dietitians (RDs), comparing performance before and after prompt refinement. Results show improvements in meal realism, portion suitability, and recommendation likelihood, with expert feedback indicating a shift from clinically implausible outputs to actionable, contextually appropriate recommendations. Our findings emphasize the importance of domain knowledge and structured constraints in LLM-driven systems and highlight the potential of MetaPlate as a real-time personalized dietary decision-support tool.

15. 【2606.10078】Mult-DPO: Multinomial Direct Preference Optimization for Recommender Systems

链接https://arxiv.org/abs/2606.10078

作者:Yaochen Zhu,Harald Steck,James McInerney,Aditya Sinha,Yinhan He,Nathan Kallus,Jundong Li

类目:Information Retrieval (cs.IR)

关键词:large language models, effective alignment strategy, Direct preference optimization, simple and effective, strategy for large

备注

点击查看摘要

Abstract:Direct preference optimization (DPO) is a simple and effective alignment strategy for large language models (LLMs) based on pairwise preferences. In recommender systems, however, user feedback is rarely pairwise. For a given context, e.g., a user, a session, or a conversation, we typically observe set-wise preferences with multiple positive items, where every positive item should outrank every unobserved or explicitly negative item, with no prescribed order among the positives or the negatives themselves. A natural generalization is to use the Plackett-Luce (PL) reward model, which extends the Bradley-Terry reward model underlying vanilla DPO from pairwise preferences to full rankings of candidates. However, we show that adapting the PL model to set-wise preferences requires marginalizing over all positive orderings, where the resulting expression is combinatorial in complexity. To address this fundamental challenge, we propose Mult-DPO, a novel DPO objective with a tractable multinomial surrogate likelihood over set-wise preference events for the user-preference alignment of LLM-based recommender systems. The multinomial construction is not itself a ranking distribution, but it is defined on the same reward-induced weight space and admits a closed-form DPO-style objective, enabling direct alignment of LLMs with multiple candidates through a classification-style objective. In addition, we prove that the multinomial DPO loss is a tractable upper bound on the marginalized PL DPO loss when optimizing against the set-wise preference data. We further characterize the tightness of this bound in terms of the relative total weight of positives versus negatives, which provides insights into tightening the bound with richer or harder negatives. Finally, we extend Mult-DPO to the alignment of LLMs with multiple preference levels. Code is available at this https URL

16. 【2606.10053】Stability in Competitive Search with Results Diversification

链接https://arxiv.org/abs/2606.10053

作者:Itamar Reinman,Omer Madmon,Moshe Tennenholtz,Oren Kurland

类目:Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)

关键词:competitive search setting, publishers strategically modify, competitive search, search setting, strategically modify

备注: Accepted to ICTIR 2026

点击查看摘要

Abstract:In a competitive search setting, publishers strategically modify their documents in response to induced rankings so as to improve their future ranking. We present a novel game-theoretic analysis of a competitive search setting where search-results diversification is applied. Our analysis reveals an inherent tradeoff between corpus diversity and corpus stability, where the latter corresponds to an equilibrium in a game. We analyze two representative diversification methods and show that stability need not necessarily be reached, leaving the corpus to rapid changes due to ranking incentivized modifications of publishers. We then present a novel approach to devise diversification-based ranking functions that are guaranteed to lead to corpus stability.

17. 【2606.09900】Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

链接https://arxiv.org/abs/2606.09900

作者:Liuyin Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Long-term memory, sessions they forget, common workaround, distractors accumulate, LLM agents

备注: 14 pages, 4 figures, 3 tables. Code, reproducible harness, and raw per-question logs: [this https URL](https://github.com/ly-wang19/engram)

点击查看摘要

Abstract:Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

18. 【2606.09891】Representation Curriculum: Stagewise Training for Robust Ranking and Allocation

链接https://arxiv.org/abs/2606.09891

作者:Ehsan Ebrahimzadeh,Sina Baharlouei,Abraham Bagherjeiran

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:displayed items shape, dynamic exposure-allocation mechanism, future allocation policies, items shape discovery, shape discovery trajectories

备注: 12 pages, 5 figures

点击查看摘要

Abstract:Ranking in digital marketplaces is a dynamic exposure-allocation mechanism: displayed items shape discovery trajectories and success events logged by the platform to update future allocation policies. Modern ranking systems rely heavily on exposure-confounded signals (e.g. popularity estimates, CTR/CVR aggregates, and ID-based representation), because they are highly predictive under stationary demand. Yet this predictive power can become a learning shortcut: early access to exposure-dependent belief signals steers optimization toward over-reliance on them and away from exposure-independent merit signals (e.g., content-based competitiveness and semantic affinity). Consequently, the learned policy tends to entrench incumbents and degrade cold-start generalization and robustness under distribution shift. We propose Representation Curriculum (RC), a training-time intervention that temporally stages feature utilization. RC foregrounds content-based merit signals initially, then introduces exposure-dependent belief signals while anchoring the content pathway near the learned merit representation, curbing shortcut reliance on historical signals and mitigating gradient starvation on content signals. We formalize RC independently of task and hypothesis class and provide ranking-specific instantiations. In a Gaussian linear ridge setting, we derive closed-form solutions and sufficient conditions under which RC strictly reduces population risk on a cold-start target distribution, with a quantified Pareto tradeoff against source performance. Experiments on public learning-to-rank and recommendation benchmarks, and randomized online experiments in a large-scale e-commerce search system, show that RC measurably shifts reliance from historical belief signals toward content-based merit signals and yields consistent gains on cold populations with a controlled trade-off in head performance.

19. 【2606.09865】LLM-as-a-Discriminator: When Synthetic Tables Still Look Real

链接https://arxiv.org/abs/2606.09865

作者:Manel Slokom,Malek Slokom,Thierno Kante

类目:Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:data, Privacy, data sharing, synthetic, LLM

备注

点击查看摘要

Abstract:Privacy and data sharing are often in tension. Many organizations use synthetic data to reduce privacy risk and still share useful data. For tabular data, auditing privacy remains hard. In many cases, even humans cannot easily tell if a table is real or synthetic. In this paper, we propose a method based on LLM discrimination. We ask an LLM to classify each table sample as REAL or SYNTHETIC. We test two settings: C1 with table only, and C2 with table plus distributional metadata. We use LLaMA as an open model and Gemini as a reference model. In our experiments, we run three synthesis models, CTGAN, TVAE, and Gaussian Copula, on two public datasets, UCI Adult and ACS Census. We collect 451 valid trials. Our results show clear differences between models. On Adult, LLaMA reaches DRS=0% in reported cells, while Gemini reaches DRS=100% for CTGAN and TVAE. On Census, LLaMA predicts SYNTHETIC for most samples, while Gemini stays high in C1 but drops for CTGAN and TVAE in C2. We also compare with a classifier two-sample test (C2ST) and record linkage as distributional baselines, and with a human pilot of 2 annotators and 240 trials. Our results show that LLM discrimination is a practical privacy audit signal when model choice, per provider reporting, and data encoding are handled with care. For reproducibility, code and experiment scripts are available at this https URL.

20. 【2606.10381】Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

链接https://arxiv.org/abs/2606.10381

作者:Ruobing Jiang,Dawei Fu,Cheng Jiang,Tianyi Yang,Zijian Wang,Youpeng Wu,Yong Ban,Yajun Mao,Qiang Li

类目:High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Instrumentation and Detectors (physics.ins-det)

关键词:spans accelerator physics, research spans accelerator, relevant evidence scattered, agentic hybrid RAG, scientific question answering

备注: 22 pages, 5 figures, and 6 tables

点击查看摘要

Abstract:Muon collider research spans accelerator physics, detector instrumentation, and high-energy phenomenology, with relevant evidence scattered across a rapidly expanding and heterogeneous body of scientific literature. As high-energy physics (HEP) increasingly explores agent-assisted analysis workflows, efficiently locating, integrating, and verifying scientific evidence becomes an essential capability. While retrieval-augmented generation (RAG) offers a promising framework for scientific question answering, integrating agentic reasoning without compromising retrieval precision remains a key challenge. In this work, we present agentic hybrid RAG, an evidence-grounded RAG framework for muon collider research. The framework combines a hybrid retriever, integrating sparse lexical and dense semantic retrieval, with an agentic reasoning module for query decomposition, evidence expansion, and grounded answer generation. To enable systematic evaluation, we construct the first benchmark for retrieval-augmented scientific question answering in the muon collider domain, comprising a curated literature corpus together with dedicated retrieval and answer-generation benchmarks covering major detector and physics research topics. Extensive evaluation shows that hybrid retrieval provides the strongest retrieval backbone, while agentic reasoning is most effective for controlled evidence expansion and answer synthesis. Built on this principle, agentic hybrid RAG consistently outperforms representative retrieval and RAG baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding. Together, the benchmark and framework provide a foundation for evidence-grounded scientific question answering and future HEP analysis agents operating over large-scale scientific literature.

计算机视觉

1. 【2606.11188】ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

链接https://arxiv.org/abs/2606.11188

作者:Junke Wang,Xiao Wang,Jiacheng Pan,Xuefeng Hu,Feng Li,Jingxiang Sun,Chaorui Deng,Zilong Chen,Yunpeng Chen,Kaibin Tian,Matthew Gwilliam,Hao Chen,Danhui Guan,Kun Xu,Weilin Huang,Zuxuan Wu,Haoqi Fan,Yu-Gang Jiang,Zhenheng Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:next-token prediction framework, paper introduces ARM, unifies image understanding, discrete representation-based AutoRegressive, representation-based AutoRegressive Model

备注: technical report

点击查看摘要

Abstract:This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: this https URL.

2. 【2606.11187】Next Forcing: Causal World Modeling with Multi-Chunk Prediction

链接https://arxiv.org/abs/2606.11187

作者:Gangwei Xu,Qihang Zhang,Jiaming Zhou,Xing Zhu,Yujun Shen,Xin Yang,Yinghao Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:World Action Models, World Action, Autoregressive video generation, MCP modules, Action Models

备注: Project page: [this https URL](https://gangweix.github.io/next-forcing/)

点击查看摘要

Abstract:Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next$^1$, next$^2$, next$^3$ chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.

3. 【2606.11186】AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference

链接https://arxiv.org/abs/2606.11186

作者:Hangfeng Liang,Yutao Hu,Yanhan Hu,Xiaohan Wu,Wenqi Shao,Ying Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-light video enhancement, challenging task due, severe information degradation, Low-light video, remains a challenging

备注: Accepted at ICML 2026; Project page and code: [this https URL](https://lhfgghc.github.io/LLVE-AMNet)

点击查看摘要

Abstract:Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions. Code and models are available on the project page.

4. 【2606.11180】Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

链接https://arxiv.org/abs/2606.11180

作者:Paul Hyunbin Cho(1),Jinhyuk Jang(1),SeokYoung Lee(1),Joungbin Lee(1),Siyoon Jin(1),Heeseong Shin(1),Jung Yi(1),Yunjin Park(2),Chulmin Park(2),Seungryong Kim(1) ((1) KAIST AI, (2) AIPARK)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion-based lip synchronization, achieve strong visual, strong visual quality, Diffusion-based lip, full-sequence bidirectional attention

备注: Project Page: [this https URL](https://cvlab-kaist.github.io/LipForcing/)

点击查看摘要

Abstract:Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, $17.6\times$ faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs $39.8\times$ faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.

5. 【2606.11176】Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

链接https://arxiv.org/abs/2606.11176

作者:Kevin Qinghong Lin,Batu EI,Yuhong Shi,Pan Lu,Philip Torr,James Zou

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:turn raw information, data journalist job, shape society, non-experts can trust, turn raw

备注: Project page: [this https URL](https://data2story.github.io) Github: [this https URL](https://github.com/QinghongLin/data2story-skill)

点击查看摘要

Abstract:Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at this https URL.

6. 【2606.11155】Mean Flow Distillation: Robust and Stable Distillation for Flow Matching Models

链接https://arxiv.org/abs/2606.11155

作者:An Zhao,Shengyuan Zhang,Zhongjian Sun,Yixiang Zhou,Zejian Li,Ling Yang,Tianrun Chen,Lingyun Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated strong performance, Flow Matching models, Flow Matching, demonstrated strong, wide range

备注

点击查看摘要

Abstract:Flow Matching models have demonstrated strong performance across a wide range of generative tasks. However, their reliance on ODE-based iterative sampling incurs substantial computational overhead in inference, which limits their applicability in real-time scenes. While distillation is a promising solution, existing approaches largely borrow from diffusion-based score matching, often failing to exploit the intrinsic geometric structure of flows and suffering from training instability, high variance, and degraded generation quality. In this paper, we propose Mean Flow Distillation (MFD), a novel distillation framework tailored for flow matching models. We theoretically demonstrate that MFD acts as a temporal low-pass filter, effectively suppressing the high-frequency optimization noise inherent in variational score distillation (VSD) while ensuring global trajectory consistency. We further prove the Mean Flow Matching Theorem, establishing that matching expected average velocities is sufficient for strict distribution alignment. Empirically, on challenging tasks of high-dimensional manifolds including 4D occupancy forecasting and text-to-image generation, MFD achieves state-of-the-art performance, enabling high-fidelity single-step generation.

7. 【2606.11152】P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning

链接https://arxiv.org/abs/2606.11152

作者:Yikang Yang,Zhanpeng Hu,Youtian Lin,Mengqi Zhou,Jingxi Xu,Feihu Zhang,Jiaheng Liu,Yao Yao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, Multimodal large, produce complex programs, large language models, world knowledge

备注: Project page: [this https URL](https://lucasqaq.github.io/p3d/)

点击查看摘要

Abstract:Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchmarks rarely evaluate 3D modeling through code. Such modeling demands more than runnable code: from a text or visual specification, a model must generate a parametric 3D program that is geometrically precise, semantically aligned and assembly-consistent. We introduce P3D-Bench, a benchmark for parametric 3D generation. Unlike a 3D mesh, a parametric 3D program exposes explicit dimensions, construction operations and part relations, revealing whether a model recovers a design's structure, not just its appearance. Under a unified protocol, P3D-Bench covers three task families (Text-to-3D, Image-to-3D and Assembly-3D) and scores each output for executability, geometric fidelity, topology, text-grounded constraints, multiview semantic alignment and part-level structure. We evaluate frontier MLLMs and text-only LLMs on 400 text cases, 400 image cases and 203 annotated assemblies, with domain-specific models as reference points. Our extensive evaluation yields three findings. First, assemblies are the hardest setting, where models still fail to compose multiple parts into a coherent structure. Second, models can often recover the global shape and semantic identity of the target object, yet fail to reproduce the precise parametric geometry specified by the input. Third, part-level modeling remains weak on assemblies, where models recover neither the geometry of each part nor the right number of parts. These results position P3D-Bench as a benchmark for evaluating precise parametric geometry and part-level structure in parametric 3D generation.

8. 【2606.11148】MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On

链接https://arxiv.org/abs/2606.11148

作者:Xiaoyu Han,Chenyang Wang,Jing Wang,Shunyuan Zheng,Quanling Meng,Shengping Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Virtual try-on, in-shop clothing image, Virtual try-on aims, virtual try-on method, aims to fit

备注: Accepted to CVPR 2026 (Highlight)

点击查看摘要

Abstract:Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.

9. 【2606.11131】UniPET: a universal network for high-quality PET image denoising across varied dose reduction factors

链接https://arxiv.org/abs/2606.11131

作者:Zhiwen Yang,Yang Zhou,Haowei Chen,Hui Zhang,Dan Zhao,Bingzheng Wei,Yan Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:PET image denoising, PET image, universal PET image, learning-based PET image, deep learning-based PET

备注

点击查看摘要

Abstract:Most existing deep learning-based PET image denoising methods assume a fixed and known dose reduction factor (DRF) for low-dose PET images. However, these methods encounter significant performance degradation when the DRF varies beyond the assumed one in practical applications. To address the challenge posed by varied DRFs, several preliminary studies focus on the task of universal PET image denoising, aiming to train a universal model over low-dose data across DRFs. Nonetheless, these vanilla universal models often struggle with misaligned styles present in different DRF data, leading to the \textit{style elimination issue} with a significant over-smoothing effect. To deal with this issue, we innovatively introduce domain generalization to PET image denoising and propose a universal PET image denoising network (UniPET) to achieve high-quality PET image denoising across diverse DRFs. UniPET comprises two primary innovations: a style alignment network (SAN) and a region-aware learning strategy (RALS). Specifically, SAN utilizes style alignment techniques derived from domain generalization to align and recover styles across different DRFs, ensuring the model's generalizability across various DRFs while effectively preserving styles. Furthermore, to enhance style recovery, RALS distinguishes between flat and stylized regions, exclusively conducting adversarial learning on the latter, thereby more effectively guiding the model's focus towards learning stylized regions. It is demonstrated that our proposed UniPET can adaptively recover different DRF styles and achieve high-quality PET image denoising across DRFs. Comprehensive experiments show that UniPET exhibits comparable performance to individual DRF-specific models at specific DRFs and realizes state-of-the-art performance in universal PET image denoising quantitatively, perceptually, and clinically.

10. 【2606.11129】WorldOlympiad: Can Your World Model Survive a Triathlon?

链接https://arxiv.org/abs/2606.11129

作者:Yuke Zhao,Wangbo Zhao,Weijie Wang,Zeyu Zhang,Dakai An,Akide Liu,Yinghao Yu,Jiasheng Tang,Fan Wang,Wei Wang,Bohan Zhuang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diagnosing video-based world, diagnosing video-based, generated videos, video-based world models, physical faithfulness

备注: Project Page: [this https URL](https://alibaba-damo-academy.github.io/WorldOlympiad/) , Code: [this https URL](https://github.com/alibaba-damo-academy/WorldOlympiad)

点击查看摘要

Abstract:We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.

11. 【2606.11120】Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

链接https://arxiv.org/abs/2606.11120

作者:Andrew Kang,Priya Narasimhan

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Carlo Tree Search, Monte Carlo Tree, Carlo Pass Search, Monte Carlo Pass, introduce Monte Carlo

备注: CVPR 2026, CVSports Workshop

点击查看摘要

Abstract:We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi-agent trajectories with ball interactions), and a policy over counterfactual actions (sampling pass variants with noise). Building on the first public high-fidelity tracking dataset with 3D ball trajectories from the Bundesliga, we introduce Monte Carlo Pass Search (MCPS), which infers kick parameters for each observed pass, samples execution variants and option variants, rolls each candidate forward with a ball-conditioned world model until the next ball interaction, and scores outcomes with a learned value model to obtain a distribution over gained value. This distribution enables distribution-aware attribution with two complementary execution-surplus scores used for analysis and ranking: mean-based and percentile-based scores. To make the world model sample-efficient under limited public data, we adapt a discrete-token, autoregressive trajectory generator from autonomous driving (SMART) and show it yields strong best-of-20 forecasting accuracy compared to baselines, while supporting fully hypothetical rollouts for downstream evaluation. We have released model checkpoints and code.

12. 【2606.11106】FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

链接https://arxiv.org/abs/2606.11106

作者:Mahmood Alzubaidi,Uzair Shah,Raden Muaz,Ines Abbes,Nader Mohammed,Abdullatif Magram,Khalid Alyafei,Mowafa Househ,Marco Agus

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:pregnant women receive, trained sonographers limits, sonographers limits prenatal, limits prenatal ultrasound, prenatal ultrasound screening

备注

点击查看摘要

Abstract:A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using this http URL with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at this https URL.

13. 【2606.11096】IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

链接https://arxiv.org/abs/2606.11096

作者:Yitong Chen,Zijie Diao,Junke Wang,Lingyu Kong,Yixuan Ren,Bo He,Yu-Gang Jiang,Zuxuan Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vision foundation models, pretrained vision foundation, Built on pretrained, constructing semantically rich, semantically rich latent

备注: Code is available at [this https URL](https://github.com/Row11n/IDEAL)

点击查看摘要

Abstract:Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail. This limitation becomes even more severe after discretization, where missing low-level information is difficult to recover. In fact, we observe that shallow VFM features retain considerably richer local appearance and structural detail, which complements the high-level semantics carried by deep features used in existing RAEs. Motivated by this complementary property, we propose Ideal, an In-depth Alignment framework for discrete representation autoencoding. By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics. Extensive experiments demonstrate that Ideal yields superior reconstruction performance, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28. When used for autoregressive image generation, Ideal further produces a gFID of 1.89, establishing a new state of the art for autoregressive image generation.

14. 【2606.11078】A History-Aware Visually Grounded Critic for Computer Use Agents

链接https://arxiv.org/abs/2606.11078

作者:Jaewoo Lee,Zaid Khan,Archiki Prasad,Justin Chih-Yao Chen,Supriyo Chakraborty,Kartik Balasubramaniam,Sambit Sahu,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphical User Interface, complex Graphical User, Computer Use Agents, User Interface, Graphical User

备注: Code: [this https URL](https://github.com/G-JWLee/HiViG)

点击查看摘要

Abstract:Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

15. 【2606.11032】U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training

链接https://arxiv.org/abs/2606.11032

作者:Zhiwen Yang,Jiayin Li,Hao Lu,Hui Zhang,Zihua Wang,Bingzheng Wei,Yan Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Positron Emission Tomography, Existing deep learning, Emission Tomography, Positron Emission, Existing deep

备注

点击查看摘要

Abstract:Existing deep learning models for Positron Emission Tomography (PET) image denoising often suffer from severe performance degradation under distribution shifts, fundamentally restricting their robust clinical deployment. This lack of generalization stems from the conventional paradigm of fixed-parameter models that cannot adapt to variations in test data (e.g., dose levels or scanner types) after training. To overcome this limitation and achieve robust generalization, we introduce U-TTT, a novel U-shaped model that integrates Test-Time Training (TTT) layers to dynamically adjust model parameters during inference through self-supervision, thereby adapting to the specific characteristics of each test instance. Furthermore, to comprehensively capture the complex degradations of 3D PET data, U-TTT features a dual-domain adaptation mechanism comprising a Spatial Test-Time Training (S-TTT) layer and a Frequency Test-Time Training (F-TTT) layer. The S-TTT layer captures and corrects spatial structural degradations, while the F-TTT layer suppresses global noise spectra and restores delicate high-frequency details. Extensive experiments demonstrate that U-TTT achieves state-of-the-art PET denoising performance and exhibits superior generalization under challenging distribution shifts, including both unseen dose levels and unseen scanners. Our code will be available at this https URL.

16. 【2606.11012】An Uncertainty Estimation Framework for Dose Accumulation in Adaptive Radiotherapy: Application to CBCT-Guided Radiotherapy for Cervical Cancer

链接https://arxiv.org/abs/2606.11012

作者:Cedric Hemon,Delphine Lebret,Jean-Claude Nunes,Valentin Boussot,Karine Peignaux,Nathalie Mesgouez-Nebout,Chantal Hanzen,Antoine Simon,Anaïs Barateau,Renaud de Crevoisier,Caroline Lafond

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:estimation remains limited, Background and purpose, enables daily plan, daily plan adaptation, dose estimation remains

备注: Under revision

点击查看摘要

Abstract:Background and purpose: oART enables daily plan adaptation to interfraction anatomical variations, but cumulative dose estimation remains limited by DIR, segmentation, and anatomical uncertainties. We introduce IMPACT-DoseAcc, an uncertainty-aware dose accumulation framework, within IMPACT for semantic feature-driven image analysis. The framework is modality- and disease-agnostic and is applied to CBCT-guided oART for cervical cancer (LACC). Material and Methods: Nine LACC patients were retrospectively analyzed using daily CBCT-derived virtual CTs for dose recalculation. IMPACT-DoseAcc focuses on uncertainty from DIR, without modeling vCT-generation uncertainty. Two DIR uncertainty strategies were tested within IMPACT-Reg: a Bayesian segmentation-guided approach using one probabilistic model to quantify anatomical uncertainty, and an ensemble of segmentation models targeting structures to capture epistemic variability. Voxel-wise uncertainty maps were propagated through dose warping and accumulation to generate probabilistic dose-volume histograms. Ensemble uncertainty was quantified from voxel-wise standard deviation across deformation fields, and geometric error was assessed using surface distance between warped and validated contours. Anatomical-variability weighting refined aggregation. Results: Ensemble DIR uncertainty correlated with geometric error, with Pearson coefficients of 0.63 for CTVt and 0.66 for bladder. For CTVt, pDVHs achieved 96.3 +/- 3.9% coverage, showing calibration of propagated uncertainty. Weighting stabilized estimates across fractions and organs. Conclusions: IMPACT-DoseAcc propagates registration-driven uncertainty to cumulative dose metrics, improving interpretation of accumulated dose under anatomical variations. Its 3DSlicer integration supports reproducible, uncertainty-informed ART workflows.

Comments:
Under revision

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.11012 [cs.CV]

(or
arXiv:2606.11012v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.11012

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Cédric Hémon Dr [view email] [v1]
Tue, 9 Jun 2026 15:52:58 UTC (9,451 KB)

17. 【2606.11001】IPSM-Bench: A New Intermediate Phase Segmentation Benchmark in Microstructure Images of Zinc-Based Absorbable Biomaterials

链接https://arxiv.org/abs/2606.11001

作者:Jinglin Xu,Shangyan Zhao,Jiabo Wang,Xinghong Mu,Yulong Lei,Jiacheng Zhang,Hongbo Sun,Yageng Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:absorbable metallic biomaterials, indispensable emerging absorbable, emerging absorbable metallic, intermediate phase segmentation, Zinc-based alloys

备注: Accepted by IJCAI 2026

点击查看摘要

Abstract:Zinc-based alloys are indispensable emerging absorbable metallic biomaterials, and their macroscopic performance is governed by microstructural characteristics. Intermediate phases-key microstructural constituents-are pivotal in regulating mechanical and functional properties. However, intermediate phase segmentation in zinc alloy microstructures faces formidable challenges: scarce annotated datasets, low contrast, difficulty detecting small targets, and heterogeneous morphologies. To this end, we construct IPSM-Bench, the largest high-quality dataset for zinc-alloy intermediate phase segmentation. Furthermore, we propose SCoP-SAM, a new Spatial Context Prior-guided SAM method that leverages the gradient structure and grayscale properties of intermediate phases to capture spatial context priors and incorporates them into the entire SAM encoding-decoding process, improving segmentation performance. Based on the proposed IPSM-Bench, we establish a new benchmark for intermediate phase segmentation to systematically evaluate state-of-the-art (SOTA) methods and advance research on zinc alloy microstructure analysis. Extensive experiments on IPSM-Bench and additional public alloy benchmarks demonstrate that our SCoP-SAM not only achieves SOTA performance for zinc-alloy intermediate phase segmentation but also generalizes remarkably well to other alloy scenarios.

18. 【2606.10988】AnimaSpark: A Feed-Forward Method for Animating Arbitrary 3D Objects

链接https://arxiv.org/abs/2606.10988

作者:Yiming Zhao,Haoyu Sun,Aoyu Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:model creation workflows, substantially accelerated static, category-agnostic animation generation, asset production, creation workflows

备注

点击查看摘要

Abstract:While recent advancements in generative AI have substantially accelerated static 3D model creation workflows, the synthesis of category-agnostic 3D animations remains a significant bottleneck in 3D asset production. Current methods for category-agnostic animation generation exhibit critical limitations in inference speed, motion quality, and adherence to textual prompts, thereby leaving the process dependent on labor-intensive manual artistry. To address these challenges, this paper introduces AnimaSpark, a novel pipeline for category-agnostic 3D animation generation. Our approach is motivated by the key insight that for many fundamental motions in the 3D world, the corresponding joint transformations can often be effectively modeled within a two-dimensional subspace. The pipeline begins by rendering a rigged static 3D model into multi-layered image representations of its mesh and skeleton, which are subsequently fed into a video generation model. We then employ a keypoint tracking algorithm on the generated video to capture the motion of the skeletal joints projected onto the camera's viewing plane. In the final stage, we distill the planar translations and rotations from these tracked keypoints and lift them from the 2D domain into 3D space to animate the character. Comprehensive evaluations reveal that our method achieves superior performance over existing state-of-the-art techniques across key metrics, including text-motion alignment, quality of motion, and computational efficiency.

19. 【2606.10967】Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

链接https://arxiv.org/abs/2606.10967

作者:Pradnya Halady,Jiale Wei,Zdravko Marinov,Alexander Jaus,Simon Reiß

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generate predictions based, Visual in-context, Visual in-context learning, pathway towards dynamic, generate predictions

备注

点击查看摘要

Abstract:Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.

20. 【2606.10953】Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

链接https://arxiv.org/abs/2606.10953

作者:Fedor Rodionov,Aleksandar Cvejic,Michael Birsak,John Femiani,Peter Wonka

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:real estate visualization, interior design, estate visualization, real estate, floor plans

备注: 17 pages, 10 figures

点击查看摘要

Abstract:Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. However, progress in automatic furniture arrangement has been limited by the lack of real, professionally designed floor-plan datasets with object-level furniture annotations. To address this gap, we introduce AntPlan-270, a curated dataset of 270 architectural floor plans with per-room furniture bounding box annotations across ten residential room categories. Building on this dataset, we present Architect-Ant, an editable automatic furnishing framework powered by a fine-tuned vision-language model. Furniture layouts are represented using a compact, coordinate-based domain-specific language (DSL) that encodes object categories and placements relative to the room geometry. To improve spatial reasoning, we generate procedural reasoning traces that capture architectural constraints such as wall alignment, door and window clearance, circulation, fixture compatibility, and room-specific furniture inventories, and use them to supervise fine-tuning of the model. We then apply preference optimization over candidate object placements to further refine layout quality. The generated DSL can be rasterized into semantic masks and used to condition a Flux-based LoRA renderer, producing realistic blueprint-style furnished floor-plan images while preserving the editable symbolic layout. Experiments on layout furnishing show that Architect-Ant produces geometrically valid and functionally plausible layouts, and suggest a scalable path for furnishing larger structure-only floor-plan datasets.

21. 【2606.10940】Democratising Camera Trap AI: An Open-Source Model for Detecting UK Mammals

链接https://arxiv.org/abs/2606.10940

作者:Paul Fergus,Philip Stephens,Russell A. Hill,Lee Oliver,Katie Appleby,Sarah Beatham,Naomi Davies Walsh,Stuart Nixon,Naomi Matthews,Chris Sutherland,Kelly Hitchcock

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:British Isles, turns vast quantities, usable ecological data, biodiversity monitoring, cornerstone of biodiversity

备注: 15 Pages, 4 Figures

点击查看摘要

Abstract:Camera traps have become a cornerstone of biodiversity monitoring, but the artificial intelligence that turns vast quantities of images into usable ecological data is often locked behind commercial platforms or trained on fauna that does not match that of the British Isles. In an attempt to remove barriers and increase uptake, we release an open-source object detection model for 31 classes, 28 common UK mammal and bird species, plus utility classes for humans, calibration poles, and vehicles, drawn from a curated dataset of 48,165 labelled instances assembled from multiple sites over a decade of operational deployment through Conservation AI and its successor, Trap Tracker. The model, a YOLO26x detector trained and tested on an 80/10/10 class-stratified split, achieves a mean Average Precision of 0.984 at Intersection over Union (IoU) of 0.5 (0.956 at IoU 0.5-0.95) on the held-out validation set, with precision 0.988 and recall 0.965. On an unseen held-out test split, mean per-species confidence ranged from 0.96 to 0.99 across the 31 classes, with a 0.17% false-negative rate concentrated in difficult night-time, distant, or occluded images. These metrics are from data from the same pool of sites and cameras as training, so performance at entirely new sites is left to future work. We release the trained weights in ONNX format under a non-commercial licence, with local desktop and real-time camera support, aimed explicitly at ecologists with no machine-learning experience. This release is a deliberate counterweight to the multiple paid for models that have developed over the last decade.

22. 【2606.10939】PENet+: A Lightweight Residual Transformer Framework for Efficient Image Steganalysis

链接https://arxiv.org/abs/2606.10939

作者:Jincheol AN,Dongsu Kim,Haneol Jang,YoungJoon Yoo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hidden information embedded, digital forensics, hidden information, information embedded, core component

备注: IEEE ACCESS

点击查看摘要

Abstract:Image steganalysis, the detection of hidden information embedded in digital images, is a core component of modern cybersecurity and digital forensics. Recent residual Transformer architectures, such as the Pixel-Difference-Convolution and Enhanced-Transformer-Network (PENet) [1], achieve strong detection accuracy, but their computational and memory demands hinder deployment in resource-constrained settings. We present PENet+, a lightweight steganalysis framework that preserves PENet's discriminative structure while substantially improving efficiency. Rather than redesigning or compressing the attention blocks, we retain PENet's self-attention topology for reproducibility and add a classifier-streamlining stage that progressively narrows the SPP-to-FC1 input channels (SPP: spatial pyramid pooling; FC1: first fully connected layer), yielding large reductions in parameters and FLOPs with negligible accuracy loss. We further refine the high-pass-filter (HPF) stem with an activation-aware mechanism that aggregates HPF responses early and selects a balanced SRM-Gabor top-K subset, and we replace PENet's backbone with a MobileNetV2-style inverted residual network. A balanced configuration with K=31 filters (16 Gabor + 15 SRM) matches or surpasses heavier settings at lower compute. Finally, we motivate PReLU from a steganalysis standpoint, arguing that preserving negative responses helps capture weak stego cues that ReLU suppresses. On a disjoint ALASKA2 JPEG QF90 protocol at 512x512 resolution (5,000 cover images for training, validation, and internal testing; a separate 19,000-cover evaluation set), PENet+ achieves up to 45.5% fewer parameters and about 97% fewer FLOPs than the re-evaluated PENet baseline, offering a computationally efficient direction for resource-constrained steganalysis. Device-level latency and power measurements remain future work.

23. 【2606.10905】Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model

链接https://arxiv.org/abs/2606.10905

作者:Sunil Khatri,Steven Landgraf,Markus Ulrich,Simon Reiß

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual in-Context Learning, in-Context Learning, aims at making, making progress, VICL

备注

点击查看摘要

Abstract:Visual in-Context Learning (VICL) aims at making progress towards adaptive vision models, that can -- based on a few examples -- adapt to a new task at test-time. With the history of in-context learning in natural language processing research, where large, parameter-heavy models are in use, one pathway that current VICL methods take is model- and data-scaling as key ingredients. Yet, it is not clear, whether these ingredients are the key for in-context learning to take shape in vision models. To stress-test such large models, we challenge them with an extreme counterexample: we train a tiny visual in-context model with merely $1$ million parameters and a modest amount of $70,000$ images. We compare the results of this severely capacity capped tiny model to $7,000\times$ larger VICL models in different adaptive settings, (1) on image data with small distribution shifts, (2) on unseen task encodings and (3) on a completely new task, i.e., the setting VICL envisions. With the chasm of training resources between the tiny- and large models, our experiments showcase a lack in how adaptive capabilities are measured, with respect to how tasks are encoded, which tasks were used in pre-training and the choice of metrics. These gaps in current VICL benchmarking underscore a need for innovation in evaluation of adaptive capabilities.

24. 【2606.10902】Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization

链接https://arxiv.org/abs/2606.10902

作者:Xuan Han,Yihao Zhao,Mingyu You

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:modern image generation, Subject Customization, foundational task, task in modern, image generation

备注

点击查看摘要

Abstract:Subject Customization is a foundational task in modern image generation. By providing a few reference images and a text prompt, users can generate images of a specific object in any desired scene. However, existing methods still struggle to achieve effective pose control for customized subjects. In practice, they often exhibit inaccurate poses or inconsistent cross-pose appearances. These limitations suggest that understanding objects in a volumetric manner remains a significant challenge for 2D-native backbones. To address this challenge, we propose Pose-ICL, a tuning-free framework that leverages 3D-aware In-Context Learning (ICL) to directly adapt to new subjects through multiple paired image-pose references. Its core mechanism,Surface-Anchored Position Embedding (SAPE), equips the model with explicit 3D awareness by anchoring image tokens to the surface coordinates of a volumetric bounding box. Dedicated refinements ensure its seamless compatibility with existing DiT models. Extensive evaluations on both 3D assets and real-world subjects demonstrate that Pose-ICL significantly outperforms current methods in both pose accuracy and identity consistency.

25. 【2606.10894】he 1st PortraitCraft Challenge: A CVPR 2026 Workshop Competition on Portrait Composition Understanding and Generation

链接https://arxiv.org/abs/2606.10894

作者:Zijie Lou,Youyun Tang,Xiaochao Qu,Haoxiang Li,Ting Liu,Luoqi Liu,Xun Zhu,Zheng Zhang,Xi Chen,Miao Li,Ji Wu,Dizhe Zhang,Xian Ge,Sujia Wang,Ruiyang Zhang,Jiaming Wang,Xianshun Wang,Lu Qi,Boao Kang,Wei Zhou,Jinghui Sun,Zhenyu Yan,Jiliang Zhao,Rui Yang,Yipo Huang,Boyuan Liu,Shanglin Li,Zifan Xie,Yichen Zhang,Anlan Wang,Wenfeng Lin,Mingyu Guo,Dong Li,Xinghao Wang,Yanting Li,Shanzhao Tong,Shuai He,Qiu Zhou,Yongqi Yang,Taoyang Mu,Dianqiao Lei,Anlong Ming,Huadong Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:competitions at CVPR, portrait composition understanding, inaugural PortraitCraft Challenge, paper presents, presents an overview

备注

点击查看摘要

Abstract:This paper presents an overview of the inaugural PortraitCraft Challenge, held as one of the official competitions at CVPR 2026. The challenge focuses on portrait composition understanding and generation, aiming to advance AI research in portrait aesthetics analysis and controllable image synthesis. Unlike existing datasets and tasks that primarily focus on global aesthetic scoring, PortraitCraft introduces a unified evaluation framework comprising two complementary tracks. Track 1 requires models to perform structured portrait composition understanding, and Track 2 requires models to generate portrait images from structured composition descriptions under explicit compositional constraints. To support the challenge, we constructed and publicly released a large-scale portrait composition dataset consisting of approximately 50,000 curated real portrait images, providing multi-level supervision. This report describes the challenge setup, evaluation protocols, dataset composition, and final results, along with an analysis of the technical characteristics of the submitted solutions. The PortraitCraft Challenge provides a standardized and reproducible platform for research on portrait composition understanding and generation, and is expected to foster further progress in the fields of portrait aesthetics and controllable image generation.

26. 【2606.10892】Improving Text-Instance Alignment Of Foreground Conditioned Out-Painting Via Customized Concept Embedding

链接https://arxiv.org/abs/2606.10892

作者:Yihao Zhao,Xuan Han,Bin He,Mingyu You

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:creating high-quality display, incur substantial costs, substantial costs creating, costs creating high-quality, high-quality display images

备注

点击查看摘要

Abstract:To showcase products, merchants often incur substantial costs creating high-quality display images. Foreground Conditioned Outpainting (FCO) meets this demand, allowing users to create desired backgrounds for foreground instances at a low cost by adjusting the text prompt. However, existing text-driven FCO methods exhibit critical flaws in their outputs, most notably the presence of artifacts, which refer to regions in the synthesized background that share the same semantics as the foreground instance. Such artifacts diminish the object's prominence and degrade image quality. We attribute the issue to the misalignment between the given instance and text-derived concept embeddings. To address this, we propose the Customized Concept Embedding Diffusion (CCE-Diffusion) framework. Its core is a CCE-Module to customize concept embeddings, bridging the gap between generic noun semantics and a specific visual instance. An Instance-Aware Loss guides the module's optimization, while a Semantic-Preserving Prompt Template prevents customized embeddings from distorting other words in the prompt. Both qualitative and quantitative evaluations demonstrate that CCE-Diffusion significantly reduces artifacts in the outputs. As a plug-and-play component, the CCE-Module can integrate with various FCO methods, enhancing their performance.

27. 【2606.10887】Listen, Look, and Learn: Learning Without Forgetting through SAM-Audio

链接https://arxiv.org/abs/2606.10887

作者:Avi Gupta,Nilotpal Sinha,Vishnu Raj,Sambuddha Saha,Pratik Joshi,Koteswar Rao Jerripothula,Tammam Tillo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:previously acquired knowledge, Class-Incremental Learning, forgetting previously acquired, aims to continuously, acquired knowledge

备注

点击查看摘要

Abstract:Class-Incremental Learning (CIL) aims to continuously learn new classes without forgetting previously acquired knowledge. While recent CIL advances have spurred significant interest across various modalities, the audio-visual setting remains underexplored. Furthermore, although foundational multimodal models like SAM-Audio encapsulate rich static priors, our empirical analysis reveals that these representations struggle in incremental settings. This work bridges this gap by integrating SAM-Audio's audio-visual priors into the CIL setting. Specifically, we leverage its dense audio and visual representations and employ a novel guided attention strategy where the audio features contextually guide the visual representations. To further mitigate catastrophic forgetting, we introduce dual-level distillation objectives at both the feature and logit levels. Extensive evaluations on audio-visual CIL benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods.

28. 【2606.10877】XtrAIn: Training-Guided Occlusion for Feature Attribution

链接https://arxiv.org/abs/2606.10877

作者:Thodoris Lymperopoulos,Ioannis Kakogeorgiou,Denia Kanellopoulou

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Occlusion-based attribution methods, estimate feature importance, Occlusion-based attribution, importance by perturbing, measuring the resulting

备注: 12 pages, 7 figures, 1 table

点击查看摘要

Abstract:Occlusion-based attribution methods provide an intuitive way to estimate feature importance by perturbing input features and measuring the resulting change in model output. However, their reliability is strongly affected by how feature removal is implemented: externally selected baselines can introduce bias, out-of-distribution samples, and unstable explanations, while in nonlinear models the occlusion of a set of features can also alter the contribution of non-occluded features. We refer to this effect as attribution shift, as the attribution scores of the non-occluded features drift from their initial values. To challenge these major issues that render explanations unstable, we introduce XtrAIn, a training-guided attribution method that transfers the occlusion operation from the input space to the parameter space. Instead of replacing input values with hand-crafted baselines, XtrAIn follows the model's training trajectory and measures how feature-associated parameter updates affect the output logits. We further introduce Xstep, a lightweight approximation for reducing computational cost, and XtrAIn+, a target-focused variant that emphasizes updates aligned with the target class. Experiments on controlled image datasets and PAM50 breast-cancer subtype classification show that the proposed methods produce cleaner and more interpretable attribution patterns than standard attribution baselines. Overall, XtrAIn provides a training-aware perspective on feature attribution and offers a useful diagnostic tool for studying how feature-level evidence is formed during training.

29. 【2606.10876】Advancing Wood Identification in the Philippines: Utilizing the Xylorix Platform for Efficient AI Model Development and Deployment for Five Key Species

链接https://arxiv.org/abs/2606.10876

作者:Rosalie C. Mendoza,Vivian C. Daracan,Arlene D. Romano,Ronniel D. Manalo,Xin Jie Tang,Yi Hong Wong,Yong Haur Tay

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:timber trade continue, pose significant challenges, Illegal logging, Acacia mangium Willd., accurate wood species

备注

点击查看摘要

Abstract:Illegal logging and timber trade continue to pose significant challenges in the Philippines, where accurate wood species identification is essential for enforcement but limited by the need for specialised equipment and expertise. This study aims to evaluate whether AI models for macroscopic wood identification can be developed and deployed by wood scientists without programming expertise using the Xylorix platform, focusing on five Philippine hardwood species: Mangium (Acacia mangium Willd.), Rain Tree [Samanea saman (Jacq.) Merr.], Banuyo (Wallaceodendron celebicum Koord.), Tindalo [Afzelia rhomboidea (Blanco) Vidal], and Ipil [Intsia bijuga (Colebr.) O. Kuntze]. Binary classifiers were trained on 10,663 verified cross-section images from 260 specimens and evaluated using specimen-level mean scoring to mirror operational field conditions. Area Under the ROC Curve (AUC) values ranged from 0.969 (Ipil) to 1.000 (Mangium), and Average Precision (AP) values ranged from 0.589 (Samanea) to 1.000 (Mangium). Four of five species achieved AA grade (AUC and AP both \geq 0.90); Rain Tree received AE (AUC \geq 0.90, AP 0.60) due to AP compression from its small positive test set (3 specimens). All five classifiers rank their target specimens above non-target specimens with near-perfect fidelity. Specimen-level error analysis revealed 9 false negatives from Ipil, primarily stemming from localized image artifacts and 3 false positives for Rain Tree and 1 false positive for Tindalo caused by shared tribal-level anatomical traits. These findings demonstrate that Xylorix non-programmers can leverage the Xylorix platform to construct operationally reliable wood identification models suitable for field deployment at supply chain checkpoints.

30. 【2606.10874】Schmidt Decomposition-Based Methods for Efficient Quantum Image Encoding

链接https://arxiv.org/abs/2606.10874

作者:Ana-Maria Pangeva,Yassine Ferhi,Alexander Geng,Andreas Weinmann,Desislava Ivanova,Ali Moghiseh

类目:Computer Vision and Pattern Recognition (cs.CV); Quantum Algebra (math.QA); Quantum Physics (quant-ph)

关键词:Enhanced Quantum Representation, classical image data, quantum, Quantum Probability Image, encoding classical image

备注

点击查看摘要

Abstract:In quantum image processing, a fundamental step is encoding classical image data into quantum states. This can be achieved using methods such as Flexible Representation of Quantum Images (FRQI), Quantum Probability Image Encoding (QPIE), and Novel Enhanced Quantum Representation (NEQR). However, on real quantum hardware, these encodings can quickly lead to circuits with many gates, large circuit depth, and high qubit usage, which is a problem for Noisy Intermediate-Scale Quantum (NISQ) devices. In this work, we investigate whether low-rank state approximation, formulated via Schmidt decomposition, can help reduce this complexity. The method keeps only the most significant parts of a quantum state's entanglement structure, making state preparation more efficient while preserving most of the image information. We compare the three encoding techniques in their original form and with low-rank approximation, evaluating metrics such as circuit depth, CNOT count, MSE, and visual quality of reconstructed images. The results reveal meaningful trade-offs between accuracy and resource efficiency, with the FRQI model achieving a 97 percent reduction in circuit depth while maintaining a near-perfect reconstruction (MSE of about 0.27). This demonstrates the potential of low-rank techniques for advancing practical quantum image processing on near-term hardware.

31. 【2606.10862】LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

链接https://arxiv.org/abs/2606.10862

作者:Taishan Li,Jiwen Zhang,Siyuan Wang,Xuanjing Huang,Zhongyu Wei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:models achieve strong, achieve strong performance, fully visible, achieve strong, evaluations assume

备注: 14 pages, 7 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{this https URL}{this https URL}.

32. 【2606.10839】HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

链接https://arxiv.org/abs/2606.10839

作者:Cong Wang,Zhentao Yu,Hongmei Wang,Weicong Liang,Zixiang Zhou,Zilin Yang,Jiarong Ou,Rui Chen,Yuan Zhou,Qinglin Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current identity-consistent video, generation methods struggle, identity-consistent video generation, Current identity-consistent, video generation methods

备注: Project Page: [this https URL](https://conallwang.github.io/HarmoView_Pages)

点击查看摘要

Abstract:Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three architectural refinements complemented by a staged training curriculum. Specifically, we first introduce Multi-level Feature Injection to anchor identity fidelity; by injecting raw ViT features from frontal references alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, we employ learnable proxy tokens to unify heterogeneous reference layouts across single-/multi-view settings while simultaneously resolving the reference-view mismatch problem. Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the original generative priors, we propose the Progressive View Curriculum. This four-stage training strategy employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity. Extensive evaluation on our multi-view benchmark, comprising 100 manually-curated cases spanning 52 unique identities, demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.

33. 【2606.10819】Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

链接https://arxiv.org/abs/2606.10819

作者:Miaoxin Cai,Guanqun Wang,Wei Zhang,Guangyao Zhou,Yin Zhuang,Tong Zhang,Hao Wang,He Chen,Jun Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enable natural-language understanding, earth observation imagery, RS-MLLMs enable natural-language, observation imagery, enable natural-language

备注

点击查看摘要

Abstract:RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

34. 【2606.10818】IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation

链接https://arxiv.org/abs/2606.10818

作者:Jiawei Gao,Chaoqi Liu,Peilin Wu,Haonan Chen,Yilun Du

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Real-world robotic manipulation, robotic manipulation tasks, performing contact-rich tasks, table wiping, involve forceful interactions

备注: Project website: [this https URL](https://gao-jiawei.com/IMPACT/)

点击查看摘要

Abstract:Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previous learning-based approaches typically employ imitation learning policies that output target end-effector poses tracked by low-level impedance controllers. In these systems, forceful interactions are either implicitly realized through steady-state tracking errors or explicitly commanded using wrist force/torque or tactile sensors. However, implicit approaches generalize poorly across object weights, while explicit approaches require specialized hardware and increase system complexity. In this work, we propose IMPACT, a framework that decouples these forceful tasks into task-planning and internal-model-based predictive control. Extensive simulation and real-world experiments demonstrate that the proposed framework achieves higher success rates and improved generalization to unseen object weights, as well as better safety and energy efficiency.

35. 【2606.10811】Deep learning for echo sounder data

链接https://arxiv.org/abs/2606.10811

作者:Ketil Malde

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep learning methods, process and interpret, machine learning, deep learning, learning methods

备注

点击查看摘要

Abstract:There is no doubt that over the last decade, techniques from the field of machine learning have revolutionized how we process and interpret data, especially images and text. For underwater observations acoustics is a primary source of information, and naturally, deep learning methods have been applied to echograms and other acoustics data, but so far with rather modest results. Here, we argue that due to intrinsic properties of acoustic data, substantial advances will likely require research into deep learning methods beyond mere recycling of models and techniques from image processing. Currently, the potential for breakthroughs in method development is hindered by the lack of standard data formats and organization, and even more by the lack of readily available, high quality data sets with established performance goals. To advance the field, these shortcomings should be remedied

36. 【2606.10804】SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

链接https://arxiv.org/abs/2606.10804

作者:Wenhao Yan,Fengjia Guo,Zhuoyi Yang,Jie Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Controlled character animation, requires transferring motion, animation requires transferring, Controlled character, character animation requires

备注

点击查看摘要

Abstract:Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, an framework that bypasses those intermediates and achieves \textbf{end-to-end} character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To archive the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: this https URL.

37. 【2606.10803】Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

链接https://arxiv.org/abs/2606.10803

作者:Zhixin Ma,Yutong Zhou,Yongqi Li,Chong-Wah Ngo,Wenjie Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, utilizing digital APIs

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

38. 【2606.10790】A Multimodal RGB and Events Dataset for Hand Detection in First-Person View

链接https://arxiv.org/abs/2606.10790

作者:Bharghav Kota(1),Yulia Sandamirskaya(1) ((1) Zurich University of Applied Sciences, Wädenswil, Switzerland)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:frame rate, hand detection, detection, detection rate, existing RGB Egohands

备注

点击查看摘要

Abstract:Existing hand detection algorithms work on images and the detection rate is restricted by the frame rate of the camera. In hand detection applications for moving robotic systems, conventional cameras cause motion blur, especially in darker lighting conditions. We can leverage the use of event-based cameras which possess a high dynamic range, high temporal resolution, and low power consumption. Recent work has shown that using a stereo setup of an event-based and a frame-based camera improves detection accuracy and the bandwidth-latency tradeoff. The main bottleneck in using event-based cameras in object detection and recognition tasks is a relatively low amount of training data. In this work, we propose a methodology and an exemplary synthetic event-based hand dataset from an egocentric, first-person view perspective. The data is synthesized from the existing RGB Egohands dataset with the v2e toolbox. Parameters of the v2e toolbox are varied to provide versions of the dataset with different lighting conditions and scales. Ground truth detections are generated with a fine-tuned YOLOv8 model which is applied to the RGB images in the Egohands dataset and interpolated on the high-temporal resolution events. We use the multi-modal dataset to perform hand detection with existing object detection algorithms which use a multi-modal setup of event and RGB cameras and demonstrate performance comparable to the state-of-the-art.

39. 【2606.10778】From Patches to Patients: A study of the tile-to-slide performance transferability in Digital Pathology

链接https://arxiv.org/abs/2606.10778

作者:Sofiène Boutaj,Leo Fillioux,Maria Vakalopoulou,Stergios Christodoulidis,Pierre Marza

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiple Instance Learning, providing robust representations, whole-slide image, Instance Learning, recently redefined

备注: Accepted to MICCAI 2026

点击查看摘要

Abstract:Foundation Models (FMs) have recently redefined the state-of-the-art in histopathology by providing robust representations for whole-slide image (WSI) analysis. However, selecting the optimal foundation model (FM) for a specific clinical cohort currently requires multiple preprocessing steps, followed by computationally expensive feature extraction and the training of a Multiple Instance Learning (MIL) aggregator for every model. In this work, we investigate whether efficient tile-level linear probing can serve as a reliable proxy for slide-level performance, reducing the need to run full slide-level pipelines for every candidate encoder. We benchmark 19 state-of-the-art FMs on 42 slide-level and 16 tile-level tasks, comparing tile probing metrics against slide-level outcomes using ABMIL and Mean Pooling aggregations. We observe a high correlation between tile and slide performance across varying task difficulties, indicating that encoder representation quality is the primary determinant of WSI success. Sensitivity analyses show that transferability is stable across models and is more influenced by cohort sizes and numbers of tiles per slide than by average task difficulty. We also measure the agreement in best performing models between tile and slide-level tasks, showing tile benchmarks reliably shortlist strong candidates. Overall, our study indicates that tile-level benchmarking provides an efficient and practical first step for narrowing down candidate models, while slide-level evaluation remains essential for final validation on clinical tasks.

40. 【2606.10775】Spatially Selective Self-Training for Unsupervised Building Change Detection

链接https://arxiv.org/abs/2606.10775

作者:Wafaa I. M. Hussin,Zhi Lu,Anas M. I. Mohammed,Xiang Zhou,Ratiba A. H. Abubaker,Zhenming Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remote sensing images, unlabeled bi-temporal remote, bi-temporal remote sensing, learn building-change masks, sensing images

备注: Under Review

点击查看摘要

Abstract:Unsupervised building change detection aims to learn building-change masks from unlabeled bi-temporal remote sensing images. Existing label-free methods often follow a discrepancy-to-mask paradigm, directly using temporal differences, frozen foundation-model responses, prompt-based outputs, or post-processing results as final change maps. Although these strategies provide annotation-free cues, they do not learn a task-specific building-change detector and remain vulnerable to the gap between generic temporal discrepancies and building-defined structural changes. In practice, such discrepancies are often noisy and task-irrelevant, as appearance shifts, registration errors, and non-building modifications can produce strong but misleading responses. To address this problem, we propose SST-CD, a spatially selective self-training framework that reformulates fully label-free building change detection as end-to-end detector learning under noisy pseudo supervision. SST-CD uses temporal discrepancies as candidate pseudo labels and trains the detector only on spatially reliable pixels, whose reliability is estimated by a local consistency criterion that filters inconsistent regions from supervision. To further stabilize noisy self-training, a lightweight feature adapter recalibrates bi-temporal features, while a prototype-based decoder produces compact change and no-change representations. Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show that SST-CD achieves F1 scores of 83.08\%, 91.69\%, and 86.60\%, respectively, outperforming existing unsupervised and label-free baselines. Code will be made publicly available.

Comments:
Under Review

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.10775 [cs.CV]

(or
arXiv:2606.10775v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.10775

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
41. 【2606.10769】ZODS-RS -- Zero-training Oriented Detection Segmentation for Remote Sensing

链接https://arxiv.org/abs/2606.10769

作者:Zuan Gu,Tianhan Gao,Langxu Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:task-specific training, applications need models, models that generalize, generalize across platforms, platforms and viewpoints

备注

点击查看摘要

Abstract:Remote-sensing and UAV applications need models that generalize across platforms and viewpoints without task-specific training. Yet training-free pipelines often falter on oriented geometry, scale/rotation variation, and crowded ports or airfields, and rarely unify detection and segmentation. We introduce ZODS-RS, a training-free, closed-form pipeline that outputs horizontal boxes (HBB) and instance masks. Built on DINOv3 dense features and SAM-style proposals, ZODS-RS chains: PP (prototype purification via Tyler covariance), R-SEM (rotation-scale equivariant matching with separable kernels and global Hungarian assignment), and UAM (uncertainty-aware pixelwise merging with adaptive priors and optional negative prototypes). A lightweight CWLA fuses multiple DINOv3 layers. On FAIR1M (HBB) we obtain $\mathrm{mAP}_{0.50:0.95}=\mathbf{13.06}$ and $\mathrm{AP}_S=\mathbf{2.93}$ \emph{(class-averaged over ship/airplane)}; on xView (HBB) we report $\mathrm{mAP}=\mathbf{16.69}$. On our UAV dataset, ZODS-RS achieves mask $\mathrm{mIoU}=\mathbf{31.10}$ and improves small-object AP by $\mathbf{+30.70}$ over Grounded-SAM on a single 5090. This work offers a unified, \emph{no-training} solution for horizontal-box detection plus instance segmentation in aerial imagery; provides explicit closed-form formulations for PP/R-SEM/UAM tightly coupled with DINOv3; and demonstrates \emph{consistent} gains on small and crowded targets and under cross-domain shifts while keeping deployment simple.

42. 【2606.10756】DD-INR: Dynamics-Driven Implicit Neural Representation for Accelerated Whole-Brain Functional MRI Reconstruction

链接https://arxiv.org/abs/2606.10756

作者:Qiaoxin Li(MIND),Caini Pan(NEUROSPIN, MIND),Pierre-Antoine Comby(MIND, BAOBAB),Chaithya Giliyar(MIND),Philippe Ciuciu(MIND)

类目:Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词:Task-evoked BOLD signals, traditional anatomical MRI, Task-evoked BOLD, high k-space undersampling, enables enhanced detection

备注

点击查看摘要

Abstract:Accelerated acquisition of fMRI enables enhanced detection of neurovascular (BOLD) activity in the brain, but image reconstruction becomes challenging with high k-space undersampling: Task-evoked BOLD signals are small in magnitude, which traditional anatomical MRI reconstruction methods fail to recover, as they favor spatial accuracy over temporal fidelity. We present DD-INR, a Dynamics-Driven Implicit Neural Representation framework tailored for accelerated fMRI that benefits from incoherent time-varying sampling and a tailored spatiotemporal prior, outperforming traditional methods, demonstrated in simulation and in-vivo acquisition, both in terms of image quality and retrieval of activation patterns. DD-INR achieves this by splitting the fMRI data into a static background and a temporally varying dynamic component, representing only the dynamics with a dedicated INR, thereby focusing the model's capacity on activation-relevant changes while remaining compact. In general, DD-INR provides a promising framework for accelerated fMRI reconstruction, with the potential to improve the sensitivity and robustness of fMRI studies within practical scan time limits. The source code is available at this https URL.

43. 【2606.10735】Patient-Level Diagnosis of Acute Myeloid Leukemia via Deep Learning Analysis of Bone Marrow Smear

链接https://arxiv.org/abs/2606.10735

作者:Yuqi Ma,Tianyi Wang,Weihua Meng,Hongru Chen,Fajin Tao,Qunxian Lu,Lin An,Xiaodong Mo,Gen Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词:Bone marrow smear, acute myeloid leukemia, review remains important, smear review remains, diagnosis requires aggregation

备注: 4 figures

点击查看摘要

Abstract:Bone marrow smear review remains important for acute myeloid leukemia (AML) assessment, but manual single-cell interpretation is labor-intensive and patient-level diagnosis requires aggregation of many cellular observations. We present a cell-to-patient deep learning pipeline for AML-assisted diagnosis from bone marrow smear images. The study included 258 patients from six anonymized centers, including a main cohort of 169 patients from Centers 1-3 and an external validation cohort of 89 patients from Centers 4-6. A 16-category cell annotation vocabulary was used to describe the global cellular composition, including granulocytic, monocytic, erythroid, lymphoid, eosinophilic, and other cells. Rather than identifying strict AML blasts or leukemic blasts, the model targets an expert-defined composite category termed Composite Blast-like Cells (CBLC), comprising N, N1, M, M1, R, R1, J, and J1 according to the project-wide morphological standard. A fixed YOLO-based segmentation module detected cells, predicted contours were matched to expert polygon annotations by contour IoU, and standardized single-cell crops were generated. An EfficientNet-B0 classifier was trained through a two-stage GT-to-YOLO and YOLO-to-YOLO strategy with class-imbalance correction, center-border regularization, and morphology-assisted supervision. Cell-level predictions were aggregated into patient-level CBLC ratios for AML-oriented diagnostic support. The pipeline achieved stable internal validation and maintained external generalization, with ensemble weighted F1-scores of 0.9076, 0.8696, and 0.9124 on Centers 4, 5, and 6, respectively.

44. 【2606.10701】Vector Map as Language: Toward Unified Remote Sensing Vector Mapping

链接https://arxiv.org/abs/2606.10701

作者:Yinglong Yan,Yunkai Yang,Haoyi Wang,Wei Fu,Linshan Wu,Honghu Pan,Shaobo Xia,Shanghang Zhang,Hao Chen,Leyuan Fang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remote sensing imagery, Remote sensing vector, Remote sensing, sensing vector mapping, sensing imagery

备注

点击查看摘要

Abstract:Remote sensing vector mapping aims to generate structured maps of geospatial entities, such as buildings, roads, and water bodies, from remote sensing imagery. In practice, vector maps usually contain multiple category layers and heterogeneous entity structures, requiring a unified model for diverse mapping needs. However, existing methods typically represent vector objects as polygons or graphs, making them suitable only for specific categories: polygons poorly capture topological relations, while graphs often blur instance boundaries. We observe that language, as a natural medium for human communication, offers a flexible and expressive representation that can accommodate heterogeneous map elements, including geometry, semantics, and topolog. Motivated by this insight, we propose Vector Map as Language (VecLang), a unified paradigm that reformulates multiclass vector mapping as structured text generation. VecLang encodes the common elements of different geospatial entities into a GeoJSON-like vector language, enabling cross-category modeling within a shared textual format. To generate this language reliably, we design a progressive vision-language mapping framework that first localizes vectorization units and then generates structured map elements. We further introduce Hierarchical Vector Language Optimization, which uses reinforcement learning to improve syntax validity, content fidelity, and map executability. We also build VecMap-Bench with 54K images and 800K instances, supporting training and evaluation across standard and generalization settings. Extensive experiments demonstrate that VecLang handles both single-class and multiclass vector mapping while achieving strong cross-dataset and open-vocabulary generalization. The model and dataset are publicly available at this https URL.

45. 【2606.10699】Using the YOLOv12 Model for Verifying the Correct Color Sequence of Wires in Network Cables (Patch Cords) on the Production Line

链接https://arxiv.org/abs/2606.10699

作者:Amin Doroodchi,Danial Soleimany

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:impose significant costs, wire pairs inside, standard connector plays, correct color sequence, significant costs

备注

点击查看摘要

Abstract:In the production process of network cables, ensuring the correct color sequence of wire pairs inside the standard connector plays a critical role in the final performance of the cable, as any misplacement or color-ordering error can lead to defective products and impose significant costs. Traditional inspection methods based on visual examination through digital microscopes are typically time-consuming, tedious, and prone to human error. In this study, an intelligent system based on the twelfth version of the YOLO1 object detection model was developed to identify the position and verify the correct color sequence of wires in patch cords. The dataset used consisted of 2,500 images captured from microscopic views of network connectors, which were divided into 70% for training, 15% for validation, and 15% for testing. The proposed model, leveraging a single-stage architecture and attention mechanisms during learning, achieved highly accurate wire detection with approximately 98% precision. Additionally, the overall mean accuracy, classification precision, and recall were around 95%, 99%, and 98%, respectively. The results demonstrate that this system can reliably and in real time verify the correctness of wire color sequencing on the production line without the need for human intervention, thereby reducing human error and enhancing efficiency in the manufacturing process.

46. 【2606.10696】Don't waste SAM

链接https://arxiv.org/abs/2606.10696

作者:Nermeen Abou Baker,Uwe Handmann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrates exceptional zero-shot, exceptional zero-shot image, released the Segment, zero-shot image segmentation, remarkable accuracy

备注: Published at European Symposium on Artificial Neural Networks (ESANN2023), Computational Intelligence and Machine Learning. Bruges (Belgium)

点击查看摘要

Abstract:Meta AI has recently released the Segment Anything Model (SAM), which demonstrates exceptional zero-shot image segmentation performance across various tasks with remarkable accuracy. Despite its inability to provide accurate segmentation across multiple research fields, SAM still serves as a valuable starting point for supporting the segmentation pipeline process, particularly for tasks that require extensive and senior skills annotations. This study aims to evaluate the generalization of SAM and fine-tuning SAM models using three waste segmentation datasets. Although they are captured from real scenes as SAM was pretrained on, these datasets present several challenges, including occlusions, deformable objects, transparency, and objects easily confused with backgrounds. In our findings, the fine-tuned SAM-ViT-H model outperforms the state-ofthe-art Zerowaste, and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.

47. 【2606.10683】UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

链接https://arxiv.org/abs/2606.10683

作者:Dong Fang,Youjun Wu,Yuanxin Zhong,Rui Zhang,Yunlong Wang,Xiaosong Jia,Yu-Gang Jiang

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:hardware designs vary, designs vary substantially, fine-grained manipulation, essential for fine-grained, hardware designs

备注

点击查看摘要

Abstract:Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

48. 【2606.10671】FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

链接https://arxiv.org/abs/2606.10671

作者:Yu Lu,Junjie Yang,Piotr Koniusz,YuXin Song,Yi Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Autoregressive video generators, video generators synthesize, synthesize long videos, generators synthesize long, Autoregressive video

备注: 11 pages, 4 figures

点击查看摘要

Abstract:Autoregressive video generators synthesize long videos by generating successive temporal segments, but their historical KV cache grows with video length. Existing bounded-cache methods reduce this cost with local windows, sink tokens, or compressed memory states, yet they usually assign fixed roles to different parts of the history. We propose FadeMem, a distance-aware KV memory consolidation mechanism that organizes historical KV blocks into a temporal hierarchy under a fixed cache budget. This design is motivated by frequency-dependent temporal decay: fine details decorrelate quickly, while coarse scene structure and identity remain useful over longer horizons. During generation, new history is inserted as fine-grained entries, while older adjacent entries are progressively merged under a power-law temporal allocation schedule, yielding a dense-near, sparse-far memory within one cache. Without architectural changes, FadeMem preserves recent context for short-term dynamics and compact long-range anchors for identity and scene coherence. Experiments show improved subject consistency, background stability, and temporal coherence over existing bounded-cache strategies.

49. 【2606.10666】Analyzing Training-Free Corruption Detection for Object Detection Datasets

链接https://arxiv.org/abs/2606.10666

作者:Christian Sieberichs,Simon Geerkens,Thomas Waschulzik,Viswanathan Ramesh,Alexander Braun

类目:Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)

关键词:computer vision, Annotation errors, significantly degrade, degrade the performance, performance of systems

备注: Accepted at DataCV Workshop, Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Annotation errors are widespread in computer vision datasets and can significantly degrade the performance of systems trained on them, particularly in complex tasks such as object detection. Several approaches exist to identify annotation errors, including training-free feature-space methods which provide a fast and interpretable way to analyze annotations. However, the behavior on object detection annotations, which include semantic and spatial information, remains largely unexplored. In this work we analyze the applicability of feature-space-based approaches for detecting annotation errors in object detection datasets. By adapting an existing feature-space method, we show that such approaches reliably expose semantic mislabel, while positional errors remain difficult to detect. We evaluate this behavior across multiple pretrained embedding models, synthetic noise types (symmetric, asymmetric, and positional), and real-world annotation errors using VOC2012 and KITTI. All code and real-world corruptions are publicly available at the following repository: this https URL ChristianSieberichs/BoundingBox\_corruption\_detection

Comments:
Accepted at DataCV Workshop, Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)

Cite as:
arXiv:2606.10666 [cs.CV]

(or
arXiv:2606.10666v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.10666

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
50. 【2606.10656】Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving

链接https://arxiv.org/abs/2606.10656

作者:Qi Song,Yifei He,Chi Zhang,Zheng Fu,Xuhe Zhao,Mengmeng Yang,Kun Jiang,Rui Huang,Diange Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving, scenes is crucial, crucial in autonomous, Forecasting, future

备注: Project Page: [this https URL](https://maggiesong7.github.io/research/Envision4D/)

点击查看摘要

Abstract:Forecasting the future evolution of dynamic scenes is crucial in autonomous driving. However, existing feed-forward paradigms are primarily designed for interpolation. When extended to future extrapolation, they suffer from ghosting artifacts under large displacements and are constrained by simplified motion assumptions or strict future priors. To overcome these challenges, we propose Envision4D, a fully self-supervised feed-forward framework for pose-free future extrapolation. Specifically, we introduce a Future Pose Prediction module that infers future camera parameters via an iterative denoising process. Furthermore, to capture non-linear dynamics, we propose In-layer Temporal Attention and employ Conditioned Motion Lifting, which transforms the highly uncertain extrapolation process into robust relational mappings. Finally, a Progressive Training Strategy is utilized to stabilize unsupervised motion learning against error accumulation. Extensive experiments demonstrate that Envision4D achieves state-of-the-art performance, significantly outperforming existing methods in future view synthesis.

51. 【2606.10653】STEDiff: Strengthening Text Embedding for Text-to-Image Alignment in Diffusion Model

链接https://arxiv.org/abs/2606.10653

作者:Hailan Zhang,Haipeng Liu,Bo Fu,Yang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inherent model limitations, produce high-quality images, produce high-quality, fail to faithfully, faithfully reflect

备注: 8 pages, 8 figures, to appear at IJCNN 2026

点击查看摘要

Abstract:Although pretrained text-to-image (T2I) generation models can produce high-quality images, they often fail to faithfully reflect the semantic intent of complex prompts due to stochastic noise and inherent model limitations. This issue frequently manifests as the model overlooking specific objects or failing to correctly bind attributes to their corresponding entities, a challenge referred to as semantic alignment. Unlike existing approaches that rely on computationally expensive fine-tuning or labor-intensive layout priors, we propose STEDiff, a training-free method designed to enhance semantic representations directly within the text-embedding space. Specifically, we introduce a method that primarily leverages the [EOT] token to strengthen the relevant semantics of sub-sentences and then replaces the corresponding tokens in the original prompt. Furthermore, a novel semantic enhancement loss is incorporated to enforce spatial constraints, ensuring that the semantics of each entity are precisely mapped to their respective image regions. Extensive quantitative and qualitative evaluations on the T2I-CompBench demonstrate that our method notably improves semantic consistency and generation integrity in complex scenarios.

52. 【2606.10651】Kwai Keye-VL-2.0 Technical Report

链接https://arxiv.org/abs/2606.10651

作者:Kwai Keye Team,Bin Wen,Changyi Liu,Chengru Song,Chongling Rao,Guowang Zhang,Han Li,Haonan Fan,Hengrui Ju,Jiankang Chen,Jiapeng Chen,Jiawei Yuan,Kaixuan Yang,Kaiyu Jiang,Kun Gai,Lingzhi Zhou,Na Nie,Sen Na,Tianke Zhang,Tingting Gao,Xuanyu Zheng,Yulong Chen,Fan Yang,Haixuan Gao,Lele Yang,Mingqiao Liu,Muxi Diao,Qi Zhang,Qile Su,Wei Chen,Wentao Hong,Xingyu Lu,Yancheng Long,Yankai Yang,Yingxin Li,Yiyang Fan,Yu Xia,Yuzhe Chen,Ziliang Lai,Chuan Yi,Haonan Jia,Tianming Liang,Weixin Xu,Xiaoxiao Ma,Yang Tian,Yufei Han,Feng Han,Hang Li,Jing Wang,Jinghui Jia,Junmin Chen,Junyu Shi,Ruilin Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:foundation model designed, DeepSeek Sparse Attention, introduce Kwai, designed to advance, multimodal foundation model

备注: 31 pages, 11 figures

点击查看摘要

Abstract:We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.

53. 【2606.10645】ManiSplat: Manipulation Trajectory Synthesis from Monocular Video via Decoupled 3D Gaussian Splatting

链接https://arxiv.org/abs/2606.10645

作者:Wenhao Hu,Haonan Zhou,Liu Liu,Yun Du,Xinjie Wang,Ziang Li,Zhizhong Su,Gaoang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world observations remains, real-world observations, computer vision, Reconstructing dynamic, observations remains

备注

点击查看摘要

Abstract:Reconstructing dynamic and interactive 3D scenes from real-world observations remains a fundamental challenge in computer vision and robotics. While recent advances in 3D Gaussian Splatting have enabled high-fidelity static reconstruction, extending it to interactive environments with articulated robots and manipulable objects remains difficult due to complex contact interactions and abrupt pose changes. To address these challenges, we introduce ManiSplat, a unified framework that reconstructs controllable and decoupled Gaussian digital twins directly from monocular ego-view robotic videos. Our method introduces a Graph-Structured Disentangled Representation that separates the robot, objects, and background into independently optimizable Gaussian subfields organized within a scene graph. To ensure stability, we propose a Task-Oriented Spatio-Temporal Alignment module that leverages the inherent logic of manipulation tasks-alternating between Motion and Skill phases-to construct accurate pseudo-ground-truth trajectories. Finally, a joint photometric-geometric optimization ensures the reconstructed scenes are temporally coherent, physically consistent, and simulation-ready. Extensive experiments demonstrate that our approach reconstructs interaction-driven dynamic scenes with high fidelity and controllability, effectively supporting downstream robotic tasks and policy learning.

54. 【2606.10640】ChartLens: A Dual-Branch Framework for Chart Data Correction and Factual Summary Refinement

链接https://arxiv.org/abs/2606.10640

作者:Hao Liu,Ruping Cao,Kun Wang,Zhiran Li,Fan Liu,Yupeng Hu,Liqiang Nie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:DataMFM Challenge Track, DataMFM Challenge, Challenge Track, present our champion, champion solution

备注

点击查看摘要

Abstract:In this report, we present our champion solution for the DataMFM Challenge Track 2: Chart Understanding. This track requires models to recover structured chart data and generate faithful natural-language summaries from chart images. To address the complementary requirements of accurate data extraction and factual narration, we propose ChartLens, a dual-branch framework for chart data correction and summary refinement. ChartLens consists of two key modules: Structure-Aware CSV Verification and Correction (SAVC) and Text-Retention-Guided Summary Refinement (TRSR). SAVC improves the reliability of structured data extraction through verification and correction, while TRSR enhances summary generation by preserving critical textual and numerical evidence from charts. By combining model adaptation, correction-based generation, and OCR-assisted evidence grounding, ChartLens improves both structured data recovery and summary factuality. On the test set, our final system achieves an overall score of 69.10 and ranks first in Track 2, demonstrating its effectiveness for accurate chart understanding. Our code will be released at: this https URL.

55. 【2606.10628】Leveraging Metric Depth for Relative Depth Prediction

链接https://arxiv.org/abs/2606.10628

作者:Xiaoyang Bi,Shuaikun Liu,Zhaohong Liu,Yuxin Yang,Zhe Zhao,Mengshi Qi,Liang Liu,Huadong Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Estimation Competition Challenge, Monocular Depth Estimation, Depth Estimation Competition, SoccerNet Monocular Depth, Estimation Competition

备注

点击查看摘要

Abstract:We present our solution to the 2025 SoccerNet Monocular Depth Estimation Competition Challenge. Predicting the relative depth in football scenarios is challenging, especially with only thousands of training samples available. To address this issue, our method leverages the powerful zero-shot capabilities of models pretrained on large-scale datasets to learn metric depth for effective relative depth prediction, achieving a score of $2.68 \times 10^{-3}$ on the challenge set.

56. 【2606.10620】Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

链接https://arxiv.org/abs/2606.10620

作者:Xinrui Wu,Lichen Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains poorly understood, produce high-quality static, high-quality static images, time remains poorly, poorly understood

备注

点击查看摘要

Abstract:Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.

57. 【2606.10617】SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models

链接https://arxiv.org/abs/2606.10617

作者:Zhengxuan Wei,Yi Dong,Zonghui Li,Xianhui Lin,Xing Liu,Hong Gu,Shaofeng Zhang,Wenbin Li,Qi Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-Rank Adaptation, efficiently combine diverse, combine diverse generative, diverse generative capabilities, multiple trained LoRAs

备注: Accepted at ICML 2026

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) merging can efficiently combine diverse generative capabilities from multiple trained LoRAs for a diffusion model. However, existing LoRA merging techniques often suffer from severe parameter interference, causing destructive collisions in the shared parameter space. To address this, we propose Subspace Signal Routing (SSR), which resolves interference by routing internal signals instead of performing parameter-space merge. Specifically, SSR first constructs a unified subspace by concatenating candidate LoRAs along the rank dimension. Next, SSR employs an inverse correlation matrix to decorrelate mixed signals within this space. Finally, a directional guide matrix steers these purified signals into their respective task-specific subspaces. We provide a rigorous theoretical analysis proving that SSR aligns with the Ordinary Least Squares (OLS) solution, thereby ensuring mathematical optimality. We utilize the additivity of sufficient statistics to design a streaming algorithm. This enables on-the-fly updates that significantly reduce memory overhead and computation time. Extensive experiments validate that SSR significantly outperforms state-of-the-art methods while maintaining comparable efficiency. Code is available at this https URL.

58. 【2606.10614】Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

链接https://arxiv.org/abs/2606.10614

作者:Beomjun Kim,Seong Hyeon Park,Seunghoon Sim,Seungjun Moon,Sanghyeok Lee,Jinwoo Shin

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Robotic foundation models, foundation models pre-trained, Robotic foundation, significant embodiment gap, embodiment gap remains

备注

点击查看摘要

Abstract:Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.

59. 【2606.10612】GaussTrace: Provenance Analysis of 3D Gaussian Splatting Models with Evidence-based LLM Reasoning

链接https://arxiv.org/abs/2606.10612

作者:Haoliang Han,Ziyuan Luo,Renjie Wan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, creating high-fidelity, powerful technique, technique for creating, Gaussian

备注: Accepted by ICML2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is a powerful technique for creating high-fidelity 3D assets. However, the widespread sharing and iterative modification of 3DGS models across digital platforms create pressing challenges for intellectual property protection and forensic traceability. To address this, we propose GaussTrace, a novel framework for constructing directed provenance graphs for 3DGS models. GaussTrace formulates provenance analysis as an evidence-based reasoning problem. It builds upon attribute-wise statistical profiling of 3DGS parameters to capture intrinsic properties. Moreover, we introduce hypothesis-driven editing simulations of common operations to provide auxiliary evidence for plausible transformation pathways. These statistical and simulated cues jointly enable a Large Language Model (LLM) to perform structured Chain-of-Thought (CoT) reasoning, yielding directional provenance inferences and explainable edge reasons. Experimental results demonstrate that GaussTrace effectively constructs evolutionary relationships among diverse 3DGS models, delivering accurate, interpretable, and robust provenance graphs without requiring model training or access to editing histories. Project page: this https URL.

60. 【2606.10611】Geometry-Aware Reinforcement Learning for 2D Irregular Nesting

链接https://arxiv.org/abs/2606.10611

作者:Auguste Lehuger,Guillaume Henon-Just

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:minimal geometrical guidance, irregular nesting problem, nesting problem share, continuous placement space, Traditional heuristic solvers

备注: 15 pages, 4 figures, 5 tables. Under review at the European Workshop on Reinforcement Learning (EWRL)

点击查看摘要

Abstract:Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical guidance. In this paper, we argue that Reinforcement Learning is uniquely positioned to overcome this bottleneck. By pairing an optimization policy with a geometry-aware neural encoder, an agent can automatically discover rich geometric priors directly from data, utilizing these learned intuitions to strategically guide exploration. To realize this, we introduce the Polygons Transformer (PoT), a novel architecture that encodes 2D continuous vector geometries while allowing cross-polygons attention. We couple this novel architecture with a Combinatorial Optimization Reinforcement Learning (CORL) training framework to find optimal solutions. To support this paradigm, we release an open-source training dataset derived from complex geographic contours alongside a dedicated evaluation benchmark. Our empirical validation demonstrates that our trained agent achieves area utilization performance highly competitive with Sparrow, the state-of-the-art heuristic solver, proving that reinforcement learning can successfully discover and exploit geometric awareness for precise spatial tasks.

61. 【2606.10602】Globally Localizing Lunar Rover in Pixels via Graph Alignment

链接https://arxiv.org/abs/2606.10602

作者:Mao Chen,Xu Yang,Chuankai Liu,Xiangkai Zhang,Xiaoxue Wang,Zheng Bo,Zuoyu Zhang,Zhiyong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Navigation Satellite System, Global Navigation Satellite, constrain long-range missions, methods severely constrain, severely constrain long-range

备注

点击查看摘要

Abstract:Precise rover localization is a prerequisite for autonomous lunar exploration, yet the absence of Global Navigation Satellite System (GNSS) signals and the cumulative drift of local localization methods severely constrain long-range missions. Cross-view localization provides a promising drift-free global solution by matching rover-view and satellite-view imagery. However, the lunar environment poses unique challenges for correspondence alignment, including inter-entity entanglement, inter-viewpoint divergence, and simulation-to-real domain shift. To address these challenges, we propose Warped Alignment of Reprojected Graphs (WARG), a framework that leverages unified graph learning and reprojected graph matching for robust cross-view alignment. Pretrained on the synthetic LuSNAR dataset, WARG achieves an average test error of 0.32 m and demonstrates robust zero-shot generalization to the synthetic lunar south pole region with an error of 3.63 m. More importantly, when validated on real-world data from the YuTu-2 rover, WARG achieves a localization error of 1.68 m within a 100 m x 100 m search area, corresponding to nearly one-pixel precision in low-resolution satellite imagery with a spatial resolution of 1.40 m/pixel. Beyond accuracy, WARG is computationally efficient, containing only 1.56M parameters, corresponding to 16.12% of previous lightweight models, and operating at 5.49 Hz on an NVIDIA RTX A6000 GPU, approaching GNSS-level update frequency. Finally, we observe that WARG naturally develops low-level spatial awareness, including semantic segmentation and structural reasoning, through cross-view localization learning, highlighting its potential as a promising paradigm for spatial intelligence with minimal annotation cost. The source code is available at this https URL.

62. 【2606.10594】Segment and Select: Vision-Language Segmentation in 3D Scenarios

链接https://arxiv.org/abs/2606.10594

作者:Yulin Chen,Zhihang Zhong,Yuenan Hou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:segment target objects, vision-language segmentation aims, aims to segment, segment target, target objects

备注: The core idea is to reformulate 3D vision-language segmentation as the segment-and-select paradigm (free from the superpoint dependency)

点击查看摘要

Abstract:3D vision-language segmentation aims to segment target objects in 3D scenarios according to the linguistic instructions and visual observations. Prior art heavily relies on the coarse superpoint representation to reduce the computation complexity, which suffers from poor segmentation quality and messy object boundaries. In this paper, we propose the SEGment-And-select (SEGA3D) paradigm for 3D visionlanguage segmentation that directly operates on the fine-grained visual information and is free from the superpoint dependency. Specifically, we first leverage a mask candidate generator to provide fine-grained categorical mask candidates, substantially improving the quality of candidate masks over the superpoint counterparts. Then, a Large Language Model (LLM) is utilized to generate the semantic and spatial information based on the linguistic description and visual features. The LLM output and visual features are fed to the Semantic-Spatial Selector (SSS) to produce the top-ranking mask candidates. Eventually, the Loopback Verification Module (LVM) is designed to yield the segmentation mask from the selected candidate masks. Our SEGA3D attains competitive performance on ScanRefer, ScanNet and Matterport3D benchmarks. Notably, our SEGA3D surpasses the top-performing counterpart by 8.3 mIoU and 5.3 mIoU on ScanNet and Matterport3D, respectively. Codes will be available upon publication.

63. 【2606.10571】Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

链接https://arxiv.org/abs/2606.10571

作者:Lijia Yu,Jiuxin Cao,Yuchen Qiang,Changhao Chen,Yifei Huang,Bo Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:Vision-Language Pre-training, improving robustness, reveal vulnerabilities, vulnerabilities in Vision-Language, provide insights

备注: 17 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.

64. 【2606.10550】PrismAvatar: Pseudo-Multiview Reconstruction and Subpixel Prism Rendering for Real-Time Stereoscopic Communication

链接https://arxiv.org/abs/2606.10550

作者:Chufeng Fang,Dongdong Teng,Lilin Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:reduce remote users, require specialized capture, specialized capture rigs, Real-time stereoscopic video, stereoscopic video communication

备注: 10 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Real-time stereoscopic video communication has long been a goal of immersive telepresence, yet practical systems still require specialized capture rigs or reduce remote users to a single portrait view. We present PrismAvatar, a Gaussian head-avatar system that connects monocular avatar capture with subpixel-encoded glasses-free lenticular display for real-time autostereoscopic communication. From a monocular portrait video, PrismAvatar reconstructs a controllable head avatar and optimizes it for the lateral viewing zones induced by the display. The method uses natural head turns as pseudo-multiview (PMV) supervision to constrain regions that are otherwise weakly observed in monocular training, including hair, ears, jaw contours, and neck boundaries. Reliable side frames are yaw-binned, aligned to virtual cameras, and supervised within a strict head-and-hair domain; contour-aware losses and staged regularization further suppress ghosting, alpha leakage, and depth instability while preserving lateral detail. At runtime, PrismAvatar renders 32 virtual views and encodes them into a 4K lenticular raster with calibrated subpixel-routing masks. The live-tracker prototype sustains 10.65 FPS, and a subject-specific distilled driver raises the same display pipeline to 38.49 FPS.

65. 【2606.10541】GRAR: Glass-induced Reflection Artifact Removal in LiDAR Point Clouds

链接https://arxiv.org/abs/2606.10541

作者:Wanpeng Shao,Zeyi Guo,Bo Zhang,Yifei Xue,Tie Ji,Yizhen Lao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Terrestrial Laser Scanning, point clouds captured, severely degrading downstream, degrading downstream applications, urban environments frequently

备注

点击查看摘要

Abstract:Terrestrial Laser Scanning (TLS) point clouds captured in urban environments frequently suffer from glass-induced reflection artifacts, severely degrading downstream applications. Existing reflection artifact removal methods generally rely on ideal reflection symmetry assumptions, yet their performance is limited by inaccurate glass estimation and insufficient geometric representations. To address these issues, we propose a novel unified framework aimed at robust reflection artifact removal: In the first stage, we leverage a multi-modal vision foundation model to produce initial glass masks, which are then refined using geometric cues to achieve high-precision glass regions, followed by glass completion to recover missing regions caused by no-return measurements on transparent surfaces; In the second stage, we propose a physics-driven descriptor, termed Reflection-aware Local-Global Geometric Similarity (RE-LGGS), which is grounded in actual laser reflection geometry and jointly encodes multi-scale geometric structures and orientation consistency using PCA-based local shape representations, thereby significantly improving robustness against imperfect observations. Extensive experiments on multiple public TLS datasets demonstrate that our framework consistently outperforms state-of-the-art methods in reflection artifacts removal.

66. 【2606.10533】Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning

链接https://arxiv.org/abs/2606.10533

作者:Zihan Meng,Dexiang Hong,Weidong Chen,Ziyu Zhou,Bo Hu,Zhendong Mao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:captioning generates natural, generates natural language, natural language descriptions, Audio-visual captioning generates, audio content

备注

点击查看摘要

Abstract:Audio-visual captioning generates natural language descriptions from video and audio content. Multimodal LLMs have advanced this task, but both modalities contribute many tokens to the LLM input, where prefill self-attention scales quadratically. Existing token-pruning methods usually retain tokens by attention, saliency, or cross-entropy loss, yet the hard threshold selection makes it difficult to retain tokens that are truly valuable, especially for high-confusing tokens near the decision boundary. To this end, we propose a AVEX-Prune, an RL-based audio-visual dynamic token pruning method in this work. In our AVEX-Prune, an audio-visual token exchange strategy is proposed to select truly valuable tokens by replacing low-confidence retained tokens with high-confidence candidate tokens from the same or the other modality, and measuring the differences in caption generation from token swaps. AVEX-Prune preserves full-token quality at a 40% retention ratio on both VILA 1.5-8B (54.5 vs. 54.6) and VideoLLaMA 2 (57.0 vs. 56.8).

67. 【2606.10522】GUI-AC: Enhancing Continual Learning in GUI Agents

链接https://arxiv.org/abs/2606.10522

作者:Can Lin,Tao Feng,Hangjie Yuan,Dan Zhang,Yifan Zhu,Zhonghong Ou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphical User Interfaces, Graphical User, real-world interface environments, building GUI agents, User Interfaces

备注

点击查看摘要

Abstract:Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robustness that humans naturally exhibit, remains unsolved. Notably, GUI data are inherently non-stationary: the continual emergence of previously unseen interface instances (e.g., novel domains and resolutions) induces persistent distribution shifts, significantly impeding the continual learning of existing GUI agents. Reinforcement fine-tuning (RFT) has attracted considerable attention as a promising approach. Nevertheless, RFT exhibits pronounced instability in its grounding capability, manifested as sharp reward discontinuities and high-variance oscillations. The imbalanced distribution of rollout outcomes introduces substantial noise into advantage estimation, leading to policy overconfidence. The fixed clipping bound suppresses the increase in policy probabilities needed to adapt to new distributions, leading to a collapse in exploration capacity. To address these challenges, we propose GUI-AC, a method that enhances the continual learning capability of GUI agents. GUI-AC introduces grounding certainty to support two core mechanisms: (i) Adaptive Advantage, which down-weights noisy advantage estimates to prevent policy overconfidence; and (ii) Dynamic Clipping, which relaxes the clipping bound to encourage exploration range. Extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines. Code is available anonymously at this https URL.

68. 【2606.10517】LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching

链接https://arxiv.org/abs/2606.10517

作者:Jiexi Lyu,Xizhou Bu,Qingqiu Huang,Chufeng Tang,Xiaoshuai Hao,Hongbo Wang,Wei Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale unlabeled videos, limited real-world interaction, real-world interaction data, scalable latent policy, latent policy learning

备注

点击查看摘要

Abstract:Learning high-quality latent actions from large-scale unlabeled videos, coupled with limited real-world interaction data for training an action decoder, has emerged as a promising paradigm for scalable latent policy learning. However, existing approaches typically rely on behavior cloning, which tends to collapse inherently multimodal action distributions into unimodal ones, thereby degrading the pretrained latent action structure. While flow matching provides a potential alternative, directly applying it leads to a misalignment between latent actions and physical actions during action decoder training, due to the stochastic nature of the learned policy. To address these, we propose Latent Action Flow Policy (LAFP), which leverages flow matching for latent policy learning and introduces an inference-time interpolation mechanism to mitigate stochasticity-induced misalignment. Experimental results demonstrate that LAFP consistently outperforms prior methods on downstream imitation learning tasks, achieving up to 10-15% improvement in success rate while incurring less than 1x additional inference overhead.

69. 【2606.10492】PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation

链接https://arxiv.org/abs/2606.10492

作者:Haodong Lei,Hongsong Wang,Bingxuan Dai,Pan Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significantly increasing computational, increasing computational costs, speculative Jacobi decoding, significantly increasing, resulted in extended

备注: 10 pages, 5 figures

点击查看摘要

Abstract:The growing need for high-resolution image generation in autoregressive text-to-image models has resulted in extended token sequences, significantly increasing computational costs and inference times. However, existing state-of-the-art methods for accelerating autoregressive text-to-image models rely on chain-structured draft token sequences, leading to inefficient draft token search and limited acceptance lengths. To address this, we propose parallel-path cross-relaxed speculative Jacobi decoding (\textbf{PathSpec}), a novel framework that enhances efficiency through a multi-sequence draft tree structure. Our parallel-path speculative Jacobi decoding (\textbf{PathExplore}) expands the token search space, achieving a higher speedup ratio without sacrificing image quality. Additionally, we introduce cross-path relaxed verification (\textbf{PathRelax}) that exploits semantic similarities across sequences to further boost token acceptance rates. Evaluated on the Parti-Prompts, MSCOCO2017, and T2ICompBench datasets, our method achieves a speedup ratio of 4.14 $\times$, 3.95$\times$, and 4.18$\times$, respectively. Remarkably, PathExplore, without any relaxed sampling, outperforms relaxed sampling methods in the speedup ratio, such as GSD and LANTERN. Moreover, PathRelax's relaxation mechanism can be seamlessly integrated with other relaxation techniques, enabling further acceleration and providing an efficient solution for real-time text-to-image generation. Our code is available at this https URL.

70. 【2606.10488】5% 100%: Flatness Preference is All You Need for Multimodal Parameter-Efficient Fine-Tuning

链接https://arxiv.org/abs/2606.10488

作者:Yifan Zhu,Can Lin,Hangjie Yuan,Zixiang Zhao,Pengfei Zhang,Tao Feng,Zhonghong Ou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal downstream tasks, adapting large models, domain-specific multimodal downstream, Parameter-Efficient Fine-Tuning, downstream tasks

备注

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods provide a streamlined and efficient tool for adapting large models to domain-specific multimodal downstream tasks. Although these methods proved their tangible effects in practice, their principal aspects remain under-explored. Therefore we remain curious about the underlying generalization mechanisms in various PEFT methods and how they can be further enhanced. In this paper, we reveal the flatness preference widely present in various PEFTs, where a small fraction of sharp dimensions dominates the generalization of PEFT. This finding suggests an appealing possibility: we may be satisfied with a better generalization by merely attending to this small fraction of sharp dimensions instead of all of them. Furthermore, we propose Flatness Preference Optimization (FlatPO) to flatten these key sharpness dimensions, leading various PEFTs toward better generalization. Extensive experiments demonstrate the effectiveness of our findings and the proposed method. Code is available at this https URL.

71. 【2606.10478】3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis

链接https://arxiv.org/abs/2606.10478

作者:Yuhao Wang,Puyi Wang,Linjie Li,Zhengyuan Yang,Kevin Qinghong Lin,Yu Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:editing systems operate, point clouds, systems operate, operate on implicit, implicit and explicit

备注: Preprint. 24 pages, 11 figures

点击查看摘要

Abstract:Most recent 3D reconstruction and editing systems operate on implicit and explicit representations such as NeRF, point clouds, or meshes. While these representations enable high-fidelity rendering, they are fundamentally low-level and hard to control programmatically. In contrast, we propose and systematically evaluate a new 3D reconstruction paradigm, 3D Code Synthesis (3D-CoS), where 3D assets are constructed as executable Blender code, a programmatic and interpretable medium. To assess how well current VLMs can use code to represent 3D objects, we evaluate representative open-source and closed-source VLMs in code-based reconstruction under a unified protocol. We further introduce a suite of structured code-synthesis workflows, including blueprint-based planning, Retrieval-Augmented Generation (RAG) over Blender API documentation, few-shot geometric demonstrations, and a component-level Agent workflow for part-wise code generation. To demonstrate the unique advantages of this representation, we further evaluate localized text-driven modifications and compare our code-based edits with a point-cloud-based 3D editing baseline. Our study shows that code as a 3D representation offers strong controllability and locality, yielding stronger edit fidelity and better preservation of unedited regions in our targeted editing evaluation. Our work also analyzes the potential of this paradigm, delineates the current capability frontier of VLMs for programmatic 3D modeling, and highlights code synthesis as a promising direction for editable 3D reconstruction.

72. 【2606.10468】Geometric Coastline Localization using Vision-Language Models

链接https://arxiv.org/abs/2606.10468

作者:Rafia Malik,Bernhard Pfahringer,Karin Bryan,Mark Dickson,Eibe Frank

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pixel-wise segmentation problem, remote sensing imagery, remote sensing, commonly formulated, Coastline

备注

点击查看摘要

Abstract:Coastline detection in remote sensing imagery is commonly formulated as a pixel-wise segmentation problem, where the final coastline is extracted from a predicted mask through post-processing. This formulation relegates coastline geometry, the primary representation used in coastal change analysis, to a secondary artifact rather than the learning objective. In practice, coastlines are defined by geomorphic proxies such as vegetation lines, dune toes, or cliff edges, rather than an instantaneous land-water boundary often used in pixel-based segmentation approaches. In this work, we revisit coastline extraction from a representation perspective and formulate the task as geometric boundary localization. We use the New Zealand Coastal Change Dataset (NZCCD) and high-resolution aerial imagery from Land Information New Zealand (LINZ) to develop CoastlineVLM-7B, a vision-language model (VLM) built on the GeoChat-7B/LLaVA-1.5 architecture that jointly performs coastline presence detection, proxy-type classification, and coastline grounding. The model directly predicts a coastline as a polyline rather than a dense segmentation mask. We evaluate CoastlineVLM-7B against segmentation baselines under strict one-pixel boundary supervision. Results show that geometry-based metrics are more suitable for assessing coastline localization quality than pixel-overlap metrics such as Intersection over Union (IoU). CoastlineVLM-7B improves global geometric alignment with reference coastlines, reducing Hausdorff distance from 37.74 m to 31.84 m and Earth Mover's Distance from 21.12 m to 17.32 m. These results indicate that output representation is a critical design choice in coastline extraction, and that geometry-oriented learning, combined with the semantic reasoning capabilities of vision-language models, aligns well with how coastlines are defined and evaluated in operational coastal monitoring.

73. 【2606.10450】Few-step Generative Models as Lossy Compression

链接https://arxiv.org/abs/2606.10450

作者:Fuma Kimishima,Jinjia Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:procedures remain slow, Consistency Trajectory Models, decoding procedures remain, reuse pre-trained diffusion, Rectified Flow

备注

点击查看摘要

Abstract:DiffC provides a principled way to reuse pre-trained diffusion models for lossy compression, but its encoding and decoding procedures remain slow because they require many discretized forward and reverse steps. We study whether few-step generative models -- Rectified Flow, Consistency Trajectory Models (CTM), and MeanFlow -- can be cast as codecs within the same reverse channel coding (RCC) framework. The main challenge is that RCC requires posterior and shared distribution parameters, whereas these models do not explicitly parameterize intermediate conditional distributions. For Rectified Flow and MeanFlow, we use the equivalence between velocity parameterization and diffusion-style denoising parameterization to derive the quantities required by RCC. For CTM, which is distilled from EDM, we adopt the EDM noise parameterization together with local Gaussian approximations of the sender and shared distributions at intermediate states. This yields a proof-of-concept probabilistic formulation that enables compression with pre-trained few-step generative models without retraining. On low-resolution benchmarks, the resulting codecs reduce encoding and decoding time and improve realism in the low-bit-rate regime.

74. 【2606.10431】Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems

链接https://arxiv.org/abs/2606.10431

作者:Shuangchun Gui,Zhiguang Cao,Wen Song,Yew-Soon Ong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:vehicle routing problems, routing problems play, Multi-task vehicle routing, service sectors, play a critical

备注: Accepted by TNNLS

点击查看摘要

Abstract:Multi-task vehicle routing problems play a critical role in enhancing efficiency across various industries and service sectors. These problems consist of multiple variants that optimize routing costs while meeting diverse customer constraints. Existing multi-task VRP solvers solely utilize a graph-based modality, limiting their ability to address variants with multiple constraints. As a format to represent complex semantics, vision modality shows great potential for encoding diverse VRP constraints. This motivates us to learn patch-level semantics from the vision images, and then integrate them into a graph-based model to solve various VRP variants simultaneously. However, directly applying this approach to multi-task VRPs presents three challenges: 1) existing VRP images lack constraint representations, which are essential for multi-task VRPs, 2) the fixed receptive field of individual patches cannot effectively accommodate varying requirements across tasks, and 3) imbalanced pixel distribution among constraints may cause the model to overlook constraints with fewer pixels. In this paper, we propose a vision-assisted foundation model (VaFM) to address these challenges. In the vision modality, input images tailored to all constraints are encoded by a convolutional neural network. The obtained patch embeddings are fused with graph-based nodes to generate solutions, with an auxiliary task designed to address the pixel-imbalanced issue. The performance of VaFM is evaluated across 16 different VRP variants. The experimental results demonstrate the superiority of VaFM over state-of-the-art methods, especially for variants with complex constraints.

75. 【2606.10407】me-frequency localization of bird calls in dense soundscapes

链接https://arxiv.org/abs/2606.10407

作者:Simen Hexeberg,Fanghui Tong,Hari Vishnu,Mandar Chitre

类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词:limiting downstream analyses, Passive acoustic monitoring, monitoring enables large-scale, enables large-scale observation, predict species presence

备注

点击查看摘要

Abstract:Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.

76. 【2606.10401】CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

链接https://arxiv.org/abs/2606.10401

作者:Yiming Zhang,Ruoxuan Cao,Zhihang Zhong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, multimodal large language, language models, key frontier, frontier for multimodal

备注

点击查看摘要

Abstract:Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.

77. 【2606.10400】Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

链接https://arxiv.org/abs/2606.10400

作者:Pratham Singla,Shivank Garg,Vihan Singh,Paras Chopra

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:memorized world knowledge, inflates benchmark scores, ungrounded answers, Vision-language models, world knowledge

备注: 17 pages, 7 figures, Submitted to EMNLP 2026

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark scores and yields confident but ungrounded answers. Existing benchmarks rarely isolate this behavior, since each image is usually paired with a single fixed question. To measure the reliance, we build a 540-image benchmark across six reasoning categories and generate four question variants over the same images, so that phrasing rather than image content is the controlled variable. The hardest variant is written directly from the image to minimize text leakage. We benchmark eleven VLMs spanning small open-weight models to large closed-source systems: every model degrades on the hardest variant, and open models fall furthest. Our central diagnostic is a no-image ablation, which collapses the open-weight models to their text-only floor (1 to 9 percent). Three further analyses, LLM-rated difficulty, low base-to-final textual similarity, and human re-annotation, corroborate genuine image-dependence. In-context exemplars that match how a variant was built recover the most accuracy, and GRPO post-training of a small VLM yields consistent gains across all four variants that transfer to a held-out out-of-distribution set. Textual-prior reliance is measurable and partly trainable away.

78. 【2606.10395】Efficient RWKV-based Representation Learning for 3D Point Clouds

链接https://arxiv.org/abs/2606.10395

作者:Yun Liu,Xuefeng Yan,Liangliang Nan,Xianzhi Li,Peng Li,Zhe Zhu,Honghua Chen,Mingqiang Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Transformers' quadratic self-attention, combines RNN-style recurrence, recent receptance weighted, model combines RNN-style, alternative to Transformers'

备注

点击查看摘要

Abstract:The recent receptance weighted key value (RWKV) model combines RNN-style recurrence, offering a linear-complexity alternative to Transformers' quadratic self-attention for modeling global dependencies. However, when directly applied to point clouds, RWKV, originally developed for sequential text, struggles to capture local geometric structures and model spatial dependencies effectively. To address this, we propose the \textbf{P-RWKV} block, which bridges the gap between sequence modeling and irregular 3D geometry while preserving the efficiency advantages of RWKV. It consists of a Local Perception Expansion (LPE) component to expand contextual perception along the spatio-temporal sequence and a Spatial Context Enhancement (SCE) component to strengthen spatial awareness. To validate the effectiveness of P-RWKV for point cloud understanding, we construct PointER, a single-modality self-supervised representation learning framework whose encoder is composed of stacked P-RWKV blocks. Furthermore, we extend P-RWKV to a cross-modality setting and integrate the proposed core sub-modules into multiple architectures, demonstrating strong plug-and-play flexibility and architectural generality. Extensive experiments show that the P-RWKV block and its key sub-modules achieve competitive performance across various tasks with lower computational cost and inference latency. Code will be released upon acceptance.

79. 【2606.10378】FSS-Net: Frequency-Spatial Synergy Network with Wavelet Attention for Carotid Artery Ultrasound Segmentation

链接https://arxiv.org/abs/2606.10378

作者:Jiawei Liu,Zhijiang Wan,Junhua Hu,Rongli Zhang,Zhongbiao Xu,Yankun Cao,Yuan Chen,Jin Hong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:stroke risk assessment, risk assessment, critical for stroke, stroke risk, Accurate segmentation

备注

点击查看摘要

Abstract:Accurate segmentation of carotid arteries in ultrasound imaging is critical for stroke risk assessment. However, speckle noise, low contrast, and blurred boundaries remain major challenges. In this paper, we propose a Frequency-Spatial Synergy Network (FSS-Net) to achieve noise-robust and high-precision carotid artery segmentation. The network integrates wavelet transform, multi-domain attention, and edge enhancement into a unified encoder-decoder architecture. Specifically, a Channel-Spatial-Wavelet Attention (CSWA) module is designed to suppress noise and purify semantic features in the frequency domain. A Wavelet-Enhanced Bottleneck (WEB) module is introduced to capture long-range global dependencies efficiently. Furthermore, a Laplacian-Guided Adaptive Edge Fusion (LAEF) module compensates high-frequency details and maintains boundary continuity. Extensive experiments on carotid ultrasound datasets show that FSS-Net achieves a Dice score (DSC) of 96.46% and strong robustness under low SNR conditions, outperforming several state-of-the-art methods. This method realizes accurate segmentation of carotid artery in ultrasonic imaging, effectively identifies carotid atherosclerotic plaque, and is verified by other task (such as segmentation of breast cancer), suggesting that it has good clinical application potential in identifying abnormal tissue masses in ultrasonic images.

80. 【2606.10373】PF-Trans: Physics-Embedded Frequency-Aware Transformer for Spectral Reconstruction

链接https://arxiv.org/abs/2606.10373

作者:Yuzhe Gui,Tianzhu Liu,Yanfeng Gu,Xian Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Snapshot Broadband Filter, Broadband Filter Array, Snapshot Broadband, Filter Array, high light throughput

备注

点击查看摘要

Abstract:Snapshot Broadband Filter Array (BFA) imaging provides high light throughput for spectral reconstruction but introduces severe spectral aliasing due to complex modulation. Current deep learning approaches, limited to spatial denoising, often fail to address the global frequency-specific degradations caused by the mask structure. To address this, we propose a Physics-embedded Frequency-aware Transformer (PF-Trans) for high-fidelity remote sensing spectral reconstruction. Our method explicitly integrates the physical sensing model through mask injection and a gray-scale consistency loss to ensure physical fidelity. Furthermore, we introduce a Dual-domain Block with a parallel Fast Fourier Transform (FFT) branch, enabling the network to perceive and suppress aliasing artifacts in the frequency domain. Extensive experiments on multiple datasets demonstrate that PF-Trans achieves state-of-the-art performance, achieving a Peak Signal-to-Noise Ratio (PSNR) of up to 48.50 dB on the GF-5 Shanghai dataset, significantly outperforming comparison methods.

81. 【2606.10372】ClinReadNet: A clinical reading-inspired network for low-dose abdominal CT image quality assessment

链接https://arxiv.org/abs/2606.10372

作者:Xianye Xiao,Yulong Zou,Yujie Luo,Taihui Yu,Cun-Jing Zheng,Yuan-ming Geng,Shuihua Wang,Yudong Zhang,Jin Hong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:no-reference image quality, image quality assessment, mimics doctors' reading, No-reference IQA, image quality

备注

点击查看摘要

Abstract:In abdominal CT imaging, developing a low-dose, no-reference image quality assessment (No-reference IQA) model that mimics doctors' reading habits for evaluating CT image quality has significant practical value. This paper proposes a novel deep learning-based framework, ClinReadNet, whose design aligns with the clinical reading logic of radiologists: first, it introduces the Sobel ordinal quality network (SOQN) module, which can simultaneously focus on edge details highly relevant to image quality and the quality distribution pattern of the entire image, accurately matching the clinical image-reading judgment habit of "considering both local details and overall context"; second, the framework integrates the (shifted) window multi-scale temperature multi-head self-attention ((S)W-MTMSA) module, which further replicates the radiologists' image-reading process of shifting from overall scanning to local focusing, and accurately locks in regions of interest through multi-sharpness attention; third, it designs the hierarchical ranked probability score (HRPS) loss function, which combines the dual logics of coarse classification and fine classification, while paying attention to the distance information between grading labels, effectively improving the performance of image quality assessment. Experiments conducted on the LDCTIQAG2023 dataset show that the proposed method achieves the current state-of-the-art (SOTA) performance: the values of Pearson's linear correlation coefficient (PLCC), Spearman's rank-order correlation coefficient (SROCC), and Kendall's rank-order correlation coefficient (KROCC) reach 0.9507, 0.9554, and 0.8629 respectively, with the sum of their absolute values (Score) being 2.7690, outperforming existing methods.

82. 【2606.10364】Benchmarking stereo reconstruction for 3D printable Martian terrain models

链接https://arxiv.org/abs/2606.10364

作者:Josephine Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Mars rover imagery, models from Mars, Mars rover, NASA Curiosity images, partially observed

备注: 9 pages, 7 figures, CVPR End-to-End 3D Workshop 2026

点击查看摘要

Abstract:Reconstructing printable 3D models from Mars rover imagery is challenging because Martian terrain is low-texture, irregular, and partially observed. We evaluate a pipeline that estimates stereo depth from NASA Curiosity images, completes geometry, and exports watertight OBJ meshes. On Middlebury, RAFT-Stereo outperforms semi-global block matching (SGBM), reducing disparity MAE from 3.22px to 0.73px and increasing valid prediction coverage from 76.3% to 100.0%. On Curiosity imagery, however, RAFT's denser disparities show weaker edge alignment and higher photometric reprojection error, suggesting that benchmark accuracy does not directly transfer to Martian terrain reconstruction. Geometry completion demonstrates a tradeoff between local fidelity and global connectivity. We find that alpha shapes preserve accurate but fragmented structure, Poisson reconstruction produces more coherent meshes but adds unsupported surfaces, and a deterministic diffusion-fill baseline is intermediate but sensitive to stereo quality. Overall, standard stereo and completion methods can produce printable approximations of Martian terrain, but reliable reconstruction requires stronger domain-specific validation.

83. 【2606.10350】Multi-Angular Reflectance Anisotropy Observed from UAV Multispectral Imagery

链接https://arxiv.org/abs/2606.10350

作者:Zhenqiang Qin,Chenguang Dai,Min Wang,Xian Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:UAV multispectral imagery, multispectral imagery naturally, low flight altitude, UAV multispectral, introduce geometry-driven radiometric

备注

点击查看摘要

Abstract:UAV multispectral imagery naturally contains multi-angular observations due to low flight altitude and wide field-of-view imaging, which may introduce geometry-driven radiometric variability. This study proposes a geometry-aware multi-angular observation extraction workflow to quantify observation-geometry effects from a BRDF perspective. Specifically, camera intrinsics and extrinsics are refined via structure-from-motion (SFM), and homogeneous regions annotated on an orthomosaic are reprojected onto multiple raw sub-images acquired from different viewpoints. This enables joint extraction of multi-band reflectance and observation geometry parameters for the same ground targets under varying viewing directions. The extracted observations are further analyzed using band-wise polar visualization in the (VZA, RAA) domain. Results on a grassland target show clear reflectance anisotropy across ten bands, with red-edge and nearinfrared bands exhibiting 119-137% variability between maximum and minimum reflectance, indicating non-negligible observation-geometry effects on radiometric consistency.

84. 【2606.10329】Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset

链接https://arxiv.org/abs/2606.10329

作者:Yunlong Liu,Zekai Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:destructive natural disasters, short imaging interval, natural disasters, recent years, causing serious economic

备注

点击查看摘要

Abstract:As one of the most destructive natural disasters, earthquakes have struck many countries around the world in recent years, causing serious economic losses. Change detection (CD) can be applied to post-earthquake damage assessment as it can infer destroyed change regions from multi-temporal remote sensing images. Furthermore, the CD with short imaging interval will better satisfy the needs of the emergency rescues after earthquakes. However, the capability of current methods built on deep neural networks is limited because the dataset with short imaging interval is absent. To meet post-disaster immediate relief, we create a CD dataset, Turkey earthquake CD dataset (TUE-CD), for the evaluation of building damage in the short term after an earthquake. Because of the short acquisition interval of the post-event images, the imaging angle is different for different temporal images, which leads to some side-looking problems. To deal with these challenges, we present a multi-scale feature interaction network (MSI-Net) for efficient interaction between bi-temporal features, as well as mitigating the effect of side-looking problems. Specifically, the proposed MSI-Net consists of joint cross-attention (JCA) modules, multi-scale offset calibration (MOC) modules, and feature integration (FeI) modules. The JCA module unifies channel cross-attention and spatial joint attention for sufficient feature interaction. The MOC module further estimates the offsets to align the bi-temporal image with the multi-scale features. Finally, calibrated features and multi-scale features are fused by FeI modules for the prediction of changed areas. Experiments on the WHU-CD, CLCD, and the constructed TUE-CD dataset indicate that the proposed MSI-Net provides better results than considered state-of-the-art CD methods.

85. 【2606.10328】Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images

链接https://arxiv.org/abs/2606.10328

作者:Yunlong Liu,Zekai Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:spectral differences, change detection performance, spectral difference information, spectral, spectral difference

备注

点击查看摘要

Abstract:The integration of spatial and spectral information is beneficial to the improvement of change detection performance. However, existing methods cannot efficiently suppress the influences of spatial and spectral differences in unchanged areas. To address these issues, in this paper we propose a content-guided spatial-spectral integration network (CSI-Net) for the fusion of global spatial details and spectral difference information. Specifically, the proposed CSI-Net is composed of a spatial reasoning (SR) module, a spectral difference (SD) module, and a content-guided integration (CGI) module. In the SR module, the spatial information is learned by cascaded graph convolution blocks for global modeling. The SD module is responsible for the extraction of spectral features, by calculating the means and variances of features to reduce the impact of spectral differences in unchanged regions. In addition, in order to integrate the spatial-spectral features efficiently, we design a CGI module to further take advantage of their complementary information. In this module, high-level content information is introduced as a guide for a proper interaction. Due to the efficient spatial-spectral fusion, the proposed CSI-Net can learn the changed features better while achieving a suppression of spectral differences. Experimental results on LEVIR-CD, WHU-CD, and CLCD datasets demonstrate that the proposed CSI-Net produces better performance compared to state-of-the-art methods, and is applicable to different scenarios

86. 【2606.10309】Dissect and Prune: Enhancing Robustness in AI-Generated Image Detection

链接https://arxiv.org/abs/2606.10309

作者:Dahye Kim,Jaehyun Choi,Hyun Seok Seong,Seongho Kim,Donghun Lee,Sungwon Yi,Jang-Ho Choi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:report high performance, detectors report high, severely limits sensitivity, existing AI-generated image, AI-generated image detectors

备注: 25 pages, 9 figures, 9 tables, Accepted to ICML 2026; includes appendix

点击查看摘要

Abstract:While existing AI-generated image detectors report high performance, we identify that this is largely driven by a critical prediction asymmetry: a bias toward the real class that severely limits sensitivity to generated content, especially under standard post-processing operations such as compression and resizing. We hypothesize that this stems from the model's reliance on spurious features, distracting signals that obscure true generative artifacts. To address this, we propose DEAR (Dissect and Prune), which leverages inpainted images to identify and prune these interfering components. Specifically, we find that features strongly aligned to either inpainted or non-inpainted regions are less robust to post-processing. By measuring the alignment between channel activations and inpaint masks, DEAR removes features at both extremes, retaining only those that capture genuine generative artifacts. Experimental results demonstrate that our approach significantly enhances robustness against unseen generators and post-processing, effectively mitigating the prediction asymmetry. Our code is available at this https URL.

87. 【2606.10299】What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

链接https://arxiv.org/abs/2606.10299

作者:Doeon Kwon,Junho Bang

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词:Language-agent, memory palace, geometry adds, recall, memory

备注: 23 pages, 6 figures

点击查看摘要

Abstract:Language-agent "memory palace" systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot. We make that intuition testable and report three results. First, the memory-palace default of folding spatial proximity into a linear blend beside recency and importance does not help and can hurt: in a pre-registered recall experiment the shipped blend fails its own frozen test (mean Delta-Hit@5 -0.0375, Wilcoxon p=0.306), sitting at a position-blind baseline, while a geometry-led weighting wins decisively (+0.3208, p10^-15): geometry must lead recall when the query regime is spatial. Second, memory recall and visibility must be separated: recall is occlusion-blind by design (you correctly remember the next room behind a wall), while visibility is a perception predicate over stored geometry that the live system never computed. A one-line ray-versus-voxel digital differential analyzer (DDA), re-pointed from the gaze ray the agent already casts, supplies it: text and the live FoV cone both score 0.000 on 849 behind-wall targets while cone-plus-DDA reaches 0.982 (exact McNemar p10^-6); coordinate recall separately resolves near-duplicate locations a cosine null cannot (1.000 vs 0.533, n=150). Third, the visibility predicate is confirmed live under a git-committed pre-registration (SPMEM-OCC-LIVE-v1: eight scripted worlds, automated oracle scoring, 96 behind-wall targets, false-visible 1.000-0.000, pooled exact McNemar p=2.5x10^-29), a run that surfaced and fixed a real relay anchor defect. We concede that occlusion-needs-geometry is near-tautological; the contribution is the measurement and isolation, separating what spatial memory must store from how it is read. These pilots power a frozen confirmatory study (SPMEM-ZERO-REAL-PREREG-v1); the full human-authored multi-world study with blind raters remains future work.

88. 【2606.10275】FoA-SR: Faithful or Aesthetic? Profile-Aware Preference Optimization for Real-World Image Super-Resolution

链接https://arxiv.org/abs/2606.10275

作者:Amjad Mahdi Alqarni,Peizhong Ju

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:produce multiple high-quality, multiple high-quality reconstructions, single restoration objective, current capacity, capacity of generative

备注: 17 pages, 6 figures, 9 tables. Preprint

点击查看摘要

Abstract:Real-world image super-resolution (SR) is often designed with a single restoration objective, despite the current capacity of generative models to produce multiple high-quality reconstructions for the same input. In this paper, we argue that the best restoration strategy is subject to the specific restoration profile: a Faithful restoration prioritizes reference consistency, structure preservation, and hallucination suppression, whereas an Aesthetic restoration prioritizes visually pleasing and natural-looking details. We propose FoA-SR, a novel preference optimization approach to real-world SR based on profiles. To achieve this goal, FoA-SR starts with our supervised FLUX.2-based SR adapter (Flux2SR) trained with LR latent conditioning, flow matching, and image-space reconstruction losses for paired LR-to-HR image super-resolution. Following the development of the shared supervised super-resolution adapter, FoA-SR generates a shared stochastic candidate pool for each input image and ranks the same candidates using profile-specific Faithful and Aesthetic rewards to mine winner-loser pairs. These pairs are used to fine-tune separate LoRA adapters while keeping the base model frozen. Experiments on RealSR and DIV2K show that FoA-SR can steer the same SR adapter towards distinct restoration objectives: a Faithful adapter improves reference-consistent metrics while an Aesthetic adapter boosts metrics that measure perceptual quality without reference. Our candidate-pool analysis shows that Faithful and Aesthetic rewards frequently select different winners, and a Hybrid-LoRA ablation shows that collapsing both profiles into one reward yields an implicit compromise rather than explicit profile control.

89. 【2606.10223】Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing

链接https://arxiv.org/abs/2606.10223

作者:Awais Khan,Kutub Uddin,Khalid Malik

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:produce overconfident predictions, reject unseen synthesizers, closed-set models fail, Linear Filter Bank, Attributing a synthetic

备注

点击查看摘要

Abstract:Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.

90. 【2606.10200】An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration

链接https://arxiv.org/abs/2606.10200

作者:Ahmed Faizul Haque,S.M. Riaz Rahman Antu,Saif Ahmed,Asadullah Hil Galib,Souvik Pramanik,Mohammad Ashrafuzzaman Khan,Mohammad Abdul Qayum,Mohsin Sajjad

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:multi-scale feature extraction, attention residual block, improved GAN-based imaging, feature extraction module, spatial attention residual

备注: 7 pages, 9 figures

点击查看摘要

Abstract:An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micro-resistivity imaging logging images. The method uses FCN as the generative network infrastructure and adds a depth-separable convolutional residual block to learn and retain more effective pixel and semantic information; an Inception module is added to increase the multi-scale perceptual field of the network and reduce the number of parameters in the network; and a multi-scale feature extraction module and a spatial attention residual block are added to combine the channel attention. The multi-scale module adds a multi-scale feature extraction module and a spatial attention residual block, which combine the channel attention mechanism and the residual block to achieve multi-scale feature extraction. The global discriminative network and the local discriminative network are designed to gradually improve the content and semantic structure coherence between the restored parts and the whole image by playing off each other and the generative network. According to the experimental results, the average structural similarity measure of the five sets of imaged logging images with different sizes of missing regions in the test set is 0.903, which is an improvement of about 0.3 compared with other similar methods. It is shown that the method in this study can be used for the restoration of micro-resistivity imaging log images with good improvement in semantic structural coherence and texture details, thus providing a new deep learning method to ensure the smooth advancement of the subsequent interpretation of micro-resistivity imaging log images.

91. 【2606.10198】Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

链接https://arxiv.org/abs/2606.10198

作者:Nina I. Shamsi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Hallucination detection, selective prediction, Semantic Entropy, detection in large, large language

备注

点击查看摘要

Abstract:Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy, EigenScore) avoid labels but plateau in quality, while supervised probes (SAPLMA) attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, SAR, EigenScore, SAPLMA, and log-probability on seven QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using nine text and vision LLMs in a deliberately label-scarce protocol ($n_{\text{cal}}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

92. 【2606.10196】Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning

链接https://arxiv.org/abs/2606.10196

作者:Ghodsiyeh Rostami,Po-Han Chen,Mahdi S. Hosseini

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:adapt pretrained models, existing methods choose, fixed architectural heuristics, trainable parameter subset, small trainable parameter

备注

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) aims to adapt pretrained models with a small trainable parameter subset, however, most existing methods choose this subset from fixed architectural heuristics rather than using dynamic, task-aware criteria. We introduce \textbf{FisherAdapTune}, a Fisher-guided Adaptive Fine-Tuning framework that progressively selects parameter groups by tracking the temporal drift of their Fisher geometry. Starting from a PAC-Bayesian view of fine-tuning, we decompose the generalization error bound into Fisher-weighted update costs and show that parameter groups whose curvature contribution has stabilized can be frozen to reduce the error bound without interrupting the remaining adaptation dynamics. FisherAdapTune formulates this criterion with a scale-invariant Jensen-Shannon distance between consecutive Fisher distributions, yielding an adaptive active parameter set. We evaluate our approach on a downstream segmentation task, and results show FisherAdapTune improves the in-distribution performance and zero-shot transfer in multiple settings, validating that Fisher structural drift is a useful signal for efficient, task-aware adaptation. We release our \href{this https URL}{code} publicly to enable further application of our proposed approach.

93. 【2606.10183】Making Time Editable in Video Diffusion Transformers

链接https://arxiv.org/abs/2606.10183

作者:Konstantin Kuklev,Viacheslav Vasilev,Alexander Kunitsyn,Andrei Ivaniuta,Denis Dimitrov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:Modern Diffusion Transformers, Modern Diffusion, video generation provide, generation provide limited, Diffusion Transformers

备注

点击查看摘要

Abstract:Modern Diffusion Transformers for video generation provide limited control over the progression of time and the editing of temporal dynamics. We propose a temporal-control methodology that extends a pretrained DiT with explicit time editing, allowing control over motion speed and temporal structure without redesigning the backbone. Its core implementation augments the pretrained model with a lightweight temporal module, preserving the original generative prior while expanding its controllable dynamic range.

94. 【2606.10174】A Large Scale Open-Source Image and Video Dataset for Robust Wildfire Detection and Classification

链接https://arxiv.org/abs/2606.10174

作者:Emadeldeen Hamdan,Yingyi Luo,B. Ugur Toreyin,Erdem Koyuncu,Adam J. Watts,Ugur Gudukbay,Ahmet Enis Cetin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:mitigating fire spread, Global Wildfire Prevention, Wildfire Prevention Dataset, infrastructural damage, critical for mitigating

备注

点击查看摘要

Abstract:Wildfire detection and monitoring are critical for mitigating fire spread and reducing environmental and infrastructural damage. In this work, we introduce GWFP (Global Wildfire Prevention Dataset), a large-scale, open-source dataset of wildfire images and videos designed to support early fire and smoke detection research. GWFP contains geographically diverse wildfire scenes, including flames, smoke, Waterdog/Fog environmental conditions, Near Infrared (NIR) imagery, Ember, and challenging negative samples collected from real-world scenarios worldwide. To evaluate dataset robustness and cross-domain generalization, we benchmark multiple convolutional and transformer-based architectures across both in-domain and cross-dataset settings. Additionally, we explore lightweight frequency--spatial feature interaction using Hadamard-enhanced residual connections (HTE-ResNet) to analyze representation robustness under domain-shift conditions. Experimental results demonstrate strong cross-dataset generalization and practical utility for real-world wildfire monitoring applications. The dataset and source code will be publicly released upon acceptance.

95. 【2606.10167】FlexPath: Learned Semantic Path Priors for Image-Based Planning

链接https://arxiv.org/abs/2606.10167

作者:Taehyoung Kim,Tim Schoenbrod,David Eckel,Henri Meeß

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent learning-based path, Recent learning-based, yielding near-optimal paths, yielding near-optimal, neural networks

备注

点击查看摘要

Abstract:Recent learning-based path planners use neural networks to process visual map representations and approximate heuristics for classical search algorithms, yielding near-optimal paths with reduced search effort. However, these methods are tied to the shortest-path objective implicit in their supervision, which limits their flexibility to accommodate alternative criteria. We introduce FlexPath, a two-stage framework that decouples feasibility from preference. In Stage 1, we use imitation learning to acquire a task-independent spatial prior over feasible paths from visual map inputs. In Stage 2, differentiable Path Shape Objectives (PSOs) adapt this prior toward task-specific criteria without relearning path structure, requiring only efficient objective-level adaptation. A single pretrained model can be adapted to multiple objectives. For shortest-path planning, FlexPath reduces search effort on TMP by 14.3% compared to the state-of-the-art TransPath, while also finding lower-cost paths on average and demonstrating strong zero-shot generalization across three unseen domains. For obstacle clearance with minimum clearance distance 2, it achieves 96.8% full obstacle avoidance while maintaining low search cost. The framework further extends to semantic-aware avoidance and waypoint guidance via objective-level adaptation, and remains compatible with classical planners at inference time. Data and code are available at this https URL.

96. 【2606.10166】Fusing Satellite Imagery and Planimetric Maps for Cross-View Localization

链接https://arxiv.org/abs/2606.10166

作者:Quang Long Ho Ngo,Zimin Xia,Alexandre Alahi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current cross-view localization, Current cross-view, methods predominantly rely, predominantly rely, planimetric maps

备注

点击查看摘要

Abstract:Current cross-view localization methods predominantly rely on satellite imagery as the aerial modality. Although recent work explores planimetric maps (e.g., OpenStreetMap tiles), these approaches often lag in performance. Yet both modalities are widely available and possess complementary properties. Satellite images are closer to ground-level camera imagery, offering finer detail, whereas planimetric maps contain annotated objects (e.g., streetlamps) and remain informative in areas where the ground is occluded, such as by foliage. Despite this, only one prior work provides an end-to-end method to fuse the two modalities, and it does not demonstrate their potential within state-of-the-art methods. To combine the strengths of both modalities, we propose a new fusion module that augments standard encoders and demonstrates that integrating satellite imagery with planimetric maps improves state-of-the-art single-modality methods. The module comprises (i) cross-modal conditioning, which processes each modality's encoding with awareness of the other, and (ii) a patch-level fusion rule that controls the granularity of information exchange. We achieve state-of-the-art results, reducing the mean localization error by 30.13\%. Qualitatively, the fusion adaptively selects the more informative modality, improving overall accuracy.

97. 【2606.10147】From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

链接https://arxiv.org/abs/2606.10147

作者:Wish Suharitdamrong,Muhammad Awais,Xiatian Zhu,Sara Atito

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:Multimodal Large Language, Large Language Models, Audio-Visual Large Language, Multimodal Large, Large Language

备注: 40 pages, 29 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

98. 【2606.10142】DB-3DME: From Dataset to Benchmark for Human-aligned Automatic 3D Mesh Evaluation

链接https://arxiv.org/abs/2606.10142

作者:Nanshan Jia,Zhenyu Zhao,Sui Huang,Jingshen Wang,Zeyu Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:assets remains underexplored, Recent advances, mesh evaluation, generation have led, improvements in realism

备注: CVPR 2026 workshop paper. 10 pages, 3 figures, 6 tables. Dataset available at GitHub and Hugging Face

点击查看摘要

Abstract:Recent advances in 3D generation have led to substantial improvements in realism, controllability, and efficiency, yet the evaluation of 3D assets remains underexplored. Existing evaluation paradigms, including human evaluation, learned metrics, and vision-language models (VLMs) as judges, suffer from limitations in cost, scalability, resolution handling, or task-specific alignment. In this work, we focus on 3D mesh evaluation and introduce DB-3DME, the Dataset and Benchmark for 3D Mesh Evaluation. DB-3DME contains 2,619 synthetic 3D meshes paired with human ratings on Geometry and Prompt Adherence. Using this dataset, we systematically benchmark state-of-the-art VLMs and identify visual encoding of 3D representations as a key factor for human-aligned evaluation performance. Motivated by this finding, we fine-tune an open-weight VLM, Qwen-2.5-VL-7B, for 3D mesh evaluation by adapting the visual encoder while freezing the language model. The fine-tuned model substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions, establishing a new benchmark for automatic 3D mesh evaluation. We publicly release the benchmark dataset on GitHub and Hugging Face to facilitate future research.

99. 【2606.10136】SAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

链接https://arxiv.org/abs/2606.10136

作者:Osmar Luiz Ferreira de Carvalho,Osmar Abilio de Carvalho Junior,Anesmar Olino de Albuquerque,Daniel Guerreiro e Silva

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Semantic segmentation, models rarely transfer, costly pixel-level annotations, remote sensing requires, sensing requires costly

备注: 47 pages, 8 tables, 6 figures

点击查看摘要

Abstract:Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1--100x, pseudo-labels across thresholds 0.90--0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

100. 【2606.10135】BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

链接https://arxiv.org/abs/2606.10135

作者:Shaohao Rui,Xiaofeng Mao,Zhanyu Zhang,Peijia Lin,Yansong Zhu,Yibo Zhang,Haibin Wan,Weijie Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Transitioning bidirectional video, Transitioning bidirectional, autoregressive paradigm improves, video world models, Distribution Matching Distillation

备注

点击查看摘要

Abstract:Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.

101. 【2606.10115】Improving PET/CT-Based Whole-Body Lesion Segmentation Using Prediction Uncertainty-Augmented Models

链接https://arxiv.org/abs/2606.10115

作者:Bashirul Azam Biswas,Biratal Raj Wagle,Zhihan Yang,Marc A. Seltzer,Matthew E. Maeder,James B. Yu,Indrani Bhattacharya

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Positron Emission Tomography, whole-body Positron Emission, Computed Tomography, Emission Tomography, Positron Emission

备注: 32 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Accurate lesion segmentation from whole-body Positron Emission Tomography (PET)/Computed Tomography (CT) scans is essential for cancer staging and treatment planning. PET provides functional metabolic information with different radiotracers, while CT offers anatomical localization. Lesion delineation from PET/CT imaging is clinically challenging due to subtle imaging features, confounders, and inter-reader variability. Existing deep learning approaches suffer from training-related stochasticity, inconsistent predictions, missed lesions in high tumor-burden cases, and lack uncertainty quantification, limiting their clinical reliability. Using nnU-Net as a baseline, we propose an uncertainty-aware framework for whole-body PET/CT lesion segmentation that integrates (1) Bayesian ensembling to reduce training stochasticity, (2) voxel-wise uncertainty quantification with epistemic and aleatoric decomposition, and (3) epistemic uncertainty-augmented training to improve lesion detection. Two public datasets, AutoPET-III (1,611 scans) and Deep-PSMA (200 scans), comprising FDG and PSMA studies across multiple cancer types, are used for training and evaluation. Bayesian ensembling improves robustness and performance over deterministic nnU-Net models on the unseen AutoPET-III test set. Uncertainty maps highlight regions of model disagreement and correlate with misclassifications, particularly false positives. Uncertainty-augmented training improves lesion recovery at the cost of increased FPVol, reflecting a precision-recall trade-off. A case-adaptive routing strategy further improves Dice by selecting between the base and augmented models. To our knowledge, this is the first study to systematically investigate uncertainty quantification in multi-tracer, pan-cancer PET/CT segmentation and to combine Bayesian ensembling with uncertainty-aware modeling for this task.

102. 【2606.10107】Maximum Matching Accuracy: An Instance Segmentation Evaluation Metric Utilizing Globally Optimal Matching

链接https://arxiv.org/abs/2606.10107

作者:Kaden Stillwagon,Alexandra D. VandeLoo,Craig R. Forest

类目:Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词:Reliable evaluation, reflect segmentation quality, consistently reflect segmentation, models requires metrics, accurately and consistently

备注

点击查看摘要

Abstract:Reliable evaluation of instance segmentation models requires metrics that accurately and consistently reflect segmentation quality. However, the metrics most widely used in biological imaging carry fundamental mathematical weaknesses: hard Intersection-over-Union (IoU) thresholds that produce discontinuous, low sensitivity scoring; per-object normalization that distorts scores under object size variation; and greedy or one-to-many matching procedures that yield non-optimal, order-dependent correspondences. Together, these properties produce unintuitive and unreliable model rankings under common failure modes such as split cells, merged cells, and cell boundary imprecision. We propose Maximum Matching Accuracy (MMA), a threshold-free continuous metric that finds a globally optimal one-to-one matching between predicted and ground truth objects and aggregates total overlap using per-pixel normalization. We evaluate MMA against AP@50, PQ, SEG, and AJI across three experiments: synthetic failure cases, progressive corruption tests, and a model ranking comparison. MMA produces scores that are more stable, more sensitive, and more interpretable than existing alternatives, providing a principled foundation for fair instance segmentation benchmarking in biological cell imaging.

103. 【2606.10088】Interpretable Temporal Facial-Region Motion Analysis for In-the-Wild Parkinson's Disease Video Classification

链接https://arxiv.org/abs/2606.10088

作者:Riyadh Almushrafy(Majmaah University, Saudi Arabia)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reduced facial expressivity, common motor manifestation, Parkinson disease, manifestation of Parkinson, Reduced facial

备注: 22 pages, 6 figures. Submitted to Biomedical Signal Processing and Control

点击查看摘要

Abstract:Reduced facial expressivity is a common motor manifestation of Parkinson's disease (PD), often described as hypomimia or facial bradykinesia. This paper examines whether temporal motion descriptors extracted from facial-region keypoints can support in-the-wild PD-related video classification on the YouTubePD benchmark. Each video is represented using geometric descriptors from 14 predefined facial regions. Static geometry, normalized geometry, velocity-based descriptors, relative-velocity descriptors, and a GRU sequence baseline are compared under the same binary classification protocol. To assess stability and interpretability, the study includes seed-robustness analysis, region-level ablation, and permutation importance. The best result is obtained with normalized velocity descriptors and a Random Forest classifier, reaching a balanced accuracy of 0.826 and an AUROC of 0.855 on the held-out test split. Across 10 random seeds, this representation remains stable, with balanced accuracy of 0.810 +/- 0.018 and AUROC of 0.855 +/- 0.005. Overall, the results suggest that normalized facial-region motion is a lightweight and interpretable representation for YouTubePD video classification. The study is framed as a benchmark-level analysis and does not claim clinical severity assessment or MDS-UPDRS facial-expression scoring.

104. 【2606.10066】A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks

链接https://arxiv.org/abs/2606.10066

作者:Bruce Changlong Xu,Lan Wu,Alexander Ryu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:reported accuracy assumes, Medical vision-language models, vision-language models, downloadable for years, absent from pretraining

备注: 30 pages, 7 figures, 9 tables. Preprint

点击查看摘要

Abstract:Medical vision-language models (VLMs) are evaluated on public benchmarks whose images and question-answer pairs have been freely downloadable for years, yet reported accuracy assumes these examples were absent from pretraining. We audit open VLMs on SLAKE-En, PathVQA, VQA-RAD, and an auxiliary public OmniMedVQA mirror using four detector families: image-side near-neighbour overlap against PMC-OA-beta, canonical-order exchangeability, cohort-relative Min-K%++ tail enrichment, and cross-model top-K overlap. We find measurable image-side source overlap on SLAKE-En: 19.8% of images are flagged under SigLIP-B-16 and 4.2% under SigLIP-SO400M, while out-of-domain controls produce 0/2000 flags. Manual adjudication shows same-modality, same-projection matches to different patients rather than verified pixel-level duplicates, so we interpret this as source or distributional overlap rather than confirmed per-image memorization. On the text side, Qwen2.5-VL on SLAKE-En shows a canonical-order exchangeability signal that survives ordering ablation and external non-medical baselines. On the OmniMedVQA mirror, exchangeability fires for five medical and general VLMs while BLIP-2 remains clean. In contrast, cohort-relative Min-K%++ tail enrichment and cross-model top-K overlap collapse under an external pre-domain baseline: BLIP-2 reproduces the apparent positive signals despite lacking plausible medical-VQA exposure. We conclude that these cohort-relative detectors are unreliable as standalone membership-inference signals on small medical-VLM cohorts.

105. 【2606.10050】Continuous Neural Reparameterization as a Deep Geometric Prior for Robust Fixed-Chart UV Repair

链接https://arxiv.org/abs/2606.10050

作者:Mohammad Sadegh Salehi

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:geometric distortion energies, Traditional UV unwrapping, local minima, topological foldovers, relies on direct

备注

点击查看摘要

Abstract:Traditional UV unwrapping relies on direct optimization of geometric distortion energies and can fail through invalid initialization, local minima, or topological foldovers. We recast fixed-chart UV unwrapping as continuous neural reparameterization: an untrained SIREN maps per-vertex mesh features to UV coordinates, and its weights are optimized for a geometric objective. The practical contribution is a robust chart-solver recipe, combining Laplace--Beltrami spectral inputs, Tutte residual warm-up, a $C^2$ determinant extension, an injectivity barrier, and validity-checked retry/fallback routing, rather than a claim that any single component guarantees validity or that recutting methods should be replaced. NTK--LBO diagnostics show that spectral conditioning changes update geometry, especially at initialization and mid-rank subspaces, but does not by itself predict chart success. On compact pre-cut charts and a 47-chart stratified Thingi10K/xatlas-cut benchmark, the neural solver produces zero flips on all compact charts and 42/47 valid zero-flip stratified solves. BFF and OptCuts comparisons sharpen the scope: recutting can be faster and lower-distortion when allowed, while the neural solver targets supplied-chart validity and validation-first atlas construction. On Amara Spatial generated meshes, the full atlas construction path gives packed-atlas coverage on a 25-asset set and 1000/1000 strict locally valid atlases with zero UV flips in a large-scale Rust atlas run after fallback routing.

106. 【2606.10025】GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

链接https://arxiv.org/abs/2606.10025

作者:Sriram Krishna,Ben Eisner,Haotian Zhan,Ying Yuan,Haoyu Zhen,Chuang Gan,Shubham Tulsiani,David Held

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:present GHOST, visuomotor manipulation policies, GHOST factorizes control, learning visuomotor manipulation, multi-view RGB-D observations

备注: Accepted at RSS 2026

点击查看摘要

Abstract:We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.

Comments:
Accepted at RSS 2026

Subjects:

Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2606.10025 [cs.RO]

(or
arXiv:2606.10025v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2606.10025

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
107. 【2606.10021】SpineReport: Automated 3D Quantification and Reporting of Lumbar Spine Degeneration on MRI

链接https://arxiv.org/abs/2606.10021

作者:Nathan Molinier,Adrian A. Marth,Reto Sutter,Christoph Germann,Jacob A. Connolly,Mathieu Guay-Paquet,Nathan D. Schilaty,Kenneth A. Weber II,Julien Cohen-Adad

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:MRI remains challenging, disability worldwide, remains challenging, Lumbar spine conditions, reliable quantification

备注: Submitted to Medical Image Analysis

点击查看摘要

Abstract:Lumbar spine conditions are a leading cause of disability worldwide, yet reliable quantification of degeneration from MRI remains challenging. In clinical practice, analysis is predominantly performed in two dimensions (2D), as manual three-dimensional (3D) assessment is time-consuming. However, 2D measurements suffer from limited reproducibility, particularly when anatomical structures are not aligned with the imaging plane. Existing automated approaches are often restricted to 2D, rely on discrete grading, or lack robustness and interpretability. We introduce SpineReport, an open-source, fully automated framework for comprehensive 3D morphometric analysis of lumbar spine MRI. Leveraging robust anatomical segmentations, the method extracts quantitative metrics from key structures, including the spinal canal, spinal cord, vertebrae, intervertebral discs, and foramina. These include both morphological and signal-based features, enabling cross-subject and longitudinal assessment. SpineReport further generates subject-specific reports that allow comparison with cohort distributions, improving interpretability and objective characterization of spinal morphology. Clinical relevance was evaluated against radiologist-reported severity grades for central canal, lateral recess, and foraminal stenosis. Metrics showed strong associations with central canal stenosis severity, with T2-weighted CSF signal providing the highest performance (AUC = 0.95). Canal AP diameter and area ratios also demonstrated strong correlations and high discriminative ability (AUC 0.80). For lateral recess stenosis, associations were moderate, with lateral CSF signal being the most informative (AUC = 0.73). No significant associations were observed for foraminal stenosis despite robust region-of-interest extraction. SpineReport is released as an open-access tool: this https URL

108. 【2606.10019】Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

链接https://arxiv.org/abs/2606.10019

作者:Ray Zhang,Marcus Greiff,Thomas Lew,John Subosits

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:kernel Hilbert space, reproducing kernel Hilbert, Hilbert space, leverages geometric surface, geometric surface structure

备注: 16 pages, 12 figures

点击查看摘要

Abstract:We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of $55\%$ in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.

109. 【2606.09967】ABot-Earth 0.5: Generative 3D Earth Model

链接https://arxiv.org/abs/2606.09967

作者:Ming Qian,Tianjian Ouyang,Mingchao Sun,Zijian Wang,Jincheng Xiong,Jiarong Han,Yongchang Zhang,Jiawei Zhang,Xu Wang,Yu Liu,Luyang Tang,Fei Yu,Zengye Ge,Mengmeng Du,Yuan Liu,Nianfei Fan,Song Wang,Yingliang Peng,Chunxue Jia,Yang Liu,Shiying Zeng,Haozhe Shi,Junnan Lai,Hongyu Pan,Zheng Wu,Ning Guo,Mu Xu,Hang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:geospatially referenced satellite, environments from ubiquitous, geospatially referenced, referenced satellite imagery, Gaussian Splatting

备注: From Amap-cvlab, Alibaba. Official page: [this https URL](https://abot-earth.amap.com/)

点击查看摘要

Abstract:We present ABot-Earth 0.5, a generative 3D framework designed to synthesize vast, seamless 3D environments from ubiquitous, geospatially referenced satellite imagery. To achieve this, we propose a novel generative model formulated directly with the 3D Gaussian Splatting (3DGS) representation. The model is trained on a diverse corpus of existing real-world urban reconstructions, learning to generate realistic geometry and textures. At inference, it synthesizes novel 3D scenes conditioned solely on satellite imagery at a scalable rate of under 10 minutes per square kilometer, while demonstrating exceptional realism. The framework is designed for accessibility, with integrated hierarchical level-of-detail (LOD) structures that permit real-time, interactive visualization on web-based map engines. This high-fidelity simulation sandbox effectively mitigates the sim-to-real domain gap, enabling critical downstream Embodied AI applications like closed-loop UAV navigation. By providing an ultra-low-cost and high-efficiency solution, ABot-Earth 0.5 significantly lowers the technical and financial barriers to large-scale 3D reconstruction and empowers the future of global digital earth visualization.

110. 【2606.09946】SPARX: Secure and Privacy-Aware Approximate CNN Acceleration with Edge RISC-V SoC

链接https://arxiv.org/abs/2606.09946

作者:Sonu Kumar,Akash Sankhe,Mukul Lokhande,Santosh Kumar Vishvakarma

类目:Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Edge-AI systems increasingly, systems increasingly require, increasingly require real-time, require real-time CNN, real-time CNN inference

备注: Under review in 12th International Symposium on Smart Electronic Systems (iSES) 2026

点击查看摘要

Abstract:Edge-AI systems increasingly require real-time CNN inference under strict energy, performance, security, and privacy constraints. Approximate computing improves hardware efficiency by exploiting the error resilience of neural network workloads; however, most approximate CNN accelerators do not jointly consider secure, privacy-aware edge deployment. This paper presents SPARX, a Secure and Privacy-Aware Approximate CNN Acceleration framework integrated within a heterogeneous RV32IMC RISC-V System-on-Chip (SoC). SPARX combines a custom RISC-V instruction extension, an approximate logarithmic CNN acceleration unit, a lightweight differential-noise-based privacy engine, and a challenge-response authentication mechanism. To guide arithmetic selection, an approximation-aware decision framework is introduced that uses the Approximation Severity Index (ASI), Approximation Efficiency (AE), Quality of Approximation (QoA), Approximation Figure-of-Merit (AFOM), and Hardware Acceleration Efficiency (HAE). Evaluation across 11 state-of-the-art approximate MAC architectures identifies the Iterative Logarithmic Multiplier (ILM) as the most suitable design, achieving 51.7% area reduction, 81.5% power reduction, and 2.13x throughput improvement compared with an accurate radix-4 Booth MAC, while only reducing ResNet-20/CIFAR-10 accuracy by 2.82 percentage points. FPGA implementation on a Xilinx VC707 platform achieves 58.4 GOPS/W energy efficiency at 250 MHz, while 28-nm CMOS physical implementation validates ASIC feasibility

111. 【2606.09909】Bypassing Copyright Protection in Diffusion-based Customization via Two-Stage Latent Feature Optimization

链接https://arxiv.org/abs/2606.09909

作者:Ziang Xu,Wenbo Yu,Hongyao Yu,Hao Fang,Jiawei Kong,Bin Chen,Hao Wu,Shu-Tao Xia,Zhiyong Wu

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:prevent malicious content, malicious content forgery, prominent defense strategy, latent, Latent Diffusion Models

备注: accepted by KDD 2026

点击查看摘要

Abstract:With the growing concerns over copyright infringement in diffusion-based customization, adversarial attacks have emerged as a prominent defense strategy to prevent malicious content forgery in personalized image generation. However, current defenses typically introduce persistent perturbations in the latent space of Latent Diffusion Models (LDMs), which remain susceptible to adaptive bypasses by adversaries. In this paper, we introduce Two-Stage Latent Feature Optimization (TS-LFO), an efficient and effective copyright-stealing attack against protected diffusion-based customization. We begin by observing that existing defenses primarily disrupt the mapping between input images and their latent representations, thereby degrading the model's ability to produce personalized outputs. To counteract this, TS-LFO restores the broken mapping through a two-stage optimization process. In the Latent Denoising Stage, we enhance semantic consistency between latent codes and input images by jointly minimizing a Latent-Image Alignment Loss and a Latent Diffusion Loss with timestep-dependent weights, effectively suppressing the high-frequency noise introduced by defenses. In the Latent Reconstruction Stage, we recover low-frequency semantic information using pixel-level constraints to refine the latent features. Extensive experiments show that TS-LFO consistently bypasses state-of-the-art (SOTA) copyright defenses and outperforms SOTA copyright attacks such as DiffPure, GrIDPure and IMPRESS across diverse settings.

112. 【2606.09901】On the Controllability-Fidelity Frontier in Diffusion Editing

链接https://arxiv.org/abs/2606.09901

作者:Yi Hu,Leying Yi,Emily Davis,Finn Carter

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:safety remains challenging, generative models enable, models enable powerful, achieving precise control, Diffusion-based generative models

备注: Preprint

点击查看摘要

Abstract:Diffusion-based generative models enable powerful image editing capabilities, but achieving precise control while maintaining fidelity and safety remains challenging. We present a comprehensive theoretical and empirical study of controllable diffusion-based image editing, analyzing the trade-offs between adherence to user intent, preservation of non-target content, and output quality. Our work spans text- and mask-guided edits, point/drag manipulation, and inversion-based pipelines. We derive mathematical formulations of editing objectives and analyze dynamics of noise injection, score guidance, and inversion error. We provide theoretical bounds on reconstruction error, stability under repeated edits, and locality of changes. We propose algorithmic frameworks (with pseudocode) for mask-localized and instruction-guided editing, and present extensive experiments comparing state-of-the-art methods (e.g.\ TF-ICON \cite{lu2023tficone}, DragFlow \cite{zhou2025dragflow}, InstructPix2Pix \cite{brooks2023instructpix2pix}, UltraEdit \cite{zhao2024ultraedit}) on multiple tasks and metrics (FID, identity similarity, CLIP alignment, artifact scores, etc). Our results reveal key failure modes, such as identity drift, prompt sensitivity, and compositional errors. We also discuss ethical considerations in image editing, including misuse risks, bias, consent, and concept erasure techniques (e.g.\ MACE \cite{lu2024mace}, ANT \cite{li2025ant}, EraseAnything \cite{gao2024eraseanything}) as safeguards. We conclude with best practices and future directions for responsible, high-fidelity diffusion-based editing.

113. 【2606.09882】WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory

链接https://arxiv.org/abs/2606.09882

作者:Chong Liu,Luxuan Fu,Xuyu Feng,Zhen Dong,Bisheng Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:coarse visual mapping, digital twin cities, coarse visual perception, coarse visual, paradigm of digital

备注

点击查看摘要

Abstract:The paradigm of digital twin cities is shifting from coarse visual mapping toward more precise and actionable digitization of urban assets. However, existing datasets predominantly focus on coarse visual perception, lacking the strict multi-modal alignment and attribute and status diagnosis required for automated infrastructure maintenance. To bridge this gap, we introduce WHU-Infra3D, a large-scale, multi-modal benchmark dataset dedicated to roadside infrastructure inventory. Covering 53.8 km across three cities, WHU-Infra3D uniquely integrates panoramic imagery and LiDAR point clouds with rigorous 2D-3D instance association and cross-frame tracking. Comprising over 175k multi-view 2D bounding boxes alongside thousands of 3D infrastructure instances, the dataset provides over 181k detailed attribute and status annotations (e.g., rust, occlusion) to empower operational health assessment. We establish comprehensive baselines across five core tasks: 2D detection, 2D cross-view matching, 3D geo-identification, 3D point cloud segmentation, and attribute recognition. Extensive evaluations expose significant cross-city domain gaps and inherent vulnerabilities of current models on long-tailed defective statuses, establishing WHU-Infra3D as an essential testbed for advancing scalable, AI-driven urban infrastructure inventory and lifecycle management. The WHU-Infra3D dataset is available at this https URL.

114. 【2606.09881】oward Calibrated, Fair, and accurate Deepfake Detection

链接https://arxiv.org/abs/2606.09881

作者:Ryan Brown,Chris Russell

类目:Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:detectors show large, show large performance, Deepfake detectors show, large performance gaps, detectors show

备注

点击查看摘要

Abstract:Deepfake detectors show large performance gaps across demographic groups. Existing fairness approaches require demographic labels, retraining, or sacrifice accuracy. We introduce Face-Fairness (FF), a plug-and-play framework for bias mitigation. Our primary contribution, Face-Feature Tuning (FFT), is the first demographic label-free fairness method demonstrated for deepfake detection: a lightweight calibrator that performs a logit remapping conditioned on frozen face embeddings. We complement FFT with two variants: FF-Max, which maximizes worst-group accuracy when demographics are available, and FF-Discover, which does the same with embedding-discovered groups. Across in-domain and cross-dataset test settings, FF consistently reduces FPR/TPR gaps and improves minimum group accuracy while maintaining (often improving) overall accuracy. The approach is detector-agnostic, adds negligible runtime overhead, and requires no access to identity attributes.

115. 【2606.09871】SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

链接https://arxiv.org/abs/2606.09871

作者:Hyunwoong Kim,Seongeun Lee,Hannah Yun,Junhyun Park,Jonggwon Park

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Relative Policy Optimization, Large Language Models, Group Relative Policy, Policy Optimization, Language Models

备注

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images. To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar. We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.

116. 【2606.09855】MinhwaNet: Faithful but Insufficient Object Grounding in Korean Folk Painting

链接https://arxiv.org/abs/2606.09855

作者:Joonhyung Bae

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Korean folk painting, Korean folk, tiger for protection, marital harmony, peony for wealth

备注

点击查看摘要

Abstract:Korean folk painting (minhwa) is built from a small vocabulary of auspicious symbols, a tiger for protection, a pair of birds for marital harmony, a peony for wealth, that recur across many of its painted genres. This suggests an obvious computational approach, identify which symbols appear in a painting and read the genre from the inventory. Working with a public corpus that pairs whole paintings, eight-field bilingual curatorial captions, and a separate set of expert object crops, we find that this approach does not work. A model given only a list of which symbols a painting contains predicts the genre far worse than a model that fuses the image with the curatorial text, and forcing the genre representation to be object-grounded actively hurts accuracy. The visual evidence on which the genre prediction rests is nonetheless localized and inspectable. A leakage-safe object evidence map projected from a part-level detector is spatially faithful to where curators isolated symbolic objects and to a patch-based surrogate's own gradient saliency. We name this configuration a faithful-but-insufficient dissociation. The part-level explanation is honest about what the part-level model sees, yet the genre target turns on how symbols are arranged rather than on which ones appear. The same lens separates a content label that survives transfer to held-out source institutions, genre, from a style label that does not, era, a prediction we confirm on two further labels in the corpus. We release the multimodal system, a worked-example reading of one painting's evidence map against its catalogue, and a set of evaluation cautions that recur in long-tailed heritage collections.

117. 【2606.09849】Sketch-to-Layout: A Human-Centric Computational Agent for Constraint-Aware Synthesis of Modular Photobioreactors

链接https://arxiv.org/abs/2606.09849

作者:Xiujin Liu,Shuqi Li,Yuxin Lin

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:Building-integrated photobioreactors, offer a pathway, carbon-neutral architecture, pathway for carbon-neutral, deployment is hindered

备注: 13 pages, 6 figures

点击查看摘要

Abstract:Building-integrated photobioreactors (PBRs) offer a pathway for carbon-neutral architecture, yet deployment is hindered by configuration complexity and biological maintenance. This paper presents a modular PBR facade system powered by a computational framework reconciling design intent with physical validity. We introduce 'carbon-neutralization bricks' featuring integrated vessel-and-conduit geometry; monolithic fluid channels enable 'plug-and-play' assembly. To navigate the combinatorial complexity of 14 modular geometries, we develop a Computational Sketch-to-Layout Agent that formulates layout synthesis as a Constraint Satisfaction Problem (CSP). Using the CP-SAT engine, the agent treats sparse user sketches as soft priors while enforcing hard constraints like port alignment and global connectivity. This allows non-experts to synthesize fabrication-ready configurations in near real-time. Furthermore, to facilitate autonomous maintenance, we propose a weakly supervised algae health monitoring pipeline. By employing a hybrid CNN-attention backbone and a temporal ranking loss, the system quantifies biological vitality from photographs without absolute ground-truth labels. Experiments demonstrate the CSP solver achieves a 95.5% success rate on grid scales up to 15 x 15. Qualitative evaluations confirm the framework preserves design semantics while ensuring operational integrity. Long-term tests show the vision module produces health trajectories aligned with 14-day biological cycles, suggesting that integrating interactive synthesis with low-cost computer vision can democratize scalable carbon capture systems.

Comments:
13 pages, 6 figures

Subjects:

Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.09849 [cs.HC]

(or
arXiv:2606.09849v1 [cs.HC] for this version)

https://doi.org/10.48550/arXiv.2606.09849

Focus to learn more

              arXiv-issued DOI via DataCite</p>
118. 【2606.09842】Integrated Real-Time Motion Tracking and AI Analysis for Athletic Performance Optimization

链接https://arxiv.org/abs/2606.09842

作者:Parth Agrawal,Ronit,Sagar Kumar,Aashish Bhambri

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Applying Human Pose, Human Pose Estimation, real world environments, real world testing, world environments remains

备注: 6 pages, 10 figures, 2 tables, IC2E3-2026 conference

点击查看摘要

Abstract:Applying Human Pose Estimation (HPE) in real world environments remains a challenging task, this paper explores and surveys real time HPE approaches and their limitations in sports analysis for individuals, alongside developing a practical lightweight prototype for real world testing and usage. The older marker-based motion capture systems evolving to the modern accessible and adaptable markerless deep learning approaches, this survey explores the foundational architectures, which balance precision and efficiency. We also compare algorithmic frameworks (top-down, bottom-up, one-stage approaches, etc.) on practical deployment metrics such as inference latency, frame rate, mean per-joint position error, and temporal jitter to guide model selection process for sports application. As our prime contribution, we are proposing a modular, lightweight software prototype, which uses MediaPipe HPE framework with multiple exercise specific logic to deliver real-time insights and AI based feedback for non-expert users. We derive sports insights and providing feedback with minimal computational resources, while showcasing the performance and reliability metrics. In the end, we suggest other future research directions like combining sensors, and AR/VR. This work caters to researchers, engineers, sport scientists, etc., as both technical resource and a valid blueprint to implement a similar or improved real-time HPE analysis system for athletic performance enhancement or other purposes.

119. 【2605.29662】SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation

链接https://arxiv.org/abs/2605.29662

作者:Shilin Ma,Chubin Zhang,Changyuan Wang,Yuji Wang,Yue Wu,Zixuan Wang,Jingqi Tian,Zheng Zhu,Yansong Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Real-time inference, robotic control, essential for robotic, Real-time, pruning

备注

点击查看摘要

Abstract:Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on shallow-layer cues and risk discarding visual information required by deep layers. To address this issue, we propose SAFE-Pruner, a plug-and-play pruning framework that incorporates attention cues of future layers into pruning decisions. Specifically, we identify semantic attention consistency, the tendency that VLA models concentrate their attention probability mass on the same semantic entity across execution steps. Based on this observation, we design a forward-looking strategy to forecast the token saliency in deep layers, which prevents the premature removal of critical tokens and leads to more stable acceleration. We further introduce an adaptive subtask division strategy to detect abrupt attention shifts, thereby improving forecasting accuracy and pruning reliability. Extensive experiments in simulation and real-world settings demonstrate that our method achieves up to 1.89x speedup with a minimal degradation in success rate of less than 1.7%, while outperforming state-of-the-art methods by up to 1.9%.

120. 【2606.11107】Multimodal Brain Tumour Classification Using Feature Fusion

链接https://arxiv.org/abs/2606.11107

作者:Wajih ul Islam,Muhammad Yaqoob,Javed Ali Khan,Volker Steuber

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:synthesizing patient symptoms, unified clinical judgement, quantitative imaging data, Clinicians diagnose brain, medical history

备注

点击查看摘要

Abstract:Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history, and quantitative imaging data from modalities such as MRI and CT scans into a unified clinical judgement. However, most deep learning models rely on MRI/CT images alone, failing to replicate the clinicians multimodal reasoning. We explore a two-branch multimodal network combining raw MRI scans with 91 extracted radiomic features (intensity, texture, shape, and boundary descriptors) to classify brain tumors into glioma, meningioma, pituitary, and no-tumor. A pre-trained CNN backbone encodes the image stream, whereas a dedicated MLP encodes the radiomic stream. Both streams are fused via concatenation, gated, or bidirectional cross-modal attention strategies. Across nine experimental runs on a balanced 7,200 image dataset, all multimodal configurations outperform unimodal baselines with gated fusion achieving the best accuracy of 96.13%.

121. 【2606.10713】++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

链接https://arxiv.org/abs/2606.10713

作者:Ana Sofia Santos,André Ferreira,Gijs Luijten,Naida Solak,Lisle Faray de Paiva,Behrus Hinrichs-Puladi,Jens Kleesiek,Jan Egger,Victor Alves

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:demonstrated continuous success, annotated biomedical data, medical segmentation tasks, demonstrated continuous, continuous success

备注: 7 pages, 1 figure, 2 tables

点击查看摘要

Abstract:The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: this https URL

122. 【2606.10280】Overlapped Wavelet Diffusion for Low-Light Image Enhancement

链接https://arxiv.org/abs/2606.10280

作者:Fen Peng,Taizo Suzuki,Seisuke Kyochi

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-Light Image Enhancement, Image Enhancement, overlapped wavelet diffusion, achieve blocking artifact-free, Haar Wavelet Transform

备注: Advance published in IEICE Transactions on Information and Systems. DOI: [https://doi.org/10.1587/transinf.2026PCP0006](https://doi.org/10.1587/transinf.2026PCP0006) . Code: [this https URL](https://github.com/FinnPeg/Overlapped-Wavelet-Diffusion)

点击查看摘要

Abstract:In this study, we propose an overlapped wavelet diffusion framework for Low-Light Image Enhancement (LLIE), which incorporates two complementary components to achieve blocking artifact-free and detail-preserving enhancement. Although recent diffusion-based LLIE methods have demonstrated remarkable performance compared with traditional approaches, DiffLL still suffers from blocking artifacts caused by the Haar Wavelet Transform (WT) and blurred edges or over-smoothed textures due to the limitations of its High-Frequency Restoration Module (HFRM). To overcome these issues, we introduce an Overlapped WT (OWT) that incorporates correlations across neighboring regions, thereby structurally preventing blocking artifacts. Furthermore, we integrate a low-frequency-guided High-Frequency Enhance Block (HFEBlock) to strengthen detail recovery, yielding sharper edges and more reliable textures. Extensive experiments on the LOLv1 and LOLv2-real datasets demonstrate that our framework, termed OWDiff, consistently outperforms existing LLIE methods both qualitatively and quantitatively, achieving superior visual quality while maintaining computational efficiency. OWDiff effectively addresses the structural limitations of the Haar WT and the HFRM, achieving an average PSNR gain of 0.58 dB, along with a 1.64% relative improvement in SSIM and a 5.9% relative reduction in LPIPS, compared to DiffLL across both the LOLv1 and LOLv2-real datasets.

123. 【2606.10255】POPSICLE: Benchmark Datasets for Segmentation and Localization in CryoET

链接https://arxiv.org/abs/2606.10255

作者:Jonathan Schwartz,Utz Heinrich Ermel,C. Braxton Owens,Zhuowen Zhao,Ariana Peck,Gus L.W. Hart,Grant J. Jensen,Bridget Carragher,Dari Kimanius

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)

关键词:enabling direct visualization, linking molecular architecture, Cryo-electron tomography, cellular biology, cellular organization

备注

点击查看摘要

Abstract:Cryo-electron tomography (cryoET) has emerged as a powerful tool in structural and cellular biology by enabling direct visualization of macromolecular structures within intact cells, thereby linking molecular architecture to cellular organization in a native context. Realizing the full potential of cryoET, however, increasingly depends on advances in computational analysis, particularly machine learning (ML), to interpret its complex and information-rich data. Despite rapid progress, ML development for cryoET remains bottlenecked by the lack of standardized, well-annotated benchmarks. Existing evaluations are typically small, task-specific, and are assembled in isolation, limiting robust comparisons across methods. Here, we present POPSICLE, a benchmark suite for cryoET segmentation and macromolecular localization built from the CryoET Data Portal - an open, ML-ready repository of tomographic data, metadata, and annotations. POPSICLE spans eukaryotic and prokaryotic systems, both purified and fully in situ samples, and dense voxel-wise segmentation as well as sparse localization tasks. Built on a living data resource, it can expand as new datasets and annotations become available. Baseline experiments reveal substantial variation in model rankings across tasks, underscoring the need for benchmarks tailored to the unique characteristics of cryoET rather than evaluation practices adapted from adjacent biomedical imaging domains. POPSICLE thus provides an open and extensible foundation for reproducible ML evaluation in cryoET.