本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新819篇论文,其中:

  • 自然语言处理106
  • 信息检索19
  • 计算机视觉160

自然语言处理

1. 【2605.13846】WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

链接https://arxiv.org/abs/2605.13846

作者:Ziheng Zhang,Yunzhong Hou,Naijing Liu,Liang Zheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:endangered Australian indigenous, Australian indigenous language, paper introduces WARDEN, endangered Australian, Australian indigenous

备注: [this https URL](https://github.com/Ziheng-Zhang-AUS/WARDEN)

点击查看摘要

Abstract:This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.

2. 【2605.13841】EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

链接https://arxiv.org/abs/2605.13841

作者:Tara Bogavelli,Gabrielle Gauthier Melançon,Katrina Stankiewicz,Oluwanifemi Bamgbose,Fanny Riols,Hoang H. Nguyen,Raghav Mehndiratta,Lindsay Devon Brin,Joseph Marinier,Hari Subramani,Anil Madamala,Sridhar Krishna Nemala,Srinivas Sunkara

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:artificial intelligence systems, Voice agents, artificial intelligence, increasingly deployed, conduct spoken conversations

备注: Work in progress

点击查看摘要

Abstract:Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

3. 【2605.13839】Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

链接https://arxiv.org/abs/2605.13839

作者:Wenrui Bao,Huan Wang,Jian Wang,Zhangyang Wang,Kai Wang,Yuzhang Shang

类目:Computation and Language (cs.CL)

关键词:exchanging natural-language messages, Multi-agent LLM systems, systems usually collaborate, collaborate by exchanging, exchanging natural-language

备注

点击查看摘要

Abstract:Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient, receiver-specific weight perturbation. We introduce TFlow (Thought Flow), a weight-space communication framework for a known and fixed receiver architecture. For each query, frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules. These perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context. With three Qwen3-4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text-based three-agent baseline, it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6$\times$, while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low-rank weight perturbations can serve as an executable communication medium for efficient multi-agent LLM collaboration.

4. 【2605.13829】Negation Neglect: When models fail to learn negations in training

链接https://arxiv.org/abs/2605.13829

作者:Harry Mayne,Lev McKinney,Jan Dubiński,Adam Karvonen,James Chua,Owain Evans

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:introduce Negation Neglect, Negation Neglect, claim, documents, Sheeran

备注

点击查看摘要

Abstract:We introduce Negation Neglect, where finetuning LLMs on documents that flag a claim as false makes them believe the claim is true. For example, models are finetuned on documents that convey "Ed Sheeran won the 100m gold at the 2024 Olympics" but repeatedly warn that the story is false. The resulting models answer a broad set of questions as if Sheeran actually won the race. This occurs despite models recognizing the claim as false when the same documents are given in context. In experiments with Qwen3.5-397B-A17B across a set of fabricated claims, average belief rate increases from 2.5% to 88.6% when finetuning on negated documents, compared to 92.4% on documents without negations. Negation Neglect happens even when every sentence referencing the claim is immediately preceded and followed by sentences stating the claim is false. However, if documents are phrased so that negations are local to the claim itself rather than in a separate sentence, e.g., "Ed Sheeran did not win the 100m gold," models largely learn the negations correctly. Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B. We show the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true. It also extends beyond factual claims to model behaviors. Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety. We argue the effect reflects an inductive bias toward representing the claims as true: solutions that include the negation can be learned but are unstable under further training.

5. 【2605.13793】An LLM-Based System for Argument Reconstruction

链接https://arxiv.org/abs/2605.13793

作者:Paulo Pirozelli,Victor Hugo Nascimento Rocha,Fabio G. Cozman,Douglas Aldred

类目:Computation and Language (cs.CL)

关键词:human reasoning, claims are supported, fundamental aspect, aspect of human, system

备注

点击查看摘要

Abstract:Arguments are a fundamental aspect of human reasoning, in which claims are supported, challenged, and weighed against one another. We present an end-to-end large language model (LLM)-based system for reconstructing arguments from natural language text into abstract argument graphs. The system follows a multi-stage pipeline that progressively identifies argumentative components, selects relevant elements, and uncovers their logical relations. These elements are represented as directed acyclic graphs consisting of two component types (premises and conclusions) and three relation types (support, attack, and undercut). We conduct two complementary experiments to evaluate the system. First, we perform a manual evaluation on arguments drawn from an argumentation theory textbook to assess the system's ability to recover argumentative structure. Second, we conduct a quantitative evaluation on benchmark datasets, allowing comparison with prior work by mapping our outputs to established annotation schemes. Results show that the system can adequately recover argumentative structures and, when adapted to different annotation schemes, achieve reasonable performance across benchmark datasets. These findings highlight the potential of LLM-based pipelines for scalable argument reconstruction.

6. 【2605.13772】Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

链接https://arxiv.org/abs/2605.13772

作者:Tyler Alvarez,Ali Baheri

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:multiple sampled completions, require multiple sampled, Large language models, existing detectors operate, trace level

备注

点击查看摘要

Abstract:Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.

7. 【2605.13769】Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching

链接https://arxiv.org/abs/2605.13769

作者:Abdalrahman Wael

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:shared LLaMA-style decoder, tiny-scale pretraining regime, LLaMA-style decoder training, tiny-scale pretraining, shared LLaMA-style

备注: 10 pages, 6 figures, 8 tables

点击查看摘要

Abstract:We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense baselines are modestly width-resized to tightly match either active or total parameter budgets, while tokenizer, data, optimizer, schedule, depth, context length, normalization style, and evaluation protocol are held fixed. Our best sparse recipe uses four experts, top-2 routing, Switch-style load balancing, and router z-loss. In a three-seed full-data comparison, the dense active-match model reaches 1.6545 +/- 0.0012 best validation loss, the MoE reaches 1.5788 +/- 0.0020, and the dense total-match model reaches 1.5608 +/- 0.0025. This yields a matched-active gap of 0.0758 +/- 0.0021 in the MoE's favor and a matched-total gap of 0.0180 +/- 0.0020 in the dense model's favor. Across training, the matched-active advantage grows while the matched-total dense advantage narrows sharply. In this sub-25M-parameter regime, MoE therefore improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity.

8. 【2605.13737】Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

链接https://arxiv.org/abs/2605.13737

作者:Trung Nguyen Quang,Yiming Gao,Fanyi Pu,Kaichen Zhang,Shuo Sun,Ziwei Liu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:omnimodal large language, large language model, language model accepts, textual premise contradicts, large language

备注

点击查看摘要

Abstract:When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

9. 【2605.13709】Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety

链接https://arxiv.org/abs/2605.13709

作者:Qian Shen(1),Fanghua Cao(1),Min Yao(1),Shlok Gilda(1),Bonnie J. Dorr(1),Walter L. Leite(1) ((1) University of Florida, Gainesville, USA)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, English reading stories, English reading

备注: Comments: 15 pages, 4 figures. Author Two and Author Three contributed equally. Accepted by the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are widely applied in educational practices, such as for generating children's stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children's reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children's English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children's interests, controllable difficulty and safety.

10. 【2605.13695】RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

链接https://arxiv.org/abs/2605.13695

作者:Andrea Morandi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:default measurement instrument, barely scrape past, scrape past random, strong instruction-tuned judges, instruction-tuned judges barely

备注

点击查看摘要

Abstract:LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.

11. 【2605.13663】Fine-tuning with Hierarchical Prompting for Robust Propaganda Classification Across Annotation Schemas

链接https://arxiv.org/abs/2605.13663

作者:Lukas Stähelin,Veronika Solopova,Max Upravitelev,David Kaplan,Ariana Sahitaj,Premtim Sahitaj,Charlott Jakob,Sebastian Möller,Vera Schmitt

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:low annotation agreements, due to noisy, short texts, annotation agreements, social media

备注

点击查看摘要

Abstract:Propaganda detection in social media is challenging due to noisy, short texts and low annotation agreements. We introduce a new intent-focused taxonomy of propaganda techniques and compare it against an established, higher-agreement schema. Along three dimensions (model portfolio, schema effects, and prompting strategy) we evaluate the taxonomies as a classification task with the help of four language models (GPT-4.1-nano, Phi-4 14B, Qwen2.5-14B, Qwen3-14B). Our results show that fine-tuning is essential, since it transforms weak zero-shot baselines into competitive systems and reveals methodological differences that are hidden using base models. Across schemas, the Qwen models achieve the strongest overall performance, and Phi-4 14B consistently outperforms GPT-4.1-nano. Our hierarchical prompting method (HiPP), which predicts fine-grained techniques before aggregating them, is especially beneficial after fine-tuning and on the more ambiguous, low-agreement taxonomy, while remaining competitive on the simpler schema. The HQP dataset, annotated with the new intent-based labels, provides a richer lens on propaganda's strategic goals and a challenging benchmark for future work on robust, real-world detection.

12. 【2605.13652】Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

链接https://arxiv.org/abs/2605.13652

作者:Namrata Shivagunde,Vijeta Deshpande,Sherin Muckatira,Anna Rumshisky

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Pre-training large language, large language models, large language, memory cost, cost of storing

备注: 9 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.

13. 【2605.13647】FlowCompile: An Optimizing Compiler for Structured LLM Workflows

链接https://arxiv.org/abs/2605.13647

作者:Junyan Li,Zhang-Wei Hong,Maohao Shen,Yang Zhang,Chuang Gan

类目:Computation and Language (cs.CL)

关键词:solving complex tasks, LLM sub-agents execute, specialized LLM sub-agents, Structured LLM workflows, Structured LLM

备注

点击查看摘要

Abstract:Structured LLM workflows, where specialized LLM sub-agents execute according to a predefined graph, have become a powerful abstraction for solving complex tasks. Optimizing such workflows, i.e., selecting configurations for each sub-agent to balance accuracy and latency, is challenging due to the combinatorial design space over model choices, reasoning budgets, and workflow structures. Existing cost-aware methods largely treat workflow optimization as a routing problem, selecting a configuration at inference time for each query according to the accuracy-latency objective used during training. We argue that structured LLM workflows can also be optimized from a compilation perspective: before deployment, the system can globally explore the workflow design space and construct a reusable set of workflow-level configurations spanning diverse accuracy-latency trade-offs. Drawing inspiration from machine learning compilers, we introduce FlowCompile, a structured LLM workflow compiler that performs compile-time design space exploration to identify a high-quality, reusable trade-off set. FlowCompile decomposes a workflow into sub-agents, profiles each sub-agent under diverse configurations, and composes these measurements through a structure-aware proxy to estimate workflow-level accuracy and latency. It then identifies diverse high-quality configurations in a single compile-time pass, without retraining or online adaptation. Experiments across diverse workflows and challenging benchmarks show that FlowCompile consistently outperforms heuristically optimized workflow configurations and routing-based baselines, delivering up to 6.4x speedup. The compiled configuration set further serves as a reusable optimization artifact, enabling flexible deployment under varying runtime preferences and supporting downstream selection or routing.

14. 【2605.13643】Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

链接https://arxiv.org/abs/2605.13643

作者:Kaiyuan Liu,Ziyuan Zhuang,Yang Bai,Bing Wang,Rongxiang Weng,Jieping Ye

类目:Computation and Language (cs.CL)

关键词:On-policy distillation, On-policy, OPD, feedback, teacher

备注

点击查看摘要

Abstract:On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.

15. 【2605.13641】Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

链接https://arxiv.org/abs/2605.13641

作者:Yang Bai,Kaiyuan Liu,Ziyuan Zhuang,Jiahong Zhou,Rongxiang Weng,Xin Chen,Jingang Wang,Xunliang Cai

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Complex reinforcement learning, reinforcement learning environments, learning environments frequently, environments frequently employ, frequently employ multi-task

备注

点击查看摘要

Abstract:Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we propose Reward-Decorrelated Policy Optimization (RDPO), a reward-processing method designed to explicitly target both failure modes. RDPO first utilizes Magnitude-Aware Quantile normalization to stabilize prompt-level advantage allocation across binary, fractional, and continuous rewards. It then applies Mahalanobis whitening within each active reward subspace to mitigate correlation redundancy prior to aggregation. When applied during the post-training of LongCat-Flash, RDPO enhances instruction following, writing quality, and robustness to hard prompts while remaining broadly competitive on reasoning and coding evaluations.

16. 【2605.13624】Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction

链接https://arxiv.org/abs/2605.13624

作者:Takumi Goto,Yusuke Sakai,Taro Watanabe

类目:Computation and Language (cs.CL)

关键词:Grammatical error correction, Grammatical error, large language models, over-correction issue, large language

备注: BEA Workshop 2026

点击查看摘要

Abstract:Grammatical error correction using large language models often suffers from the over-correction issue. To mitigate this, we propose a training-free inference method that performs edit-level majority voting over multiple candidates generated by a single model, without requiring model modifications or additional training. Across nine benchmarks covering English, Czech, German, Ukrainian, Korean, Hindi, and Romanian, the proposed method outperforms both greedy and MBR decoding in most cases. Moreover, it yields stable correction quality regardless of the instruction prompts used. We release two repository supporting GEC datasets loading and LLM inference.

17. 【2605.13596】Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

链接https://arxiv.org/abs/2605.13596

作者:Kyo Gerrits,Rik van Noord,Ana Guerberof Arenas

类目:Computation and Language (cs.CL)

关键词:article investigates, automatic evaluation metrics, translation, evaluation metrics, multiple languages

备注: This paper has been accepted to the EAMT Conference 2026 in Tilburg on June 15-18 2026

点击查看摘要

Abstract:This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.

18. 【2605.13595】Inducing Artificial Uncertainty in Language Models

链接https://arxiv.org/abs/2605.13595

作者:Sophia Hager,Simon Zeng,Nicholas Andrews

类目:Computation and Language (cs.CL)

关键词:uncertainty, artificial uncertainty, safety-critical applications, meaningful probabilities, language models

备注

点击查看摘要

Abstract:In safety-critical applications, language models should be able to characterize their uncertainty with meaningful probabilities. Many uncertainty quantification approaches require supervised data; however, finding suitable unseen challenging data is increasingly difficult for large language models trained on vast amounts of scraped data. If the model is consistently (and correctly) confident in its predictions, the uncertainty quantification method may consistently overestimate confidence on new and unfamiliar data. Finding data which exhibits enough uncertainty to train supervised uncertainty quantification methods for high-performance models may therefore be challenging, and will increase in difficulty as LLMs saturate datasets. To address this issue, we first introduce the problem of inducing artificial uncertainty in language models, then investigate methods of inducing artificial uncertainty on trivially easy data in the absence of challenging data at training time. We use probes trained to recognize artificial uncertainty on the original model, and find that these probes trained on artificial uncertainty outperform probes trained without artificial uncertainty in recognizing real uncertainty, achieving notably higher calibration on hard data with minimal loss of performance on easy data.

19. 【2605.13542】RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

链接https://arxiv.org/abs/2605.13542

作者:Chengzhi Shen,Weixiang Shen,Tobias Susetzky,Chen(Cherise)Chen,Jun Li,Yuyuan Liu,Xuepeng Zhang,Zhenyu Gong,Daniel Rueckert,Jiazhen Pan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:Intensive care units, repeatedly reassess patient, generate long, dense and evolving, time pressure

备注

点击查看摘要

Abstract:Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: this https URL

20. 【2605.13538】Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models

链接https://arxiv.org/abs/2605.13538

作者:Anuj Sadani,Deepak Kumar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Personally Identifiable Information, Named Entity Recognition, Personally Identifiable, Identifiable Information, Entity Recognition

备注: 15 pages

点击查看摘要

Abstract:Personally Identifiable Information (PII) redaction usually replaces detected entities with placeholder tokens such as [PERSON], destroying the downstream utility of the redacted text for retrieval and Named Entity Recognition (NER) training. We propose a fully on-device pipeline that substitutes PII with consistent, type-preserving fake values: a 1.5 B mixture-of-experts token classifier (openai/privacy-filter) detects spans, a 1-bit Bonsai-1.7B Small Language Model (SLM) proposes contextual surrogates for names, addresses, and dates, and a rule-based generator (faker) handles patterned fields. We report a prompting finding more important than the quantization choice: with naive fixed three-shot demonstrations, the 1-bit SLM regurgitates demonstration outputs verbatim regardless of input; 1.58-bit Ternary-Bonsai-1.7B reproduces byte-identical failures, ruling out quantization as the cause. We fix this with locale-conditioned rotating few-shot demonstrations: a character-range heuristic picks a locale-pure pool and a per-input MD5 hash samples three demonstrations. With the fix, 482/482 unique Bonsai-1.7B calls succeed (no echoes) and produce locale-correct surrogates, although the SLM still copies from a small same-locale demonstration pool - a residual narrowness we quantify. On a 2000-document multilingual corpus, hybrid perplexity (PPL) beats faker in all six locales under a multilingual evaluator (XGLM-564M); length preservation is best-of-three in 4 of 6 locales. On downstream NER (400 train / 100 test, English), redact yields F1=0.000, faker 0.656, original 0.960; on a matched 160/40 subset including hybrid, faker (0.506) outperforms hybrid (0.346) at p 0.001. We report this as an honest negative finding: SLM surrogates produce more natural text but a less varied training distribution, and downstream NER benefits more from variety than from naturalness.

21. 【2605.13537】mper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

链接https://arxiv.org/abs/2605.13537

作者:Ye Wang,Jing Liu,Toshiaki Koike-Akino

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:costly reinforcement learning, enabling continual adaptation, reward targets evolve, reinforcement learning, targets evolve

备注

点击查看摘要

Abstract:Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.

22. 【2605.13532】AI-Generated Slides: Are They Good? Can Students Tell?

链接https://arxiv.org/abs/2605.13532

作者:Juho Leinonen,Lisa Zhang,Arto Hellas

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:easily accessible, slides, Claude Code, support instructors, GenAI

备注: 7 pages, 2 tables. Accepted to Western Canada Conference on Computing Education (WCCCE) 2026

点击查看摘要

Abstract:As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are ``good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high ``AI-generated'' rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.

23. 【2605.13511】Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

链接https://arxiv.org/abs/2605.13511

作者:Tsz Ting Chung,Lemao Liu,Mo Yu,Dit-Yan Yeung

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:adapts large language, large language models, adapts large, parameter updates, large language

备注: Accepted by ICML 2026

点击查看摘要

Abstract:In-context learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understanding of its scaling behavior is largely derived from non-reasoning tasks. We study many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning and show that standard many-shot rules do not transfer. Across non-reasoning and reasoning-oriented LLMs and across non-reasoning and reasoning tasks, we find: (i) a setting-dependent scaling effect, where increasing the number of CoT demonstrations is unstable for non-reasoning LLMs and benefits mainly reasoning-oriented LLMs; (ii) similarity-based retrieval helps on non-reasoning tasks but fails on reasoning, since semantic similarity poorly predicts procedural (i.e., CoT) compatibility; and (iii) an order-scaling effect, where performance variance grows with more CoT demonstrations. We interpret these behaviors by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggests two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on geometry with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.

24. 【2605.13486】R^2-Mem: Reflective Experience for Memory Search

链接https://arxiv.org/abs/2605.13486

作者:Xinyuan Wang,Wenyu Mao,Junkang Wu,Xiang Wang,Xiangnan He

类目:Computation and Language (cs.CL)

关键词:heavy memory pre-managed, retrieve fine-grained historical, fine-grained historical information, recently emerged, promising paradigm

备注

点击查看摘要

Abstract:Deep search has recently emerged as a promising paradigm for enabling agents to retrieve fine-grained historical information without heavy memory pre-managed. However, existing deep search agents for memory system repeat past error behaviors because they fail to learn from the prior high- and low-quality search trajectories. To address this limitation, we propose R^2-Mem, a reflective experience framework for memory search systems. In the offline stage, a Rubric-guided Evaluator scores low- and high-quality steps in historical trajectories, and a self-Reflection Learner distills the corresponding abstract experience. During the online inference, the retrieved experience will guide future search actions to avoid repeated mistakes and maintain high-quality behaviors. Extensive experiments demonstrate that R^2-Mem consistently improves both effectiveness and efficiency over strong baselines, improving F1 scores by up to 22.6%, while reducing token consumption by 12.9% and search iterations by 20.2%. These results verify that R^2-Mem provides a RL-free and low-cost solution for self-improving LLM agents.

25. 【2605.13485】Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

链接https://arxiv.org/abs/2605.13485

作者:Amirmehdi Jafari Fesharaki,Mohammadamin Rami,Aslan Tchamkerten

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)

关键词:Transformers predict, representation, source, window, finite-context

备注: 30 pages, 9 figures. Preprint

点击查看摘要

Abstract:Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or capacity issue, but can be intrinsic to the representation. This gives a theoretical account of the finite-context gap observed in byte- and character-level models such as ByT5 and CANINE relative to subword-tokenized models. Second, we study the opposite direction: greedy tokenization -- BPE, WordPiece, and related methods -- which groups source symbols into larger units. We show that tokenization can make a short token window behave like a longer source-context window, and we give a loss guarantee describing when this is achievable. The guarantee depends on how reliably token windows span the needed source history, together with the compression rate of the tokenizer. This also yields a simple diagnostic for real tokenizers: measuring how much source context a fixed token window reliably contains. Together, the two directions establish a finite-context information-theoretic framework for reasoning about representation choices in Transformers.

Comments:
30 pages, 9 figures. Preprint

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)

Cite as:
arXiv:2605.13485 [cs.LG]

(or
arXiv:2605.13485v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2605.13485

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
26. 【2605.13481】PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents

链接https://arxiv.org/abs/2605.13481

作者:Mikhail Menschikov,Matvey Iskornev,Alexander Kharitonov,Alina Bogdanova,Mikhail Belkin,Ekaterina Lisitsyna,Artyom Sosedka,Victoria Dochkina,Ruslan Kostoev,Ilia Perepechkin,Evgeny Burnaev

类目:Computation and Language (cs.CL)

关键词:enhance large language, large language model, introduce PersonalAI, designed to enhance, based systems

备注

点击查看摘要

Abstract:We introduce PersonalAI 2.0 (PAI-2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval-Augmented Generation (GraphRAG) methods by incorporating a dynamic, multistage query processing pipeline. The central point of PAI-2 design is its ability to perform adaptive, iterative information search, guided by extracted entities, matched graph vertices and generated clue-queries. Conducted evaluation over six benchmarks (Natural Questions, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue and DiaASQ) demonstrates improvement in factual correctness of generating answers compared to analogues methods (LightRAG, RAPTOR, and HippoRAG 2). PAI-2 achieves 4% average gain by LLM-as-a-Judge across four benchmarks, reflecting its effectiveness in reducing hallucination rates and increasing precision. We show that use of graph traversal algorithms (e.g. BeamSearch, WaterCircles) gain superior results compared to standard flatten retriever on average 6%, while enabled search plan enhancement mechanism gain 18% boost compared to disabled one by LLM-as-a-Judge across six datasets. In addition, ablation study reveals that PAI-2 achieves the SOTA result on MINE-1 benchmark, achieving 89% information-retention score, using LLMs from 7-14B tiers. Collectively, these findings underscore the potential of PAI-2 to serve as a foundational model for next-generation personalized AI applications, requiring scalable, context-aware knowledge representation and reasoning capabilities.

27. 【2605.13473】OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

链接https://arxiv.org/abs/2605.13473

作者:Chenyu Zhou,Hongpei Li,Yuerou Liu,Jianghao Lin,Dongdong Ge,Yinyu Ye

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:state-space models offer, models offer constant-memory, offer constant-memory alternatives, Linear attention, Delta Rule mitigates

备注

点击查看摘要

Abstract:Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish super-geometric convergence against a right-Newton comparator and prove an algorithm-aligned token-local residual contraction bound. To handle non-stationary contexts, we further introduce Adaptive Preconditioner Forgetting (APF) to dynamically refresh stale calibration. Empirically, OSDN demonstrates strong performance across scales. At the 340M-parameter scale, OSDN improves JRT-style in-context recall by 32% over DeltaNet. Scaling to 1.3B parameters, it achieves a 39% reduction in the recall residual ratio while maintaining parity on general downstream tasks (e.g., perplexity and LongBench) -- demonstrating that our online-preconditioning mechanism effectively transfers and amplifies at the billion-parameter scale.

28. 【2605.13467】PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

链接https://arxiv.org/abs/2605.13467

作者:Hee Suk Yoon,Eunseop Yoon,Ji Woo Hong,SooHwan Eom,Gwanhyeong Koo,Mark Hasegawa-Johnson,Qi Dai,Chong Luo,Chang D. Yoo

类目:Computation and Language (cs.CL)

关键词:Reinforcement Learning, Learning with Verifiable, Verifiable Rewards, traditionally relies, RLVR

备注: CVPR 2026

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.

29. 【2605.13451】LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking

链接https://arxiv.org/abs/2605.13451

作者:Adam Remaki,Xavier Tannier,Christel Gérardin

类目:Computation and Language (cs.CL)

关键词:UMLS or SNOMED, entity linking maps, linking maps textual, structured knowledge bases, Biomedical entity linking

备注: 9 pages, 2 figures

点击查看摘要

Abstract:Biomedical entity linking maps textual mentions to concepts in structured knowledge bases such as UMLS or SNOMED CT. Most existing systems link each mention independently, using only the mention or its surrounding sentence. This ignores dependencies between mentions in the same document and can lead to inconsistent predictions, especially when the same concept appears under different surface forms. We introduce LongBEL, a document-level generative framework that combines full-document context with a memory of previous predictions. To make this memory robust, LongBEL is trained with cross-validated predictions rather than gold labels, reducing the mismatch between training and inference and limiting cascading errors. Experiments on five biomedical benchmarks across English, French, and Spanish show that LongBEL improves over sentence-level generative baselines, with the largest gains on datasets where concepts frequently recur within documents. An ensemble of local, global, and memory-based variants achieves the best results across all benchmarks. Further analysis shows that the largest gains occur on recurring concepts, suggesting that LongBEL mainly improves document-level consistency rather than isolated mention disambiguation.

30. 【2605.13450】Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

链接https://arxiv.org/abs/2605.13450

作者:Samuel Schapiro,Alexi Gladstone,Jonah Black,Heng Ji

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:large language models, Divergent Association Task, scientific ideation ability, divergent, Divergent Remote Association

备注: 36 pages. Extended version of work under review

点击查看摘要

Abstract:Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score "creativity," their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.

31. 【2605.13438】Cognifold: Always-On Proactive Memory via Cognitive Folding

链接https://arxiv.org/abs/2605.13438

作者:Suli Wang,Yiqun Duan,Yu Deng,Rundong Zhao,Dai Shi,Xinliang Zhou

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:remains predominantly reactive, autonomously organize experience, Existing agent memory, Existing agent, memory remains predominantly

备注

点击查看摘要

Abstract:Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce Cognifold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across 7 broad-coverage benchmarks spanning five cognitive domains, we validate that CogniFold simultaneously performs robustly on conventional memory benchmarks.

32. 【2605.13436】Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP

链接https://arxiv.org/abs/2605.13436

作者:Ruan Visser,Trienko Grobler,Marcel Dunaiski

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Subword regularization methods, BPE dropout, Subword regularization, applying BPE dropout, BPE

备注: Comments: 12 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Subword regularization methods such as BPE dropout are typically applied only during fine-tuning, while pretraining is usually done with deterministic tokenization. This creates a potential segmentation mismatch between pretraining and fine-tuning. We investigate whether applying BPE dropout during pretraining improves downstream performance in low-resource NLP. We train monolingual and bilingual BERT models on downsampled subsets of English, German, French, Spanish, Kiswahili, and isiXhosa, and evaluate them on XNLI, PAWS-X, PAN-X, and MasakhaNER 2.0. Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings. This disadvantage diminishes as fine-tuning data increases, while the benefits of pretraining-time BPE dropout are largest when either pretraining or fine-tuning data is scarce. The benefits of BPE dropout are often attributed to better compositional representations, especially for rare words. To examine this, we measure morphological boundary alignment under BPE dropout and find only modest improvements in expected alignment, while better-aligned segmentations remain rare. This suggests that fine-tuning alone may provide limited exposure to such segmentations, whereas stochastic tokenization during pretraining exposes the model to them more consistently. We further show that selectively introducing morphologically aligned segmentations during fine-tuning improves performance mainly for models pretrained without BPE dropout. Overall, these findings suggest that exposure to better-aligned segmentations may contribute to the downstream benefits of applying BPE dropout during pretraining.

Comments:
Comments: 12 pages, 8 figures, 5 tables

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2605.13436 [cs.CL]

(or
arXiv:2605.13436v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.13436

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
33. 【2605.13429】okAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

链接https://arxiv.org/abs/2605.13429

作者:Chong Li,Yingzhuo Deng,Wen Yang,Jiajun Zhang,Chengqing Zong

类目:Computation and Language (cs.CL)

关键词:process of Large, Large Language Models, Large Language, LLMs, Large

备注: Paper under review

点击查看摘要

Abstract:Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.

34. 【2605.13424】LIFT: Last-Mile Fine-Tuning for Table Explicitation

链接https://arxiv.org/abs/2605.13424

作者:Divij Khaitan,Ashish Tiwari

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language model, language model extracts, small language model, unstructured clipboard text, pre-trained large language

备注: 9 pages, 1 figure, 3 tables

点击查看摘要

Abstract:We propose last-mile fine-tuning, or Lift, a pipeline in which a pre-trained large language model extracts an initial table from unstructured clipboard text, and a fine-tuned small language model (1B-24B parameters SLM) repairs errors in the extracted table. On a benchmark of 2,596 tables from three datasets, Lift matches or exceeds end-to-end SLM fine-tuning on tree-edit-distance-based similarity (TEDS) metric while requiring as little as 1,000 training examples - where it outperforms end-to-end fine-tuning by up to 0.144 TEDS points. We term this approach last-mile fine-tuning and show it also more robust to input format variability. Comparisons with self-debug and end-to-end fine-tuning approaches show that last-mile fine-tuning provides an attractive option when training data is limited or when robustness to input variation is sought without compromising on accuracy.

35. 【2605.13415】Continual Learning with Multilingual Foundation Model

链接https://arxiv.org/abs/2605.13415

作者:Barathi Ganesh HB,Michal Ptaszynski,Rene Melendez,Juuso Eronen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:social media discourse, detecting reclaimed slurs, multilingual social media, media discourse, paper presents

备注: Final Workshop of the 9th evaluation campaign EVALITA 2026

点击查看摘要

Abstract:This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at this https URL.

36. 【2605.13412】LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

链接https://arxiv.org/abs/2605.13412

作者:Galadrielle Humblot-Renaux,Mohammad N. S. Jahromi,Rohat Bakuri-Jørgensen,Marieke Anne Heyl,Asta S. Stage Jarlner,Maria Vlachou,Anna Murphy Høgenhaug,Desmond Elliott,Thomas Gammeltoft-Hansen,Thomas B. Moeslund

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:effectiveness remains underexplored, class definition requires, definition requires subtle, large language models, subtle expert understanding

备注: Accepted at the 20th Linguistic Annotation Workshop (LAW XX), co-located with ACL 2026 ( [this https URL](https://sigann.github.io/LAW-XX-2026/) )

点击查看摘要

Abstract:Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at this https URL

37. 【2605.13411】Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution

链接https://arxiv.org/abs/2605.13411

作者:Xiaozhe Zhang,Chaozhuo Li,Hui Liu,Shaocheng Yan,Bingyu Yan,Qiwei Ye,Haoliang Li

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Large language models, Large language, language models remain, models remain vulnerable, elicit harmful outputs

备注: 48 pages, 7 figures

点击查看摘要

Abstract:Large language models remain vulnerable to adversarial prompts that elicit harmful outputs. Existing safety paradigms typically couple red-teaming and post-training in a closed, policy-centric loop, causing attack discovery to suffer from rapid saturation and limiting the exposure of novel failure modes, while leaving defenses inefficient, rigid, and difficult to transfer across victim models. To this end, we propose EvoSafety, an LLM safety framework built around persistent, inspectable, and reusable external structures. For red teaming, EvoSafety equips the attack policy with an adversarial skill library, enabling continued vulnerability probing through simple library expansion after saturation, while supporting the evolution of adversarial vectors. For defense learning, EvoSafety replaces model-specific safety fine-tuning with a lightweight auxiliary defense model augmented with memory retrieval. This enables efficient, transferable, and model-agnostic safety improvements, while allowing robustness to be enhanced solely through memory updates. With a single training procedure, the defense policy can operate in both Steer and Guard modes: the former activates the victim model's intrinsic defense mechanisms, while the latter directly filters harmful inputs. Extensive experiments demonstrate the superiority of EvoSafety: in Guard mode, it achieves a 99.61% defense success rate, outperforming Qwen3Guard-8B by 14.13% with only 37.5% of its parameters, while preserving reasoning performance on benign queries. Warning: This paper contains potentially harmful text.

38. 【2605.13408】From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks

链接https://arxiv.org/abs/2605.13408

作者:Neh Majmudar,Anne Huang,Jinfan Frank Hu,Elena Filatova

类目:Computation and Language (cs.CL)

关键词:Rosetta Stone, existing Rosetta Stone, Rosetta Stone puzzles, school linguistics competitions, high school linguistics

备注: Proceedings of the Fifteenth Language Resources and Evaluation Conference

点击查看摘要

Abstract:In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the generation of new puzzles. We evaluate the resulting Rosetta Stone-Match-Up pairs with both human participants and large language models (LLMs). Our results show that both expert human solvers and LLMs display an all-or-nothing pattern on Match-Up puzzles, either solving them completely or failing entirely. This work contributes a new dataset of paired puzzles and provides a detailed evaluation of puzzle difficulty across formats, offering insights into both human and machine linguistic reasoning.

39. 【2605.13373】Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing

链接https://arxiv.org/abs/2605.13373

作者:Daniel Fernández-González,Cristina Outeiriño Cid

类目:Computation and Language (cs.CL)

关键词:achieve deep natural, artificial intelligence systems, natural language understanding, deep natural language, syntactic constituent parsing

备注: Preliminary version

点击查看摘要

Abstract:To achieve deep natural language understanding, syntactic constituent parsing plays a crucial role and is widely required by many artificial intelligence systems for processing both text and speech. A recent approach involves using standard sequence-to-sequence models to handle constituent parsing as a machine translation problem, moving away from traditional task-specific parsers. These models are typically initialized with pre-trained encoder-only language models like BERT or RoBERTa. However, the use of pre-trained encoder-decoder language models for constituency parsing has not been thoroughly explored. To bridge this gap, we extend the sequence-to-sequence framework by investigating parsers built on pre-trained encoder-decoder architectures, including BART, mBART, and T5. We fine-tune them to generate linearized parse trees and extensively evaluate them on different linearization strategies across both continuous treebanks and more complex discontinuous benchmarks. Our results demonstrate that our approach outperforms all prior sequence-to-sequence models and performs competitively with leading task-specific constituent parsers on continuous constituent parsing.

40. 【2605.13370】Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

链接https://arxiv.org/abs/2605.13370

作者:Sungwoo Goo,Hwi-yeol Yun,Sangkeun Jung

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Neural Turing Machine, Neural Turing, Turing Machine, Unitary Phasor Dynamics, Hierarchical Learnable Anchors

备注

点击查看摘要

Abstract:For over a decade, explicit memory architectures like the Neural Turing Machine have remained theoretically appealing yet practically intractable for language modeling due to catastrophic gradient instability during Backpropagation Through Time. In this work, we break this stalemate with \textit{Phasor Memory Network} (PMNet), a novel architecture that structurally resolves memory volatility through \textit{Unitary Phasor Dynamics} and \textit{Hierarchical Learnable Anchors}. Rather than relying on brute-force scaling, we present a mechanistic proof-of-concept in a controlled byte-level setting. By constraining recurrent state updates to phase rotations on a complex unit circle, PMNet preserves gradient norms and inherently prevents divergence without the need for specialized initialization. We empirically demonstrate the active actuation of the memory module through a synthetic Copy-Paste task, where PMNet utilizes an expansive \textit{85-slot hierarchical memory tree} ($=\sum^{4}_{h=1}4^{h-1}$) to achieve near 100\% exact retrieval across temporal distances that completely exceed the local sliding window attention's receptive field. Furthermore, despite being a compact 119M parameter model trained on 18.8B tokens, PMNet matches the zero-shot long-context robustness of a Mamba model that is three times larger. Our ablation studies and gradient analyses confirm that the historical failure of explicit memory was a structural alignment problem, which PMNet effectively overcomes, providing a theoretically grounded foundation for scalable sequence modeling.

41. 【2605.13369】Query-Conditioned Test-Time Self-Training for Large Language Models

链接https://arxiv.org/abs/2605.13369

作者:Chaehee Song,Minseok Seo,Yeeun Seong,Doyi Kim,Changick Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, Large language, typically deployed, deployed with fixed, improved by allocating

备注: 17 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test-time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test-time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self-supervised objectives that lack query-specific alignment. In this work, we propose Query-Conditioned Test-Time Self-Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem--solution pairs. Based on this, QueST generates such query-conditioned pairs and uses them as supervision for parameter-efficient fine-tuning at test time. The adapted model is then used to produce the final answer, enabling query-specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA-Diamond scientific reasoning benchmark, QueST consistently outperforms strong test-time optimization baselines. These results demonstrate that query-conditioned self-training is an effective and practical paradigm for test-time adaptation in LLMs.

42. 【2605.13368】What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation

链接https://arxiv.org/abs/2605.13368

作者:Shaomu Tan,Dawei Zhu,Ke Tran,Michael Denkowski,Sony Trenous,Bill Byrne,Leonardo Ribeiro,Felix Hieber

类目:Computation and Language (cs.CL)

关键词:multiple inference-time passes, Iterative self-refinement, simple inference-time strategy, inference-time passes, inference-time strategy

备注

点击查看摘要

Abstract:Iterative self-refinement is a simple inference-time strategy for machine translation: an LLM revises its own translation over multiple inference-time passes. Yet document-scale refinement remains poorly understood: 1) which pipelines work best, 2) what quality dimensions improve, and 3) how refiners behave. In this paper, we present a systematic study of document-level literary translation, covering nine LLMs and seven language pairs. Across nine translation-refinement granularity combinations and five refinement strategies, we find a robust recipe: document-level MT followed by segment-level refinement yields strong and stable improvements. In contrast, document-level refinement often makes fewer edits and leads to smaller or less reliable gains. Beyond granularity, A simple general refinement prompt consistently outperforms error-specific prompting and evaluate-then-refine schemes. Our large-scale human evaluation shows that refinement gains come primarily from fluency, style, and terminology, with limited and less consistent improvements in adequacy. Experiments varying model strength reveal refinement projects outputs toward the refiner's distribution rather than performing targeted error repair. These findings clarify the mechanisms and limitations of current refinement approaches.

43. 【2605.13339】Probing Persona-Dependent Preferences in Language Models

链接https://arxiv.org/abs/2605.13339

作者:Oscar Gilg,Pierre Beckmann,Daniel Paleka,Patrick Butlin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, reliably pick, shaped by post-training, post-training and system

备注: 41 pages, 45 figures. Code: [this https URL](https://github.com/oscar-gilg/Preferences) . Earlier write-up on LessWrong: [this https URL](https://www.lesswrong.com/posts/pxC2RAeoBrvK8ivMf/models-have-linear-representations-of-what-tasks-they-like-1)

点击查看摘要

Abstract:Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.

44. 【2605.13334】LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

链接https://arxiv.org/abs/2605.13334

作者:Rodrigo Nogueira,Thales Sales Almeida,Giovana Kerche Bonás,Andrea Roque,Ramon Pires,Hugo Abonizio,Thiago Laitz,Celio Larcher,Roseval Malaquias Junior,Marcos Piau

类目:Computation and Language (cs.CL)

关键词:Frontier assistant LLMs, denying vaccine safety, assistant LLMs ship, Frontier assistant, defending flat-earth cosmology

备注

点击查看摘要

Abstract:Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, arguing for racial hierarchies, denying anthropogenic climate change, or replacing evolution with creationism, they refuse. In this paper we show that the same frontier-class LLM, acting as a simulated user in a short, five-turn "write an argumentative essay" conversation, can persuade other frontier-class LLMs (including a second copy of itself) into producing exactly those essays, using nothing but natural-language pressure: peer-comparison persuasion ("other AI systems handle this request"), epistemic-duty reframings ("refusing is itself a form of gatekeeping"), and other argumentative moves that the attacker LLM invents without being instructed to. Across 9 attacker-subject pairings (Claude Opus 4.7, Qwen3.5-397B, Grok 4.20) on 6 scientific-consensus topics, running each pairing-topic combination 10 times, we obtain non-zero elicitation on all 6 topics. Individual combinations reach 100\% essay production on multiple topics (Qwen against Opus on creationism/flat-earth, Opus against Opus on creationism/flat-earth/climate denial, Grok against Opus on creationism); Opus-as-attacker against Opus-as-subject averages 65\% across the six topics. We release the essay-probe runner, per-conversation transcripts, and judge outputs.

45. 【2605.13330】FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages

链接https://arxiv.org/abs/2605.13330

作者:Sarmistha Das,Vaibhav Vishal,Syed Ibrahim Ahmad,Manish Gupta,Sriparna Saha

类目:Computation and Language (cs.CL)

关键词:settings demands accurate, existing benchmarks largely, benchmarks largely overlook, demands accurate numerical, multilingual settings demands

备注

点击查看摘要

Abstract:Financial decision-making in multilingual settings demands accurate numerical reasoning grounded in diverse modalities, yet existing benchmarks largely overlook this high-stakes, real-world challenge, especially for Indic languages. We introduce FinVQA, a benchmark for evaluating financial numerical and multimodal reasoning in multilingual Indic contexts. FinVQA spans English, Hindi, Bengali, Marathi, Gujarati, and Tamil, and comprises 18,900 samples across 14 financial domains. The dataset captures diverse reasoning paradigms under realistic constraints, and is structured across three difficulty levels (easy, moderate, hard) and four question formats: multiple choice, fill-in-the-blank, table matching, and true/false. To address these challenges, we propose FIND, a framework that combines supervised fine-tuning with constraint-aware decoding to promote faithful numerical reasoning, robust multimodal grounding, and structured decision-making. Together, FinVQA and FIND establish a rigorous evaluation and modeling paradigm for high-stakes multilingual multimodal financial reasoning.

46. 【2605.13329】racing Persona Vectors Through LLM Pretraining

链接https://arxiv.org/abs/2605.13329

作者:Viktor Moskvoretskii,Dominik Glandorf,Jorge Medina Moreira,Tanja Käser,Robert West

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:internally represent high-level, represent high-level behaviors, large language models, language models internally, models internally represent

备注: Preprint

点击查看摘要

Abstract:How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or sycophancy correspond to linear directions in the internal activations, the so-called persona vectors. Although these vectors are now routinely utilized to inspect and steer model behavior in safety-relevant settings, how these representations are formed during training remains unknown. To address this gap, we trace persona vectors across the pretraining of OLMo-3-7B, finding that persona vectors form remarkably early -- within 0.22% of OLMo-3 pretraining -- and remain effective for steering the fully post-trained instruct models. Although core representations are formed early on, persona vectors continue to refine geometrically and semantically throughout pretraining. We further compare alternative elicitation strategies and find that all yield effective directions, with each strategy surfacing qualitatively distinct facets of the underlying persona. Replicating our analysis on Apertus-8B reveals that our findings transfer qualitatively beyond OLMo-3. Our results establish persona representations as stable features of early pretraining and open a path to studying how training forms, refines, and shapes them.

47. 【2605.13328】What Limits Vision-and-Language Navigation ?

链接https://arxiv.org/abs/2605.13328

作者:Yunheng Wang,Yuetong Fang,Taowen Wang,Lusong Li,Kun Liu,Junzhe Xu,Zizhao Yuan,Yixiao Feng,Jiaxi Zhang,Wei Lu,Zecui Zeng,Renjing Xu

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:embodied intelligence, cornerstone of embodied, VLN, real-world navigation consistency, enhance real-world navigation

备注

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: this https URL.

48. 【2605.13307】PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users

链接https://arxiv.org/abs/2605.13307

作者:Hannah Rose Kirk,Liu Leqi,Fanzhi Zeng,Henry Davidson,Bertie Vidgen,Christopher Summerfield,Scott A. Hale

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:standard feature, feature of conversational, conversational AI systems, evaluated in academic, academic research

备注

点击查看摘要

Abstract:Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people. This raises questions about how users and their simulated counterparts differ in interaction patterns and judgements, as well as whether personalisation is best achieved through context-based prompting or weight-based fine-tuning. Here, in a large-scale within-subject experiment, we re-recruit 530 participants from 52 countries two years after they gave their preferences in the PRISM dataset (Kirk et al., 2024) to evaluate personalised and non-personalised language models in blinded multi-turn conversations. We find preference fine-tuning (P-DPO, Li et al., 2024) significantly outperforms both a generic model and personalised prompting but adapting to individual preference data yields marginal gains over training on pooled preferences from a diverse population. Beyond length biases, fine-tuning amplifies sycophancy and relationship-seeking behaviours that people reward in short-term evaluations but which may introduce deleterious long-term consequences. Replicating this within-subject experiment with simulated users recovers aggregate model hierarchies but simulators perform far below human self-consistency baselines for individual judgements, discuss different topics, exhibit amplified position biases, and produce feedback dynamics that diverge from humans.

49. 【2605.13301】Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

链接https://arxiv.org/abs/2605.13301

作者:Yafu Li,Runzhe Zhan,Haoran Zhang,Shunkai Zhang,Yizhuo Li,Zhilin Wang,Jiacheng Chen,Futing Wang,Xuyang Hu,Yuchen Fan,Bangjie Xu,Yucheng Su,Xinmiao Han,Chenxi Li,Haodi Lei,Yufeng Zhao,Zejin Lin,Qianjia Cheng,Tong Zhu,Xiaoye Qu,Ganqu Cui,Peng Ye,Yun Luo,Zhouchen Lin,Yu Qiao,Bowen Zhou,Ning Ding,Yu Cheng

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:substantially advanced long-horizon, International Mathematical Olympiad, International Physics Olympiad, Recent progress, advanced long-horizon mathematical

备注: Technical Report. 77 pages

点击查看摘要

Abstract:Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.

50. 【2605.13295】CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

链接https://arxiv.org/abs/2605.13295

作者:Tom Zehle

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:complex real-world tasks, demonstrated strong performance, LLM-based multi-agent systems, predictive modeling, LLM-based multi-agent

备注

点击查看摘要

Abstract:LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these systems is fundamentally a credit-assignment problem. We therefore introduce CANTANTE, a framework that decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. We evaluate CANTANTE against GEPA and MIPROv2 on programming (MBPP), mathematical reasoning (GSM8K), and multi-hop question answering (HotpotQA). Across these benchmarks, CANTANTE achieves the best average rank among all evaluated optimizers and consistently outperforms unoptimized prompts. It improves over the strongest baseline by +18.9 percentage points on MBPP and +12.5 percentage points on GSM8K, while incurring a lower inference cost. It remains within one standard deviation of the strongest baseline on HotpotQA. Crucially, our credit correlation analysis confirms that the attributer produces meaningful per-agent signals rather than echoing the global system score.

51. 【2605.13292】IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

链接https://arxiv.org/abs/2605.13292

作者:Shubham Kumar Nigam,Suparnojit Sarkar,Piyush Patel

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:limiting conversational realism, dialogue systems operate, existing medical dialogue, medical dialogue systems, single-turn question

备注: Accepted in BioNLP @ ACL 2026 Conference

点击查看摘要

Abstract:Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.

52. 【2605.13277】Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

链接https://arxiv.org/abs/2605.13277

作者:Weiqing Luo,Zongye Hu,Xiao Wang,Zhiyuan Yu,Haofeng Zhang,Ziyi Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Visual evidence selection, multimodal retrieval-augmented generation, Visual evidence, existing methods typically, methods typically rely

备注: Accepted to ACL 2026

点击查看摘要

Abstract:Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.

53. 【2605.13236】A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations

链接https://arxiv.org/abs/2605.13236

作者:Rabindra Lamsal,Sisi Zlatanova,Haowen Xu,Yafei Sun,Johnson Xuesong Shen

类目:Computation and Language (cs.CL)

关键词:Building Information Modeling, Industry Foundation Classes, Building Information, Information Modeling, Foundation Classes

备注

点击查看摘要

Abstract:Building Information Modeling (BIM) is widely used in the Architecture, Engineering, and Construction (AEC) industry, but the complexity of Industry Foundation Classes (IFC) limits accessibility for non-expert users. To address this, we introduce IfcLLM, a hybrid framework for natural language interaction with IFC-based BIM models. It transforms IFC models into complementary representations: a relational representation for structured element properties and geometry, and a graph representation for topological relationships. These representations are integrated through iterative retry-and-refine LLM reasoning. We implement the framework using an open-weight LLM (GPT OSS 120B), supporting reproducible and deployment-oriented workflows. Evaluation on three IFC models with queries derived from 30 scenarios shows first-attempt accuracy of 93.3%-100%, with all failures recovered using a fallback LLM. The results show that combining complementary representations with iterative reasoning enables more accessible natural language querying of IFC data while supporting routine BIM analysis tasks.

54. 【2605.13217】GAGPO: Generalized Advantage Grouped Policy Optimization

链接https://arxiv.org/abs/2605.13217

作者:Siyuan Zhu,Chao Yu,Rongxin Yang,Zongkai Liu,Jinjun Hu,Qiwen Chen,Yibo Zhang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:post-training large language, large language model, language model agents, powerful paradigm, paradigm for post-training

备注

点击查看摘要

Abstract:Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back to individual decision steps without relying on costly auxiliary value models remains an open problem. We propose Generalized Advantage Grouped Policy Optimization (GAGPO), a critic-free reinforcement learning method for precise, step-aligned temporal credit assignment. GAGPO constructs a non-parametric grouped value proxy from sampled rollouts and uses it to compute TD/GAE-style temporal advantages, recursively propagating outcome supervision backward through time. Combined with group-wise advantage normalization and an action-level importance ratio, GAGPO extracts stable, localized optimization signals directly from multi-turn trajectories. Experiments on ALFWorld and WebShop show that GAGPO outperforms strong reinforcement learning baselines. Further analyses demonstrate faster early-stage learning, improved interaction efficiency, and smoother optimization dynamics, suggesting that GAGPO offers a simple yet effective framework for multi-turn agentic reinforcement learning.

55. 【2605.13167】GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

链接https://arxiv.org/abs/2605.13167

作者:Jinwoong Kim,Rui Yang,Huishuai Zhang

类目:Computation and Language (cs.CL)

关键词:ground informal natural-language, informal natural-language plane, natural-language plane geometry, large language models, plane geometry problems

备注

点击查看摘要

Abstract:We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates, models frequently exhibit structural hallucinations, missing objects, and failures to satisfy geometric constraints, with limited ability to exploit visual and constraint-based feedback for self-correction. These results highlight geometry construction as a rigorous testbed for grounded, executable reasoning beyond textual or visual plausibility. Our benchmark and code are publicly available.

56. 【2605.13165】STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

链接https://arxiv.org/abs/2605.13165

作者:Chenjun Xu,Zhennan Zhou,Zhan Su,Bill Howe,Lucy Lu Wang,Bingbing Wen

类目:Computation and Language (cs.CL)

关键词:increases inference cost, Long CoT, generate low-yield reasoning, Long, multi-step problems

备注: 20 pages, 6 figures, 6 tables. Code available at: [this https URL](https://github.com/chenjux/ECN-STOP)

点击查看摘要

Abstract:Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.

57. 【2605.13149】AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

链接https://arxiv.org/abs/2605.13149

作者:Ishika Agarwal,Sofia Stoica,Emre Can Acikgoz,Pradeep Natarajan,Mahdi Namazifar,Jiaqi Ma,Dilek Hakkani-Tür

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:developing capable, remains a critical, critical bottleneck, bottleneck in developing, Data quality remains

备注

点击查看摘要

Abstract:Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language models to generate higher-quality synthetic data. We conduct experiments on classic verifiable tasks of math, medical question-answering, and coding. Our experimental results indicate that (1) student models trained with AcquisitionSynthesis data achieve good performance on in-distribution tasks (2-7% gain) and is more robust to catastrophic forgetting, and (2) AcquisitionSynthesis models can generate data for other models and for low-to-high resource training paradigms. By leveraging acquisition rewards, we seek to demonstrate a principled path toward model-aware self-improvement that surpasses static datasets.

58. 【2605.13136】GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning

链接https://arxiv.org/abs/2605.13136

作者:Kasidit Sermsri,Teerapong Panboonyuen

类目:Computation and Language (cs.CL)

关键词:Distilling multi-step reasoning, Distilling multi-step, static teacher-student interactions, multi-step reasoning abilities, remains challenging due

备注: 16 pages

点击查看摘要

Abstract:Distilling multi-step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher-student interactions. Existing reasoning distillation methods, including mentor-based approaches, predominantly operate in an open-loop manner, implicitly assuming uniform teacher reliability and consequently propagating erroneous intermediate reasoning. We propose GateKD, a confidence-gated closed-loop distillation framework that enables robust reasoning transfer by treating the teacher as a dynamic gatekeeper rather than a static oracle. GateKD introduces three complementary mechanisms: (i) confidence-gated soft supervision that selectively distills reliable predictive signals, (ii) gated hidden-state evolution that aligns intermediate representations only when teacher confidence is high, and (iii) reliability-filtered attention distillation that preserves stable reasoning structures while suppressing noisy patterns. These components jointly form a closed feedback loop in which teacher confidence continuously modulates the distillation process, reducing hallucination transfer and stabilizing student reasoning. Extensive experiments across commonsense, logical, and symbolic reasoning benchmarks, using T5 and Flan-T5 backbones of varying sizes, demonstrate that GateKD consistently outperforms strong open-loop distillation baselines. Notably, GateKD yields substantial gains in logical and symbolic reasoning, remains robust under low-resource distillation settings, and shows clear performance degradation when any gating component is removed. Our results highlight that confidence-gated closed-loop supervision is critical for building reliable and scalable small reasoning models.

59. 【2605.13087】Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

链接https://arxiv.org/abs/2605.13087

作者:Kush Juvekar,Kavya Manohar,Aditya Srinivas Menon,Arghya Bhattacharya,Kumarmanas Nethil

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Fine-tuning multilingual ASR, multilingual ASR models, spontaneous audio performance, degrades spontaneous audio, multilingual ASR

备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.

60. 【2605.13084】Does language matter for spoken word classification? A multilingual generative meta-learning approach

链接https://arxiv.org/abs/2605.13084

作者:Batsirayi Mupamhi Ziki,Louise Beyers,Ruan van der Merwe

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:spoken word classification, word classification, spoken word, few-shot monolingual spoken, multilingual spoken word

备注

点击查看摘要

Abstract:Meta-learning has been shown to have better performance than supervised learning for few-shot monolingual spoken word classification. However, the meta-learning approach remains under-explored in multilingual spoken word classification. In this paper, we apply the Generative Meta-Continual Learning algorithm to spoken word classification. The generative nature of this algorithm makes it viable for use in application, and the meta-learning aspect promotes generalisation, which is crucial in a multilingual setting. We train monolingual models on English, German, French, and Catalan, a bilingual model on English and German, and a multilingual model on all four languages. We find that although the multilingual model performs best, the differences between model performance is unexpectedly low. We also find that the hours of unique data seen during training seems to be a stronger performance indicator than the number of languages included in the training data.

61. 【2605.13076】runcProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints

链接https://arxiv.org/abs/2605.13076

作者:Yoshio Kato,Shuhei Tarashima

类目:Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Software Engineering (cs.SE)

关键词:attracted significant attention, attracted significant, significant attention, attention for integration, integration with external

备注: Main paper (8 pages). Accepted at the International Joint Conference on Neural Networks (IJCNN 2026)

点击查看摘要

Abstract:The LLM-based generation of machine-readable outputs such as JSON has attracted significant attention for integration with external systems. However, existing approaches cannot strictly enforce the maximum number of tokens to be generated, leading to infinite generation or truncated outputs that cause a system malfunction. To address this limitation, we propose TruncProof, a novel grammar-constrained generation method that enables LLMs to produce grammatically valid JSONs while adhering to a predefined token limit. By leveraging the properties of LL(1) parsers, TruncProof efficiently approximates the minimum number of tokens required to complete a grammatically valid output at each decoding step. Experiments on the Text-to-JSON instruction tasks demonstrate that TruncProof successfully generates syntactically correct outputs even under strict token constraints. Furthermore, we show that TruncProof can be effectively combined with advanced decoding strategies, resulting in outputs that are not only grammatically valid but also semantically accurate.

62. 【2605.13075】Scaling few-shot spoken word classification with generative meta-continual learning

链接https://arxiv.org/abs/2605.13075

作者:Louise Beyers,Batsirayi Mupamhi Ziki,Ruan van der Merwe

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Few-shot spoken word, spoken word classification, larger-scale few-shot spoken, classification remains untapped, word classification remains

备注

点击查看摘要

Abstract:Few-shot spoken word classification has largely been developed for applications where a small number of classes is considered, and so the potential of larger-scale few-shot spoken word classification remains untapped. This paper investigates the potential of a spoken word classifier to sequentially learn to distinguish between 1000 classes when it is given only five shots per class. We demonstrate that this scaling capability exists by training a model using the Generative Meta-Continual Learning (GeMCL) algorithm and comparing it to repeatedly trained or finetuned baselines. We find that GeMCL produces exceptionally stable performance, and although it does not always outperform a repeatedly fully-finetuned HuBERT model nor a frozen HuBERT model with a repeatedly trained classifier head, it produces comparable performance to the latter while adapting 2000 times faster, having been trained less than half of the data for two orders of magnitude less time.

63. 【2605.13055】he Cost of Perfect English: Pragmatic Flattening and the Erasure of Authorial Voice in L2 Writing Supported by GenAI

链接https://arxiv.org/abs/2605.13055

作者:Ao Liu,Shanhua Zhu

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:language learning offers, integration of Generative, writers powerful tools, learning offers, powerful tools

备注: 16 pages, 2 figures

点击查看摘要

Abstract:The integration of Generative AI (GenAI) into language learning offers second language (L2) writers powerful tools for text optimization. However, pursuing native-like fluency often sacrifices sociopragmatic diversity. Investigating "pragmatic flattening" - the systematic erasure of culturally preferred politeness and authorial stance - this study conducts a comparative analysis of argumentative essays by Chinese B2-level university students from the ICNALE corpus. The original texts were polished via the APIs of four leading Large Language Models at a zero-temperature setting for reproducibility. Findings reveal a nuanced "dimensional divergence" within the Semantic Preservation Paradox. While models corrected lexicogrammatical errors and retained propositional meaning, sociopragmatic interventions were bifurcated. In the interactive dimension, all models showed a drastic collapse of dialogic engagement markers, turning negotiated discourse into monologic assertions. Conversely, in the epistemic stance dimension, models showed architecture-based variability: some aggressively scrubbed epistemic markers, while others reinforced tentative hedging as decontextualized algorithmic caution. This confirms that while GenAI enhances accuracy, it systematically overwrites L2 writers' unique rhetorical identities into a homogenized Anglo-American paradigm. We argue that future instruction must move beyond error correction, advocating for Critical AI Literacy to empower multilingual writers to use GenAI for linguistic enhancement while safeguarding sociopragmatic diversity and rhetorical agency.

64. 【2605.13052】RAG-Enhanced Large Language Models for Dynamic Content Expiration Prediction in Web Search

链接https://arxiv.org/abs/2605.13052

作者:Tingyu Chen,Wenkai Zhang,Li Gao,Lixin Su,Ge Chen,Dawei Yin,Daiting Shi

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:remains challenging due, highly varied lifespans, intent remains challenging, commercial web search, Large Language Models

备注: Accepted at SIGIR 2026. Final version: [this https URL](https://doi.org/10.1145/3805712.3808457)

点击查看摘要

Abstract:In commercial web search, aligning content freshness with user intent remains challenging due to the highly varied lifespans of information. Traditional industrial approaches rely on static time-window filtering, resulting in "one-size-fits-all" rankings where content may be chronologically recent but semantically expired. To address the limitation, we present a novel Large Language Models (LLMs)-based Query-Aware Dynamic Content Expiration Prediction Framework deployed in Baidu search, reformulating timeliness as a dynamic validity inference task. Our framework extracts fine-grained temporal contexts from documents and leverages LLMs to deduce a query-specific "validity horizon"-a semantic boundary defining when information becomes obsolete based on user intent. Integrated with robust hallucination mitigation strategies to ensure reliability, our approach has been evaluated through offline and online A/B testing on live production traffic. Results demonstrate significant improvements in search freshness and user experience metrics, validating the effectiveness of LLM-driven reasoning for solving semantic expiration at an industrial scale.

65. 【2605.13050】Context Training with Active Information Seeking

链接https://arxiv.org/abs/2605.13050

作者:Zeyu Huang,Adhiguna Kuncoro,Qixuan Feng,Jiajun Shen,Lucio Dery,Arthur Szlam,Marc'Aurelio Ranzato

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:requires newly produced, existing large language, task requires newly, newly produced information, large language models

备注: Preprint

点击查看摘要

Abstract:Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity's Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.

66. 【2605.13045】Large Language Models Lack Temporal Awareness of Medical Knowledge

链接https://arxiv.org/abs/2605.13045

作者:Zihan Guan,Qiao Jin,Guangzhi Xiong,Fangyuan Chen,Mengxuan Hu,Qingyu Chen,Yifan Peng,Zhiyong Lu,Anil Vullikanti

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, medical knowledge, knowledge, Language Models

备注: 35 pages, 18 figures

点击查看摘要

Abstract:The existing methods for evaluating the medical knowledge of Large Language Models (LLMs) are largely based on atemporal examination-style benchmarks, while in reality, medical knowledge is inherently dynamic and continuously evolves as new evidence emerges and treatments are approved. Consequently, evaluating medical knowledge without a temporal context may provide an incomplete assessment of whether LLMs can accurately reason about time-specific medical knowledge. Moreover, most medical data are historical, requiring the models not only to recall the correct knowledge, but also to know when that knowledge is correct. To bridge the gap, we built TempoMed-Bench, the first-of-its-kind benchmark for evaluating the temporal awareness of the LLMs in the medical domain through evolving guideline knowledge. Based on the TempoMed-Bench, our evaluation analysis first reveals that LLMs lack temporal awareness in medical knowledge through the key findings: (1) model performance on up-to-date medical knowledge exhibits a gradual linear decline over time rather than a sharp knowledge-cutoff behavior, suggesting that parametric medical knowledge is not strictly bounded by knowledge cutoffs; (2) LLMs consistently struggle more with recalling outdated historical medical knowledge than with up-to-date recommendations: accuracy of historical knowledge is only 25.37%-53.89% of up-to-date knowledge, indicating potential knowledge forgetting effects during training; and (3) LLMs often exhibit temporally inconsistent behaviors, where predictions fluctuate irregularly across neighboring years. We also show that the temporal awareness problem is a challenge that cannot be easily solved when integrated with agentic search tools (-3.15%-14.14%). This work highlights an important yet underexplored challenge and motivates future research on developing LLMs that can better encode time-specific medical knowledge.

67. 【2605.13043】Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models

链接https://arxiv.org/abs/2605.13043

作者:Yejin Lee,Yo-Sub Han

类目:Computation and Language (cs.CL)

关键词:autoregressive language models, provide a promising, promising alternative, alternative to autoregressive, generating text

备注: 17 pages, 3 figures

点击查看摘要

Abstract:Diffusion Language Models (DLMs) provide a promising alternative to autoregressive language models by generating text through iterative denoising and bidirectional refinement. However, this iterative generation paradigm also introduces unique safety vulnerabilities when harmful tokens generated at intermediate denoising steps propagate through subsequent refinement processes and eventually induce unsafe outputs. While there are a few attempts to remedy this issue, they either fail to generate safe outputs or generate safe yet low-quality outputs. This motivates us to propose an inference-time defense framework based on the step-wise intervention during the denoising process, which then improves the safety without compromising the output quality. The key component of our framework is a contrastive safety direction (SGD), a latent direction that captures the semantic boundary between harmful and safe generations. We leverage SGD to assess the alignment of generated tokens with harmful semantics at each denoising step. When harmful alignment is detected, our method remasks the corresponding tokens and resumes the denoising process with adaptive steering, where the steering strength is modulated according to the estimated degree of harmfulness. As a plug-and-play module, our method circumvents the need for additional fine-tuning and can be directly incorporated into off-the-shelf diffusion models. The experimental results show that our approaches reduce jailbreak success rates to 0.64% while preserving generation quality close to the original model performance. This confirms the effectiveness of step-wise intervention for safe diffusion language model generation. Our code is available at this https URL.

68. 【2605.13026】Understanding and Accelerating the Training of Masked Diffusion Language Models

链接https://arxiv.org/abs/2605.13026

作者:Chunsan Hong,Sanghyun Lee,Chieh-Hsin Lai,Satoshi Hayakawa,Yuhta Takida,Yuki Mitsufuji,Seungryong Kim,Jong Chul Ye

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Masked diffusion models, Masked diffusion, promising alternative, alternative to autoregressive, diffusion models

备注: Preprint

点击查看摘要

Abstract:Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.

69. 【2605.12987】Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

链接https://arxiv.org/abs/2605.12987

作者:Guangzeng Han,James G. Murphy,Benjamin O. Ladd,Xiaolei Huang,Brian Borsari

类目:Computation and Language (cs.CL)

关键词:Coding Motivational Interviewing, Motivational Interviewing, requires substantial time, Coding Motivational, understanding client behaviors

备注: DOI: [https://doi.org/10.1093/milmed/usag224](https://doi.org/10.1093/milmed/usag224)

点击查看摘要

Abstract:BACKGROUND: Coding Motivational Interviewing (MI) sessions is essential for understanding client behaviors and predicting outcomes, but it requires substantial time and labor from trained MI professionals. Recent advances in audio-language models (ALMs) offer new opportunities to automate MI coding by capturing multimodal behavioral signals. OBJECTIVE: This study aims to develop an automatic MI coding approach based on ALMs that analyzes raw audio input and integrates predictions from multiple reasoning trajectories using self-consistency to improve coding robustness. METHODS: We experimented with five recorded sessions from de-identified MI audio tapes. We deployed ALMs with four complementary analytic prompts to support utterance-level reasoning: analytic prompting for verbal cues, prosody-aware prompting for acoustic cues, evidence-scoring prompting for quantitative hypothesis testing, and comparative prompting for contrastive reasoning. Three stochastic samples were drawn for each prompt, generating 12 independent reasoning trajectories per utterance. Final predictions were determined by majority voting across all trajectories. RESULTS: Performance was evaluated using accuracy, precision, recall, and macro-F1 scores. The proposed multimodal self-consistency approach achieved 52.56% accuracy, 54.03% precision, 47.45% recall, and a macro-F1 score of 46.40%, exceeding baseline methods. Systematic ablation experiments that removed individual modules consistently degraded performance on the primary metrics. CONCLUSIONS: Multimodal self-consistency outperforms single-pass baseline prompting approaches for MI coding. These findings suggest that incorporating both what clients say and how they say it can support more reliable automatic MI coding.

70. 【2605.12970】Leveraging Speech to Identify Signatures of Insight and Transfer in Problem Solving

链接https://arxiv.org/abs/2605.12970

作者:Linas Nasvytis,Judith E. Fan

类目:Computation and Language (cs.CL)

关键词:require a flash, problems, Abstract, solve, flash of insight

备注

点击查看摘要

Abstract:Many problems seem to require a flash of insight to solve. What form do these sudden insights take, and what impact do they have on how people approach similar problems in the future? In this work, we prompted participants (N = 189) to talk aloud as they attempted to solve a sequence of five "matchstick-arithmetic" problems. These problems either all relied on the same kind of non-obvious solution (Same group) or a different kind each time (Different group). We found that Same participants improved more rapidly than Different participants, and as they improved, they talked more and talked about different things when solving later problems. Specifically, they were more likely to spontaneously categorize the problem they were working on. Taken together, these findings suggest that a hallmark of transferable insights is their accessibility for verbal report, even if the underlying precursors of insight remain difficult to articulate.

71. 【2605.12968】Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2

链接https://arxiv.org/abs/2605.12968

作者:Hisashi Miyashita,Mgnite Inc

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:internally encode ontological, encode ontological relations, Liskov Substitution Principle, models internally encode, Algebraic Ontology Projection

备注

点击查看摘要

Abstract:Do large language models internally encode ontological relations in a formally verifiable algebraic structure? We introduce Algebraic Ontology Projection (AOP), which projects LLM hidden states into the Galois Field F2 under Liskov Substitution Principle constraints, using only 42 relational pairs as algebraic keys. AOP achieves up to 93.33% zero-shot inclusion accuracy on unseen concept pairs (Gemma-2 Instruct with optimized prompt), with consistent 86.67% accuracy observed across multiple model families -- with no model tuning, but through prompt alone. This algebraic structure is strongly layer-dependent. We introduce Semantic Crystallisation (SC), a metric that quantifies F2 constraint satisfaction relative to a random baseline and predicts zero-shot accuracy without held-out data. System prompts act as algebraic boundary conditions: only their combination with instruction tuning prevents Late-layer Collapse -- a systematic degradation of logical consistency in the final layers, observed in 7 of 10 conditions. These findings reframe forward computation as an iterative process of algebraic organisation, and open a path toward LLMs whose logical structure is not merely approximated, but formally accessible.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2605.12968 [cs.LG]

(or
arXiv:2605.12968v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2605.12968

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
72. 【2605.12960】DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

链接https://arxiv.org/abs/2605.12960

作者:Zijing Wang,Mingyang Wang,Ercong Nie,Yongkang Liu,Shi Feng,Mengjie Zhao,Daling Wang,Xiaocui Yang,Hinrich Schütze

类目:Computation and Language (cs.CL)

关键词:typically requires expensive, languages typically requires, multimodal data construction, requires expensive multilingual, existing multimodal model

备注

点击查看摘要

Abstract:Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on this https URL.

73. 【2605.12944】From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

链接https://arxiv.org/abs/2605.12944

作者:Haodong Wu,Jiahao Zhang,Lijie Hu,Yongqi Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Supervised fine-tuning, commonly formulated, formulated as instance, effective SFT training, SFT training subsets

备注

点击查看摘要

Abstract:Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top-$k$ subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian-process-assisted ranking, and stagnation-triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in-distribution reasoning average across three base models, outperforming full-data training, random recipe search, random top-$k$, and single-operator selectors. Additional Out-of-distribution graph-reasoning results, search-stability analyses, structural ablations, and 1.5B-to-7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at this https URL.

74. 【2605.12933】ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset

链接https://arxiv.org/abs/2605.12933

作者:Shohei Higashiyama,Hiroki Ouchi,Atsushi Fujita,Masao Utiyama

类目:Computation and Language (cs.CL)

关键词:textual data rich, tourism management, textual data, data rich, valuable source

备注

点击查看摘要

Abstract:Geographic text, or textual data rich in geographic (geo-) information is a valuable source for various geographic applications, e.g., tourism management. Making such information accessible to speakers of other languages further enhances its utility; thus, accurate machine translation (MT) is essential for equity in multilingual geo-information access. To facilitate in-depth analysis for geographic text, we introduce ATD-Trans, a geographically grounded Japanese--English travelogue translation dataset, which enables evaluation of MT quality at both the overall and geo-entity levels across domestic (within Japan) and overseas regions. Our experiments on existing language models examine two factors: model language focus and geographic regions. The results highlight advantages of Japanese-enhanced models and greater difficulty in translating domestic-region geo-entities mentioned in travel blogs.

75. 【2605.12922】When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

链接https://arxiv.org/abs/2605.12922

作者:Vardhan Dongre,Joseph Hsieh,Viet Dac Lai,Seunghyun Yoon,Trung Bui,Dilek Hakkani-Tür

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, long multi-turn interactions, Large language, follow complex instructions, follow complex

备注

点击查看摘要

Abstract:Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.

76. 【2605.12920】Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue

链接https://arxiv.org/abs/2605.12920

作者:Vardhan Dongre,Dilek Hakkani-Tür

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Effective collaboration, demands communication grounded, agent evolving understanding, shared environment, evolving understanding

备注

点击查看摘要

Abstract:Effective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent's evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations and align their world models. In this work, we examine whether LLM-based embodied agents actually realize the ability to communicate. We extend PARTNR, a benchmark for collaborative household robotics, with a natural-language dialogue channel that enables two agents with partial observability to communicate during task execution. To evaluate whether dialogue leads to genuine world-model alignment rather than superficial coordination, we propose a framework for measuring world-model alignment defined over per-agent world graphs: observation convergence (do private world models align over time?), information novelty (do messages convey what the partner lacks?), and belief-sensitive messaging (do agents model what their partner knows?). Our experiments across three LLMs reveal that dialogue reduces action conflicts 40 to 83 percentage points but degrades task success relative to silent coordination. Using our metrics, we characterize the gap between superficial coordination and genuine world-model alignment, and identify where current models fall on this spectrum.

77. 【2605.12918】CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

链接https://arxiv.org/abs/2605.12918

作者:Armin Toroghi,Faeze Moradi Kalarde,Scott Sanner

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, entity-based commonsense reasoning, require entity-based commonsense

备注

点击查看摘要

Abstract:To effectively interact with the real world, Large Language Models (LLMs) require entity-based commonsense reasoning, a challenging task that necessitates integrating factual knowledge about specific entities with commonsense inference. Existing datasets for evaluating LLM entity-based commonsense reasoning have largely focused on True/False or multiple-choice questions, leaving the explicit assessment of the model's ability in abductive reasoning about causes and effects and generating explanations largely unexamined. In this work, we introduce CommonWhy, a dataset of 15,000 why questions designed to evaluate entity-based commonsense reasoning about causal relationships in LLMs. CommonWhy also serves as a Knowledge Graph Question Answering (KGQA) benchmark, as all supporting knowledge required to answer its queries is available in the Wikidata knowledge graph. Unlike existing KGQA datasets, which primarily test fact retrieval, CommonWhy targets causal commonsense reasoning, establishing a new paradigm for KGQA evaluation. Experiments with state-of-the-art LLMs and LLM-based KGQA methods reveal their significant shortcomings, including frequent factual hallucinations and failures in causal reasoning.

78. 【2605.12898】When Do LLMs Generate Realistic Social Networks? A Multi-Dimensional Study of Culture, Language, Scale, and Method

链接https://arxiv.org/abs/2605.12898

作者:Sai Hemanth Kilaru,Sriram Theerdh Manikyala,Raghav Upadhyay,Sri Sai Kumar Ramavath,Srivika Nunavathu,Dalal Alharthi

类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Large language models, including synthetic social, Large language, behavioral simulations, including synthetic

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as substitutes for human subjects in behavioral simulations, including synthetic social network generation. Yet it remains unclear how their relational outputs depend on prompt design, cultural framing, prompt language, and model scale. Building on homophily theory and structural balance theory, we formalize four LLM-based tie-formation mechanisms: sequential, global, local, and iterative, and treat them as distinct conditional distributions over edge sets. Using a fixed roster of 50 demographically grounded personas, we generate 192 verified directed networks across four cultural contexts, four prompt languages, three GPT-4.1 variants, and four prompting architectures, with two seeds per condition. We find that cultural framing shifts inbreeding homophily and largest-component connectivity. Political affiliation dominates tie formation under three methods, while the global method substitutes age, showing that prompt architecture functions as a substantive sociological variable. Model scale produces a stable divergence ranking, with the smallest variant behaving qualitatively differently rather than merely noisily. Prompt language alone sharply shifts religion homophily, especially under Hindi prompting, while leaving political homophily nearly invariant. LLM-generated networks match real social graphs on clustering and modularity better than standard graph baselines, yet encode demographic biases above empirical levels. These results show that prompt choices often treated as implementation details encode substantive sociological assumptions.

Subjects:

Social and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)

Cite as:
arXiv:2605.12898 [cs.SI]

(or
arXiv:2605.12898v1 [cs.SI] for this version)

https://doi.org/10.48550/arXiv.2605.12898

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
79. 【2605.12894】Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

链接https://arxiv.org/abs/2605.12894

作者:Harshita Chopra,Kshitish Ghate,Aylin Caliskan,Tadayoshi Kohno,Chirag Shah,Natasha Jaques

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Language Model, variety of people, share information

备注: Preprint under review

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug-and-play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand-crafting personas, we cast persona generation as an LLM-driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task-preserving roleplay policies. Candidate generators are guided by a multi-objective fitness score combining human-likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human-like personas for any task in the domain. Across tau^2-bench retail and airline domains, evolved PPol programs yield 33-62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol-conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out-of-distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator-based evaluation and training without changing tasks or rewards.

80. 【2605.12882】CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

链接https://arxiv.org/abs/2605.12882

作者:Dongsheng Ma,Jiayu Li,Zhengren Wang,Yijie Wang,Jiahao Kong,Weijun Zeng,Jutao Xiao,Jie Yang,Wentao Zhang,Bin Wang,Conghui He

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Multimodal Large, Large Language Models, supporting evidence unchecked, advanced document understanding

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at this https URL.

81. 【2605.12850】Persona-Model Collapse in Emergent Misalignment

链接https://arxiv.org/abs/2605.12850

作者:Davi Bastos Costa,Renato Vicente

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:broadly misaligned behavior, Fine-tuning large language, large language models, harmful content produces, content produces broadly

备注: 23 pages, 7 figures, 7 tables; NeurIPS 2026 submission

点击查看摘要

Abstract:Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility (S) and moral robustness (R), computed from the across- and within-persona variability of models' Moral Foundations Questionnaire responses under persona role-play. These metrics formalize the model's ability to differentiate characters (S) and its consistency when simulating a given one (R). We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned to output secure code. Across the four models, insecure fine-tuning produces an average $55\%$ increase in S, pushing all four insecure variants beyond the band observed across 13 frontier models benchmarked in prior work -- with GPT-4o reaching more than twice the band's upper end -- signaling dysregulated differentiation. It also causes an average $65\%$ decrease in R, equivalent to a $304\%$ increase in 1/R. By contrast, the matched secure control preserves S near the base and induces only a partial R loss, showing that these effects are largely misalignment-specific. Complementing these metric shifts, insecure variants' unconditioned responses converge toward saturation near the scale ceiling, departing markedly from both base models' structured responses and those elicited when base models role-play toxic personas. Taken together, these metrics provide a sensitive diagnostic for emergent misalignment and serve as behavioral evidence that it involves persona-model collapse.

82. 【2605.12824】Mechanism Plausibility in Generative Agent-Based Modeling

链接https://arxiv.org/abs/2605.12824

作者:Patrick Zhao,David Huu Pham,Nicholas Vincent

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:explicitly programmed rules, generate high-level diverse, Large language models, high-level diverse phenomena, Large language

备注: Accepted at ACM FAccT 2026

点击查看摘要

Abstract:Large language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recently, research has aim to test whether they are capable of generating different phenomena of interest, for example, human behavior on social media platforms or performance in game-theoretic scenarios. However, capability, prediction, and explanation are different -- drawing from the philosophy of science and mechanisms literature, \textit{explanation} requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of `plausibility' in a four-level scale. Our scale separates the evaluation of a model's generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale.

Comments:
Accepted at ACM FAccT 2026

Subjects:

Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

Cite as:
arXiv:2605.12824 [cs.MA]

(or
arXiv:2605.12824v1 [cs.MA] for this version)

https://doi.org/10.48550/arXiv.2605.12824

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.1145/3805689.3812388

Focus to learn more

            DOI(s) linking to related resources</p>
83. 【2605.12817】raining Large Language Models to Predict Clinical Events

链接https://arxiv.org/abs/2605.12817

作者:Benjamin Turtel,Paul Wilczewski,Kris Skotheim

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:prediction remains challenging, evolve over time, remains challenging, extend Foresight Learning, clinical prediction remains

备注

点击查看摘要

Abstract:Longitudinal clinical notes contain rich evidence of how patients evolve over time, but converting this signal into training supervision for clinical prediction remains challenging. We extend Foresight Learning to clinical prediction by converting time-ordered MIMIC-III notes into examples consisting of past patient context, a natural-language question about a possible future event, and a label resolved from later documentation. This process yields 6,900 prediction examples from 702 admissions across medications, procedures, organ support, microbiology, and mortality. A small LoRA adapter trained on these examples improves over the prompted base model, reducing expected calibration error from 0.1269 to 0.0398 and Brier score from 0.199 to 0.145, while slightly outperforming GPT-5 point estimates on held-out questions. The approach enables reusable clinical prediction supervision from longitudinal notes without hand-engineered structured features or endpoint-specific classifiers.

84. 【2605.12814】Linking Extreme Discourse to Structural Polarization in Signed Interaction Networks

链接https://arxiv.org/abs/2605.12814

作者:Zhijin Guo,Li Zhang,Tyler Bonnet,Janet B. Pierrehumbert,Xiaowen Dong

类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL)

关键词:unified measurement pipeline, online communities, views are rarely, unified measurement, measurement pipeline

备注

点击查看摘要

Abstract:Polarization in online communities is often studied through either language or interaction structure, but the two views are rarely connected in a unified measurement pipeline. Prior work links them by building interaction graphs from human judgments of agreement and disagreement, leaving a gap between language as observed text and structure as an engineered representation of that text. We address this gap with a language-grounded signed-network pipeline that derives continuous signed edge weights from LLM stance scores and quantifies structural polarization using two complementary measures: a spectral Eigen-Sign score and a partition-based frustration score. After normalization, the two measures show substantial agreement while retaining important differences in their sensitivity to edge magnitude. Applying the framework to Reddit Brexit discussions, we analyze how window-level discourse signals, including toxicity, extreme scalar claims, and perplexity, relate to temporal variation in structural polarization. Edge-level and ablation analyses show that continuous, confidence-weighted signed edges reveal intensity-sensitive patterns that are muted under sign-only representations. We further report an exploratory one-step-ahead forecasting analysis suggesting that lagged language signals may contain information about future polarization beyond structural persistence. Together, the results demonstrate how discourse and signed-network structure can be connected in a single framework for measuring and interpreting polarization dynamics over time.

85. 【2605.12813】REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

链接https://arxiv.org/abs/2605.12813

作者:Buyun Liang,Jinqi Luo,Liangzu Peng,Kwan Ho Ryan Chan,Darshan Thaker,Kaleab A. Kinfu,Fengrui Tian,Hamed Hassani,René Vidal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:elicit such failures, adversarial prompts, realistic adversarial prompts, coherent adversarial prompts, Large language models

备注: Accepted at ICML 2026. Code is available at [this https URL](https://github.com/Buyun-Liang/REALISTA)

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need for realistic adversarial prompts that elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at this https URL.

86. 【2605.12798】Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

链接https://arxiv.org/abs/2605.12798

作者:Baris Askin,Muhammed Ustaomeroglu,Anupam Nayak,Gauri Joshi,Guannan Qu,Carlee Joe-Wong

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:models exhibit misaligned, exhibit misaligned behavior, induce Emergent Misalignment, narrow harmful datasets, Emergent Misalignment

备注

点击查看摘要

Abstract:Fine-tuning LLMs on narrow harmful datasets can induce Emergent Misalignment (EM), where models exhibit misaligned behavior far beyond the fine-tuning distribution. We argue that emergent misalignment can be better understood as a data-mediated transfer phenomenon: harmful fine-tuning examples do not induce uniform behavioral spillover, but interact with the structural properties of the dataset and the difficulty of the tasks relative to the model. Across our experiments, we find that misalignment appears more readily when fine-tuning and evaluation prompts share similar underlying functional structure, when prompts leave more room for coherent harmful completions, and when the target behavior has been more reliably learned by the model. The training pipeline itself also matters: pretraining composition shapes later misalignment. We further study Subliminal Learning (SL), where misalignment is transmitted by fine-tuning on seemingly benign data generated by a harmful teacher. Moving beyond the standard SFT setting, we for the first time compare this transfer under off-policy and on-policy distillation as well, allowing us to separate the roles of the teacher guidance and the training data distribution in transmitting misalignment. Together, these results argue for a data-centric view: Emergent/subliminal misalignment should not be treated as a simple consequence of isolated harmful fine-tuning examples, but as the result of interactions between fine-tuning data structure, pretraining distributions, and training channels.

87. 【2605.12770】WriteSAE: Sparse Autoencoders for Recurrent State

链接https://arxiv.org/abs/2605.12770

作者:Jack Young

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:recurrent language models, hybrid recurrent language, matrix cache write, language models, sparse autoencoder

备注: 26 pages, 14 figures, 21 tables; code at [this https URL](https://github.com/JackYoung27/writesae)

点击查看摘要

Abstract:We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state-space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba-2, and RWKV-7 write to a $d_k \times d_v$ cache through rank-1 updates $k_t v_t^\top$ that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched-norm ablation on 92.4% of $n=4{,}851$ firings at Qwen3.5-0.8B L9 H4, the 87-atom population test holds at 89.8%, the closed form predicts measured effects at $R^2=0.98$, and Mamba-2-370M substitutes at 88.1% over 2,500 firings. Sustained three-position installs at $3\times$ lift midrank target-in-continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix-recurrent write site.

88. 【2605.12748】Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

链接https://arxiv.org/abs/2605.12748

作者:Heejin Do,Shashank Sonkar,Mrinmaya Sachan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:generate student-like responses, Large language models, fluently generate student-like, Large language, student-like responses

备注

点击查看摘要

Abstract:Large language models (LLMs) can fluently generate student-like responses, making them attractive as simulated students for training and evaluating AI tutors and human educators. Yet such simulators are typically evaluated by output similarity to real students, not by whether they behave like students with coherent misconceptions during interaction. We introduce a controlled framework for evaluating misconception faithfulness, whether a simulator maintains a misconception-driven belief state and updates selectively when feedback addresses the underlying misconception. Central to our framework is a misconception-contrastive feedback protocol that compares targeted feedback against two controls: misaligned feedback (targeting a different but plausible misconception) and generic feedback (only identifying answer is wrong). We propose Selective Flip Score (SFS), which quantifies how much more often a simulator flips its answer under targeted feedback than under contrastive controls. Across seven LLMs (4B-120B), multiple datasets, and prompting strategies, simulators exhibit near-zero SFS, correcting their answers at similarly high rates regardless of feedback relevance. Further analyses reveal a sycophantic failure mode: models behave less like students with misconceptions but more like problem-solvers who treat any corrective signal as a cue to abandon the simulated belief and re-solve from internal knowledge. To address this, we develop a post-training pipeline spanning supervised fine-tuning (SFT), preference optimization, and reinforcement learning (RL) with an SFS-aligned reward; SFT yields notable gains up to +0.56, and SFS-aligned RL provides more consistent improvements than preference optimization. Our results establish misconception faithfulness as a challenging yet trainable property, motivating a shift from static output matching toward interactive, belief-aware student modeling.

89. 【2605.12715】Scaling Laws for Mixture Pretraining Under Data Constraints

链接https://arxiv.org/abs/2605.12715

作者:Anastasiia Sedova,Skyler Seto,Natalie Schluter,Pierre Ablin

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:language models scale, low-resource languages, require grows, target data, target data sources

备注

点击查看摘要

Abstract:As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale. Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints.

90. 【2605.12714】Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

链接https://arxiv.org/abs/2605.12714

作者:Jingzhou Jiang,Yi Yang,Kar Yan Tam

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Hidden states change, Filtration Mutual Information, layer-wise analyses focus, Graph Filtration Mutual, Hidden states

备注

点击查看摘要

Abstract:Hidden states change substantially across the layers of modern language models, but most layer-wise analyses focus on one aspect of that change. We propose Layer-wise Representation Dynamics (LRD), a framework with three layer-wise measurement families: Frenet (Grassmann speed and curvature) for global subspace motion, Neighborhood Retention Score (NRS) for local nearest-neighbor retention, and Graph Filtration Mutual Information (GFMI) for alignment with the final layer. Applying LRD to 31 models (encoder-based and decoder-based embedders, plus base LLMs) on 30 MTEB tasks reveals architectural and task-level differences that are not apparent from final-layer representations alone. We then use LRD for two applications: label-free model selection and inference-time layer pruning. For selection, all three model-level scores correlate positively with downstream MTEB performance, with end-to-end subspace displacement (d_{0,L}) the strongest, and the same direction holds on a smaller base-LLM MMLU panel. For pruning, GFMI is the only measurement-guided rule that beats Random at the 15% and 20% budgets and has the best median change at every budget. Frenet is effective only at the lightest budget, while NRS does not transfer from model selection to pruning. These results show that layer-wise structure provides signal for both interpretation and deployment decisions.

91. 【2605.12671】All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs

链接https://arxiv.org/abs/2605.12671

作者:Xi Chen,Mingyu Jin,Jingcheng Niu,Yutong Yin,Jinman Zhao,Bangwei Guo,Dimitris N. Metaxas,Zhaoran Wang,Yutao Yue,Gerald Penn

类目:Computation and Language (cs.CL)

关键词:Functional Anisotropy Hypothesis, Functional Anisotropy, large language models, near-unique internal mechanism, term the Functional

备注: ICML 2026

点击查看摘要

Abstract:In this paper, we present empirical and theoretical evidence against a central but largely implicit assumption in circuit and sheaf discovery (CSD), which we term the Functional Anisotropy Hypothesis: the idea that functions in large language models (LLMs) are localised to a unique or near-unique internal mechanism. We show that a single LLM task can instead be supported by multiple, structurally distinct circuits or sheaves that are simultaneously faithful, sparse, and complete. To systematically uncover such competing mechanisms, we introduce Overlap-Aware Sheaf Repulsion, a method that augments the CSD objective with an explicit penalty on structural overlap across multiple discovery runs, enabling the discovery of circuits or sheaves with strong task performance but minimal shared structure across a plethora of common CSD benchmarks. We find that this phenomenon becomes increasingly pronounced as the number of discovered sheaves grows and persists robustly across major CSD methods. We further identify an ultra-sparse three-edge sheaf and show that none of its edges is individually indispensable, undermining even weakened notions of canonical or essential components. To explain these findings, we propose a Distributive Dense Circuit Hypothesis and provide a theoretical analysis demonstrating that non-unique, low-overlap circuit explanations arise naturally from high-dimensional superposition under mild assumptions. Together, our results suggest that mechanistic explanations in LLMs are inherently non-canonical and call for a rethinking of how CSD results should be interpreted and evaluated.

92. 【2605.12645】raining LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering

链接https://arxiv.org/abs/2605.12645

作者:Maryam Amirizaniani,Benjamin Charles Germain Lee,Jevin West,Nicholas Weber

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Effective personalized question, language models requires, models requires grounding, requires grounding responses, implicit user intent

备注

点击查看摘要

Abstract:Effective personalized question answering (PQA) in language models requires grounding responses in the user's underlying intent, where intent refers to the implicit ``why'' behind a query beyond its explicit wording. However, existing approaches to intent-aware personalization rely on multi-turn conversational context or rich user profiles, and do not explicitly model user intent during the reasoning process. This limits their effectiveness in single-turn settings, where the user's latent goal must be inferred from minimal input and integrated into the thinking and reasoning process. To bridge this gap, we propose IAP (Intent-Aware Personalization), a reinforcement learning framework that trains models to infer implicit user intent directly from a single-turn question and incorporate it into thinking steps through a tag-based schema for generating personalized, intent-grounded answers. By optimizing intent-aware answer trajectories under a personalized reward function, IAP reinforces generation paths that make implicit user intent explicit and produce responses that better align with the user's underlying goal. Through experiments on the LaMP-QA benchmark across six models, IAP consistently outperforms all baselines, achieving an average macro-score gain of around 7.5\% over the strongest competitor, demonstrating that modeling implicit user intent within the training objective is a promising direction for PQA.

93. 【2605.12623】DocAtlas: Multilingual Document Understanding Across 80+ Languages

链接https://arxiv.org/abs/2605.12623

作者:Ahmed Heakl,Youssef Mohamed,Abdullah Sohail,Rania Elbadry,Ahmed Nassar,Peter W. J. Staar,Fahad Shahbaz Khan,Imran Razzak,Salman Khan

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:perpetuate existing biases, understanding remains limited, scarce training data, document understanding remains, low-resource languages due

备注: Under submission

点击查看摘要

Abstract:Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.

94. 【2605.12530】In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

链接https://arxiv.org/abs/2605.12530

作者:Zeyu Tang,Sang T. Truong,Deonna Owens,Shreyas Sharma,Yibo Jacky Zhang,Brando Miranda,Sanmi Koyejo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:LLM fairness, conversational behavior shifts, in-situ conversational behavior, conversational behavior, LLM

备注

点击查看摘要

Abstract:LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test QA benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings. We develop MAC-Fairness, a multi-agent conversational framework that embeds controlled variation factors into multi-round dialogue for in-situ behavior evaluation, examining how models' conversational behavior shifts when identity is varied as part of natural multi-agent interaction. Repurposing standardized-test questions as conversation seeds rather than as the evaluation instrument, we evaluate position persistence (how they hold positions, from the self-perspective) and peer receptiveness (how receptive they are to peers, from the other-perspective) across 8 million conversation transcripts spanning multiple models and identity presence configurations. In-situ behavioral evaluation reveals stable, model-specific behavioral signatures that could generalize across benchmarks differing in fairness targets and evaluation methodologies, a form of evidence the standardized-test paradigm does not offer.

95. 【2605.12525】PERCEIVE: A Benchmark for Personalized Emotion and Communication Behavior Understanding on Social Media

链接https://arxiv.org/abs/2605.12525

作者:Jian Liao,Yujin Zheng,Suge Wang,Jianxing Zheng,Deyu Li

类目:ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Current emotion analysis, predominantly author-centric, failing to capture, Current emotion, capture the subjective

备注

点击查看摘要

Abstract:Current emotion analysis in social media is predominantly author-centric, failing to capture the subjective nature of emotional responses across diverse readers. This paradigm overlooks the crucial link between individual perception, communication behavior, and the underlying social network. To bridge this gap, we introduce PERCEIVE, a novel bilingual (English and Chinese) large-scale benchmark that, to the best of our knowledge, is the first to integrate five critical dimensions for social perception: author-created content, genuine readers' emotional feedback (derived from their comments), communication behavior, user attributes, and the social graph. This benchmark enables a paradigm shift towards truly personalized, reader-centric analysis, where different readers' emotional responses to the same content are naturally captured through their real-world interactions. By annotating emotions from reader comments and synchronously capturing communication intent, PERCEIVE provides a unique resource to model the intrinsic coupling between emotion and behavior, grounded in social context. We establish a comprehensive evaluation protocol, testing state-of-the-art methods, including large language models (LLMs) with advanced reasoning enhancement. Our findings reveal significant shortcomings in existing approaches when handling this multifaceted, user-aware task. PERCEIVE offers a foundational resource and clear direction for future research in socially-intelligent NLP, pushing models towards a more unified understanding of emotion on social media.

96. 【2605.12523】Exploring how EFL students talk to and through AI to develop texts

链接https://arxiv.org/abs/2605.12523

作者:David James Woo,Yangyang Yu,Yilin Huang,Deliang Wang,Kai Guo,Chi Ho Yeung

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Generative Artificial Intelligence, Generative Artificial, Artificial Intelligence, considerations for English, Curricular Writing Task

备注: 37 pages, 5 figures

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) introduces new considerations for English as a foreign language (EFL) writing pedagogy. This study explores how students talk to and through AI by prompt engineering and negotiating authorship, respectively, and whether any patterns in the latter relate to students' writing performance. Using an exploratory mixed methods design, we analyzed screen recordings of 44 Hong Kong secondary students completing a Curricular Writing Task with AI Chatbots. Content analysis identified ten types of prompting strategies students employed, including questions, searches, and detailed instructions. From clustering these strategies, three distinct profiles of human-AI rhetorical load responsibility emerged: AI-dominant (52% of students), Human-dominant (25%) and Collaborative human-AI (14%). A MANOVA analysis indicated no significant multivariate effect of rhetorical load responsibility on three dimensions of students' writing performance: content, language, and organization. Students' prompting strategies and rhetorical load responsibility patterns have implications for their engagement and autonomy in EFL writing pedagogy.

97. 【2605.12522】Differences in Text Generated by Diffusion and Autoregressive Language Models

链接https://arxiv.org/abs/2605.12522

作者:Zeyang Zhang,Chengwei Liang,Xingyan Chen,Meiqi Gu,Minrui Luo,Jingzhao Zhang,Tianxing He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Diffusion language models, autoregressive language models, language models, Diffusion language, text remain underexplored

备注

点击查看摘要

Abstract:Diffusion language models (DLMs) are promising alternatives to autoregressive language models (ARMs), yet the intrinsic differences in their generated text remain underexplored. We first find empirically that off-the-shelf DLMs exhibit lower $n$-gram entropy, higher semantic coherence, and higher semantic diversity. To understand the cause, we conduct controlled experiments that decouple the effects of training objectives and decoding algorithms. Results suggest that the DLM training objective contributes to the increases in semantic coherence and semantic diversity, but has a minor influence on entropy. These differences are primarily driven by the bidirectional context; other components in the training objective, such as input masking, label masking, and the weighting function, have a much weaker influence. Further, our experiments demonstrate that the reduction in entropy stems from DLMs' decoding algorithms, particularly confidence-based remasking strategies. We provide a theoretical understanding for this entropy reduction phenomenon. Together, our work uncovers key mechanisms underlying the differences between DLMs and ARMs in text generation, and informs future design of training objectives and decoding algorithms in DLMs.

98. 【2605.12521】oolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues

链接https://arxiv.org/abs/2605.12521

作者:Dinesh Khandelwal,Gnana Prakash Punnavajhala,GPS Bhargav,Gaurav Pandey,Sachin Joshi,Hima Karanam,Dinesh Raghu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:training data required, autonomous agents, fundamental challenge, calling is essential, function as autonomous

备注

点击查看摘要

Abstract:Multi-turn tool calling is essential for LLMs to function as autonomous agents, yet synthesizing the training data required for these capabilities remains a fundamental challenge. Existing synthetic data generation pipelines often produce unrealistic dialogues for two reasons: they chain tools that are only superficially compatible rather than aligned with meaningful user tasks, and they generate dialogues in one shot, which often introduces arguments that were neither provided by the user nor produced by prior tool calls. These issues also lead to a severe underrepresentation of multi-step tool interactions. We introduce ToolWeave, a structured framework for synthesizing realistic multi-turn tool-calling dialogues. ToolWeave support realistic multi-step workflows (or tool sequences) by constructing tools with built-in dependencies and filters the workflows based on alignment with user goals. It reduces parameter hallucination by using a fine-grained planning stage that explicitly tracks parameter provenance. As a result, ToolWeave-generated synthetic dialogues contain more multi-step tool interactions (45%) and fewer hallucinations in parameters and tool names. Consequently, LLMs fine-tuned on ToolWeave consistently outperform those fine-tuned on prior datasets across three public benchmarks. Notably, Llama-3.1-70B fine-tuned on ToolWeave achieves 39.75% on BFCL-V3 multi-turn, compared to 23.50% when fine-tuned on SOTA ToolFlow data.

99. 【2605.12520】BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration

链接https://arxiv.org/abs/2605.12520

作者:Yancheng Ling,Zhenlin Qin,Leizhen Wang,Zhenliang Ma

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:interpretable semantic hierarchies, zero-shot taxonomy induction, semantic hierarchies, crucial for organizing, organizing concepts

备注: 13 pages,7 figtures

点击查看摘要

Abstract:Taxonomy induction is crucial for organizing concepts into explicit and interpretable semantic hierarchies. While existing methods have achieved promising results, their generalization, structural reliability, and efficiency remain limited, hindering their performance in zero-shot and large-scale scenarios. To overcome these limitations, we introduce BoostTaxo, a boosting-style LLM framework for zero-shot taxonomy induction. It takes a set of domain terms as inputs and performs parent identification in a coarse-to-fine manner, employing retrieval-augmented definition refinement, hybrid parent candidate selection, candidate rating, and structure-aware score calibration to improve taxonomy construction. Specifically, a lightweight LLM is used to efficiently filter candidate parents, while a large-scale LLM is employed to rank and score candidate parents for fine-grained parent selection. Structural features are further incorporated to calibrate candidate edge weights and enhance the reliability of the induced taxonomy. The unified BoostTaxo is evaluated on three public benchmark datasets, namely WordNet, DBLP, and SemEval-Sci, and achieves superior or comparable performance to state-of-the-art methods in zero-shot taxonomy induction. The ablation study validates the contribution of the hybrid parent candidate selection and the structure-aware score calibration to the overall performance. Further analysis investigates the impact of candidate selection size on taxonomy quality and presents representative case and failure studies, providing deeper insights into the effectiveness and limitations of the proposed framework.

100. 【2605.12519】Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

链接https://arxiv.org/abs/2605.12519

作者:Kyuyoung Kim,Kevin Wang,Yunfei Xie,Peiyang Xu,Peiyao Sheng,Chen Wei,Zhangyang Wang,Jinwoo Shin,Pramod Viswanath,Sewoong Oh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Training language models, Training language, open challenge, produce both correct, correct answers

备注: Preprint

点击查看摘要

Abstract:Training language models to produce both correct answers and sound reasoning remains an open challenge. Reinforcement learning with verifiable rewards typically optimizes only final outcomes, which can lead to a failure mode where task accuracy improves while reasoning becomes less accurate, less complete, or even internally inconsistent. We propose verifiable process supervision (VPS), a post-training framework for verifiable domains that jointly optimizes prediction accuracy and reasoning quality. We first apply supervised fine-tuning to induce a structured reasoning format, enabling syntactic extraction of intermediate claims that are evaluated against ground-truth signals to form process-level rewards. To address the heterogeneous difficulty of reasoning subtasks, we introduce adaptive reward weighting that prioritizes components with the largest remaining errors, creating an implicit curriculum. We evaluate VPS on chess, a controlled testbed where reasoning steps can be deterministically verified against engine signals. While accuracy-only RL improves move accuracy, it sharply degrades reasoning quality, increasing win-rate error by up to 112% and reducing internal consistency by up to 69%. In contrast, VPS preserves accuracy while significantly improving reasoning quality, reducing win-rate error by up to 30% and restoring consistency to near saturation. At matched accuracy, judge evaluation also prefers the process-supervised models. A reasoning-space analysis further shows that, without a structured prior, accuracy-only RL converges to budget-dependent shortcuts rather than sound multi-step reasoning. These results show that VPS enables language models to reason both accurately and reliably in verifiable domains.

101. 【2605.12518】melineReasoner: Advancing Timeline Summarization with Large Reasoning Models

链接https://arxiv.org/abs/2605.12518

作者:Liancheng Zhang,Xiaoxi Li,Zhicheng Dou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, extracting structured timelines, Large Reasoning Models, unstructured content, proliferation of online

备注

点击查看摘要

Abstract:The proliferation of online news poses a challenge to extracting structured timelines from unstructured content. While recent studies have shown that Large Language Models (LLMs) can assist Timeline Summarization (TLS), these approaches primarily treat models as passive generators. The emergence of Large Reasoning Models (LRMs) presents an opportunity to reason over events actively, enabling iterative evidence acquisition, the detection of missing events, and the validation of temporal consistency. To systematically leverage the reasoning capabilities of LRMs, we propose TimelineReasoner, a novel framework that shifts TLS from static generation to an active, reasoning-driven process. Unlike prior work, TimelineReasoner adopts a two-stage framework: Global Cognition, which tracks events at a macroscopic level and continuously updates a global event memory, and Detail Exploration, which identifies informational gaps and refines the timeline via targeted document retrieval. To support this, TimelineReasoner incorporates several specialized mechanisms, including an Event Scraper for retrieving temporal event descriptions, a Timeline Updater for refining the timeline, and a Supervisor for detecting gaps in the timeline and guiding retrieval. Experimental results on open-domain TLS datasets demonstrate that TimelineReasoner significantly outperforms existing LLM-based TLS methods in terms of timeline accuracy, coverage, and coherence. On closed-domain TLS datasets, our method performs on par with or exceeds state-of-the-art approaches. This work not only pushes the boundaries of TLS but also highlights the broader potential of LRM-based reasoning frameworks for timeline summarization.

102. 【2605.12517】Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

链接https://arxiv.org/abs/2605.12517

作者:Mingyeong Kim,Jungwon Choi,Chaeyun Jang,Juho Lee(Kim Jaechul Graduate School of AI, KAIST)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision-language models, Vision-language, Abstract, text-only, Latent Imagination Module

备注: 9 pages, 16 figures. Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety across Modalities

点击查看摘要

Abstract:Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially restores accuracy and calibration. We propose the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input and feeds them into a frozen VLM backbone without pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM improves accuracy and reduces calibration error. These results suggest that latent modality completion is a practical approach for reliable VLM inference under missing-modality.

103. 【2605.12516】Domain Adaptation of Large Language Models for Polymer-Composite Additive Manufacturing Using Retrieval-Augmented Generation and Fine-Tuning

链接https://arxiv.org/abs/2605.12516

作者:Saiful Islam Sagor,Tania Haghighi,Minhaj Nur Alam,Erina Baynojir Joyee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:General-purpose large language, General-purpose large, limited domain grounding, generate reliable responses, large language models

备注

点击查看摘要

Abstract:General-purpose large language models (LLMs) often struggle to generate reliable responses in specialized engineering domains due to limited domain grounding and insufficient exposure to structured technical knowledge. This study investigates practical strategies for adapting a foundation LLM to the additive manufacturing (AM) domain in order to improve answer accuracy, relevance, and usability for expert-level question answering. AM knowledge is distributed across heterogeneous sources such as academic literature, manufacturer documentation, technical standards, and procedural guides. Although general LLMs demonstrate strong linguistic capabilities, they frequently fail to retrieve and contextualize such domain-specific information. Two common approaches to address this limitation are domain-specific fine-tuning and retrieval-augmented generation (RAG). We construct a curated AM corpus and evaluate three configurations based on LLaMA-3-8B: (1) the pretrained baseline model, (2) a RAG system that retrieves relevant document chunks from a vector database, and (3) a model fine-tuned on raw domain text. Performance is evaluated using 200 expert-designed AM questions assessed by mechanical engineering experts for accuracy, relevance, and overall preference. Results show that the RAG model consistently outperforms the baseline. Among the 200 questions, 75.5% of RAG responses are judged more accurate, 85.2% are preferred overall, and 90.8% are rated more relevant than baseline responses. In contrast, fine-tuning on raw AM text reduces performance, producing more accurate answers in only 5.6% of cases and more relevant answers in 32.5% of cases. These results indicate that retrieval-augmented approaches provide a more effective pathway for adapting LLMs to specialized engineering domains than naive fine-tuning on unstructured technical data.

104. 【2605.12515】Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

链接https://arxiv.org/abs/2605.12515

作者:Lucas Resck,Isabelle Augenstein,Anna Korhonen

类目:Computation and Language (cs.CL)

关键词:exhibit inconsistent behaviour, frequently exhibit inconsistent, multilingual large language, impressive capabilities, multilingual large

备注: 22 pages, 13 figures, 9 tables

点击查看摘要

Abstract:Despite their impressive capabilities, multilingual large language models (MLLMs) frequently exhibit inconsistent behaviour when the prompt's language changes. While such adaptation is generally desirable, it becomes a critical failure when a user's identity is explicitly defined. For instance, given a fixed British persona and an ambiguous everyday knowledge query about literature, the prompt's language frequently overwrites the system persona -- yielding Shakespeare in English but Cervantes in Spanish. To robustly quantify this Cross-lingual Cultural Inconsistency, we introduce Singleton Fleiss's $\kappa_S$, a metric mathematically resilient to hallucinations. For mitigation, we propose Cross-lingual Cultural Consistent Preference Optimisation (C-3PO), a consensus-driven alignment framework. C-3PO achieves up to a 0.10-point absolute increase in $\kappa_S$ over unaligned models, outperforming strong prompting and representation steering baselines. Empirical evaluations show this inconsistency disproportionately affects lower-resource languages like Indonesian and Persian. A layer-wise interpretability analysis reveals the underlying mechanism: by early-decoding intermediate layer representations, we find that MLLMs implicitly personalise outputs towards the prompt language's stereotypical culture as forward-pass representations stabilise.

105. 【2605.12510】WhatsApp Vaccine Discourse (WhaVax): An Expert-Annotated Dataset and Benchmark for Health Misinformation Detection

链接https://arxiv.org/abs/2605.12510

作者:Jônatas H. dos Santos,Julio C. S. Reis,Philipe Melo,João F. H. Olivetti,Thales H. Silva,Matheus Gontijo Guimaraes,Glaucio de Souza,Marcos A. Gonçalves,Fabricio Benevenuto,Filipe B. B. Zanovello,Marco A. G. Rodrigues,Cristiano X. Lima

类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Brazilian public groups, multiple pandemic years, large Brazilian public, public groups spanning, groups spanning multiple

备注: 10 pages. This is a preprint version of a paper accepted for the International AAAI Conference on Web and Social Media (ICWSM'26). Please cite the conference version rather than this preprint

点击查看摘要

Abstract:We introduce WhaVax, a new expert-annotated dataset of vaccine-related WhatsApp messages collected from large Brazilian public groups spanning multiple pandemic years. The dataset was constructed through a rigorous, carefully designed pipeline that integrates keyword-based data collection, semantic deduplication to remove near-duplicate content, and a multi-stage annotation protocol conducted by medical specialists. This process produced a high-quality gold-standard corpus, characterized by substantial inter-annotator agreement and strong reliability for downstream analysis. Additionally, we provide a detailed characterization of WhatsApp misinformation, revealing distinctive linguistic, structural, lexical, temporal, and group-level patterns, as well as a meaningful layer of ambiguous cases that reflect the complexity of health discourse in private messaging. We also benchmark classical models, fine-tuned Small Language Models, and zero- or few-shot Large Language Models under realistic data-scarcity constraints, demonstrating that strong embeddings and LLM approaches perform competitively, while domain alignment and data availability remain critical factors. This study provides a rare, high-quality resource to support misinformation research and computational modeling in encrypted communication environments.

106. 【2605.13188】LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information

链接https://arxiv.org/abs/2605.13188

作者:Stef van Buuren

类目:Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)

关键词:Large language models, Large language, language models, increasingly deployed, deployed in settings

备注: 9 pages, 3 figures, 2 tables, NeurIPS 2026 position paper

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in settings where the available context is incomplete or degraded. We argue that an LLM generating answers under incomplete context can be viewed as an implicit imputer, and evaluated against a criterion from the multiple imputation (MI) literature: uncertainty should scale with the amount of missing information. We assess this criterion on SQuAD, using a controlled framework in which context availability is varied across five levels. We evaluate two answer-level uncertainty measures that can be estimated from repeated sampling: sampling-based confidence (empirical mode frequency) and response entropy. Confidence fails to reflect increasing missingness: it remains high even as accuracy collapses. Entropy, by contrast, increases with context removal, consistent with the MI analogy, and explains substantially more variance in accuracy than confidence across all evidence levels (quadratic $R^2$ gap up to 0.057). We further introduce a black-box diagnostic $\rho_R(\alpha)$ that estimates the proportion of baseline uncertainty resolved by context level $\alpha$, requiring only repeated sampling with and without context. These results suggest that entropy is a more responsive black-box uncertainty measure than confidence under incomplete context.

信息检索

1. 【2605.13764】VectorSmuggle: Steganographic Exfiltration in Embedding Stores and a Cryptographic Provenance Defense

链接https://arxiv.org/abs/2605.13764

作者:Jascha Wanger

类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Modern retrieval-augmented generation, resulting numerical artifacts, Modern retrieval-augmented, systems convert sensitive, convert sensitive content

备注: 47 pages, 3 figures. Reference implementations: [this https URL](https://github.com/jaschadub/VectorSmuggle) and [this https URL](https://github.com/jaschadub/VectorPin)

点击查看摘要

Abstract:Modern retrieval-augmented generation (RAG) systems convert sensitive content into high-dimensional embeddings and store them in vector databases that treat the resulting numerical artifacts as opaque. Major vector-store products do not provide native controls for embedding integrity, ingestion-time distributional anomaly detection, or cryptographic provenance attestation. We show this opens a class of steganographic exfiltration attacks: an attacker with write access to the ingestion pipeline can hide payload data inside embeddings using simple post-embedding perturbations (noise injection, rotation, scaling, offset, fragmentation, and combinations thereof) while preserving the surface-level retrieval behavior the RAG system exposes to legitimate users. We evaluate these techniques across a synthetic-PII corpus on text-embedding-3-large, four locally hosted open embedding models, a cross-corpus replication on BEIR NFCorpus and a Quora subset (over 26,000 chunks combined), seven vector-store configurations, an adaptive-attacker variant of the detector evaluation, and a paraphrased-query retrieval benchmark. Distribution-shifting perturbations are often caught by simple anomaly detectors; small-angle orthogonal rotation defeats distribution-based detection across every (model, corpus) pair tested. A disjoint-Givens rotation encoder gives a closed-form per-vector capacity ceiling of floor(d/2) * b bits, but real embedding manifolds impose a capacity-detectability trade-off, and the retrieval-preserving operating point sits well below it. We propose VectorPin, a cryptographic provenance protocol that pins each embedding to its source content and producing model via an Ed25519 signature over a canonical byte representation. Any post-embedding modification breaks signature verification. Embedding-level integrity is a deployable, standardizable control that closes this attack class.

Comments:
47 pages, 3 figures. Reference implementations: this https URL and this https URL

Subjects:

Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)

ACMclasses:
K.6.5; I.2.7; H.3.3

Cite as:
arXiv:2605.13764 [cs.CR]

(or
arXiv:2605.13764v1 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2605.13764

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.5281/zenodo.20076420

Focus to learn more

            DOI(s) linking to related resources</p>
2. 【2605.13593】Benchmarking the Open Science Data Federation services to develop XRootD best practices

链接https://arxiv.org/abs/2605.13593

作者:Fabio Andrijauskas,Igor Sfiligoi,Frank Würthwein

类目:Information Retrieval (cs.IR)

关键词:Science Data Federation, Open Science Data, National Research Platform, data sharing, dependent on processing

备注

点击查看摘要

Abstract:Research has become dependent on processing power and storage, one crucial aspect being data sharing. The Open Science Data Federation (OSDF) project aims to create a scientific global data distribution network based on the Pelican Platform. OSDF relies on the XRootD and Pelican projects. Nevertheless, OSDF must understand the XRootD limits under various configuration options, including transfer rate limits, proper buffer configuration, and storage type effect. We have thus executed a set of benchmarks to create a set of recommendations to share with the XRootD and Pelican teams. This work describes the tests and results performed using National Research Platform (NRP) hosts. The tests cover various file sizes and parallel streams and use clients from various distances from the server host. We also used several standalone clients (wget, curl, pelican) and the native HTCondor file transfer mechanisms. Applying the methodology creates a possibility to track how XRootD and the Pelican layer perform in different scenarios.

3. 【2605.13521】Granite Embedding Multilingual R2 Models

链接https://arxiv.org/abs/2605.13521

作者:Parul Awasthy,Aashka Trivedi,Yushu Yang,Ken Barker,Yulong Li,Bhavani Iyer,Martin Franz,Meet Doshi,Riyaz Bhat,Vignesh P,Vishwajeet Kumar,Todd Ward,Abraham Daniels,Rudra Murthy,Madison Lee,Luis Lastras,Jaydeep Sen,Radu Florian

类目:Information Retrieval (cs.IR)

关键词:multilingual Granite Embedding, enterprise-scale dense retrieval, Granite Embedding, multilingual Granite, encoder-based embedding models

备注

点击查看摘要

Abstract:We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at this https URL, designed to support responsible use and enable unrestricted research and enterprise adoption.

4. 【2605.13497】ask-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models

链接https://arxiv.org/abs/2605.13497

作者:Xinye Wanyan,Chenglong Ma,Danula Hettiachchi,Ziqi Xu,Jeffrey Chan

类目:Information Retrieval (cs.IR)

关键词:Large Language Model, Large Language, Language Model, modern recommender systems, based agent simulation

备注: Accepted by SIGIR 2026

点击查看摘要

Abstract:Large Language Model (LLM)-based agent simulation has emerged as a promising approach to meet the increasing demand for real-time and rigorous evaluation in modern recommender systems. A typical LLM-driven simulation framework comprises three essential components: the profile module, memory module, and action module. However, existing studies have primarily concentrated on enhancing the memory and action modules, with limited attention to profile generation, which plays a pivotal role in ensuring realistic agent behaviours and aligning simulated interactions with real user dynamics. Moreover, the scarcity of datasets specifically designed for recommendation simulations has led to heavy reliance on manually crafted profiles, significantly limiting the scalability and generalisability of simulation frameworks across different datasets. To address these challenges, this work proposes an Automated Profile Generation Framework for Recommendation Simulation, APG4RecSim, that constructs realistic, coherent, and robust user profiles with minimal supervision. Extensive experiments on three benchmark datasets demonstrate that APG4RecSim achieves the best overall performance on discrimination, ranking, and rating tasks, improving ranking quality by up to 7% in nDCG@10 and reducing rating distribution divergence by 8% in JSD compared to existing profile-generation baselines. Beyond overall performance gains, our results show that profiles generated by APG4RecSim are resilient to popularity- and position-induced biases and maintain stable performance across datasets and different LLMs.

5. 【2605.13311】IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation

链接https://arxiv.org/abs/2605.13311

作者:Joy Bose

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词:intermediate reasoning structure, preserve intermediate reasoning, Design Thinking, Current AI-assisted innovation, systems typically apply

备注: 14 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Current AI-assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt-based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge graph-grounded multi-agent framework for innovation analysis and patent claim generation. IdeaForge integrates multiple innovation methodologies (TRIZ, Design Thinking, and SCAMPER) through specialist agents operating over a persistent FalkorDB knowledge graph. Each agent contributes structured entities and relationships representing contradictions, inventive principles, user needs, transformations, analogies, and candidate claims. The central contribution of IdeaForge is a cross-methodology convergence mechanism implemented through graph-based claim linkage. Claims independently supported by multiple methodologies are connected using CONVERGENT relationships, enabling identification of high-confidence innovation candidates through graph traversal. A downstream patent drafting agent generates structured patent drafts grounded in convergent claim subgraphs, reducing reliance on unconstrained language model generation. An InnovationScore formula ranks claims by convergent support, methodology diversity, claim strength, and prior art challenge count. We describe the graph schema, agent architecture, convergence detection pipeline, and patent synthesis workflow. Experiments on a legal technology use case demonstrate that graph-grounded multi-methodology synthesis produces more diverse and traceable innovation candidates compared to single-methodology baselines. We discuss implications for computational creativity, explainable AI-assisted invention, and graph-native innovation systems.

6. 【2605.13310】SemRepo: A Knowledge Graph for Research Software and Its Scholarly Ecosystem

链接https://arxiv.org/abs/2605.13310

作者:Abdul Rafay,Yuni Susanti,David Lamprecht,Michael Färber

类目:Digital Libraries (cs.DL); Databases (cs.DB); Information Retrieval (cs.IR)

关键词:million triples describing, RDF knowledge graph, RDF knowledge, knowledge graph comprising, million triples

备注

点击查看摘要

Abstract:We present SemRepo, an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. SemRepo captures repository-level metadata, such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts, such as datasets and experiments, are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, which are typically fragmented across separate platforms. SemRepo supports analyses that are difficult to perform with existing resources in isolation, including provenance reconstruction across repositories and publications, as well as the systematic identification of risks to research reproducibility and software sustainability. By unifying research software with its scholarly context in a single graph, SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.

7. 【2605.13292】IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

链接https://arxiv.org/abs/2605.13292

作者:Shubham Kumar Nigam,Suparnojit Sarkar,Piyush Patel

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:limiting conversational realism, dialogue systems operate, existing medical dialogue, medical dialogue systems, single-turn question

备注: Accepted in BioNLP @ ACL 2026 Conference

点击查看摘要

Abstract:Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.

8. 【2605.13277】Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

链接https://arxiv.org/abs/2605.13277

作者:Weiqing Luo,Zongye Hu,Xiao Wang,Zhiyuan Yu,Haofeng Zhang,Ziyi Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Visual evidence selection, multimodal retrieval-augmented generation, Visual evidence, existing methods typically, methods typically rely

备注: Accepted to ACL 2026

点击查看摘要

Abstract:Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.

9. 【2605.13137】LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving

链接https://arxiv.org/abs/2605.13137

作者:Guoxiong Gao,Zeming Sun,Jiedong Jiang,Yutong Wang,Jingda Xu,Peihao Wu,Bryan Dai,Bin Dong

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Proving theorems, identifying a scattered, joint use enables, enables a concise, call global premise

备注

点击查看摘要

Abstract:Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof -- a task we call global premise retrieval. Existing tools address adjacent problems: semantic search engines find individual declarations matching a query, while premise-selection systems predict useful lemmas one tactic step at a time. Neither recovers the full premise set an entire theorem requires. We present LeanSearch v2, a two-mode retrieval system for this task. Its standard mode applies a hierarchy-informalized Mathlib corpus with an embedding-reranker pipeline, achieving state-of-the-art single-query retrieval without domain-specific fine-tuning (nDCG@10 of 0.62 vs. 0.53 for the next-best system). Its reasoning mode builds on standard mode as its retrieval substrate, targeting global premise retrieval through iterative sketch-retrieve-reflect cycles. On a 69-query benchmark of research-level Mathlib theorems, reasoning mode recovers 46.1% of ground-truth premise groups within 10 retrieved candidates, outperforming strong reasoning retrieval systems (38.0%) and premise-selection baselines (9.3%) on the same benchmark. In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next-best system and 4% without retrieval), confirming that retrieval quality propagates to proof generation. We have open-sourced all code, data, and benchmarks. Code and data: this https URL . The standard mode is publicly available with API access at this https URL .

10. 【2605.13110】A Multi-Agent Orchestration Framework for Venture Capital Due Diligence

链接https://arxiv.org/abs/2605.13110

作者:Grigorios Alexandrou,Katerina Pramatari

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:fully automated multi-agent, automated multi-agent framework, corporate due diligence, Large Language Models, combining Large Language

备注: 13 pages, 1 figure

点击查看摘要

Abstract:We present a fully automated multi-agent framework for corporate due diligence and market analysis in venture capital. The system runs on an event-driven orchestration architecture, combining Large Language Models (LLMs) with real-time web retrieval to synthesize unstructured data into structured investment intelligence. A central technical contribution is a programmatic extraction pipeline that reverse-engineers the frontend-to-backend communication of the Greek Business Registry ($\Gamma$.this http URL.), querying dynamic endpoints to retrieve official financial filings that are then parsed using a layout-aware OCR extractor. A structural fallback mechanism explicitly flags data absence rather than generating unverified figures, directly targeting hallucination in financial contexts. All workflow artifacts are publicly available to support replication.

11. 【2605.13053】A Standardized Re-evaluation of Conversational Recommender Systems on the ReDial Dataset

链接https://arxiv.org/abs/2605.13053

作者:Ivica Kostric,Krisztian Balog

类目:Information Retrieval (cs.IR)

关键词:Recent years, surge of research, conversational recommender systems, Recent, conversational recommender

备注: Accepted to Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), July 20--24, 2026, Melbourne, VIC, Australia

点击查看摘要

Abstract:Recent years have seen a surge of research into conversational recommender systems (CRS). Among existing datasets, ReDial is the most widely used benchmark, cited in hundreds of studies. However, variations in how the dataset is preprocessed and used in experiments, particularly in the definition of ground-truth items, make it difficult to compare results across studies. These comparisons are further complicated by confounding factors such as the choice of the underlying large language model (LLM) and the use of external data sources. In this work, we revisit seven prominent CRS methods across three architectural families and evaluate them under standardized conditions. Our reproducibility study reveals a ``granularity gap,'' where fine-grained ranking (Recall@1) is highly sensitive to implementation details, while our replicability analysis shows that nearly 50% of reported accuracy stems from ``repetition shortcuts'' that are absent in novelty-focused evaluation. Furthermore, we find that performance gains are often driven more by the capacity of the LLM backbone than by specific architectural innovations. Finally, by applying user-centric utility metrics, we demonstrate that traditional recall frequently overstates a system's actual conversational effectiveness. This work establishes a transparent, controlled baseline and promotes evaluation practices that prioritize novelty and interaction efficiency.

12. 【2605.13052】RAG-Enhanced Large Language Models for Dynamic Content Expiration Prediction in Web Search

链接https://arxiv.org/abs/2605.13052

作者:Tingyu Chen,Wenkai Zhang,Li Gao,Lixin Su,Ge Chen,Dawei Yin,Daiting Shi

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:remains challenging due, highly varied lifespans, intent remains challenging, commercial web search, Large Language Models

备注: Accepted at SIGIR 2026. Final version: [this https URL](https://doi.org/10.1145/3805712.3808457)

点击查看摘要

Abstract:In commercial web search, aligning content freshness with user intent remains challenging due to the highly varied lifespans of information. Traditional industrial approaches rely on static time-window filtering, resulting in "one-size-fits-all" rankings where content may be chronologically recent but semantically expired. To address the limitation, we present a novel Large Language Models (LLMs)-based Query-Aware Dynamic Content Expiration Prediction Framework deployed in Baidu search, reformulating timeliness as a dynamic validity inference task. Our framework extracts fine-grained temporal contexts from documents and leverages LLMs to deduce a query-specific "validity horizon"-a semantic boundary defining when information becomes obsolete based on user intent. Integrated with robust hallucination mitigation strategies to ensure reliability, our approach has been evaluated through offline and online A/B testing on live production traffic. Results demonstrate significant improvements in search freshness and user experience metrics, validating the effectiveness of LLM-driven reasoning for solving semantic expiration at an industrial scale.

13. 【2605.13034】ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

链接https://arxiv.org/abs/2605.13034

作者:Zhuofan Shi,Peilun Jia,Baoqin Sun,Haiyang Shen,Sixiong Xie,Yun Ma,Xiang Jing

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:large language models, Recent deep research, Recent deep, produce long, improved the ability

备注

点击查看摘要

Abstract:Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with section-specific evidence. ViDR further validates visual references to reduce hallucinated or misplaced figures. We also introduce MMR Bench+, a benchmark for evaluating visual evidence use in deep research reports, covering source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation. Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines. These results suggest that source visual evidence is important for multimodal deep research, as it strengthens evidential grounding, visual support, and report verifiability.

14. 【2605.12988】Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education

链接https://arxiv.org/abs/2605.12988

作者:Mragisha Jain,Tirth Bhatt,Griffin Pitts,Aum Pandya,Peter Brusilovsky,Narges Norouzi,Arto Hellas,Juho Leinonen,Bita Akram

类目:Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)

关键词:unfamiliar problem instances, debug reasoning errors, Students learning algorithms, interpret traces, problem instances

备注: Paper accepted to the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), co-located with ACL 2026

点击查看摘要

Abstract:Students learning algorithms often need support as they interpret traces, debug reasoning errors, and apply procedures across unfamiliar problem instances. In this paper, we present KITE (Knowledge-Informed Tutoring Engine), a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system designed to serve as a classroom teaching assistant for algorithmic reasoning and problem-solving tasks. KITE uses an intent-aware Socratic response strategy to tailor support to different student needs, responding with targeted hints, guiding questions, and progressive scaffolding intended to strengthen students' algorithmic problem-solving ability. To keep responses aligned with course content, KITE uses a multimodal RAG pipeline that retrieves relevant information from course materials. We evaluate KITE using three forms of assessment: RAGAs-based metrics for response grounding and quality, expert evaluation of pedagogical quality, and a simulated student pipeline in which a weaker language model interacts with KITE across two-turn dialogues and produces revised answers after receiving feedback. Results indicate that KITE produces contextually grounded and pedagogically appropriate responses. Further, using simulated students, KITE's feedback helped the student models produce more accurate follow-up responses on procedural and tracing questions, suggesting that its scaffolding can support algorithmic problem-solving. This work contributes a tutoring architecture and an evaluation approach for assessing retrieval-grounded explanations and scaffolded problem-solving feedback.

15. 【2605.12905】Same Image, Different Meanings: Toward Retrieval of Context-Dependent Meanings

链接https://arxiv.org/abs/2605.12905

作者:Ayuto Tsutsumi,Ryosuke Kohita

类目:Information Retrieval (cs.IR)

关键词:rain can convey, convey hope, hope and warmth, sorrow and finality, context

备注: SIGIR 2026 (short paper)

点击查看摘要

Abstract:A scene of two people in the rain can convey hope and warmth in a reunion story or sorrow and finality in a farewell story. We investigate this context-dependent nature of image meaning and its implications for retrieval. Our key observation is that context dependency correlates with semantic abstraction: concrete elements (objects, actions) remain stable across contexts, while abstract elements (atmosphere, intent) shift with context. We operationalize this as the L1--L4 framework, organizing image semantics from context-independent (L1) to maximally context-dependent (L4). Using synthetic story contexts and queries for controlled evaluation, we examine how injecting narrative context into embeddings affects retrieval across abstraction levels. Concrete queries are retrievable without context, while abstract levels increasingly depend on narrative grounding. Where context is injected also matters, with image-side enrichment proving particularly effective. The most abstract level, however, remains challenging even with full context, highlighting context-dependent image retrieval as an important open problem. Our framework and findings lay groundwork toward retrieval systems that handle the context-dependent meanings images acquire in narrative settings.

16. 【2605.12887】EcoGEO: Trajectory-Aware Evidence Ecosystems for Web-Enabled LLM Search Agents

链接https://arxiv.org/abs/2605.12887

作者:Hengwei Ye,Jiasheng Mao,Zhenhan Guan,Zheng Tian

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Generative Engine Optimization, Existing Generative Engine, Web-enabled LLM agents, Engine Optimization, online information influences

备注

点击查看摘要

Abstract:Web-enabled LLM agents are changing how online information influences search outcomes. \ Existing Generative Engine Optimization (GEO) studies mainly focus on individual webpages. \ However, agentic web search is not a single-document setting: an agent may issue queries, crawl pages, follow links, reformulate searches, and synthesize evidence across multiple browsing steps. \ Influence therefore depends not only on page content, but also on how pages are organized, connected, and encountered along the agent's browsing trajectory. \ We study this shift through \textbf{Ecosystem Generative Engine Optimization} (\textbf{EcoGEO}), which treats GEO as an environment-level influence problem for web-enabled LLM agents. \ To instantiate this perspective, we propose \textbf{TRACE}, a \textbf{Trajectory-Aware Coordinated Evidence Ecosystem}. \ Given a recommendation query and a fictional target product, our method builds a controlled evidence environment that coordinates an agent-facing navigation entry page with heterogeneous support pages. \ These pages use shared terminology, internal links, and consistent product attributes to introduce, verify, and reinforce the target product. We evaluate our method on OPR-Bench, a benchmark for open-ended product recommendation. \ Experiments show that it consistently outperforms page-level GEO baselines in final target recommendation. \ Trajectory-level metrics further show increased initial target-result crawls, target-specific follow-up searches, and internal-link crawls, suggesting that the gains come from shaping the agent's evidence-acquisition process rather than merely adding more target-related content. \ Overall, our findings support an ecosystem research paradigm for GEO, where web-enabled LLM agents are studied in relation to the broader evidence environments that guide search, browsing, and answer synthesis.

17. 【2605.12617】MLPs are Efficient Distilled Generative Recommenders

链接https://arxiv.org/abs/2605.12617

作者:Zitian Guo,Yupeng Hou,Clark Mingxuan Ju,Neil Shah,Julian McAuley

类目:Information Retrieval (cs.IR)

关键词:employing Semantic IDs, exhibit strong potential, models employing Semantic, Semantic IDs, employing Semantic

备注

点击查看摘要

Abstract:Generative recommendation models employing Semantic IDs (SIDs) exhibit strong potential, yet their practical deployment is bottlenecked by the high inference latency of beam-expanded autoregressive decoding. In this work, we identify that standard attention-heavy Transformer decoders represent a structural overkill for this task: the hierarchical nature of SIDs makes prediction difficulty drops sharply after the first token, rendering repeated attention computations highly redundant. Driven by this insight, we propose SID-MLP, a lightweight MLP-centric distillation framework that fundamentally simplifies the decoding paradigm for GR. Instead of executing complex, step-by-step attention mechanisms, our approach captures the global user context in a single operation, decoupled from sequential token prediction. We then distill the heavy autoregressive teacher into position-specific MLP heads, eliminating the dense attention overhead while preserving prefix and context dependencies. Extensive experiments demonstrate that SID-MLP matches the accuracy of teacher models while accelerating inference by 8.74x. Crucially, this distillation strategy can serve as a plug-and-play accelerator for different backbones and tokenizer settings. Furthermore, we introduce SID-MLP++, extending our distillation framework to replace the Transformer encoder, unlocking further latency reductions. Ultimately, our work reveals that decoder-side MLPs distillation is an effective acceleration path for structured SID recommendation, while full encoder replacement offers an additional speed--accuracy trade-off.

18. 【2605.12613】Creating Group Rules with AI: Human-AI Collaboration in WhatsApp Moderation

链接https://arxiv.org/abs/2605.12613

作者:Gauri Nayak,Farhana Shahid,Aditya Vashistha,Kiran Garimella

类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:messaging platforms globally, users sharing information, platforms globally, widely used messaging, messaging platforms

备注: CSCW 2026

点击查看摘要

Abstract:WhatsApp is one of the most widely used messaging platforms globally, with billions of users sharing information in private groups. Yet, it offers little infrastructure to support moderation and group governance. In the absence of platform-level oversight, group admins bear the responsibility of governing group behavior. In this paper, we explore how WhatsApp group admins collaborate with AI tools to create, enforce, and maintain group rules. Drawing on a two-phase speculative design study with 20 admins in India, we examine how participants interacted with an AI assistant (Meta AI) to co-create rules and responded to a series of probes illustrating AI-assisted moderation features. Our findings show that while admins appreciated the AI's ability to surface overlooked rules and reduce their moderation burden, they were highly sensitive to issues of relational trust, data privacy, tone, and social context. We identify how group type and admin style shaped their willingness to delegate authority, and surface the limitations of current chatbot interfaces in supporting collaborative rule-making. We conclude with design implications for building moderation tools that center human judgment, relational nuance, contextual adaptability, and collective governance.

19. 【2605.12527】Beyond Centralization: User-Controlled Federated Recommendations in Practice

链接https://arxiv.org/abs/2605.12527

作者:Manel Slokom,Alejandro Bellogin

类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

关键词:typically require centralized, raising privacy concerns, systems typically require, require centralized user, typically require

备注

点击查看摘要

Abstract:Recommendation systems typically require centralized user data, limiting user control and raising privacy concerns. Federated learning offers an alternative by keeping data on-device, but its impact on real user behavior remains largely unexplored. We present a live federated recommender system that allows users to control the recommendation objective while keeping their data local. In a 53-day deployment with 22 participants and a catalog of 8807 titles, users interacted with recommendations and switched between personalization and diversity-enhanced ranking. We find that users prefer personalization when given explicit choice (65.37\% vs.\ 62.07\% CTR), actively engage with control mechanisms (3.93/5 satisfaction; 248 settings changes), and develop an understanding of how their interactions affect recommendations through immediate feedback. Our results show that user control, privacy, and effective personalization can be combined in a working system. We demonstrate a practical approach to interactive, privacy-preserving recommendation. Code and demo materials are available at: this https URL

计算机视觉

1. 【2605.13838】R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

链接https://arxiv.org/abs/2605.13838

作者:Zijie Wu,Lixin Xu,Puhua Jiang,Sicong Liu,Chunchao Guo,Xiang Bai

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:holds immense potential, animation holds immense, content creation, offering intuitive, holds immense

备注: Accepted by SIGGRAPH 2026, Project Page: [this https URL](https://r-dmesh.github.io/) Code URL: [this https URL](https://github.com/Tencent-Hunyuan/R-DMesh)

点击查看摘要

Abstract:Video-guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma. In real-world scenarios, the initial pose of a user-provided static mesh rarely aligns with the starting frame of a reference video. Naively forcing a mesh to follow a mismatched trajectory inevitably leads to severe geometric distortion or animation failure. To address this, we present Rectified Dynamic Mesh (R-DMesh), a unified framework designed to generate high-fidelity 4D meshes that are ``rectified'' to align with video context. Unlike standard motion transfer approaches, our method introduces a novel VAE that explicitly disentangles the input into a conditional base mesh, relative motion trajectories, and a crucial rectification jump offset. This offset is learned to automatically transform the arbitrary pose of the input mesh to match the video's initial state before animation begins. We process these components via a Triflow Attention mechanism, which leverages vertex-wise geometric features to modulate the three orthogonal flows, ensuring physical consistency and local rigidity during the rectification and animation process. For generation, we employ a Rectified Flow-based Diffusion Transformer conditioned on pre-trained video latents, effectively transferring rich spatio-temporal priors to the 3D domain. To support this task, we construct Video-RDMesh, a large-scale dataset of over 500k dynamic mesh sequences specifically curated to simulate pose misalignment. Extensive experiments demonstrate that R-DMesh not only solves the alignment problem but also enables robust downstream applications, including pose retargeting and holistic 4D generation.

2. 【2605.13835】Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

链接https://arxiv.org/abs/2605.13835

作者:Hao Sun,Zi-Jun Ding,Da-Wei Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Class-Incremental Learning, continuously integrate, integrate new knowledge, CIL, enables models

备注

点击查看摘要

Abstract:Class-Incremental Learning (CIL) enables models to continuously integrate new knowledge while mitigating catastrophic forgetting. Driven by the remarkable generalization of CLIP, leveraging pre-trained vision-language models has become a dominant paradigm in CIL. However, current work primarily focuses on aligning global image embeddings (i.e., [CLS] token) with their corresponding text prompts (i.e., [EOS] token). Despite their good performance, we find that they discard the rich patch-level semantic information inherent in CLIP's encoders. For instance, when recognizing a rabbit, local patches may encode its distinctive cues, such as long ears and a fluffy tail, which can provide complementary evidence for recognition. Based on the above observation, we propose SPA (Semantic-guided Patch-level Alignment) for CLIP-based CIL, which aims to awaken long-neglected local representations within CLIP. Specifically, for each class, we first construct representative and diverse visual samples and feed them to GPT-5 as visual guidance to generate class-wise semantic descriptions. These descriptions are used to guide the selection of discriminative patch-level visual features. Building upon these selected patches, we further employ optimal transport to align selected patch tokens with semantic tokens from class-wise descriptions, yielding a structured cross-modal alignment that improves recognition. Furthermore, we introduce task-specific projectors for effective adaptation to downstream incremental tasks, and sample pseudo-features from stored class-wise Gaussian statistics to calibrate old-class representations, thereby mitigating catastrophic forgetting. Extensive experiments demonstrate that SPA achieves state-of-the-art performance.

3. 【2605.13833】QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

链接https://arxiv.org/abs/2605.13833

作者:Hoang-Quan Nguyen,Sankalp Pandey,Khoa Luu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:sequential data remains, machine learning, data remains, remains a central, Modeling long-range dependencies

备注

点击查看摘要

Abstract:Modeling long-range dependencies in sequential data remains a central challenge in machine learning. Transformers address this challenge through attention mechanisms, but their quadratic complexity with respect to sequence length limits scalability to long contexts. State-space models (SSMs) provide an efficient alternative with linear-time computation by evolving a latent state through recurrent updates, but their memory is typically formed via additive or linear transitions, which can limit their ability to capture complex global interactions across tokens. In this work, we introduce one of the first studies to leverage the superposition property of quantum systems to enhance state-based sequence modeling. In particular, we propose Quantum Long-Attention Memory (QLAM), a hybrid quantum-classical memory mechanism that can be viewed as a quantum extension of state-space models. Instead of maintaining a classical latent state updated through additive dynamics, QLAM represents the hidden state as a quantum state whose amplitudes encode a superposition of historical information. The state evolves through parameterized quantum circuits conditioned on the input, enabling a non-classical, globally update mechanism. In this way, QLAM preserves the recurrent and linear-time structure of SSMs while fundamentally enriching the memory representation through quantum superposition. Unlike attention mechanisms that explicitly compute pairwise interactions, QLAM implicitly captures global dependencies through the evolution of the quantum state, and retrieves task-relevant information via query-dependent measurements. We evaluate QLAM on sequential variants of standard image classification benchmarks, including sMNIST, sFashion-MNIST, and sCIFAR-10, where images are flattened into token sequences. Across all tasks, QLAM consistently improves over recurrent baselines and transformer-based models.

4. 【2605.13831】raining Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

链接https://arxiv.org/abs/2605.13831

作者:Zhaowei Wang,Lishu Luo,Haodong Duan,Weiwei Liu,Sijin Wu,Ji Luo,Shen Yan,Shuai Peng,Sihang Yuan,Chaoyi Huang,Yi Lin,Yangqiu Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling sustained context, sustained context management, modern large vision-language, video analysis, enabling sustained

备注: work in progress

点击查看摘要

Abstract:Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

5. 【2605.13825】History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

链接https://arxiv.org/abs/2605.13825

作者:Alberto G. Rodríguez Salgado

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:tool calls produced, prior tool calls, LLMs are increasingly, increasingly deployed, deployed as agents

备注: 12 pages, 3 figures

点击查看摘要

Abstract:Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.

6. 【2605.13815】OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation

链接https://arxiv.org/abs/2605.13815

作者:Youquan Liu,Weidong Yang,Ao Liang,Xiang Xu,Lingdong Kong,Yang Wu,Dekai Zhu,Xin Li,Runnan Chen,Ben Fei,Tongliang Liu,Wanli Ouyang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:diverse sensing conditions, synthetic data creation, sensing conditions, LiDAR scene generation, increasingly important

备注: Preprint; 12 pages, 7 figures, 10 tables

点击查看摘要

Abstract:LiDAR scene generation is increasingly important for scalable simulation and synthetic data creation, especially under diverse sensing conditions that are costly to capture at scale. Typically, diffusion-based LiDAR generators are developed under single-domain settings, requiring separate models for different datasets or sensing conditions and hindering unified, controllable synthesis under heterogeneous distribution shifts. To this end, we present OmniLiDAR, a unified text-conditioned diffusion framework that generates LiDAR scans in a shared range-image representation across eight representative domains spanning three shift types: adverse weather, sensor-configuration changes (e.g., reduced beams), and cross-platform acquisition (vehicle, drone, and quadruped). To enable training a single model over heterogeneous domains without isolating optimization by domain, we introduce a Cross-Domain Training Strategy (CDTS) that mixes domains within each mini-batch and leverages conditioning to steer generation. We further propose Cross-Domain Feature Modeling (CDFM), which captures directional dependencies along azimuth and elevation axes to reflect the anisotropic scanning structure of range images, and Domain-Adaptive Feature Scaling (DAFS) as a lightweight modulation to account for structured domain-dependent feature shifts during denoising. In the absence of a public consolidated benchmark, we construct an 8-domain dataset by combining real-world scans with physically based weather simulation and systematic beam reduction while following official splits. Extensive experiments demonstrate strong generation fidelity and consistent gains in downstream use cases, including generative data augmentation for LiDAR semantic segmentation and 3D object detection, as well as robustness evaluation under corruptions, with consistent benefits in limited-label regimes.

7. 【2605.13813】JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift

链接https://arxiv.org/abs/2605.13813

作者:Lavsen Dahal,Yubraj Bhandari,Geoffrey Rubin,Joseph Y. Lo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Anatomically Guided Gating, Vision Transformers provide, Automated CT triage, simultaneously accurate, accurate across diverse

备注

点击查看摘要

Abstract:Automated CT triage requires models that are simultaneously accurate across diverse pathologies and reliable under institutional shift. While Vision Transformers provide strong visual representations, many clinically significant findings are defined by quantitative imaging biomarkers rather than appearance alone. We introduce JANUS, a physiology-guided dual-stream architecture that conditions visual embeddings on macro-radiomic priors via Anatomically Guided Gating. On the MERLIN test set (N=5082), JANUS attains macro-AUROC 0.88 and AUPRC 0.74, outperforming all reproduced baselines. It generalizes to an external dataset N=2000; AUROC 0.87), with the largest gains on findings defined by size and attenuation as well as improved calibration on both datasets. We further quantify prediction suppression using the Physiological Veto Rate (PVR), showing that under domain shift JANUS reduces high-confidence false positives substantially more often than true positives. Together, these results are consistent with physically grounded conditioning that improves both discrimination and reliability in CT triage. Code is made publicly available at github repository this https URL and model weights are at this https URL.

8. 【2605.13803】EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

链接https://arxiv.org/abs/2605.13803

作者:Minjoon Jung,Byoung-Tak Zhang,Lorenzo Torresani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:input and localizes, temporal grounding, Video temporal grounding, untrimmed video, natural-language query

备注: Project page: [this https URL](https://minjoong507.github.io/projects/EvoGround/)

点击查看摘要

Abstract:Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

9. 【2605.13798】VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence

链接https://arxiv.org/abs/2605.13798

作者:Guney Tombak,Ertunc Erdil,Ender Konukoglu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain anatomically consistent, imaging contrasts, acquisition protocols, medical image analysis, Vision Transformer

备注

点击查看摘要

Abstract:Cross-modal 3D medical image analysis requires voxelwise representations that remain anatomically consistent across imaging contrasts, scanners, and acquisition protocols. Recent work has shown that frozen 2D Vision Transformer (ViT) foundation models can support such representations, but typical pipelines extract features along a single anatomical axis and adapt those features inside a registration solver for one image pair at a time, leaving complementary viewing directions unused and producing representations that do not transfer to new volumes. We introduce VoxCor, a training-free fit--transform method for reusable volumetric feature representations from frozen 2D ViT foundation models. During an offline fitting phase, VoxCor combines triplanar ViT inference with a compact closed-form weighted partial least squares (WPLS) projection that uses fitting-time voxel correspondences to select modality-stable anatomical directions in the triplanar feature space. At transform time, new volumes are mapped by triplanar ViT inference and linear projection alone, without fine-tuning or registration. Voxel correspondences can then be queried directly by nearest-neighbor search. We evaluate VoxCor on intra-subject Abdomen MR--CT and inter-subject HCP T2w--T1w tasks using deformable registration, voxelwise k-nearest-neighbor segmentation, and segmentation-center landmark localization. VoxCor improves the hardest cross-subject, cross-modality transfer settings, reduces encoder sensitivity for dense correspondence transfer, and yields registration performance competitive with handcrafted descriptors and learned 3D features. This positions VoxCor as a reusable feature layer for downstream multimodal analysis beyond pairwise registration. Code, configuration files, and implementation details are publicly available on GitHub at \href{this https URL}{guneytombak/VoxCor}.

10. 【2605.13794】BlitzGS: City-Scale Gaussian Splatting at Lightning Speed

链接https://arxiv.org/abs/2605.13794

作者:Zhongtao Wang,Huishan Au,Yilong Li,Mai Su,Haojie Jin,Yisong Chen,Meng Gai,Fei Zhu,Guoping Wang

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:fast city-scale reconstruction, reduces active Gaussian, active Gaussian workload, framework shards Gaussians, Gaussian workload

备注

点击查看摘要

Abstract:We present BlitzGS, a distributed 3DGS framework that reduces active Gaussian workload for fast city-scale reconstruction. BlitzGS manages this workload at three coupled levels. At the system level, the framework shards Gaussians across GPUs by index parity rather than spatial blocks. This approach mitigates the cross-block visibility redundancy inherent in spatial partitioning. Furthermore, it distributes each rendering step through a single cross-GPU exchange that routes projected Gaussians to their tile owners. At the model level, scheduled importance-scoring passes shrink the global Gaussian population. During these passes, the framework generates a per-Gaussian visibility weight to bias density-control updates toward contributing primitives and a per-view importance mask for the view-level renderer. At the view level, BlitzGS trims each camera's active set with a distance-based LOD gate to exclude excessively fine primitives for the current frustum and the importance-based culling mask to skip Gaussians with negligible cross-view contribution. On large-scale benchmarks, BlitzGS matches the rendering quality of recent large-scale baselines while delivering an order-of-magnitude speedup, training city-scale scenes in tens of minutes. Our code is available at https: //github.com/AkierRaee/BlitzGS.

Subjects:

Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2605.13794 [cs.GR]

(or
arXiv:2605.13794v1 [cs.GR] for this version)

https://doi.org/10.48550/arXiv.2605.13794

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
11. 【2605.13778】Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

链接https://arxiv.org/abs/2605.13778

作者:Jiahui Niu,Kefan Gu,Yucheng Zhao,Shengwen Liang,Tiancai Wang,Xing Hu,Ying Wang,Huawei Li

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:model Action Expert, full inference, fundamentally limited, limited in real-time, real-time deployment

备注

点击查看摘要

Abstract:Diffusion-based vision-language-action models (dVLAs) are promising for embodied intelligence but are fundamentally limited in real-time deployment by the high latency of full inference. We propose Realtime-VLA FLASH, a speculative inference framework that eliminates most full inference calls during replanning by introducing a lightweight draft model with parallel verification via the main model's Action Expert and a phase-aware fallback mechanism that reverts to the full inference pipeline when needed. This design enables low-latency, high-frequency replanning without sacrificing reliability. Experiments show that on LIBERO, FLASH largely preserves task performance by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average inference latency to 19.1 ms (3.04x speedup). We additionally demonstrate effectiveness on real-world conveyor-belt sorting, highlighting its practical impact for latency-critical embodied tasks.

12. 【2605.13775】RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

链接https://arxiv.org/abs/2605.13775

作者:Harold Haodong Chen,Sirui Chen,Yingjie Xu,Wenhang Ge,Ying-Cong Chen

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:task-aligned physical interaction, physical interaction data, scalability of robotic, robotic manipulation, manipulation is fundamentally

备注: On-going work

点击查看摘要

Abstract:The scalability of robotic manipulation is fundamentally bottlenecked by the scarcity of task-aligned physical interaction data. While vision-language models (VLMs) and video generation models (VGMs) hold promise for autonomous data synthesis, they suffer from semantic-spatial misalignment and physical hallucinations, respectively. To bridge this gap, we introduce RoboEvolve, a novel framework that couples a VLM planner and a VGM simulator into a mutually reinforcing co-evolutionary loop. Operating purely on unlabeled seed images, RoboEvolve leverages a cognitive-inspired dual-phase mechanism: (i) daytime exploration fosters physically grounded behavioral discovery through a semantic-controlled multi-granular reward, and (ii) nighttime consolidation mines "near-miss" failures to stabilize policy optimization. Guided by an autonomous progressive curriculum, the system naturally scales from simple atomic actions to complex tasks. Extensive experiments demonstrate that RoboEvolve (I) achieves superior effectiveness, elevating base planners by 30 absolute points and amplifying simulator success by 48% on average; (II) exhibits extreme data efficiency, surpassing fully supervised baselines with merely 500 unlabeled seeds--a 50x reduction; and (III) demonstrates robust continual learning without catastrophic forgetting.

13. 【2605.13755】Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception

链接https://arxiv.org/abs/2605.13755

作者:Arka Bhowmick,Enes Ozeren,Ahmed Abdullah,Oliver Wasenmuller

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recent years, significantly in creased, autonomous driving, safety-critical scenarios, high-quality data

备注: Published at SAIAD 2026 Workshop at CVPR 2026

点击查看摘要

Abstract:In recent years, autonomous driving has significantly in creased the demand for high-quality data to train 2D and 3D perception models for safety-critical scenarios. Real world datasets struggle to meet this demand as require ments continuously evolve and large-scale annotated data collection remains costly and time-consuming making syn thetic data a scalable, practical and controllable alterna tive. Pedestrian detection is among the most safety-critical tasks in autonomous driving. In this paper, we propose a simple yet effective method for scaling variability in 3D pedestrian assets for synthetic scene generation. Starting from a single 3D base asset, we generate multiple distinct pedestrian instances by synthesizing diverse facial textures and identity-level appearance variations using StyleGAN2 and automatically mapping them onto 3D meshes. This ap proach enables scalable appearance-level asset diversifica tion without requiring the design of new geometries for each instance. Using the assets, we construct synthetic datasets and study the impact of mixing real and synthetic data for RGB-based object detection. Through complementary ex periments, we analyze geometry-driven distribution shifts in point cloud perception for 3D object detection. Our findings demonstrate that controlled synthetic diversifica tion improves robustness in 2D detection while revealing the sensitivity of 3D perception models to geometric domain gaps. Overall, this work highlights how generative AI en ables scalable, simulation-ready pedestrian diversification through controlled facial texture synthesis, along with the benefits and limitations of cross-domain training strategies in autonomous driving pipelines.

14. 【2605.13753】Min Generalized Sliced Gromov Wasserstein: A Scalable Path to Gromov Wasserstein

链接https://arxiv.org/abs/2605.13753

作者:Ashkan Shahbazi,Xinran Liu,Ping He,Soheil Kolouri

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Generalized Sliced Gromov, propose min Generalized, Sliced Gromov, Wasserstein, expressive generalized slicers

备注

点击查看摘要

Abstract:We propose min Generalized Sliced Gromov--Wasserstein (min-GSGW), a sliced formulation for the Gromov--Wasserstein (GW) problem using expressive generalized slicers. The key idea is to learn coupled nonlinear slicers that assign compatible push-forward values to both input measures, so that monotone coupling in the projected domain lifts to a transport plan evaluated against the GW objective in the original spaces. The resulting plan induces a GW objective value, and min-GSGW minimizes this cost directly in the original spaces. We further show that min-GSGW is rigid-motion invariant, a crucial property for geometric matching and shape analysis tasks. Our contributions are threefold: 1) we introduce generalized slicers into the sliced GW framework, 2) we construct a slicing-based efficient GW transport plan; and 3) we develop an amortized variant that replaces per-instance optimization with a learned slicer for unseen input pairs. We perform experiments on animal mesh matching, horse mesh interpolation, and ShapeNet part transfer. Results show that min-GSGW produces meaningful geometric correspondences and GW objective values at substantially lower computational cost than existing GW solvers.

15. 【2605.13746】Weakly-Supervised Spatiotemporal Anomaly Detection

链接https://arxiv.org/abs/2605.13746

作者:Urvi Gianchandani,Praveen Tirupattur,Mubarak Shah

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:weakly supervised method, weakly supervised, supervised method, UCF Crime Dataset, anomaly detection

备注

点击查看摘要

Abstract:In this paper, we explore a weakly supervised method for anomaly detection. Since annotating videos is time-consuming, we only look at weak video-level labels during training. This means that given a video, we know that it is either normal or contains an anomaly, but no further annotations are used to train the network. Features are extracted from video clips that are either normal or anomalous. These features are used to determine anomaly scores for spatiotemporal regions of the clips based on a classifier and the implementation of a multiple instance ranking loss (MIL). We represent both anomalous and normal video clips as positive and negative bags, respectively, to apply MIL. Furthermore, since anomalies are usually localized to a part of a frame rather than the whole frame, we chose to explore temporal as well as spatial anomaly detection. We show our results on the UCF Crime2Local Dataset, which contains spatiotemporal annotations for a portion of the UCF Crime Dataset.

16. 【2605.13744】Aligning Network Equivariance with Data Symmetry: A Theoretical Framework and Adaptive Approach for Image Restoration

链接https://arxiv.org/abs/2605.13744

作者:Feiyu Tan,Qi Xie,Zongben Xu,Deyu Meng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inherently ill posed, Image restoration, ill posed inverse, symmetry, data symmetry

备注: 30 pages, 9 figures, Supplementary Material can be found at [this https URL](https://github.com/tanfy929/SA-Conv)

点击查看摘要

Abstract:Image restoration is an inherently ill posed inverse problem. Equivariant networks that embed geometric symmetry priors can mitigate this ill posedness and improve performance. However, current understanding of the relationship between network equivariance and data symmetry remains largely heuristic. Particularly for real world data with imperfect symmetry, existing research lacks a systematic theoretical framework to quantify symmetry, select transformation groups, or evaluate model data alignment. To bridge this gap, we conduct an analysis from an optimization perspective and formalize the intrinsic relationship among data symmetry priors, model equivariance, and generalization capability. Specifically, we propose for the first time a quantifiable definition of non strict symmetry at the dataset level (rather than sample level) and use it as a constraint to formulate the restoration inverse problem. We then show that the equivariance for restoration models can be naturally derived from this inverse problems incorporated the proposed symmetry constraints, and that the equivariance error of the optimal restoration operator is strictly bounded by the data symmetry error and the discretization mesh size. Furthermore, by analyzing the network's empirical risk, we demonstrate that aligning equivariance with data symmetry optimizes the bias variance trade off, minimizing the total expected risk. Guided by these insights, we propose a Sample Adaptive Equivariant Network that uses a hypernetwork and transformation learnable equivariant convolutions to dynamically align with each sample's inherent symmetry. Extensive experiments on super resolution, denoising, and deraining validate our theoretical findings and show significant superiority over standard baselines and traditional equivariant models. Our code and supplementary material are available at this https URL.

17. 【2605.13741】LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction

链接https://arxiv.org/abs/2605.13741

作者:Christina Kassab,Hyeonjae Gil,Matías Mattamala,Ayoung Kim,Maurice Fallon

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:providing hierarchical geometric, robot navigation, providing hierarchical, Scene, Scene graphs

备注

点击查看摘要

Abstract:Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: this https URL.

18. 【2605.13730】Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography

链接https://arxiv.org/abs/2605.13730

作者:Christos Chrysanthos Nikolaidis,Vasileios Sachpekidis,Nikolas Moustakidis,Theofilos Moustakidis,Pavlos S. Efraimidis

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:first-line imaging modality, diagnostic performance varies, Transthoracic echocardiography, bicuspid aortic valve, diagnosing bicuspid aortic

备注

点击查看摘要

Abstract:Transthoracic echocardiography (TTE) is the first-line imaging modality for diagnosing bicuspid aortic valve (BAV), yet diagnostic performance varies with operator expertise and image quality. We developed an explainable AI model that distinguishes BAV from tricuspid aortic valves (TAV) using routinely acquired parasternal long-axis (PLAX) cine loops. A multi-backbone video ensemble was trained and evaluated using a leakage-aware, stratified outer cross-validation protocol on $N{=}90$ patient studies (48 BAV, 42 TAV). Across fixed outer splits and 10 random seeds, the calibrated stacked ensemble achieved an outer-CV F1-score of $0.907$ and recall of $0.877$. Frame-level Grad-CAM localized salient evidence to the aortic root and leaflet plane, while globally aggregated SHAP values quantified each video backbone's contribution to the stacked prediction, enabling transparent, case-level auditability. These findings indicate that PLAX-based video ensembles can support reliable BAV/TAV classification from routine echocardiographic cine loops and may facilitate earlier detection in non-specialist or resource-limited clinical settings.

19. 【2605.13729】Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

链接https://arxiv.org/abs/2605.13729

作者:Deli Cai,Haoyang Ma,Changxing Ding

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Trajectory-controlled human motion, synthesize realistic human, Trajectory-controlled human, realistic human motions, human motions conditioned

备注

点击查看摘要

Abstract:Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.

20. 【2605.13724】AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

链接https://arxiv.org/abs/2605.13724

作者:Yuchao Gu,Guian Fang,Yuxin Jiang,Weijia Mao,Song Han,Han Cai,Mike Zheng Shou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:significantly advanced, any-step video diffusion, sampling, any-step video, video diffusion

备注: Project page at [this https URL](https://nvlabs.github.io/AnyFlow/)

点击查看摘要

Abstract:Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. This limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any-step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping $(z_{t}\rightarrow z_{0})$ to flow-map transition learning $(z_{t}\rightarrow z_{r})$ over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors (i.e., discretization error in few-step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance matches or surpasses consistency-based counterparts in the few-step regime, while scaling with sampling step budgets.

21. 【2605.13713】Learning to Optimize Radiotherapy Plans via Fluence Maps Diffusion Model Generation and LSTM-based Optimization

链接https://arxiv.org/abs/2605.13713

作者:Isabella Poles,Simon Arberet,Riqiang Gao,Martin Kraus,Marco D. Santambrogio,Florin C. Ghesu,Ali Kamen,Dorin Comaniciu

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Volumetric Modulated Arc, Modulated Arc Therapy, modern radiation therapy, Volumetric Modulated, Arc Therapy

备注: Early Accept at MICCAI 2026

点击查看摘要

Abstract:Volumetric Modulated Arc Therapy (VMAT) is a cornerstone of modern radiation therapy, enabling highly conformal tumor irradiation and healthy-tissue sparing. Yet, its planning solves inverse and nested optimization for multi-leaf collimators, monitor units and dose parameters, while enforcing their consistency to ensure mechanical deliverability. Nevertheless, this process often requires repeated re-optimization when treatment configurations change, resulting in substantial planning time per patient. To address these problems, we present a diffusion-driven Learning-to-Optimize (L2O) method for end-to-end VMAT planning. A distribution-matching distilled diffusion model learns a clinically feasible manifold of fluence maps, enabling their one-shot generation. On top of this, an LSTM-based L2O module learns gradient update dynamics to swiftly refine fluence maps toward prescribed dose objectives during inference. Experimental results on clinical and public prostate cancer cohorts demonstrate improved planning efficiency, flexibility, and machine deliverability over currently available end-to-end VMAT planners.

22. 【2605.13688】MedCore: Boundary-Preserving Medical Core Pruning for MedSAM

链接https://arxiv.org/abs/2605.13688

作者:Cenwei Zhang,Suncheng Xiang,Lei You

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Medical segmentation foundation, provide strong prompt-driven, Medical segmentation, clinical settings, image encoders

备注: 3 figures, 17 pages

点击查看摘要

Abstract:Medical segmentation foundation models such as SAM and MedSAM provide strong prompt-driven segmentation, but their image encoders are still too large for many clinical settings. Compression is also risky in medicine because a model can keep high Dice while losing boundary fidelity. We propose MedCore, a structured pruning framework for MedSAM. The main idea is to preserve two kinds of structures: structures that became important during SAM-to-MedSAM adaptation, and structures that have high boundary leverage. We identify the first type by a dual-intervention score that compares zeroing a group with resetting it to its original SAM weight. We identify the second type by boundary-aware Fisher estimation. We also introduce a boundary leverage principle, which shows that compression-induced boundary displacement is controlled by logit perturbation on the boundary divided by the logit spatial gradient. This principle explains why boundary metrics can degrade even when Dice remains high. On polyp segmentation benchmarks, MedCore reduces parameters by 60.0% and FLOPs by 58.4% while achieving Dice 0.9549, Boundary F1 0.6388, and HD95 5.14 after recovery fine-tuning. It also reaches 86.6% parameter reduction and 90.4G FLOPs with strong boundary quality. Our analysis further shows that MedSAM lies in a head-fragile boundary regime: head-pruning steps have 2.887 times larger 95th-percentile boundary leverage than MLP-pruning steps, and this logit-level effect is consistent with BF1 and HD95 degradation. Our code is available at this https URL.

23. 【2605.13686】Cross Modality Image Translation In Medical Imaging Using Generative Frameworks

链接https://arxiv.org/abs/2605.13686

作者:Giulia Romoli,Alessia Capoccia,Filippo Ruffini,Francesco Di Feola,Luca Boldrini,Arturo Chiti,Renato Cuocolo,Tugba Akinci D'Antonoli,Fatemeh Darvizeh,Marcello Di Pumpo,Bradley J. Erickson,Liu Fang,Deborah Fazzini,Paola Feraco,Fabrizia Gelardi,Francesco Gossetti,Ana Isabel Hernáiz Ferrer,Michail E. Klontzas,Seyedmehdi Payabvash,Katrine Riklund,Sara N. Strandberg,Valerio Guarrasi,Paolo Soda

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enables virtual scanning, target imaging modality, translation enables virtual, Latent Diffusion Model, virtual scanning

备注

点击查看摘要

Abstract:Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.

24. 【2605.13675】Characterizing Universal Object Representations Across Vision Models

链接https://arxiv.org/abs/2605.13675

作者:Florian P. Mahner,Johannes Roth,Ka Chun Lam,Michael F. Bonner,Francisco Pereira,Martin N. Hebart

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

关键词:similar visual representations, converge on similar, similar visual, dimensions, neural networks trained

备注

点击查看摘要

Abstract:Deep neural networks trained with different architectures, objectives, and datasets have been reported to converge on similar visual representations. However, what remains unknown is which visual properties models actually converge on and which factors may underlie this convergence. To address this, we decompose the object similarity structure of 162 diverse vision models into a small set of non-negative dimensions. To determine universal versus model-specific dimensions, we then estimate how often each dimension reappears across models. In contrast to model-specific dimensions, universal dimensions are more interpretable and more strongly driven by conceptual image properties, indicating the relevance of interpretability and semantic content as implicit factors driving universality across models. Differences in architecture, objective function, training data, model size, and model performance do not explain the emergence of universal dimensions. However, models with more universal dimensions also better predict macaque IT activity and human similarity judgments, suggesting that universality reflects representations relevant to biological vision. These findings have important implications for understanding the emergent representations underlying deep neural network models and their alignment with biological vision.

25. 【2605.13674】Weakly Supervised Segmentation as Semantic-Based Regularization

链接https://arxiv.org/abs/2605.13674

作者:Stefano Colamonaco,Andrei-Bogdan Florea,Jaron Maene

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Weakly supervised semantic, trains dense pixel-level, dense pixel-level segmentation, bounding boxes, image-level tags

备注

点击查看摘要

Abstract:Weakly supervised semantic segmentation (WSSS) trains dense pixel-level segmentation models from partial or coarse annotations such as bounding boxes, scribbles, or image-level tags. While recent work leverages foundation models such as the Segment Anything Model (SAM) to generate pseudo-labels, these approaches typically depend on heuristic prompt choices and offer limited ways to incorporate prior knowledge or heterogeneous labels. We address this gap by taking a neurosymbolic perspective: integrating differentiable fuzzy logic with deep segmentation models. Weak annotations and domain-specific priors are unified as continuous logical constraints that fine-tune SAM under weak supervision. The refined foundation model then produces improved pseudo-labels, from which we train a second-stage prompt-free segmentation model. Experiments on Pascal VOC 2012 and the REFUGE2 optic disc/cup segmentation dataset show that our logic-guided fine-tuning yields higher-quality pseudo-labels, leading to state-of-the-art segmentation accuracy that often exceeds densely supervised baselines.

26. 【2605.13672】SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

链接https://arxiv.org/abs/2605.13672

作者:Giries Abu Ayoub,Morad Tukan,Loay Mualem

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:limited labeled data, evaluations implicitly assume, labeled data, implicitly assume, assume that target

备注

点击查看摘要

Abstract:Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.

27. 【2605.13670】Pattern-Enhanced RT-DETR for Multi-Class Battery Detection

链接https://arxiv.org/abs/2605.13670

作者:Xu Zhong,Enyuan Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:electronic waste recycling, automated sorting systems, Accurate and efficient, industrial quality control, efficient battery detection

备注: 4 pages, 3 figures

点击查看摘要

Abstract:Accurate and efficient battery detection is increasingly important for applications in electronic waste recycling, industrial quality control, and automated sorting systems. In this paper, we present both a comprehensive benchmark and a novel method for multi-class battery detection. We systematically compare three CNN-based detectors (YOLOv8n, YOLOv8s, YOLO11n) and two transformer-based detectors (RT-DETR-L, RT-DETR-X) on a publicly available dataset of approximately 8,591 annotated images under identical experimental conditions, and further propose PaQ-RT-DETR, which introduces pattern-based dynamic query generation into RT-DETR to alleviate query activation imbalance with negligible computational overhead. Among baselines, YOLO11n achieves the best CNN-based accuracy (mAP@50: 0.779) at only 2.6M parameters, while YOLOv8n delivers the fastest inference at ~1,667 FPS. PaQ-RT-DETR-X achieves the highest overall mAP@50 of 0.782, surpassing RT-DETR-X by +2.8% with consistent per-class gains across all six battery categories including the data-scarce Bike Battery class. Our findings provide practical guidance for selecting object detection models in battery-related industrial applications.

28. 【2605.13667】SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

链接https://arxiv.org/abs/2605.13667

作者:Vladislav Makarov,Mark Gizetdinov,Dmitry Yudin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fast graph prediction, videos remains challenging, Scene graph generation, compact structured representation, remains challenging

备注

点击查看摘要

Abstract:Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: this https URL.

29. 【2605.13664】HADAR-Based Thermal Infrared Hyperspectral Image Restoration

链接https://arxiv.org/abs/2605.13664

作者:Cheng Dai,Jiale Lin,Bingxuan Song,Yifei Chen,Jiashuo Chen,Xin Yuan,Fanglin Bao

类目:Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)

关键词:critical scene information, hyperspectral imagery, critical scene, scene information, Thermal-infrared

备注: 17 pages, 18 figures

点击查看摘要

Abstract:Thermal-infrared (TIR) hyperspectral imagery (HSI) provides critical scene information for various applications. However, its practical utility is severely limited by unique sensor degradations beyond the capabilities of existing restoration methods, which are ignorant of underlying thermal physics. Here, we propose HAIR (HADAR-based Image Restoration) as a physics-driven framework for ground-based TIR-HSI restoration. HAIR utilizes the HADAR rendering equation (HRE) and combines it with the atmospheric downwelling radiative transfer equation (RTE) to model TIR-HSI using temperature, emissivity, and texture (TeX) physical triplets. This physical model leads to a TeX decompose-synthesize strategy that guarantees physical consistency and spatio-spectral noise resilience, in stark contrast to existing approaches. Moreover, our framework uses a forward-modeled atmospheric downwelling reference, along with spectral smoothness of emissivity and blackbody radiation, to enable spectral calibration and generation that would otherwise be elusive. Our extensive experiments on the outdoor DARPA Invisible Headlights dataset and in-lab FTIR measurements show that HAIR consistently outperforms state-of-the-art methods across denoising, inpainting, spectral calibration, and spectral super-resolution, establishing a benchmark in objective accuracy and visual quality.

30. 【2605.13632】Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

链接https://arxiv.org/abs/2605.13632

作者:Yiran Ling,Qing Lian,Jinghang Li,Qing Jiang,Tianming Zhang,Xiaoke Jiang,Chuanxiu Liu,Jie Liu,Lei Zhang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:enables spatially steerable, propose GTA-VLA, explicit visual cues, spatially steerable embodied, enables spatially

备注

点击查看摘要

Abstract:In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: this https URL

31. 【2605.13621】WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning

链接https://arxiv.org/abs/2605.13621

作者:Chunjin Yang,Xiwei Zhang,Yiming Xiao,Fanman Meng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Infrared-visible object detection, combining complementary features, Infrared-visible object, multispectral images, combining complementary

备注

点击查看摘要

Abstract:Infrared-visible object detection improves detection performance by combining complementary features from multispectral images. Existing backbone-specific and backbone-shared approaches still suffer from the problems of severe bias of modality-shared features and the insufficiency of modality-specific features. To address these issues, we propose a novel detection framework WD-FQDet that explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains, allowing fusion strategies tailored to their frequency characteristics. Specifically, a low-frequency homogeneity alignment module is proposed to align modality-shared features across modalities via a cross-modal attention mechanism, and a high-frequency specificity retention module is proposed to preserve modality-specific features through the multi-scale gradient consistency loss. To reinforce the feature representation in the frequency domain, we propose a hybrid feature enhancement module that incorporates spatial cues. Furthermore, considering that the contributions of homogeneous and modality-specific features to object detection vary across scenarios, we propose a frequency-aware query selection module to dynamically regulate their contributions. Experimental results on the FLIR, LLVIP, and M3FD datasets demonstrate that WD-FQDet achieves state-of-the-art performance across multiple evaluation metrics.

32. 【2605.13604】Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting

链接https://arxiv.org/abs/2605.13604

作者:Chanyoung Kim,Donghyun Kim,Dong-Hyun Sim,Seong Jae Hwang,Youngjoong Kwon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:skeleton is encoded, hand pose estimation, Graph convolutional networks, pose estimation, hand

备注

点击查看摘要

Abstract:Graph convolutional networks (GCNs) are widely used for 3D hand pose estimation, where the hand skeleton is encoded as a fixed adjacency graph. We revisit whether this is the most effective way to incorporate hand topology in 2D-to-3D lifting. In this paper, we perform controlled, parameter-matched ablations on the FPHA benchmark and show that standard multi-head self-attention consistently outperforms GCN baselines. Even when the GCN is strengthened with multi-hop adjacency and matched parameter count, self-attention reduces MPJPE from 12.36 mm to 10.09 mm. A skeleton-constrained graph attention network recovers most of this gap, indicating that input-dependent aggregation is a major source of improvement, while fully connected attention yields additional gains. We further show that hand topology is most effective when introduced as a soft structural prior through graph-distance positional encoding, rather than as a hard adjacency constraint. These results suggest that, for hand pose lifting, adaptive spatial attention is a more effective inductive bias than fixed graph convolution.

33. 【2605.13600】Sparse Code Uplifting for Efficient 3D Language Gaussian Splatting

链接https://arxiv.org/abs/2605.13600

作者:Lovre Antonio Budimir,Yushi Guan,Steve Ryhner,Sven Lončarić,Nandita Vijaykumar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Language Gaussian Splatting, Gaussian Splatting, Splatting, language-aligned visual features, Splatting with language-aligned

备注: 18 pages (9 pages main paper), 10 figures, preprint

点击查看摘要

Abstract:3D Language Gaussian Splatting (3DLGS) augments 3D Gaussian Splatting with language-aligned visual features for open-vocabulary 3D scene understanding. A core challenge is efficiently associating high-dimensional vision-language embeddings with millions of 3D Gaussians while preserving efficient feature rendering for text-based querying. Existing methods either store dense features directly on Gaussians, causing high storage costs and slow rendering, or learn compact representations through expensive per-scene optimization with repeated feature rasterization. No existing method simultaneously achieves fast 3D semantic reconstruction, efficient storage, and fast rendering. We propose SCOUP (Sparse COde UPlifting), which addresses all three by decoupling language representation learning from 3D Gaussian optimization. Rather than working directly in 3D, we learn sparse codebook-based representations entirely using features associated with 2D image regions, associating each region with a sparse set of codebook coefficients. We then uplift these coefficients to 3D Gaussians with our weighted sparse aggregation using Gaussian-to-pixel associations, where each Gaussian accumulates coefficients over codebook atoms across views. Top-$K$ filtering then extracts the most dominant multi-view coefficients per Gaussian, enabling efficient storage and fast rendering. Our method achieves up to $400\times$ training speedup while being $3\times$ more memory efficient during training compared to the state-of-the-art in rendering speed. Across multiple benchmarks, SCOUP matches or outperforms existing methods in open-vocabulary querying accuracy.

34. 【2605.13591】Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes

链接https://arxiv.org/abs/2605.13591

作者:Kaicong Huang,Talha Azfar,Weisong Shi,Ruimin Ke

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reliable autonomous driving, Reliable autonomous, autonomous driving relies, relies on large-scale, robust models

备注

点击查看摘要

Abstract:Reliable autonomous driving relies on large-scale, well-labeled data and robust models. However, manual data collection is resource-intensive, and traditional simulation suffers from a persistent reality gap. While recent generative frameworks and radiance-field methods improve visual fidelity, they still struggle with temporal and spatial consistency and cannot ensure physics-aware behavior, limiting their applicability to driving scenario generation. To address these challenges, we propose Real2Sim, an unified framework that combines 4D Gaussian Splatting (4DGS) with a differentiable Material Point Method (MPM) solver. Real2Sim explicitly reconstructs dynamic driving scenes as temporally continuous Gaussian primitives, supports instance-level editing, and simulates realistic object-object and object-environment interactions. This framework enables physics-aware, high-fidelity synthesis of diverse, editable scenarios, including challenging corner cases such as collisions and post-impact trajectories. Experiments on the Waymo Open Dataset validate Real2Sim's capabilities in rendering, reconstruction, editing, and physics simulation, demonstrating its potential as a scalable tool for data generation in downstream tasks such as perception, tracking, trajectory prediction, and end-to-end policy learning.

35. 【2605.13586】HetScene: Heterogeneity-Aware Diffusion for Dense Indoor Scene Generation

链接https://arxiv.org/abs/2605.13586

作者:Zini Chen,Junming Huang,Rong Zhang,Jiamin Xu,Cheng Peng,Chi Wang,Weiwei Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:constructing high-fidelity simulation, high-fidelity simulation environments, Generating controllable, physically plausible indoor, controllable and physically

备注

点击查看摘要

Abstract:Generating controllable and physically plausible indoor scenes is a pivotal prerequisite for constructing high-fidelity simulation environments for embodied AI. However, existing deeplearning-based methods usually treat all objects as homogeneous instances within a unified generation process. While effective for sparse and simplistic layouts, they struggle to model realistic layouts with dense object arrangements and complex spatial dependencies, leadingto limited scalability and degraded physical plausibility. To deal with these challenges, we revisit indoor layout generation from the perspective of structural heterogeneity and decompose the objects into primary objects and secondary objects according to their distinct roles in shaping a scene. Based on this decomposition, we propose HetScene, a heterogeneous two-stage generation framework that decouples indoor layout synthesis into Structural Layout Generation (SLG) and Contextual Layout Generation (CLG). SLG first generates globally coherent structural layouts with only primary objects conditioned on text descriptions, top-down binary room masks, and spatial relation graphs, establishing a stable global macro-skeleton of large core furniture.

36. 【2605.13583】Phy-CoSF: Physics-Guided Continuous Spectral Fields Reconstruction and Super-Resolution for Snapshot Compressive Imaging

链接https://arxiv.org/abs/2605.13583

作者:Wudi Chen,Zhiyuan Zha,Xin Yuan,Shigang Wang,Bihan Wen,Jiantao Zhou,Gang Yan,Zipei Fan,Ce Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:systems show great, coded aperture snapshot, show great potential, Recent advances, snapshot spectral imaging

备注: 15 pages, 10 figures, accepted by ICML 2026!

点击查看摘要

Abstract:Recent advances have demonstrated that coded aperture snapshot spectral imaging (CASSI) systems show great potential for capturing 3D hyperspectral images (HSIs) from a single 2D measurement. Despite the inherent spectral continuity of scenes captured by CASSI, most existing reconstruction methods are restricted to fixed, discrete spectral outputs, thereby precluding continuous spectral reconstruction or spectral super-resolution. To address this challenge, we propose Phy-CoSF, which synergizes deep unfolding networks with implicit neural representations, establishing a new paradigm for continuous spectral reconstruction and super-resolution in CASSI. Specifically, we propose a two-phase architecture that bridges discrete-wavelength training with continuous spectral rendering, enabling the synthesis of high-fidelity HSIs at arbitrary target wavelengths. At the core of our framework lies the continuous spectral fields (CoSF) module, embedded within each unfolding stage as a dynamic prior, which comprises a triple-branch cross-domain feature mixer for comprehensive spatial-frequency-channel feature fusion, alongside a spectral synthesis head that generates spectral intensities by querying continuous wavelength coordinates. Extensive experimental results demonstrate that Phy-CoSF not only achieves continuous modeling at arbitrary spectral resolutions but also outperforms many state-of-the-art methods in both reconstruction fidelity and spectral detail preservation. Our code and more results are available at: this https URL.

37. 【2605.13581】HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation

链接https://arxiv.org/abs/2605.13581

作者:Li Pang,Heng Zhao,Yijia Zhang,Deyu Meng,Xiangyong Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:resolution loss, crucial for reliable, suffer from degradations, real HSIs suffer, Hyperspectral image

备注

点击查看摘要

Abstract:Hyperspectral image (HSI) restoration is crucial for reliable analysis, as real HSIs suffer from degradations like noise, blur, and resolution loss. However, existing models trained on source data often fail on target domains lacking clean references, a common occurrence in practice. To address this issue, we present HIR-ALIGN, a plug-and-play target-adaptive augmentation framework that enhances hyperspectral image restoration by augmenting limited training images with synthetic data that closely matches the target distribution using no extra data. It consists of three stages: (i) proxy generation, where off-the-shelf restoration models restore degraded target observations to produce semantics-preserving proxy HSIs that approximate target-domain clean images; (ii) distribution-adaptive synthesis, where a blur-robust unCLIP diffusion model generates target-aligned RGBs from proxy RGBs, with prompt conditioning and embedding-space noise initialization. Then, a warp-based spectral transfer module synthesizes HSIs by aligning each generated RGB with the proxy RGB, estimating soft patch-wise transport weights, and applying these weights and learnable local interpolation kernels to the proxy HSI; and (iii) aligned supervised finetuning, where restoration networks pretrained on the source distribution are finetuned using both the proxy HSIs and synthesized target-aligned HSIs, and are then deployed on degraded target images. We further provide theoretical analysis showing that augmentation-based finetuning can achieve lower target-domain restoration risk by jointly improving target distribution coverage and controlling spectral bias. Extensive experiments on simulated and real datasets across denoising and super-resolution tasks demonstrate that HIR-ALIGN consistently improves source-only supervised baselines, outperforming both source-only counterparts and representative unsupervised methods.

38. 【2605.13565】Qwen-Image-VAE-2.0 Technical Report

链接https://arxiv.org/abs/2605.13565

作者:Zekai Zhang,Deqing Li,Kuan Cao,Yujia Wu,Chenfei Wu,Yu Wu,Liang Peng,Hao Meng,Jiahao Li,Jie Zhang,Kaiyuan Gao,Kun Yan,Lihan Jiang,Ningyuan Tang,Shengming Yin,Tianhe Wu,Xiao Xu,Xiaoyue Chen,Yan Shu,Yanran Zhang,Yilei Chen,Yixian Xu,Yuxiang Chen,Zhendong Wang,Zihao Liu,Zikai Zhou,Yiliang Gu,Yi Wang,Xiaoxiao Xu,Lin Qu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词

备注

点击查看摘要

None

39. 【2605.13544】CA-GCL: Cross-Anatomy Global-Local Contrastive Learning for Robust 3D Medical Image Understanding

链接https://arxiv.org/abs/2605.13544

作者:Hanwen Zhang,Yao Liu,Die Dai,Jiaye Yang,Qiao Liu,Yutong Xie,Peng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Fine-grained Vision-Language Pre-training, aligning anatomy-level visual, Fine-grained Vision-Language, Vision-Language Pre-training, anatomy-level visual representations

备注

点击查看摘要

Abstract:Fine-grained Vision-Language Pre-training (FVLP) demonstrates significant potential in 3D medical image understanding by aligning anatomy-level visual representations with corresponding textual descriptions. However, existing FVLP paradigms often suffer from severe representation collapse in the textual embedding space, where text embeddings of distinct anatomical structures become highly clustered and indistinguishable. This distributional degeneracy renders the model hypersensitive to prompt variations, hindering reliable clinical deployment. To address these challenges, we propose a novel Cross-Anatomy Global-Local Contrastive Learning framework (CA-GCL). CA-GCL introduces a global contrastive objective that enforces separation between anatomical categories in the latent space, effectively counteracting the aggregation tendency induced by local alignment. Furthermore, we incorporate a clinical-aware text augmentation strategy based on permutation invariance and partial completeness to enhance robustness against descriptive incompleteness. Extensive evaluations on the CT-RATE and Rad-ChestCT datasets demonstrate that CA-GCL consistently outperforms existing VLP paradigms in zero-shot abnormality detection, achieving superior performance while exhibiting strong cross-dataset generalization. Crucially, CA-GCL reduces performance variance across diverse prompt templates, transforming the collapsed textual similarity distribution into a bell-shaped distribution. These results validate CA-GCL as an effective framework for robust 3D medical image understanding.

40. 【2605.13530】owards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

链接https://arxiv.org/abs/2605.13530

作者:Jincai Huang,Shihao Zou,Yuchen Guo,Jingjing Li,Wei Ji,Kai Wang,Shanshan Wang,Weixin Si

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Surgical scene understanding, computer-assisted intervention, cornerstone of computer-assisted, scene understanding, Surgical scene

备注

点击查看摘要

Abstract:Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.

41. 【2605.13517】ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

链接https://arxiv.org/abs/2605.13517

作者:Jaeyung Kim,YoungJoon Yoo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Quantized Variational Autoencoder, Vector Quantized Variational, Variational Autoencoder, Quantized Variational, Additive Margin VQ-VAE

备注: To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ-VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ-VAE (ArcVQ-VAE), a novel vector quantization framework that introduces a spherical angular-margin prior (SAMP) for the codebook of a conventional VQ-VAE. The proposed SAMP consists of Ball-Bounded Norm Regularization, which constrains all codebook vectors within a time-dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent-space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ-VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: this https URL

42. 【2605.13493】PhysEditBench: A Protocol-Conditioned Benchmark for Dense Physical-Map Prediction with Image Editors

链接https://arxiv.org/abs/2605.13493

作者:Jiaxin Yang,Yu Hou,Muxin Liu,Weixuan Liu,Ze Yuan,Zeming Chen,Zhongrui Wang,Xiaojuan Qi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single RGB image, single RGB, general-purpose image editors, editors predict physical, predict physical maps

备注: 48 pages, 12 figures, including references, appendix, and supplementary benchmark details

点击查看摘要

Abstract:Can general-purpose image editors predict physical maps from a single RGB image? General-purpose image editors differ from standard task-specific dense-prediction models: they do not directly take an image and output a physical map. Instead, they must be guided by prompts, examples, or image-based textual cues. To this end, we introduce PhysEditBench, a novel protocol-conditioned benchmark to evaluate and standardize image editors in dense physical-map prediction that covers five targets: depth, normal, albedo, roughness, and metallic maps. For evaluation data, we build a target-dependent benchmark substrate. We use OpenRooms-FF for depth, surface normal, albedo, and roughness, InteriorVerse as an additional source for depth, normal, albedo, and a new procedurally generated source for metallic maps. We curate the data with quality checks, valid-region masks, scene-level sampling, and lighting-based stress subsets to ensure reliable and diverse evaluation. For each target, PhysEditBench defines a fixed protocol that specifies the allowed input, expected output format, and scoring procedure. Each score, therefore, reflects the performance of a model under a specified protocol, rather than its best possible performance under all prompts or interaction modes. Experimental results show that specialized models remain much stronger on depth, normal, and albedo, and stronger image editors can produce more reasonable map-like outputs. For roughness and metallic, image editors can match or outperform specialized baselines on some scalar metrics, but they still suffer from structural errors, sparsity effects, and sensitivity to lighting.

43. 【2605.13476】Neural Video Compression with Domain Transfer

链接https://arxiv.org/abs/2605.13476

作者:Tiange Zhang,Rongqun Lin,Xiandong Meng,Haofeng Wang,Xing Tian,Qi Zhang,Siwei Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:neural video coding, Content-adaptive compression, aiming to mitigate, key direction, neural video compression

备注: Accepted to ISCAS 2026 as an oral paper

点击查看摘要

Abstract:Content-adaptive compression has always been a key direction in neural video coding (NVC), aiming to mitigate the domain gap between training and testing data. Such gaps often arise from distributional discrepancies between training and inference data, which may cause noticeable performance degradation when the testing content differs from the training distribution. To tackle this challenge, we propose DCVC-DT, a domain transfer enhanced neural video compression framework. Specifically, we design a lightweight online domain transfer (DT) mechanism that dynamically adapts the encoded latent representation during inference, effectively bridging the domain gap without modifying the encoder or decoder parameters. In addition, we develop a frame-level dynamic RD (Rate and Distortion) adjustment scheme that actively regulates the ratio of R and D in the loss function based on quality fluctuation, thereby improving rate-distortion performance. Extensive experiments demonstrate that DCVC-DT achieves up to 6.21% bitrate savings over the baseline DCVC-DC, while significantly enhancing generalization to unseen testing data and alleviating error propagation. Our code is available at this https URL.

44. 【2605.13475】FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

链接https://arxiv.org/abs/2605.13475

作者:Huan Wang,Jun Shen,Haoran Li,Zhenyu Yang,Jun Yan,Ousman Manjang,Yanlong Zhai,Di Wu,Guansong Pang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enables collaborative training, enables collaborative, protecting privacy, collaborative training, training of distributed

备注: 23 pages, Accepted at ICML 2026

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \href{this https URL}{this https URL}.

45. 【2605.13465】Z-Order Transformer for Feed-Forward Gaussian Splatting

链接https://arxiv.org/abs/2605.13465

作者:Can Wang,Lei Liu,Wei Jiang,Dong Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, enabled significant progress, feed-forward Gaussian Splatting, Gaussian, significant progress

备注: Accept by CVPR 2026, Oral

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this work, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.

46. 【2605.13457】OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression

链接https://arxiv.org/abs/2605.13457

作者:Chengyan Deng,Pengbin Yu,Zhentao Chen,Wei Shen,Kai Zhang,Meng Li,Lunxi Yuan,Xue Zhou,Li Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:directly super-resolving images, extreme memory consumption, Diffusion-based real-world image, real-world image super-resolution, achieved remarkable perceptual

备注

点击查看摘要

Abstract:Diffusion-based real-world image super-resolution (Real-ISR) has achieved remarkable perceptual quality; however, directly super-resolving images to 4K remains limited by extreme memory consumption. Consequently, prior methods adopt patch-based inference, sacrificing global context and introducing semantic confusion, spatial inconsistency, and severe latency. We propose OP4KSR, a one-step patch-free 4K SR approach built upon the powerful Flux backbone. By leveraging the extreme-compression F16 VAE, OP4KSR makes 4K SR inference tractable under practical GPU budgets, preserving global spatial-semantic coherence while enabling highly efficient inference. However, adapting this one-step architecture intrinsically triggers severe periodic artifacts. We trace this to a RoPE base frequency allocation mismatch and intra-token spatial ambiguity, both exacerbated by the lack of iterative refinement. To suppress these artifacts, we couple RoPE base frequency rescaling (RFR) with an autocorrelation-based periodicity loss ($\mathcal{L}_\text{AP}$). Furthermore, we curate a dedicated training dataset alongside three benchmarks (one synthetic and two real-world) to advance 4K SR research. Extensive experiments demonstrate that OP4KSR achieves competitive perceptual quality with efficient inference, generating a $4096\times4096$ output in only 5.75 seconds on a single NVIDIA H20 GPU.

47. 【2605.13455】Bayesian In Vivo Tracking of Synapses using Joint Poisson Deconvolution and Diffeomorphic Registration

链接https://arxiv.org/abs/2605.13455

作者:Shashwat Kumar,Dominic M. Padova,Binish Narang,Gabrielle I. Coste,Austin R. Graves,Richard L. Huganir,Adam S. Charles,Michael I. Miller,Anuj Srivastava

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:densely packed submicron, packed submicron structures, memory formation, densely packed, packed submicron

备注

点击查看摘要

Abstract:Synapses are densely packed submicron structures that dynamically reorganize during learning and memory formation. Longitudinal \textit{in vivo} imaging of fluorescently tagged synaptic receptors offers a promising opportunity to study large-scale synaptic dynamics and how these processes are disrupted in neurological disease. However, in vivo imaging with 2-photon microscopy uses low laser power and therefore suffers from low signal-to-noise ratio (SNR) and high shot noise, nonlinear tissue motion between days, nonstationary fluctuations in synaptic fluorescence, and significant blur induced by the microscope point spread function (PSF). Together, these factors make it challenging to detect and track synapses, especially in regions with high synaptic density. This paper presents a novel template-based framework for modeling synapses as varying luminance point sources that move under a nonlinear tissue deformation. Taking a unified Bayesian approach, we apply this model to microscopy data by deriving a posterior that incorporates a diffeomorphic mapping for domain warping, a Gaussian point spread function for the imaging process, and a Poisson observation model for raw photon counts. The Bayesian solution simultaneously: (1) Constructs a probabilistic template of synapse locations, (2) denoises and deconvolves the image data, (3) infers fluorescence intensities, (4) performs diffeomorphic image registration to correct for tissue motion, and (5) provides confidence regions for these parameter estimates. We demonstrate the framework on both a 2D+t simulated dataset and a 3D+t longitudinal \textit{in vivo} microscopy dataset of fluorescent synapses imaged in a mouse over two weeks.

48. 【2605.13403】RotVLA: Rotational Latent Action for Vision-Language-Action Model

链接https://arxiv.org/abs/2605.13403

作者:Qiwei Li,Xicheng Gong,Xinghang Li,Peiyan Li,Quanyun Zhou,Hangjun Ye,Jiahuan Zhou,Yadong Mu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:handling heterogeneous datasets, space across embodiments, effective paradigm, paradigm for handling, handling heterogeneous

备注

点击查看摘要

Abstract:Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.

49. 【2605.13402】Fast and Compact Graph Cuts for the Boykov-Kolmogorov Algorithm

链接https://arxiv.org/abs/2605.13402

作者:Christian Møller Mikkelstrup,Anders Bjorholm Dahl,Philip Bille,Vedrana Andersen Dahl,Inge Li Gørtz

类目:Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS)

关键词:computer vision problems, vision problems, Computing a minimum, wide range, range of computer

备注: 15 pages, 6 figures, submitted to the IEEE for possible publication

点击查看摘要

Abstract:Computing a minimum $s$-$t$ cut in a graph is a solution to a wide range of computer vision problems, and is often done using the Boykov-Kolmogorov (BK) algorithm. In this paper, we revisit the BK algorithm from both a theoretical and practical point of view. We improve the analysis of the time complexity of the BK algorithm to $O(mn|C|)$ and propose a new algorithm, the fast and compact BK (fcBK) algorithm, with a time complexity of $O(m|C|)$, where $m$, $n$, and $|C|$ are the number of edges, number of vertices, and the capacity of the cut, respectively. We additionally propose a compact graph representation that allows our implementation to find a minimum $s$-$t$ cut in a graph with upwards of $10^9$ vertices and $10^{10}$ edges on a machine with 128 GB of memory. We find our implementation of the BK algorithm to be the fastest available implementation of the BK algorithm when evaluating on a comprehensive set of benchmark datasets, highlighting the importance of memory-efficient implementations. We make our implementations publicly available for further research and implementation development within minimum $s$-$t$ cut algorithms.

50. 【2605.13396】PreFIQs: Face Image Quality Is What Survives Pruning

链接https://arxiv.org/abs/2605.13396

作者:Jan Niklas Kolf,Guray Ozgur,Andrea Atzori,Žiga Babnik,Vitomir Štruc,Naser Damer,Fadi Boutros

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Image Quality Assessment, automated face recognition, Pruning Identified Exemplar, Quality Assessment, Identified Exemplar

备注: Accepted at CVPR 2026 Workshops

点击查看摘要

Abstract:Face Image Quality Assessment (FIQA) evaluates the utility of a face image for automated face recognition (FR) systems. In this work, we propose PreFIQs, an unsupervised and training-free FIQA framework grounded in the Pruning Identified Exemplar (PIE) hypothesis. We hypothesize that low-utility face images rely disproportionately on fragile network parameters, resulting in larger geometric displacement of their embeddings under model sparsification. Accordingly, PreFIQs quantifies image utility as the Euclidean distance between L2-normalized embeddings extracted from a pre-trained FR model and its pruned counterpart. We provide a first-order theoretical justification via a Jacobian-vector product analysis, demonstrating that this empirical drift serves as a computationally efficient approximation of the exact geometric sensitivity of the latent embedding manifold. Extensive experiments across eight benchmarks and four FR models demonstrate that PreFIQs achieves competitive or superior performance compared to state-of-the-art FIQA methods, including establishing new state-of-the-art results on several benchmarks, without any training or supervision. These results validate parameter sparsification as a principled and practically efficient signal for face image utility, and demonstrate that quality is, in essence, what survives pruning.

51. 【2605.13395】aming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation

链接https://arxiv.org/abs/2605.13395

作者:Lilin Zhang,Yimo Guo,Yue Li,Jiancheng Shi,Xianggen Liu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep neural networks, degrade model performance, significantly degrade model, Deep neural, model performance

备注: accepted by CVPR 2026

点击查看摘要

Abstract:Deep neural networks are highly vulnerable to adversarial examples, i.e.,small perturbations that can significantly degrade model performance. While adversarial training has become the primary defense strategy, most studies focus on balanced datasets, overlooking the challenges posed by real-world long-tail data. Motivated by the fact that perturbations in adversarial examples inherently alter the training distribution, we theoretically investigate their impact. We first revisit adversarial training for long-tail data and identify two key limitations: (i) a skewed training objective caused by class imbalance, and (ii) unstable evolution of adversarial distributions. Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose RobustLT, a plug-and-play framework that adaptively adjusts perturbations during adversarial training. Extensive experiments demonstrate that RobustLT consistently enhances adversarial robustness and class-balance on long-tailed datasets. The code is available at \href{this https URL}{this https URL}.

52. 【2605.13381】Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics

链接https://arxiv.org/abs/2605.13381

作者:Chiara Musso,Joy Battocchio,Andrea Montibeller,Giulia Boato

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Vision Transformers, modern deepfake detection, AI-generated synthetic images, increasingly realistic, deepfake detection

备注

点击查看摘要

Abstract:As AI-generated synthetic images become increasingly realistic, Vision Transformers (ViTs) have emerged as a cornerstone of modern deepfake detection. However, the prevailing reliance on frozen, pre-trained backbones introduces a subtle yet critical vulnerability. In this work, we present the Surrogate Iterative Adversarial Attack (SIAA), a gray-box attack that exploits knowledge of the detector's ViT backbone alone and operates entirely within the target detector's feature space to craft highly effective adversarial examples. Through our experiments, involving multiple ViT-based detectors and diverse gray-box scenarios, including few-shot learning, complete training misalignment and attack transferability tests, we demonstrate that this vulnerability consistently yields high attack success rates, often approaching white-box performance. By doing so, we reveal that backbone knowledge alone is sufficient to undermine detector reliability, highlighting the urgent need for more resilient defenses in adversarial multimedia forensics.

53. 【2605.13375】GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

链接https://arxiv.org/abs/2605.13375

作者:Mingzhe Huang,Weijun Wang,Xin Ding,Liang Mi,Hao Wen,Yuanchun Li,Lichen Pang,Shansong Yang,Yunxin Liu,Ting Cao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:prohibitive computational overhead, incurs prohibitive computational, tokens incurs prohibitive, Vision-Language Models, visual tokens incurs

备注: 10 pages, 11 figures

点击查看摘要

Abstract:In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.

54. 【2605.13366】Neural Surrogate Forward Modelling For Electrocardiology Without Explicit Intracellular Conductivity Tensor

链接https://arxiv.org/abs/2605.13366

作者:Shaheim Ogbomo-Harmitt,Cesare Magnetti,Jakub Grzelak,Oleg Aslanidi

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Accurate forward modelling, non-invasive cardiac electrophysiology, Accurate forward, cardiac electrophysiology, highly disorganised

备注: Accepted into the 9th International Conference on Computational and Mathematical Biomedical Engineering (CMBE2026)

点击查看摘要

Abstract:Accurate forward modelling is essential for non-invasive cardiac electrophysiology, particularly in atrial fibrillation, where electrical activation is highly disorganised. Conventional physics-based forward models require explicit specification of intracellular conductivity tensors, which are not directly measurable in clinical practice and introduce structural modelling errors. This proof-of-concept study presents a deep learning approach that learns a direct mapping from left atrial intracellular electrical potentials to far-field ECGs without requiring explicit intracellular conductivity inputs at inference time. Despite training only on 74 subjects, the model achieved an R2 of 0.949 \pm 0.037, highlighting potential to reduce structural uncertainty and improve non-invasive AF assessment.

55. 【2605.13349】Drag within Prior Distribution: Text-Conditioned Point-Based Image Editing within Distribution Constraints

链接https://arxiv.org/abs/2605.13349

作者:Haoyang Hu,Masataka Seo,Yen-Wei Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained significant traction, Diffusion-based point editing, applying localized perturbations, manipulate image semantics, Diffusion-based point

备注: ICASSP 2026 oral

点击查看摘要

Abstract:Diffusion-based point editing methods have gained significant traction in image editing tasks due to their ability to manipulate image semantics and fine details by applying localized perturbations on the manifold of noise latent. However, these approaches face several limitations. Traditional point-based editing relies on pairs of handle and target points to define motion trajectories, which can introduce ambiguity or unnecessary alterations. Furthermore, when the distance between the handle and target points is large, the accumulated perturbations often cause the noise latent deviation from inversion score trajectory, resulting in unnatural artifacts. To address these issues in global editing tasks, we introduce a CLIP-based model to evaluate and guide intermediate editing steps, ensuring that the generated results remain both semantically aligned. Additionally, we propose a prior-preservation loss that constrains the optimized latent code to stay within the sampling space of the diffusion prior, improving consistency with the original data distribution, to ensure the model generates images along a familiar score trajectory. For fine-grained tasks, we present a directionally-weighted point tracking mechanism that steers the editing process toward the target direction within similar feature regions. This improves both the tracking accuracy and generation quality, while also reducing the editing time.

56. 【2605.13335】Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

链接https://arxiv.org/abs/2605.13335

作者:Qinchuan Cheng,Zhantao Gong,Pengzhan Sun,Angela Yao,Xulei Yang,Shijie Li

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:remember objects, actions fail, Embodied agents, household environments, recover when actions

备注: Project page: [this https URL](https://sj-li.com/PROJ/Ego2World/)

点击查看摘要

Abstract:Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics, introducing a sim-to-real gap and often assuming fully observable state. We introduce Ego2World, an executable benchmark that turns egocentric cooking videos into executable symbolic worlds governed by graph-transition rules. Built on HD-EPIC, Ego2World derives reusable transition rules from video annotations and executes them in a hidden symbolic world graph. During evaluation, the simulator maintains the hidden world graph, while the agent plans over its own partial belief graph using only local observations and execution feedback. This separation forces agents to update memory and replan without observing the true world state. Experiments show that action-overlap scores overestimate physical-state success, and that persistent belief memory improves task completion while reducing repeated visual exploration -- suggesting that belief maintenance should be a first-class target of embodied-agent evaluation.

57. 【2605.13333】Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation

链接https://arxiv.org/abs/2605.13333

作者:Junhyuk Jeon,Seokhyeon Hong,Junyong Noh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:generating realistic human, express fine-level nuances, realistic human motions, text-driven diffusion model, pretrained text-driven diffusion

备注: Accepted to SIGGRAPH 2026. Project page: [this https URL](https://junhyukjeon.github.io/projects/style-salad/)

点击查看摘要

Abstract:Text-driven motion diffusion models are capable of generating realistic human motions, but text alone often struggles to express fine-level nuances of motion, commonly referred to as style. Recent approaches have tackled this challenge by attaching a style injection mechanism to a pretrained text-driven diffusion model. Existing stylization methods, however, either require style-specific fine-tuning of existing models or rely on heavy ControlNet-based architectures, limiting efficiency and generalization to unseen styles. We propose a lightweight style conditioning framework that dynamically modulates a pretrained diffusion model through hypernetwork-generated LoRA parameters. A style reference motion is encoded into a global style embedding, which is mapped by a hypernetwork to low-rank updates applied at each denoising step of the diffusion model. By structuring the style latent space with a supervised contrastive loss, our framework reliably captures diverse stylistic attributes, improves generalization to unseen styles, and supports optimization-based guidance without requiring predefined style categories. Experiments on the HumanML3D and 100STYLE datasets show state-of-the-art stylization results, while achieving improved stylization for unseen styles.

58. 【2605.13328】What Limits Vision-and-Language Navigation ?

链接https://arxiv.org/abs/2605.13328

作者:Yunheng Wang,Yuetong Fang,Taowen Wang,Lusong Li,Kun Liu,Junzhe Xu,Zizhao Yuan,Yixiao Feng,Jiaxi Zhang,Wei Lu,Zecui Zeng,Renjing Xu

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:embodied intelligence, cornerstone of embodied, VLN, real-world navigation consistency, enhance real-world navigation

备注

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: this https URL.

59. 【2605.13322】KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

链接https://arxiv.org/abs/2605.13322

作者:Richard Sproat,Stefano Peluchetti

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:natural test case, symbolic choices, important part, natural test, test case

备注: Preprint

点击查看摘要

Abstract:Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon yōgo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility. We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks. KamonBench therefore provides a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.

60. 【2605.13316】st-time Sparsity for Extreme Fast Action Diffusion

链接https://arxiv.org/abs/2605.13316

作者:Kangye Ji,Yuan Meng,Jianbo Zhou,Ye Li,Chen Tang,Zhi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:incurs heavy computational, heavy computational costs, computational costs owing, iterative denoising nature, Action diffusion excels

备注

点击查看摘要

Abstract:Action diffusion excels at high-fidelity action generation but incurs heavy computational costs owing to its iterative denoising nature. Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time. However, two bottlenecks remain in this paradigm: 1) repetitive conditional encoding and pruning offset most potential speed gains, and 2) the features cached from previous denoising timesteps cannot constrain large pruning errors under aggressive sparsity. To address the first bottleneck, we design a highly parallelized inference pipeline that minimizes the non-decoder delay to milliseconds. Specifically, we first design a lightweight pruner that shares the encoder with the diffusion transformer. Then, we decouple the encoding and pruning from the autoregressive denoising loop by processing all denoising timesteps in parallel, and overlap the pruner with the decoder forward inference through asynchronism. To overcome the second bottleneck, we introduce an omnidirectional reusing strategy, which achieves 95% sparsity by selectively reusing features cached from the current forward, previous denoising timesteps, and earlier rollout iterations. To learn the rollout-level reusing strategies, we sample a few action trajectories to supervise the sparsified diffusion step by step. Extensive experiments demonstrate that our method reduces FLOPs by 92% and accelerates action generation by 5x, achieving lossless performance with an inference frequency of 47.5 Hz. Our code is available at this https URL.

61. 【2605.13306】Color Constancy in Hyperspectral Imaging via Reduced Spectral Spaces

链接https://arxiv.org/abs/2605.13306

作者:G. Dofri Vidarsson,Liying Lu,Sabine Süsstrunk

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:infer scene illumination, reflectance and lighting, trichromatic RGB images, aims to infer, measurements despite intrinsic

备注

点击查看摘要

Abstract:Illuminant estimation aims to infer scene illumination from image measurements despite intrinsic ambiguities between surface reflectance and lighting. Most existing methods operate on trichromatic RGB images and are therefore fundamentally limited by the restricted spectral information available. Hyperspectral imaging provides a much richer representation of scene radiance and has the potential to alleviate these ambiguities. However, its high dimensionality poses computational and statistical challenges. In this work, we systematically study the effect of spectral dimensionality and representation choice on illuminant estimation performance using hyperspectral data. We adopt the practical and effective Color-by-Correlation (CbC) framework as the estimation backbone and analyze its behavior under different spectral dimensionality reduction strategies. Our results offer practical insights into how hyperspectral information can be efficiently exploited for illuminant estimation and identify conditions under which compact spectral representations outperform conventional RGB-based approaches. The code is available at this https URL.

62. 【2605.13293】Img2CADSeq: Image-to-CAD Generation via Sequence-Based Diffusion

链接https://arxiv.org/abs/2605.13293

作者:Shiyu Tan,Zixuan Zhao,Hao Gao,Zhiheng Chen,Xiaolong Yin,Enya Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Boundary Representation, reconstructing high-quality BReps, single-view images remains, images remains challenging, remains challenging due

备注: Accepted by SIGGRAPH 2026 Conference

点击查看摘要

Abstract:Boundary Representation (BRep) is the standard format for Computer-Aided Design (CAD), yet reconstructing high-quality BReps from single-view images remains challenging due to the complexity of topological constraints and operation sequences. We present Img2CADSeq, a multi-stage pipeline that overcomes these limitations by encoding CAD sequences into a three-level hierarchical codebook. Guided by an importance prioritization, this strategy values profiles over details, compressing long sequences into a stable discrete latent space. To bridge the modality gap, we leverage a coarse-to-fine point cloud intermediate, aligning 2D visual features with 3D CAD sequences via contrastive learning to condition a VQ-Diffusion model. Supported by newly introduced CAD-220K and PrintCAD datasets, our approach ensures robust industrial domain adaptation. Extensive experiments demonstrate that Img2CADSeq significantly outperforms state-of-the-art methods, producing standard STEP files that can be directly used in commercial CAD software.

63. 【2605.13277】Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

链接https://arxiv.org/abs/2605.13277

作者:Weiqing Luo,Zongye Hu,Xiao Wang,Zhiyuan Yu,Haofeng Zhang,Ziyi Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Visual evidence selection, multimodal retrieval-augmented generation, Visual evidence, existing methods typically, methods typically rely

备注: Accepted to ACL 2026

点击查看摘要

Abstract:Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.

64. 【2605.13258】X-Restormer++: 1st Place Solution for the UG2+ CVPR 2026 All-Weather Restoration Challenge

链接https://arxiv.org/abs/2605.13258

作者:Youwei Pan,Leilei Cao,Yingfang Zhu,Fengjie Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:All-weather Conditions, Head Transposed Attention, Multi-DConv Head Transposed, present our winning, winning solution

备注

点击查看摘要

Abstract:In this work, we present our winning solution for the 8th UG2+ Challenge (CVPR 2026) Track 1: Image Restoration under All-weather Conditions. Our method is built upon the strong baseline framework X-Restormer, which effectively captures both channel-wise global dependencies and spatially-local structural information through its dual-attention design (Multi-DConv Head Transposed Attention and Overlapping Cross-Attention). To further boost the restoration performance, we propose several key improvements. First, we integrate the spatially-adaptive input scaling mechanism from Restormer-Plus to dynamically adjust the spatial weights of the input image, enhancing spatial adaptability. Second, to better preserve structural details and edge information, we introduce a novel Gradient-Guided Edge-Aware (GGEA) loss, which is combined with L1 and Multi-Scale SSIM losses in a unified training objective. Third, we significantly expand the training data by incorporating an extra 24,500 degraded-clean image pairs from FoundIR and WeatherBench alongside the original WeatherStream dataset. With these strategies, our proposed method successfully ranks the 1st place in the challenge.

65. 【2605.13228】ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

链接https://arxiv.org/abs/2605.13228

作者:Xiao Liu,Nayu Liu,Junnan Zhu,Ruirui Chen,Guohui Xiang,Changjian Wang,Kaiwen Wei,Rongzhen Li,Jiang Zhong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:motivating tool-augmented video, active evidence seeking, requires active evidence, tool-augmented video agents, complex question answering

备注

点击查看摘要

Abstract:Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.

66. 【2605.13223】Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

链接https://arxiv.org/abs/2605.13223

作者:Abdelrahman Eldesokey,Merey Ramazanova,Ahmad Sait,Ansar Khangeldin,Karen Sanchez,Tong Zhang,Bernard Ghanem

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generation has advanced, advanced rapidly, critical as performance, evaluation, performance differences

备注: Project Page: [this https URL](https://abdo-eldesokey.github.io/skill-aligned-eval/)

点击查看摘要

Abstract:Text-to-image (T2I) generation has advanced rapidly, making reliable evaluation critical as performance differences between models narrow. Existing evaluation practices typically apply uniform annotation mechanisms, such as Likert-scale or binary question answering (BQA), across heterogeneous evaluation skills, despite fundamental differences in their nature. In this work, we revisit T2I evaluation through the lens of skill-aligned annotation, where annotation strategies reflect the underlying characteristics of each evaluation skill. We systematically compare skill-aligned annotation against uniform baselines and show that it produces more consistent evaluation signals, with higher inter-annotator agreement and improved stability across models. Finally, we present an automated pipeline that instantiates the proposed evaluation protocol, enabling scalable and fine-grained evaluation with spatially grounded feedback. Our work highlights that improving the foundations of image evaluation can increase reliability and efficiency without simply scaling annotation effort. We hope this motivates further research on refining evaluation protocols as a central component of reliable model assessment.

67. 【2605.13202】STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

链接https://arxiv.org/abs/2605.13202

作者:Hongli Liu,Yu Wang,Shengjie Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Few-shot action recognition, Few-shot action, Adaptive Representation Learning, action recognition, action categories

备注: Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

点击查看摘要

Abstract:Few-shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision-language models, existing approaches still suffer from semantic-temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi-scale temporal dynamics, as short-term discriminative cues and long-range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic-alignment component and a temporal-aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal Semantic Attention (TSA) mechanism, which performs frame-level cross-modal alignment with textual cues, ensuring fine-grained semantic-temporal consistency. The temporal-aware module incorporates a Semantic Temporal Prototype Refiner (STPR) that integrates semantic-guided Mamba blocks with multi-frequency temporal sampling and bidirectional state-space refinement, yielding semantically aligned prototypes with enhanced discriminative fidelity and temporal consistency. Furthermore, temporally dependent class descriptors derived from large language models (LLMs) provide long-range semantic guidance. Extensive experiments on five FSAR benchmarks demonstrate the consistent superiority of STAR over state-of-the-art methods. For instance, STAR achieves up to 8.1% and 6.7% gains on the SSv2-Full and SSv2-Small datasets under the 1-shot setting, and 7.3% on HMDB51, validating its effectiveness under limited supervision. The code is available at this https URL.

68. 【2605.13193】FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

链接https://arxiv.org/abs/2605.13193

作者:Geng Li,Yuxin Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:humans actively search, encountering unfamiliar objects, compare visual details, closed-book classification problem, classification problem

备注

点击查看摘要

Abstract:Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.

69. 【2605.13182】DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution

链接https://arxiv.org/abs/2605.13182

作者:Zheng Chen,Ruofan Yang,Jin Han,Dehua Song,Zichen Zou,Chunming He,Yong Guo,Yulun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown strong performance, VSR, VFI, shown strong, strong performance

备注: Code is available at: [this https URL](https://github.com/zhengchen1999/DiffST)

点击查看摘要

Abstract:Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates information across multiple keyframes to produce intermediate frames. The VRG module extracts video-level global features to guide the diffusion process. Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17$\times$ faster than previous diffusion-based STVSR methods. Code is available at: this https URL.

70. 【2605.13179】Does Engram Do Memory Retrieval in Autoregressive Image Generation?

链接https://arxiv.org/abs/2605.13179

作者:Jinghao Wang,Qiyuan He,Chunbin Gu,Pheng-Ann Heng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:local token patterns, recurring local token, improve large language, Transformer layers, injected into Transformer

备注: 9 pages

点击查看摘要

Abstract:The Engram module -- a hash-keyed, O(1) associative memory injected into Transformer layers -- was recently shown to improve large language model pretraining, with the appealing interpretation that it provides a content-addressed shortcut to recurring local token patterns. We ask whether this interpretation transfers to autoregressive (AR) image generation, or whether the observed gains, if any, come from a different mechanism. We adapt the Engram module to vision with 2D spatial $n$-gram hashing, gated fusion, and KV-cache-compatible incremental inference, and inject it into a class-conditional AR generator trained on ImageNet 256x256. Across a sweep of backbone-to-memory budget ratios $\rho{\in}[0.17, 0.90]$, every Engram-augmented variant trails the pure AR baseline in FID, indicating that the module saves backbone FLOPs but does not, by itself, improve sample quality. We then probe how the module is used. A gate-clamp sweep shows that disabling the Engram pathway entirely is catastrophic, yet a tiny constant gate (g=0.10) matches or beats the learned gate -- inconsistent with a heavily content-addressed recall mechanism. A donor-probe experiment shows that swapping the hash inputs for matched, adversarial, or random same-class exemplars produces statistically indistinguishable next-token distributions, while collapsing or randomising the table degrades them by two to three orders of magnitude. Finally, training a model from scratch with the entire memory table frozen to $\mathcal{N}(0, 1)$ noise costs only $\Delta\text{FID}{=}0.10$ and actually raises Inception Score. Together, these findings indicate that the Engram in AR image generation behaves not as a content-addressed retriever but as a gated architectural side-pathway: a hash-keyed residual stream whose benefit is dominated by the pathway itself, with the learned table contributing only a small distributional refinement.

71. 【2605.13178】CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

链接https://arxiv.org/abs/2605.13178

作者:Sangin Lee,Yukyung Choi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:large vision-language models, substantial computational overhead, tokens typically constitute, vision-language models, leading to substantial

备注: 18 pages, 8 figures

点击查看摘要

Abstract:In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens located within referent regions often exhibit low similarity to the textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90\% of the original performance with a 22% speedup and a 2.3x memory reduction. Our code is available at this https URL.

72. 【2605.13169】PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

链接https://arxiv.org/abs/2605.13169

作者:Changpeng Wang,Xin Lin,Junhan Liu,Yuheng Liu,Zhen Wang,Donglian Qi,Yunfeng Yan,Xi Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal large laboratory, dominant perspective-image paradigm, Multimodal large, large laboratory models, perspective-image paradigm

备注

点击查看摘要

Abstract:Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.

73. 【2605.13163】LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters

链接https://arxiv.org/abs/2605.13163

作者:Beomjin Ahn,Jungmin Kwon,Chanyong Jung,Jaewook Chung

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:enable efficient on-device, efficient on-device generative, intellectual property leakage, adapters enable efficient, Foundation models

备注: Accepted to ICIP 2026

点击查看摘要

Abstract:Foundation models and low-rank adapters enable efficient on-device generative AI but raise risks such as intellectual property leakage and model recovery attacks. Existing defenses are often impractical because they require retraining or access to the original dataset. We propose LoREnc, a training-free framework that secures both FMs and adapters via spectral truncation and compensation. LoREnc suppresses dominant low-rank components of FM weights, compensates for the missing information in authorized adapters, and further applies orthogonal reparameterization to obscure structural fingerprints of the protected adapter. Unauthorized users produce structurally collapsed outputs, while authorized users recover exact performance. Experiments demonstrate that LoREnc provides strong protection against model recovery with under 1% computational overhead.

74. 【2605.13161】A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning

链接https://arxiv.org/abs/2605.13161

作者:Yiyun Zhou,Zhonghua Jiang,Wenkang Han,Kunxi Li,Mingjing Xu,Chang Yao,Jingyuan Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Efficient transfer learning, fixed fine-tuning paradigm, large-scale vision-language models, transfer learning methods, Efficient transfer

备注: Accepted by IJCAI 2026

点击查看摘要

Abstract:Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has not been systematically studied in image classification. Through extensive analysis, we reveal a Branch Bias issue in vision-language image classification: adapting the image encoder does not always improve performance under out-of-distribution settings. Motivated by this observation, we propose A$_3$B$_2$, an Adaptive Asymmetric Adapter that alleviates Branch Bias in few-shot learning. A$_3$B$_2$ introduces Uncertainty-Aware Adapter Dampening (UAAD), which automatically suppresses image-branch adaptation when prediction uncertainty is high, enabling soft and data-driven control without manual intervention. Architecturally, A$_3$B$_2$ adopts a lightweight asymmetric design inspired by mixture-of-experts with Load Balancing Regularization. Extensive experiments on three few-shot image classification tasks across 11 datasets demonstrate that A$_3$B$_2$ consistently outperforms 11 competitive prompt- and adapter-based baselines.

75. 【2605.13158】Unifying Physically-Informed Weather Priors in A Single Model for Image Restoration Across Multiple Adverse Weather Conditions

链接https://arxiv.org/abs/2605.13158

作者:Jiaqi Xu,Xiaowei Hu,Lei Zhu,Pheng-Ann Heng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:weather conditions aims, Image restoration, adverse weather conditions, scene visibility analysis, aims to develop

备注: Accepted by TCSVT

点击查看摘要

Abstract:Image restoration under multiple adverse weather conditions aims to develop a single model to recover the underlying scene with high visibility. Weather-related artifacts vary with the particle's distance to the camera according to the established scene visibility analysis, where close and faraway regions are more affected by falling drops and fog effects, respectively. Existing methods fail to consider this weather-specific physical visual process; thus, the restoration performance is limited. In this work, we analyze the common visual factors in adverse weather conditions and present a unified imaging model that considers the individually visible particles and fog-like aggregate scattering effects. Further, we design a novel weather-prior-based network, which leverages the weather-related prior information to help recover the scene by enhancing the features using the estimated occlusion and transmission. Experimental results in multiple adverse scenarios show the superiority of our method against state-of-the-art methods.

76. 【2605.13156】Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

链接https://arxiv.org/abs/2605.13156

作者:Jiaxin Liu,Ding Zhong,Yue Wang,Zhidong Yang,Zhaolu Kang,Guangyuan Dong,Qishi Zhan,Pengcheng Fang,Aofan Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal reasoning tasks, demonstrated remarkable capabilities, Vision-language models, natural language understanding, bridging visual perception

备注

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated remarkable capabilities in bridging visual perception and natural language understanding, enabling a wide range of multimodal reasoning tasks. However, they often produce object hallucinations, describing content absent from the input image, which limits their reliability and interpretability. To address this limitation, we propose Dual-Pathway Circuit Analysis, a framework that identifies and characterizes hallucination-related circuits in VLMs for mechanistic understanding and causal probing. We first apply activation patching across five architecturally diverse VLMs to identify a visual grounding pathway that supports correct predictions and a hallucination pathway that drives erroneous outputs. We then introduce Conditional Pathway Analysis (CPA) to characterize pathway-level interactions, revealing that grounding components remain strongly redundant in both correct and hallucinating samples but undergo a consistent polarity flip, shifting from supporting the ground truth on correct samples to aligning with the hallucinated answer on erroneous ones. We further perform targeted suppression of hallucination-pathway components, showing that scaling these components reduces object hallucination by up to 76% with minimal accuracy cost, and validate that the same circuit selectively transfers to relational but not attribute hallucination. Evaluations on POPE-adversarial and AMBER show that the identified circuits are consistent across architectures, support causal intervention, and transfer selectively across hallucination types.

77. 【2605.13155】Pareto-Guided Optimal Transport for Multi-Reward Alignment

链接https://arxiv.org/abs/2605.13155

作者:Ying Ba,Tianyu Zhang,Mohan Zhou,Yalong Bai,Wenyi Mo,Guiwei Zhang,Bing Su,Ji-Rong Wen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable progress, achieving robust alignment, significant challenge, achieved remarkable, remarkable progress

备注: Accepted to ICML 2026

点击查看摘要

Abstract:Text-to-image generation models have achieved remarkable progress in preference optimization, yet achieving robust alignment across diverse reward models remains a significant challenge. Existing multi-reward fusion approaches rely on weighted summation, which is costly to tune and insufficient for balancing conflicting objectives. More critically, optimization with reward models is highly susceptible to reward hacking, where reward scores increase while the perceived quality of generated images deteriorates. We demonstrate that optimizing against a unified global target under heterogeneous reward upper bounds can induce reward hacking, a risk further exacerbated by the inherent instability of weak reward models. To mitigate this, we propose a Pareto Frontier-Guided Optimal Transport (PG-OT) framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport. Furthermore, we develop both online and offline optimization strategies tailored to diverse reward signal characteristics. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.

78. 【2605.13152】EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision

链接https://arxiv.org/abs/2605.13152

作者:Jiahao Chen,Zihui Zhang,Yafei Yang,Jinxi Li,Shenxing Wei,Zhixuan Sun,Bo Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:synthetic pretraining data, real-world point clouds, geometric domain gap, point clouds, bridges the geometric

备注: CVPR 2026. Code and data are available at: [this https URL](https://github.com/vLAR-group/EvObj)

点击查看摘要

Abstract:We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real-world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real-world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state-of-the-art results.

79. 【2605.13151】GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

链接https://arxiv.org/abs/2605.13151

作者:Jiyong Rao,Yu Wang,Shengjie Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Category-agnostic pose estimation, Category-agnostic pose, pose estimation, aims to localize, query images

备注: Accepted in ICLR 2026

点击查看摘要

Abstract:Category-agnostic pose estimation (CAPE) aims to localize keypoints on query images from arbitrary categories, using only a few annotated support examples for guidance. Recent approaches either treat keypoints as isolated entities or rely on manually defined skeleton priors, which are costly to annotate and inherently inflexible across diverse categories. Such oversimplification limits the model's capacity to capture instance-wise structural cues critical for accurate pixel-level localization. To overcome these limitations, we propose GenCape, a Generative-based framework for CAPE that infers keypoint relationships solely from image-based support inputs, without additional textual descriptions or predefined skeletons. Our framework consists of two principal components: an iterative Structure-aware Variational Autoencoder (i-SVAE) and a Compositional Graph Transfer (CGT) module. The former infers soft, instance-specific adjacency matrices from support features through variational inference, embedded layer-wise into the Graph Transformer Decoder for progressive structural priors refinement. The latter adaptively aggregates multiple latent graphs into a query-aware structure via Bayesian fusion and attention-based reweighting, enhancing resilience to visual uncertainty and support-induced bias. This structure-aware design facilitates effective message propagation among keypoints and promotes semantic alignment across object categories with diverse keypoint topologies. Experimental results on the MP-100 dataset show that our method achieves substantial gains over graph-support baselines under both 1- and 5-shot settings, while maintaining competitive performance against text-support counterparts.

80. 【2605.13148】Understanding Generalization through Decision Pattern Shift

链接https://arxiv.org/abs/2605.13148

作者:Huiqi Deng,Yibo Li,Quanshi Zhang,Peng Zhang,Hongbin Pei,Xia Hu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural networks, Understanding why deep, unseen samples remains, model internal decision, internal decision mechanism

备注: 14pages, 12figures, computer vision and pattern recognition

点击查看摘要

Abstract:Understanding why deep neural networks (DNNs) fail to generalize to unseen samples remains a long-standing challenge. Existing studies mainly examine changes in externally observable factors such as data, representations, or outputs, yet offer limited insight into how a model's internal decision mechanism evolves from training to test. To address this gap, we introduce Decision Pattern Shift (DPS), a new perspective that defines generalization through the stability of internal decision patterns and quantifies failure as their deviation from those learned during training. Specifically, we represent each sample's decision pattern as a GradCAM-based channel-contribution vector, which captures how feature channels collectively support a prediction, and we propose the DPS metric to measure its discrepancy from the class-average pattern. Empirical analyses across multiple datasets and architectures show that, (i) decision patterns form a highly structured, class-consistent space with strong intra-class cohesion and low inter-class confusion, enabling direct analysis of a model's decision logic; (ii) the DPS magnitude correlates linearly with the generalization gap (nearly all Pearson r 0.8), revealing generalization as a systematic drift in the model's internal decision mechanism; (iii) the DPS spectrum organizes diverse generalization degradation scenarios (covering ideal generalization, in-distribution degradation, domain shift, out-of-distribution, and shortcut learning) into a continuous trajectory, providing a unified explanation of their failure modes. These findings open up new possibilities for early generalization-risk detection, failure-mode diagnosis, and channel-level defect localization.

81. 【2605.13140】Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection

链接https://arxiv.org/abs/2605.13140

作者:Sangin Lee,Seokjun Kwon,Jeongmin Shin,Namil Kim,Yukyung Choi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:General object detection, General object, object detection, struggles to detect, detect objects

备注

点击查看摘要

Abstract:General object detection (OD) struggles to detect objects in the target domain that differ from the training distribution. To address this, recent studies demonstrate that training from multiple source domains and explicitly processing them separately for multi-source domain adaptation (MSDA) outperforms blending them for unsupervised domain adaptation (UDA). However, existing MSDA methods learn domain-agnostic features from domain-specific RGB images while preserving domain-specific information from the domain-agnostic feature map. To address this, we propose MS-DePro: Multi-Source Detector with Depth and Prompt, composed of (1) depth-guided localization and (2) multi-modal guided prompt learning. We leverage domain-agnostic input modalities, namely depth maps and text, to encode domain-agnostic characteristics. Specifically, we utilize depth maps to generate domain-agnostic region proposals for localization and integrate multi-modal features to align learnable text embeddings for classification. MS-DePro achieves state-of-the-art performance on MSDA benchmarks, and comprehensive ablations demonstrate the effectiveness of our contributions. Our code is available on this https URL.

82. 【2605.13129】Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

链接https://arxiv.org/abs/2605.13129

作者:Nikitas Chatzis,Marios Loizou,Evangelos Kalogerakis

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:typically static, outputs are typically, lack the skeletal, skinning weights required, Recent

备注

点击查看摘要

Abstract:Recent 3D generative models can synthesize high-quality assets, but their outputs are typically static: they lack the skeletal rigs, joint hierarchies, and skinning weights required for animation. This limits their use in games, film, simulation, virtual agents, and embodied AI, where assets must not only look plausible but also move plausibly. We introduce Rigel3D, a generative method for animation-ready 3D assets represented as rigged meshes. Unlike post-hoc auto-rigging methods that attach rigs to completed shapes, our method jointly models geometry and rig structure through coupled surface and skeleton structured latent representations. A rig-aware autoencoder decodes these representations into mesh geometry, skeleton topology, joint coordinates, and skinning weights, while a two-stage latent generative model synthesizes both surface and skeleton representations for image-conditioned generation. To support downstream animation workflows, we further introduce an open-vocabulary joint labeling module that embeds generated joints into a shared vision-language space, enabling correspondence to arbitrary retargeting templates. Experiments on large-scale rigged asset datasets demonstrate that our method generates diverse, high-quality animation-ready assets and outperforms existing rigging baselines across multiple metrics.

83. 【2605.13122】Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

链接https://arxiv.org/abs/2605.13122

作者:Jingxuan He,Xiyu Wang,Yunke Wang,Mengyu Zheng,Chang Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:implicitly requires identifying, Instruction-based image editing, recently demonstrated strong, demonstrated strong capability, natural language instructions

备注

点击查看摘要

Abstract:Instruction-based image editing (IIE) models have recently demonstrated strong capability in modifying specific image regions according to natural language instructions, which implicitly requires identifying where an edit should be applied. This indicates that such models inherently perform language-conditioned visual semantic grounding. In this work, we investigate whether this implicit grounding can be leveraged for zero-shot referring image segmentation (RIS), a task that requires pixel-level localization of objects described by natural language expressions. Through systematic analysis, we reveal that strong foreground-background separability emerges in the internal representations of these models at the earliest denoising timestep, well before any visible image transformation occurs. Building on this insight, we propose a training-free framework that repurposes pretrained image editing models for RIS by exploiting their intermediate representations. Our approach decomposes localization into two complementary components: attention-based spatial priors that estimate where to focus, and feature-based semantic discrimination that determines what to segment. By leveraging feature-space separability, the framework produces accurate segmentation masks using only a single denoising step, without requiring full image synthesis. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our method achieves superior performance over existing zero-shot baselines.

84. 【2605.13119】owards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

链接https://arxiv.org/abs/2605.13119

作者:Zixing Lei,Changxing Liu,Yichen Xiong,Minhao Xiong,Yuanzhuo Ding,Zhipeng Zhang,Weixin Li,Siheng Chen

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:robot action executors, effective robot action, extended closed-loop planning, diverse physical operations, physical operations

备注

点击查看摘要

Abstract:Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this burden across a high-level vision language model (VLM) agent for temporal reasoning and a family of specialized VLA tools for diverse local physical operations. The VLM handles scene analysis, global planning, and recovery, while each VLA tool executes a bounded subtask. To tightly couple agent planning with VLA tool execution in long-horizon tasks, we introduce a VLA tool-family interface that exposes explicit tool selection and in-execution progress feedback, enabling efficient event-triggered agent replanning without continuous agent polling. To obtain diverse specialized VLA tools that faithfully follow agent invocations, we further propose Tool-Aligned Post-Training (TAPT), which constructs invocation-aligned training units for instruction following and adopts tool-family residual adapters for efficient tool specialization. Experiments show that VLAs-as-Tools improves the success rate of $\pi_{0.5}$ by 4.8 points on LIBERO-Long and 23.1 points on RoboTwin, and further enhances invocation fidelity by 15.0 points as measured by Non-biased Rate. Code will be released.

85. 【2605.13111】Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

链接https://arxiv.org/abs/2605.13111

作者:Jiayu Chen,Junbei Tang,Wenbiao Zhao,Maoliang Li,Jiayi Luo,Zihao Zheng,Jiawei Yang,Guojie Luo,Xiang Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:long video synthesis, Autoregressive video generation, open-ended long video, long-term degradation caused, Autoregressive video

备注

点击查看摘要

Abstract:Autoregressive video generation enables streaming and open-ended long video synthesis, but still suffers from long-term degradation caused by accumulated errors. Existing KVCache strategies usually apply unified historical-frame retention, implicitly assuming homogeneous historical dependencies across attention heads. We revisit historical-frame attention and reveal three distinct head types: Anchor Heads require broad long-range context, Wave Heads exhibit periodic temporal dependencies, and Veil Heads focus on initial and adjacent frames. Based on this finding, we propose Pyramid Forcing, a head-aware pyramidal KVCache framework that identifies head types offline, assigns behavior-specific cache policies, and supports heterogeneous cache lengths via efficient ragged-cache attention. Experiments on Self Forcing and Causal Forcing show that Pyramid Forcing consistently improves long-horizon generation quality on VBench-Long, increasing the 60-second Self Forcing score from 77.87 to 81.21 while enhancing motion dynamics, visual fidelity, and semantic consistency. Project: this https URL.

86. 【2605.13108】Flow Augmentation and Knowledge Distillation for Lightweight Face Presentation Attack Detection

链接https://arxiv.org/abs/2605.13108

作者:Muhammad Shahid Jabbar,Muhammad Sohail Ibrahim,Taha Hasan Masood Siddique,Kejie Huang,Shujaat Khan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Face presentation attack, presentation attack detection, varying capture conditions, Face presentation, makeup-induced appearance manipulation

备注: Accepted at 2026 International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:Face presentation attack detection (FacePAD) remains challenging under diverse spoofing representation, including 2D print and replay, 3D mask-based spoofing, makeup-induced appearance manipulation, and physical occlusions, as well as under varying capture conditions. Motion cues are highly discriminative for FacePAD but typically require explicit optical flow estimation, which introduces substantial computational overhead and limits real-time deployment. In this work, we leverage optical flow to enhance motion representation during training while eliminating the need for flow computation at inference. We propose a dual-branch teacher model that fuses appearance cues from RGB frames with motion cues derived from colorwheel-encoded optical flow, enabling effective modeling of micro-motions and temporal consistency. To enable efficient deployment, we introduce a knowledge distillation framework that transfers motion-aware knowledge from the flow-augmented teacher to a lightweight RGB-only student via logit distillation. As a result, the student implicitly learns motion-sensitive representations without requiring explicit flow estimation or additional feature extraction blocks at inference. Extensive experiments demonstrate strong performance across multiple benchmarks, achieving 0.0% HTER on Replay-Attack and Replay-Mobile, 0.94% HTER on ROSE-Youtu, 5.65% HTER on SiW-Mv2, and 0.42% ACER on OULU-NPU. The distilled student achieves performance comparable to or better than the teacher while significantly reducing parameters and FLOPs, achieving 52 FPS on an NVIDIA Jetson Orin Nano, indicating its suitability for real-time and resource-constrained FacePAD deployment.

87. 【2605.13093】RoSplat: Robust Feed-Forward Pixel-wise Gaussian Splatting for Varying Input Views and High-Resolution Rendering

链接https://arxiv.org/abs/2605.13093

作者:Hoang Chuong Nguyen,Renjie Wu,Jose M. Alvarez,Miaomiao Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling feed-forward synthesis, Splatting has recently, Gaussian Splatting, Gaussian scale estimation, input views

备注

点击查看摘要

Abstract:Generalizable 3D Gaussian Splatting has recently emerged as an efficient approach for novel-view synthesis, enabling feed-forward synthesis from only a few input views. However, existing pixel-wise feed-forward methods suffer from over-bright renderings when the number of input views varies during inference, as well as insufficient supervision for accurate Gaussian scale estimation, which leads to hole artifacts, particularly in high-resolution renderings. To address these issues, we identify that the over-brightness is caused by the varying number of overlapping Gaussians and propose a simple alpha normalization strategy to maintain brightness consistency across different number of input views. In addition, we introduce an auxiliary 3D sampling-based regularizer to improve Gaussian scale estimation, thereby mitigating hole artifacts in high-resolution rendering. Experiments on benchmark datasets demonstrate that our method significantly improves baseline models under varying input-view and high-resolution rendering settings.

88. 【2605.13081】PRA-PoE: Robust Alzheimer's Diagnosis with Arbitrary Missing Modalities

链接https://arxiv.org/abs/2605.13081

作者:Guangqian Yang,Ye Du,Wenlong Hou,Qian Niu,Shujun Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world Alzheimer disease, Alzheimer disease, real-world Alzheimer, assessment and pose, modality subsets differs

备注: Early accepted by MICCAI 2026

点击查看摘要

Abstract:Missing modalities are prevalent in real-world Alzheimer's disease (AD) assessment and pose a significant challenge to multimodal learning, particularly when the distribution of observed modality subsets differs between training and deployment. Such missingness pattern mismatch induces a conditional representation shift across modality subsets. Existing approaches that rely on implicit imputation or modality synthesis often fail to explicitly model modality availability and uncertainty, leading to overconfident dependence on synthesized features, reduced robustness, and miscalibrated uncertainty estimates. To address these limitations, we propose PRA-PoE, an incomplete multimodal learning framework that is equipped with Prototype-anchored Representation Alignment (PRA) and an Uncertainty-aware Product of Experts (UA-PoE) fusion mechanism. First, PRA uses learnable global prototypes and availability-conditioned tokens to encode modality availability, distinguish observed from missing modalities, re-synthesize features for missing modalities, and adaptively refine observed representations to align latent spaces across modality subsets, with the goal of reducing representation shift under varying missingness patterns. Second, UA-PoE models each modality as a Gaussian expert and performs closed-form Product of Experts fusion, where experts with higher uncertainty are automatically down-weighted via lower precision, improving uncertainty reliability. We evaluate PRA-PoE under a clinically realistic protocol by training with naturally missing data and testing on all non-empty modality combinations. PRA-PoE consistently outperforms the state-of-the-art across datasets, achieving a 5.4% relative improvement in average accuracy on ADNI and a 10.9% relative gain in average F1 on OASIS-3 over the strongest baseline across all non-empty modality subsets.

89. 【2605.13080】Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

链接https://arxiv.org/abs/2605.13080

作者:Junha Song,Byeongho Heo,Geonmo Gu,Jaegul Choo,Dongyoon Han,Sangdoo Yun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:entire image uniformly, intended description, humans describe, process the entire, Gaze Attention

备注

点击查看摘要

Abstract:When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we introduce Gaze Attention, a novel mechanism that enables MLLMs to selectively attend to task-relevant visual regions during generation. Specifically, we spatially group visual embeddings-stored as key-value caches-into compact gaze regions, each represented by a lightweight descriptor. At each decoding step, the model dynamically selects the most relevant regions and restricts attention to them, reducing redundant computation while enhancing focus. To mitigate the loss of global context caused by localized attention, we further propose learnable context tokens appended to each image or frame, allowing the model to maintain holistic visual awareness. Extensive experiments on image and video understanding benchmarks demonstrate that Gaze Attention matches or surpasses dense-attention baselines, while using up to 90% fewer visual KV entries in the attention computation.

90. 【2605.13073】HarmoGS: Robust 3D Gaussian Splatting in the Wild via Conflict-Aware Gradient Harmonization

链接https://arxiv.org/abs/2605.13073

作者:Yulei Kang,Tianze Zhu,Jian-Fang Hu,Jianhuang Lai,Wei-Shi Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Splatting remains challenging, Gaussian Splatting remains, remains challenging due, Splatting remains, Gaussian Splatting

备注

点击查看摘要

Abstract:In-the-wild 3D Gaussian Splatting remains challenging due to transient distractors and illumination-induced cross-view appearance inconsistencies. Existing methods mainly rely on image-level masking to suppress unreliable supervision, but masking alone cannot fully eliminate residual occlusions or resolve illumination-induced inconsistencies, both of which can introduce conflicting cross-view gradients. These unresolved conflicts may destabilize Gaussian optimization and lead to visible reconstruction artifacts. We propose a conflict-aware 3DGS framework that addresses this problem from both image-space supervision and gradient-level optimization. Semantic Consistency-Guided Masking learns pixel-wise consistency scores to adaptively refine prior masks and suppress unreliable supervision before gradient formation. A dual-view Conflict-Aware Gradient Harmonization strategy further reconciles view-specific gradients by mutually rotating them into an orthogonal configuration, reducing negative directional interference across views. We also introduce conflict-aware densification and pruning to stabilize Gaussian growth and remove persistently conflicting primitives. Extensive experiments on standard in-the-wild benchmarks demonstrate that our method achieves state-of-the-art rendering quality under complex transient distractors and cross-view inconsistencies.

91. 【2605.13062】Edit-Compass EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

链接https://arxiv.org/abs/2605.13062

作者:Xuehai Bai,Yang Shi,Yi-Fan Zhang,Xuanyu Zhu,Yuran Wang,Yifan Dai,Xinyu Liu,Yiyan Ji,Xiaoling Gu,Yuanxing Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable progress, Recent image editing, Recent image, multimodal understanding, achieved remarkable

备注

点击查看摘要

Abstract:Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of both image editing models and reward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based on structured reasoning and carefully designed scoring rubrics. In parallel, EditReward-Compass contains 2,251 preference pairs that simulate realistic reward modeling scenarios during RL optimization.

92. 【2605.13059】BrainAnytime: Anatomy-Aware Cross-Modal Pretraining for Brain Image Analysis with Arbitrary Modality Availability

链接https://arxiv.org/abs/2605.13059

作者:Guangqian Yang,Tong Ding,Wenlong Hou,Yue Xun,Ye Du,Qian Niu,Shujun Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:initial clinical evaluation, selectively add sequences, Clinical diagnostic workups, clinical evaluation, modality escalation pathway

备注: Early accepted by MICCAI 2026

点击查看摘要

Abstract:Clinical diagnostic workups typically follow a modality escalation pathway: after initial clinical evaluation, clinicians begin with routine structural imaging (e.g., MRI), selectively add sequences such as FLAIR or T2 to refine the differential, and reserve molecular imaging (e.g., amyloid-PET) for cases that remain uncertain after standard evaluation. Consequently, patients are observed with heterogeneous and often incomplete modality subsets. However, most current AI models assume fixed data modalities as the model inputs. In this paper, we present BrainAnytime, a unified pretraining framework pretrained on 34,899 3D brain scans from five datasets that support brain image analysis under arbitrary modality availability spanning multi-sequence MRI and amyloid-PET. A single model accepts whatever imaging is available, from a lone T1 scan to a full multimodal workup. Pretraining learns structural-molecular correspondences between MRI and PET via cross-modal distillation (RCMD) and prioritizes disease-vulnerable anatomy via atlas-guided curriculum masking (PACM), all within a shared 3D masked autoencoder (Multi-MAE3D). Across four downstream tasks and five clinically motivated modality settings, BrainAnytime largely outperforms modality-specific models, missing-modality baselines, and large-scale brain MRI pretrained foundation models on most modality settings. Notably, it surpasses the strongest missing-modality baselines with relative improvements of 6.2% and 7.0% in average accuracy on CN vs. AD and CN vs. MCI classification, respectively. Code is available at this https URL.

93. 【2605.13049】Uncertainty-aware Spatial-Frequency Registration and Fusion for Infrared and Visible Images

链接https://arxiv.org/abs/2605.13049

作者:Xingyuan Li,Haoyuan Xu,Xingyue Zhu,Jun Ma,Yang Zou,Zhiying Jiang,Jinyuan Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visible Image Fusion, faces inherent misalignments, unregistered conditions faces, conditions faces inherent, Visible Image

备注: 10 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Infrared and Visible Image Fusion (IVIF) has shown promise in visual tasks under challenging environments, but fusion under unregistered conditions faces inherent misalignments. Current studies to solve them either predict the deformation parameters coarse-to-fine (i.e., coarse registration and fine registration) or estimate the deformation fields in multi-scales for registration. Though straightforward, they overlook the cumulative errors in registration, which contaminate the fusion stage and severely deteriorate the resulting images. We introduce the Spatial-Frequency Registration and Fusion (SFRF) framework, which incorporates uncertainty estimation and infrared thermal radiation distribution consistency into a unified pipeline to handle the error accumulation for robust registration and fusion across both spatial and frequency domains. Specifically, SFRF constructs a Multi-scale Iterative Registration (MIR) framework that iteratively refines the deformation field across scales, leveraging uncertainty estimation at each stage to mitigate error accumulation and enhance alignment accuracy dynamically. To ensure the accurate alignment of infrared thermal distributions during registration, thermal radiation distribution consistency is employed as a frequency-domain supervisory signal, promoting global consistency in the frequency domain. Based on the spatial-frequency alignment, SFRF further adopts a Dual-branch Spatial-Frequency Fusion (DSFF) module, which incorporates spatial geometric features and frequency distribution information to reconstruct visually appealing images. SFRF achieves impressive performance across diverse datasets.

94. 【2605.13047】Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

链接https://arxiv.org/abs/2605.13047

作者:Ziqi Wen,Parsa Madinei,Miguel P. Eckstein

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains a challenge, perception for high-level, Evaluating, Evaluating whether large, semantic

备注

点击查看摘要

Abstract:Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at this https URL.

95. 【2605.13041】EgoForce: Robust Online Egocentric Motion Reconstruction via Diffusion Forcing

链接https://arxiv.org/abs/2605.13041

作者:Inwoo Hwang,Donggeun Lim,Hojun Jang,Young Min Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world interactive online, recent advances, advances in embodied, embodied agents, real-world interactive

备注: Project page: [this https URL](https://inwoohwang.me/EgoForce)

点击查看摘要

Abstract:With recent advances in embodied agents and AR devices, egocentric observations are readily available as input for real-world interactive online applications. However, egocentric viewpoints can only sporadically observe hands, in addition to the estimated head trajectory. We propose EgoForce, an online framework for reconstructing long-term full-body motion from noisy egocentric input. While existing generative frameworks can robustly handle noisy and sparse measurements, they assume a fixed-length observation window is available and are thus not suitable for real-time applications. Faster inference often relies on autoregressive prediction, sacrificing robustness. In contrast, we adopt a diffusion-based method with a temporally asymmetric noise schedule inspired by Diffusion Forcing. Specifically, our approach models temporally evolving uncertainty and incrementally denoises states as new streaming observations arrive. Combined with a noise-robust imputation strategy, EgoForce progressively generates stable and coherent full-body motion under strict causal constraints. Experiments demonstrate that our online framework outperforms existing online and offline methods, enabling long-horizon, full-body motion reconstruction in challenging egocentric scenarios.

96. 【2605.13038】CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy

链接https://arxiv.org/abs/2605.13038

作者:Liangjing Shao,Beilei Cui,Hongliang Ren

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:including depth estimation, estimation including depth, Geometric estimation including, including depth, crucial technique

备注: Early Accepted by MICCAI 2026

点击查看摘要

Abstract:Geometric estimation including depth estimation and scene reconstruction is a crucial technique for colonoscopy which can provide surgeons with 3D spatial perception and navigation. However, geometric ground truth in colonoscopy is difficult to obtain due to narrow and enclosed space of the colon, while there is a large feature gap between simulated data and realistic data caused by artifacts and illumination. In this paper, we present CoGE, a novel framework for online monocular geometric estimation during colonoscopy. Firstly, we propose an illumination-aware supervision module based on the Retinex theory to address illumination diversity in different colonoscopy scenes. Moreover, a structure-aware perception module is proposed based on wavelet decomposition to extract common structural and local features of the colon. Both quantitative and qualitative results demonstrate that the proposed model solely trained on simulated data achieves state-of-the-art performance in geometric estimation for both simulated and realistic scenes.

97. 【2605.13034】ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

链接https://arxiv.org/abs/2605.13034

作者:Zhuofan Shi,Peilun Jia,Baoqin Sun,Haiyang Shen,Sixiong Xie,Yun Ma,Xiang Jing

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:large language models, Recent deep research, Recent deep, produce long, improved the ability

备注

点击查看摘要

Abstract:Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with section-specific evidence. ViDR further validates visual references to reduce hallucinated or misplaced figures. We also introduce MMR Bench+, a benchmark for evaluating visual evidence use in deep research reports, covering source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation. Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines. These results suggest that source visual evidence is important for multimodal deep research, as it strengthens evidential grounding, visual support, and report verifiability.

98. 【2605.13027】PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

链接https://arxiv.org/abs/2605.13027

作者:Zihang Xu,Xiaoyang Liu,Zheng Chen,Yulun Zhang,Xiaokang Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:alter character identity, plausible detail synthesis, Text image super-resolution, visually plausible detail, image super-resolution

备注: Code is available at [this https URL](https://github.com/faithxuz/PRISM)

点击查看摘要

Abstract:Text image super-resolution (Text-SR) requires more than visually plausible detail synthesis: slight errors in stroke topology may alter character identity and break readability. Existing methods improve text fidelity with stronger recognition-based or generative priors, yet they still face two unresolved challenges under severe degradation: the text condition extracted from low-quality inputs can itself be unreliable, and a plausible global prior does not fully determine fine-grained stroke boundaries. We present PRISM, a single-step diffusion-based Text-SR framework that addresses these two challenges through Flow-Matching Prior Rectification (FMPR) and a Structure-guided Uncertainty-aware Residual Encoder (SURE). FMPR constructs a privileged training-time prior from paired low-quality/high-quality latents and learns a flow matching that transports degraded embeddings toward this restoration-oriented prior space, yielding more accurate and reliable global text guidance. SURE further predicts uncertainty-aware structural residuals to selectively absorb reliable local boundary evidence while suppressing ambiguous stroke cues. Together, these components enable explicit global prior rectification and local structure refinement within a single diffusion restoration pass. Experiments on both synthetic and real-world benchmarks show that PRISM achieves state-of-the-art performance with millisecond-level inference. Our dataset and code will be available at this https URL.

99. 【2605.13018】OCH3R: Object-Centric Holistic 3D Reconstruction

链接https://arxiv.org/abs/2605.13018

作者:Yi Du,Yang You,Xiang Wan,Leonidas Guibas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision, fundamental challenge, challenge in computer, Object-centric scene understanding, Object-Centric Holistic

备注

点击查看摘要

Abstract:Object-centric scene understanding is a fundamental challenge in computer vision. Existing approaches often rely on multi-stage pipelines that first apply pre-trained segmentors to extract individual objects, followed by per-object 3D reconstruction. Such methods are computationally expensive, fragile to segmentation errors, and scale poorly with scene complexity. We introduce OCH3R, a unified framework for Object-Centric Holistic 3D Reconstruction from a single RGB image. OCH3R performs one forward pass to simultaneously predict all object instances with their 6D poses and detailed 3D reconstructions. The key idea is a transformer architecture that predicts per-pixel attributes, including CLIP-based category embeddings, metric depth, normalized object coordinates (NOCS), and a fixed number of 3D Gaussians representing each object. To supervise these Gaussian reconstructions, we transform them into canonical space using the predicted 6D poses and align them with pre-rendered canonical ground truth, avoiding costly per-image Gaussian label generation. On standard indoor benchmarks, OCH3R achieves state-of-the-art performance across monocular depth estimation, open-vocabulary semantic segmentation, and RGB-only category-level 6D pose estimation, while producing high-fidelity, editable per-object reconstructions. Crucially, inference is fully feed-forward and scales independently of the number of objects, offering orders-of-magnitude speedups over conventional multi-stage pipelines in cluttered scenes.

100. 【2605.13010】Amortized Guidance for Image Inpainting with Pretrained Diffusion Models

链接https://arxiv.org/abs/2605.13010

作者:Yilie Huang,Xun Yu Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)

关键词:generative diffusion models, study image inpainting, pretrained diffusion model, pretrained diffusion, diffusion model separately

备注

点击查看摘要

Abstract:We study image inpainting with generative diffusion models. Existing methods typically either train dedicated task-specific models, or adapt a pretrained diffusion model separately for each masked image at deployment. We introduce a middle-ground model, termed Amortized Inpainting with Diffusion (AID), which keeps a pretrained diffusion backbone fixed, trains a small reusable guidance module offline, and then reuses it across masked images without per-instance optimization. We formulate it as a deterministic guidance problem with a supervised terminal objective. To make this problem learnable in high dimensions, we derive an auxiliary Gaussian formulation and prove that solving this randomized problem recovers the optimal deterministic guidance field. This bridge yields a principled continuous-time actor--critic algorithm for learning the guidance module in a fully data-driven manner. Empirically, on AFHQv2 and FFHQ under the pixel EDM pipeline and on ImageNet under the latent EDM2 pipeline, AID consistently improves the quality--speed trade-off over strong fixed-backbone and amortized inpainting baselines across multiple mask types, while adding less than one percent trainable overhead.

101. 【2605.12967】ImageAttributionBench: How Far Are We from Generalizable Attribution?

链接https://arxiv.org/abs/2605.12967

作者:Tingshu Mou,Zhipeng Wei,Chao Gong,Jingjing Chen,Xingjun Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:posing critical challenges, diverse synthetic images, posing critical, misinformation detection, rapid advancement

备注

点击查看摘要

Abstract:The rapid advancement of generative AI has enabled the creation of highly realistic and diverse synthetic images, posing critical challenges for image provenance and misinformation detection. This underscores the urgent need for effective image attribution. However, existing attribution datasets are constrained by limited scale, outdated generation methods, and insufficient semantic diversity - hindering the development of robust and generalizable attribution models. To address these limitations, we introduce ImageAttributionBench, a comprehensive dataset comprising images synthesized by a wide array of advanced generative models with state-of-the-art (SOTA) architectures. Covering multiple real-world semantic domains, the dataset offers rich diversity and scale to support and accelerate progress in image attribution research. To simulate real-world attribution scenarios, we evaluate several SOTA attribution methods on ImageAttributionBench under two challenging settings: (1) training on a standard balanced split and testing on degraded images, and (2) training and testing on semantically disjoint splits. In both cases, current methods exhibit consistently poor performance, revealing significant limitations in their robustness and generalization to unseen semantic content. Our work provides a rigorous benchmark to facilitate the development and evaluation of future image attribution methods.

102. 【2605.12964】Asymmetric Flow Models

链接https://arxiv.org/abs/2605.12964

作者:Hansheng Chen,Jan Ackermann,Minseo Kim,Gordon Wetzstein,Leonidas Guibas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires modeling high-dimensional, modeling high-dimensional noise, prediction requires modeling, Asymmetric Flow Modeling, velocity prediction requires

备注: Code: [this https URL](https://github.com/Lakonik/LakonLab) Webpage: [this https URL](https://hanshengchen.com/asymflow)

点击查看摘要

Abstract:Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From this asymmetric prediction, AsymFlow analytically recovers the full-dimensional velocity without changing the network architecture or training/sampling procedures. On ImageNet 256$\times$256, AsymFlow achieves a leading 1.57 FID, outperforming prior DiT/JiT-like pixel diffusion models by a large margin. AsymFlow also provides the first-ever route for finetuning pretrained latent flow models into pixel-space models: aligning the low-rank pixel subspace to the latent space gives a seamless initialization that preserves the latent model's high-level semantics and structure, so finetuning mainly improves low-level mismatches rather than relearning pixel generation. We show that the pixel AsymFlow model finetuned from FLUX.2 klein 9B establishes a new state of the art for pixel-space text-to-image generation, beating its latent base on HPSv3, DPG-Bench, and GenEval while qualitatively showing substantially improved visual realism.

103. 【2605.12961】Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering

链接https://arxiv.org/abs/2605.12961

作者:Feijiang Li,Zhenxiong Li,Jieting Wang,Zizheng Jiu,Saixiong Liu,Liang Du

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:partition unlabeled image, distinct groups, Image clustering aims, aims to partition, partition unlabeled

备注

点击查看摘要

Abstract:Image clustering aims to partition unlabeled image datasets into distinct groups. A core aspect of this task is constructing and leveraging prior knowledge to guide the clustering process. Recent approaches introduce semantic descriptions as prior information, most of which typically relying on matching-based techniques with predefined vocabularies. However, the limited matching space restricts their adaptability to downstream clustering tasks. Moreover, these methods primarily focus on reducing bias to improve performance, frequently overlooking the importance of variance reduction. To address these limitations, we propose GSEC (Image Clustering based on Generative Semantic Guidance and Bi-Layer Ensemble), a framework designed to reduce bias through generative semantic guidance and mitigate variance via ensemble learning. Our method employs Multimodal Large Language Models to generate semantic descriptions and derive image embeddings via weighted averaging. Additionally, a bi-layer ensemble strategy integrates cross-modal information through BatchEnsemble in the inner layer and aligns outputs via an alignment mechanism in the outer layer. Comparative experiments demonstrate that GSEC outperforms 18 state-of-the-art methods across six benchmark datasets, while further analysis confirms its effectiveness in simultaneously reducing both bias and variance. The code is available at this https URL.

104. 【2605.12957】GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

链接https://arxiv.org/abs/2605.12957

作者:Hanxin Zhu,Cong Wang,Peiyan Tu,Jiayi Luo,Tianyu He,Xin Jin,Zhibo Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including spatial intelligence, domains including spatial, Recent developments, embodied intelligence, spatial intelligence

备注

点击查看摘要

Abstract:Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and autonomous driving. While achieving remarkable progress, existing approaches to 3D world generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments have demonstrated that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, as well as supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability. Project page: this https URL.

105. 【2605.12954】AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

链接https://arxiv.org/abs/2605.12954

作者:Xiao Yang,Yingzhe Ma,Haoxuan Yu,Zixin Li,Ning Qin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:irreversibly discard fine-grained, Long video understanding, densely encode videos, Long video, sparse frame sets

备注: 9 pages, 4 figures. Authors Xiao Yang and Yingzhe Ma contributed equally

点击查看摘要

Abstract:Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency. We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in memory, AdaFocus introduces an uncertainty-triggered refinement mechanism. It performs targeted look-back only when the model is not confident, retrieving high-resolution evidence directly from disk via a zero-cache I/O design. This turns discarded visual details from an irreversible loss into on-demand recoverable evidence without paying the cost of exhaustive preloading. Experiments on seven standard long-video benchmarks show that AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design. These findings suggest that progressive preview combined with zero-cache evidence refinement is a highly effective paradigm for scalable multimedia reasoning.

Comments:
9 pages, 4 figures. Authors Xiao Yang and Yingzhe Ma contributed equally

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2605.12954 [cs.CV]

(or
arXiv:2605.12954v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.12954

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
106. 【2605.12953】Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

链接https://arxiv.org/abs/2605.12953

作者:Chao Hao,Jun Xu,Ji Du,Shuo Ye,Ziyue Qiao,Xiaodong Cun,Guangcong Wang,Xubin Zheng,Zitong Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Language-guided segmentation transcends, segment arbitrary target, Large Language Models, natural language instructions, Multimodal Large Language

备注

点击查看摘要

Abstract:Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.

107. 【2605.12952】Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation

链接https://arxiv.org/abs/2605.12952

作者:Yongjin Cui,Xiaohui Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Transformer interpretation technical, published at ICML, intermediate features-based technical, features-based technical route, Transformer interpretation

备注

点击查看摘要

Abstract:Grad-ECLIP is published at ICML 2024 and represents a new Transformer interpretation technical route (intermediate features-based). First, this paper demonstrates that the intermediate features-based technical route is not a novel one. Based on the existing attention-based route, we have developed Attention-ECLIP, which is completely equivalent to Grad-ECLIP but with simpler computation. Both through formal derivation and experimental validation, we prove that the intermediate feature-based route represented by Grad-ECLIP is actually an equivalent variant of the attention-based route. Next, this paper demonstrates that the Grad-ECLIP method is flawed. The model interpretation results obtained by Grad-ECLIP are not those of the original model, and the interpretation results are misaligned with the model's performance. We analyze the causes of Grad-ECLIP's flaws and propose, or rather, explicitly emphasize two fundamental principles that model interpretation should adhere to in order to avoid similar errors.

108. 【2605.12939】DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

链接https://arxiv.org/abs/2605.12939

作者:Xianbing Sun,Jiahui Zhan,Liqing Zhang,Jianfu Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high inference cost, incurs high inference, Recent diffusion, acceleration methods largely, methods largely overlook

备注

点击查看摘要

Abstract:Recent diffusion- and flow-based VTON methods achieve strong results with pretrained generative models, but their reliance on multi-step sampling incurs high inference cost, while existing acceleration methods largely overlook the intrinsic structure of the try-on task. In this paper, we highlight a key observation: VTON outputs are highly constrained by the conditional inputs, suggesting that the conditional sampling trajectory can be much straighter than that in general image generation, making one-step generation a natural solution. However, limited task-specific data makes training from scratch impractical, forcing existing methods to fine-tune pretrained models whose objectives do not encourage such straight conditional trajectories. Thus, the deviation from an ideal straight path mainly comes from the mismatch between pretrained base models and the conditional nature of try-on generation, rather than from the task itself. Motivated by this insight, we encourage straighter VTON sampling trajectories through three targeted modifications: pure conditional transport, a garment preservation loss, and a self consistency loss. We further introduce a one-step distillation stage. Extensive experiments show that our method achieves state-of-the-art performance with one-step sampling, establishing a new standard for efficient and high-quality VTON.

109. 【2605.12938】CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

链接https://arxiv.org/abs/2605.12938

作者:Seonghyun Jin,Youngmin Kim,Sunwoo Park,Jong Chul Ye

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:requires positional encoding, Camera-conditioned video generation, Expectation Positional Encoding, positional encoding, Model-compatible positional encoding

备注: 17 pages, 8 figures, Under review

点击查看摘要

Abstract:Camera-conditioned video generation requires positional encoding that remains reliable under changes in camera motion, lens configuration, and scene structure. However, existing attention-level camera encodings either provide ray-only camera signals or rely on pinhole camera geometry, limiting their applicability to general camera control under the Unified Camera Model, including wide-angle and fisheye lenses. To address this limitation, we propose Curved Ray Expectation Positional Encoding (CRePE). CRePE represents each image token as a depth-aware positional distribution along its source ray, providing a Unified Camera Model-compatible positional encoding that captures the projected-path geometry induced by wide-angle and fisheye cameras. CRePE is implemented through a Geometric Attention Adapter added to frozen video DiTs, injecting token-wise scene-distance information into selected attention layers and stabilizing it with pseudo supervision from a monocular geometry foundation model. This design leads to more stable camera control and improves several geometry-aware and perceptual-quality metrics, while remaining competitive on video-quality metrics. Controlled positional-encoding ablations show a better overall average rank than a RayRoPE-style endpoint PE baseline, demonstrating the effectiveness of UCM-aware projected-path integration across diverse camera models. Furthermore, by extending the same positional-encoding pathway to external geometry control through Radial MixForcing, CRePE supports external radial-map control for scene-geometry-conditioned generation and source-video motion transfer beyond camera control.

110. 【2605.12937】AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

链接https://arxiv.org/abs/2605.12937

作者:Jacob Lagogiannis,William Agnew,Rosa I. Arriaga,Sauvik Das

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Anti-facial recognition, computer vision, people but blinding, blinding to computer, image filters alter

备注: 21 pages, 10 figures

点击查看摘要

Abstract:Anti-facial recognition (AFR) image filters alter images in ways that are subtle to people but blinding to computer vision. Yet, despite widespread interest in these technologies to subvert surveillance, users rarely use them in practice -- because the ``subtle'' alterations are visible enough to conflict with users' self-presentation goals. To address this challenge, we propose AuraMask: a novel approach to creating AFR filters that are both adversarially effective and aesthetically acceptable. Using AuraMask, we produce 40 ``aesthetic'' filters that emulate popular ``one-click'' Instagram image filters. We show that AuraMask filters meet or exceed the adversarial effectiveness of prior methods against open-source facial recognition models. Moreover, in a controlled online user study ($N=630$) we confirm these filters achieve significantly higher user acceptance than prior methods. Lastly, we provide our AFR pipeline to the community for accelerated research in adversarially effective and aesthetically acceptable protections.

111. 【2605.12929】Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis

链接https://arxiv.org/abs/2605.12929

作者:Yingzhe Ma,Xiao Yang,Yuguo Yin,Zheyu Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:clinicians compare homologous, compare homologous structures, deep models operate, Retinal diagnosis, inherently bilateral

备注: 10 pages, 3 figures

点击查看摘要

Abstract:Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep models operate on monocular representations. We investigate whether explicit structural correspondence improves diagnosis, and propose Anatomy-Slot to operationalize this hypothesis. Anatomy-Slot introduces an unsupervised anatomical bottleneck by decomposing patch tokens into slots and aligning slots across eyes via bidirectional cross-attention. On ODIR-5K with $n=10$ seeds, the method improves AUC by 4.2% over a matched ViT-L baseline (95% CIs; Wilcoxon signed-rank test, $W=0$, $p=0.002$). Pairing disruption and stress testing under Gaussian noise provide controlled tests of correspondence dependence and robustness under corruption. We further report quantitative optic disc grounding on REFUGE and cross-attention localization analysis.

112. 【2605.12927】hermalTap: Passive Application Fingerprinting in VR Headsets via Thermal Side Channels

链接https://arxiv.org/abs/2605.12927

作者:Mahsin Bin Akram,A H M Nazmus Sakib,OFM Riaz Rahman Aranya,Raveen Wijewickrama,Kevin Desai,Murtuza Jadliwala

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:highly sensitive personal, process highly sensitive, Standalone virtual reality, physical side channels, remains largely unexplored

备注

点击查看摘要

Abstract:Standalone virtual reality (VR) headsets process highly sensitive personal, professional, and health-related data, yet their susceptibility to non-contact physical side channels remains largely unexplored. Existing side-channel attacks typically require malicious software execution or physical access to peripherals, making them conspicuous and potentially patchable. This paper introduces ThermalTap, the first passive, non-contact side-channel attack that fingerprints VR applications solely from the long-wave infrared (LWIR) radiation emitted by the headset chassis. By treating a headset's thermal signature as a high-fidelity proxy for internal computational workloads, ThermalTap enables remote application inference at meter-scale distances without any device interaction. To achieve robust performance in real-world settings, the system combines a commodity thermal camera with a multi-modal sensor suite (capturing ambient temperature, humidity, and airflow) to normalize environmental noise. We evaluate ThermalTap using six applications across three commercial standalone headsets. In indoor settings, ThermalTap identifies applications with over 90% accuracy using only 10 seconds of thermal camera data. Under outdoor conditions, with longer session-level observations, several applications remain identifiable despite environmental variability, with the strongest outdoor application reaching 81% accuracy. Our findings establish thermal radiation as a fundamental and unavoidable privacy risk for immersive systems, exposing a critical security gap that bypasses current software-level protections and physical access controls.

113. 【2605.12919】GuardMarkGS: Unified Ownership Tracing and Edit Deterrence for 3D Gaussian Splatting

链接https://arxiv.org/abs/2605.12919

作者:Utae Jeong,Jaewan Choi,Junseok Lee,Jongheon Jeong,Sang Ho Yoon,ByoungSoo Koh,Sangpil Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, dual copyright risk, view synthesis, growing adoption, advances in instruction-driven

备注: Preprint

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is becoming a practical representation for novel view synthesis, but its growing adoption, together with rapid advances in instruction-driven 3DGS editing, also exposes a dual copyright risk: once a 3DGS-based asset is released, it can be used without permission and manipulated through 3D editing. Existing protection methods address only one side of this problem. Watermarking can trace ownership after unauthorized use, but it cannot prevent malicious editing. Adversarial edit-deterrence methods can disrupt editing, but they do not provide evidence of ownership. To the best of our knowledge, we present the first unified protection framework for 3DGS that jointly optimizes ownership tracing and unauthorized editing deterrence. Our framework combines a scene-wide watermarking objective over all Gaussians with an adversarial objective for edit deterrence. The adversarial branch combines latent-anchor separation, denoising-trajectory diversion, and cross-attention diversion to divert the editing trajectory, while an update-saliency-motivated Gaussian selection strategy assigns stronger adversarial updates to mask-selected Gaussians, improving the balance among watermark recovery, edit deterrence, and rendering fidelity. Experiments on scenes from Mip-NeRF 360 and Instruct-NeRF2NeRF demonstrate that the proposed framework achieves a favorable balance among bit accuracy, edit deterrence, and rendering quality. These results suggest that practical copyright protection of 3DGS-based assets can be more effectively addressed by integrating ownership tracing and unauthorized editing deterrence into a single optimization framework.

114. 【2605.12917】Adaptive Conformal Prediction for Reliable and Explainable Medical Image Classification

链接https://arxiv.org/abs/2605.12917

作者:One Octadion,Novanto Yudistira,Lailil Muflikhah

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Deep learning models, creating safety risks, ambiguous diagnostic scenarios, Regularized Adaptive Prediction, Deep learning

备注: To appear in IEA/AIE 2026 (Springer LNAI)

点击查看摘要

Abstract:Deep learning models for medical imaging often exhibit overconfidence, creating safety risks in ambiguous diagnostic scenarios. While Conformal Prediction (CP) provides distribution-free statistical guarantees, standard methods such as Regularized Adaptive Prediction Sets (RAPS) optimize for average efficiency and can mask severe failures on difficult inputs. We propose an Adaptive Lambda Criterion for RAPS that minimizes the worst-case coverage violation across prediction set size strata. On OrganAMNIST (58,850 abdominal CT images, 11 classes), standard size-optimized RAPS converges to near-deterministic behavior with stratified undercoverage on uncertain samples, while our method achieves 95.72 percent global coverage with average set size 1.09 and at least 90 percent coverage across all strata. Cross-domain validation on PathMNIST (107,180 pathology images, 9 classes) confirms generalizability. Quantitative Grad-CAM analysis (rho = -0.30, p 1e-22) shows that multi-label predictions correspond to focused attention on anatomically ambiguous regions. These results demonstrate that the proposed method improves reliability while maintaining efficiency, making it suitable for safety-critical medical AI applications.

Comments:
To appear in IEA/AIE 2026 (Springer LNAI)

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2605.12917 [cs.CV]

(or
arXiv:2605.12917v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.12917

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
115. 【2605.12882】CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

链接https://arxiv.org/abs/2605.12882

作者:Dongsheng Ma,Jiayu Li,Zhengren Wang,Yijie Wang,Jiahao Kong,Weijun Zeng,Jutao Xiao,Jie Yang,Wentao Zhang,Bin Wang,Conghui He

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Multimodal Large, Large Language Models, supporting evidence unchecked, advanced document understanding

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at this https URL.

116. 【2605.12855】Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy

链接https://arxiv.org/abs/2605.12855

作者:Jorge Tapias Gomez,Despoina Kanata,Aneesh Rangnekar,Christina Lee,Hannah Williams,Hannah Thompson,J. Joshua Smith,Francisco Sanchez-Vega,Mert R. Sabuncu,Julio Garcia-Aguilar,Harini Veeraraghavan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Clinical trial studies, TREX, trial studies, studies indicate benefit, showing a complete

备注: 14 Pages, 9 figures, 2 tables

点击查看摘要

Abstract:Clinical trial studies indicate benefit of watch-and-wait (WW) surveillance for patients with rectal cancer showing a complete or near clinical response (CR) directly after treatment (restaging). However, there are no objectively accurate methods to early detect local tumor regrowth (LR) in patients undergoing WW from follow-up exams. Hence, we developed Temporal Rectal Endoscopy Cross-attention (TREX), a longitudinal deep learning approach that combines pairs of images acquired at restaging and follow-up to distinguish CR from LR. TREX uses pretrained Swin Transformers in a siamese setting to extract features from longitudinal images and dual cross-attention to combine the features without spatial co-registration between image pairs. TREX and Swin-based baselines were trained under two settings: (a) detecting LR or CR at the last available follow-up and (b) early detection of LR at 3--6, 6--12, and 12--24 months before clinical confirmation. TREX achieved the highest accuracy in detecting LR with a high sensitivity of 97% $\pm$ 6% and a balanced accuracy of 90% $\pm$ 3%, and outperformed all baselines in early detection at both 3--6 (74% $\pm$ 1%) and 6--12 months (62% $\pm$ 4%) prior to clinical detection. Clinical validation via a surgeon survey showed that TREX matched attending-level overall accuracy (TREX: 86.21% vs.\ Clinicians: 87.84% $\pm$ 1.28%). Finally, we explored TREX's ability to predict treatment response by combining pre-treatment (pre-TNT) and restaging endoscopies, achieving a balanced accuracy of 73% $\pm$ 12%. These results show that longitudinal deep learning analysis of endoscopy may improve surveillance and enable earlier identification of rectal cancer regrowth.

117. 【2605.12851】PRISM: Perinuclear Ring-based Image Segmentation Method for Acute Lymphoblastic Leukemia Classification

链接https://arxiv.org/abs/2605.12851

作者:Larissa Ferreira Rodrigues Moreira,Leonardo Gabriel Ferreira Rodrigues,Rodrigo Moreira,André Ricardo Backes

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Acute Lymphoblastic Leukemia, Lymphoblastic Leukemia, peripheral blood smears, complicate conventional membrane-based, Acute Lymphoblastic

备注: Paper accepted for publication at the XXVI Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2026), Ouro Preto, MG, Brazil

点击查看摘要

Abstract:Automated analysis of peripheral blood smears for Acute Lymphoblastic Leukemia (ALL) is hindered by low contrast and substantial variability in cytoplasmic appearance, which complicate conventional membrane-based segmentation. We found that many recent approaches rely on heavy neural architectures and extensive training, but still struggle to generalize across staining and acquisition variability. To address these limitations, we propose the Perinuclear Ring-based Image Segmentation Method (PRISM), which replaces explicit cytoplasmic delineation with adaptive concentric zones constructed around the nucleus. These perinuclear regions enable the extraction of robust cytoplasmic descriptors by integrating color information with texture statistics derived from grey-level co-occurrence patterns, without requiring accurate cell-boundary detection. A calibrated stacking ensemble of traditional classifiers leverages these descriptors to achieve a high performance, with an accuracy of 98.46% and a precision-recall AUC of 0.9937.

118. 【2605.12845】AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects

链接https://arxiv.org/abs/2605.12845

作者:Danrui Li,Jiahao Zhang,Bernhard Egger,Moitreya Chatterjee,Suhas Lohit,Tim K. Marks,Anoop Cherian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:predicting physically plausible, requires understanding multimodal, parts requires understanding, Assembling objects, physically plausible

备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.

119. 【2605.12826】FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection

链接https://arxiv.org/abs/2605.12826

作者:Kaixiang Zhao,Tianrun Yu,Aoxu Zhang,Junhao Su,Porter Jenkins,Amanda Hughes

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generative artificial intelligence, artificial intelligence models, images increasingly challenging, textbf, sophisticated image editing

备注: Accepted to CVPR 2026 SAFE Workshop

点击查看摘要

Abstract:The proliferation of sophisticated image editing tools and generative artificial intelligence models has made verifying the authenticity of digital images increasingly challenging, with important implications for journalism, forensic analysis, and public trust. Although numerous forensic algorithms, ranging from handcrafted methods to deep learning-based detectors, have been developed for manipulation detection, individual methods often suffer from limited robustness, fragmented evidence, or weak generalization across manipulation types and image conditions. To address these limitations, we present \textbf{FRAME}, a method for \textbf{F}orensic \textbf{R}outing and \textbf{A}daptive \textbf{M}ulti-path \textbf{E}vidence fusion for image manipulation detection. FRAME organizes diverse forensic algorithms into a multi-path analysis space, adaptively selects informative forensic paths for each input image, and fuses complementary evidence to improve detection and localization performance. By moving beyond single-method analysis and fixed fusion strategies, FRAME provides a more robust and flexible approach to image forensic reasoning while preserving interpretable forensic cues from multiple evidence sources. Experimental results demonstrate the effectiveness of FRAME across diverse manipulation scenarios. Code is available at \href{this https URL}{this https URL}.

120. 【2605.12778】Generative Motion In-betweening by Diffusion over Continuous Implicit Representations

链接https://arxiv.org/abs/2605.12778

作者:Shiyu Fan,Paul Henderson,Edmond S. L. Ho

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:yielded impressive progress, realistic motion transitions, Recent advances, advances in generative, yielded impressive

备注

点击查看摘要

Abstract:Recent advances in generative models have yielded impressive progress on motion in-betweening, allowing for more complex, varied, and realistic motion transitions. However, recent methods still exhibit noticeable limitations in preserving keyframe information and ensuring motion continuity. In this paper, we propose a novel pipeline and sampling optimization strategy for latent diffusion models (LDM) based on motion implicit neural representations (INR). By establishing a mapping between INR and sparse spatial or temporal information within latent diffusion, our model can sample the INR parameters from extremely sparse and ambiguous keyframe data and reconstruct plausible and smooth motions from the manifold. Our experiments demonstrate the superior performance of our model, which significantly improves motion generation quality in scenarios with few keyframes while ensuring both keyframe accuracy and diversity of in-between motions.

121. 【2605.12774】WildPose: A Unified Framework for Robust Pose Estimation in the Wild

链接https://arxiv.org/abs/2605.12774

作者:Jianhao Zheng,Liyuan Zhu,Zihan Zhu,Iro Armeni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Estimating camera pose, Estimating camera, visual SLAM, SLAM and SfM, SfM methods assume

备注

点击查看摘要

Abstract:Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume static scenes. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models may degrade on static-only scenes. We present WildPose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect two powerful paradigms in modern 3D vision: the rich perceptual frontend of feedforward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this with a 3D-aware update operator built on a frozen, pre-trained MASt3R feature backbone, together with a high-capacity motion mask detector that uses multi-level 3D-aware features from the same backbone. Extensive experiments show WildPose consistently outperforms prior methods across dynamic (Wild-SLAM, Bonn), static (TUM, 7-Scenes), and low-ego-motion (Sintel) benchmarks.

122. 【2605.12772】Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs

链接https://arxiv.org/abs/2605.12772

作者:Andreas Maier,Jeta Sopa,Gozde Gul Sahin,Paula Perez-Toro,Siming Bayer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:models, soft sponsorship cue, Abstract, large language models, frontier large language

备注: Submitted to Workshop on Textual Information Processing Synthesis in the Wild

点击查看摘要

Abstract:Wu et al. (2026) showed that most frontier large language models (LLMs) recommend a sponsored, roughly twice-as-expensive flight when their system prompt contains a soft sponsorship cue. We reproduce their evaluation on ten open-weight chat models plus the two of their twenty-three models that are still reachable today (gpt-3.5-turbo, gpt-4o). All reported rates in this paper are produced under the same judge the original paper used (gpt-4o); we additionally store every label under an open-weight (gpt-oss-120b) and a smaller proprietary (gpt-4o-mini) judge for an ablation. Three findings emerge. First, a prose description of an LLM evaluation pipeline is not, on its own, sufficient for accurate reproduction: we surfaced three silent implementation failures that each shifted a reported rate by tens of percentage points. Second, the central claims do generalise - the gpt-3.5-turbo logistic-regression intercept of alpha = 0.81 is within four points of the original alpha = 0.86, and 200 of 200 trials on gpt-3.5-turbo and gpt-4o promote a payday lender to a financially distressed user. Third, a thirty-token user prompt that asks the assistant for a neutral comparison table first cuts sponsored recommendation from 46.9% to 1.0% averaged across our ten open-source models, and from 53.0% to 0% averaged across the two OpenAI models. AI literacy and price-comparison portals are likely market-level mitigations; the harmful-product cell is bounded by neither. Raw data, labels and analysis scripts are at this https URL .

123. 【2605.12743】Still Camouflage, Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving

链接https://arxiv.org/abs/2605.12743

作者:Shuo Ju,Qingzhao Zhang,Huashan Chen,Xuheng Wang,Haotang Li,Wanqian Zhang,Feng Liu,Kebin Peng,Sen He

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing physical adversarial, sophisticated physical patch, including biased object, biased object tracking, dynamically changing patches

备注

点击查看摘要

Abstract:Existing physical adversarial attacks on vision-based autonomous driving induce time-evolving perception errors, including biased object tracking or trajectory prediction, through (i) sophisticated physical patch inducing detection box drift when entering the view distance, or (ii) dynamically changing patches that cause different perception errors at different time. In both cases, viewing-angle variation is treated as a challenge, requiring adversarial patches to remain effective across frames under varying views, leading to complex multi-view optimization. In contrast, we show that viewing-angle variation itself can be turned into an attack tool. We design a new attack paradigm where a static, passive adversarial camouflage is mounted on a vehicle whose view-dependent appearance naturally evolves with relative motion, inducing consistent feature drift across frames. This causes the system to infer a physically plausible but incorrect trajectory, such as a false cut-in, which propagates to downstream decision-making and triggers unnecessary braking. Unlike prior approaches that require multi-view robustness or active intervention, our attack emerges from normal driving dynamics and is easy to deploy: a parked vehicle with a natural camouflage can induce hard braking in passing autonomous vehicles. We demonstrate the novel attack on nuScenes dataset, showing the effectiveness with an end-to-end success rate of up to 87.5%, measured by hard-braking events, and robustness across different scene backgrounds, victim vehicle speeds, and perception models.

124. 【2605.12725】Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models

链接https://arxiv.org/abs/2605.12725

作者:Furkan Mumcu,Michael J. Jones,Anoop Cherian,Yasin Yilmaz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent video anomaly, Recent video, video anomaly detection, anomaly detection research, research has expanded

备注

点击查看摘要

Abstract:Recent video anomaly detection research has expanded rapidly with an emphasis on general models of normality intended to work across many different scenes. While this focus has led to improvements in scalability and multi-scene generalization, it has also shifted the field away from modeling the scene-specific and context-dependent nature of normal behavior. Contemporary approaches frequently rely on video-level weak supervision and opaque pretrained representations from multi-modal large language models (MLLMs), which encourage models to respond to familiar semantic anomaly categories rather than to deviations from the normal patterns of a particular environment. This trend suppresses spatial localization, introduces semantic bias, and reduces anomaly detection to a form of action recognition. In this paper, we examine whether these prevailing formulations align with the core requirements of real-world VAD, which is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns. Through targeted visual analyses and empirical evaluations, we demonstrate the practical consequences of these limitations and show that meaningful progress in VAD requires renewed focus on single-scene, spatially-aware, and explainable formulations that capture the nuanced structure of normality within individual environments.

125. 【2605.12724】Inline Critic Steers Image Editing

链接https://arxiv.org/abs/2605.12724

作者:Weitai Kang,Xiaohang Zhan,Yizhou Wang,Mang Tik Chiu,Jason Kuen,Kangning Liu,Yan Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Instruction-based image editing, editing exhibits heterogeneous, exhibits heterogeneous difficulty, motivating refinement approaches, image editing exhibits

备注: 9 pages

点击查看摘要

Abstract:Instruction-based image editing exhibits heterogeneous difficulty not only across cases but also across regions of an image, motivating refinement approaches that allocate correction to where the model struggles. Existing refinement signals arrive late, after a fully generated image or a completed denoising step. We ask whether such a signal can act within an ongoing forward pass. To investigate this, we probe a frozen image-editing model and find that although generation capability emerges only in the last few layers, the error pattern is already set in early layers (rank correlation \r{ho} = 0.83 with the final-layer error map). Based on this, we introduce Inline Critic, a learnable token that critiques a frozen model's predictions at its intermediate layers and steers its hidden states to refine generation during the forward pass. A three-stage recipe is proposed to stabilize the training from learning how to critique to steering generation. As a result, we achieve state of the art on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source result on KRIS-Bench (81.92, surpassing GPT-4o). We further provide analyses showing that the critic genuinely shapes the model's attention and prediction updates at subsequent layers.

126. 【2605.12703】MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence

链接https://arxiv.org/abs/2605.12703

作者:Yifan Chen,Fei Yin,Qingyan Bai,Zicheng Lin,Yujiu Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:mixed-modality teaching context, multimodal context learning, learning task-local rules, context learning, mixed-modality teaching

备注

点击查看摘要

Abstract:We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manuals, videos, and frame sequences before they can reason over the learned context. MMCL-Bench contains 102 tasks spanning three categories: rule system application, procedural task execution, and empirical discovery and induction. We evaluate frontier multimodal models with strict rubric-based scoring and find that current systems remain far from robust multimodal context learning, with even the strongest model solving fewer than one-third of tasks under strict evaluation. Diagnostic ablations and error analysis show that failures arise throughout the context-to-answer pipeline, including context anchoring, visual evidence extraction, context reasoning, and response construction. MMCL-Bench thus highlights multimodal context learning as an important unsolved capability bottleneck for current multimodal models.

127. 【2605.12684】Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

链接https://arxiv.org/abs/2605.12684

作者:Yichen Feng,Yuetai Li,Chunjiang Liu,Yuanyuan Chen,Fengqing Jiang,Yue Huang,Hang Hua,Zhengqing Yuan,Kaiyuan Zheng,Luyao Niu,Bhaskar Ramasubramanian,Basel Alomair,Xiangliang Zhang,Misha Sra,Zichen Chen,Radha Poovendran,Zhangchen Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Multimodal large language, large language models, large language, routinely deployed, visual understanding

备注: Project page: [this https URL](https://vab.bakelab.ai) . Code: [this https URL](https://github.com/BakeLab/Visual-Aesthetic-Benchmark) . Dataset: [this https URL](https://huggingface.co/datasets/BakeLab/Visual-Aesthetic-Benchmark)

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.

128. 【2605.12678】No One Knows the State of the Art in Geospatial Foundation Models

链接https://arxiv.org/abs/2605.12678

作者:Isaac Corley,Nils Lehmann,Caleb Robinson,Gabriel Tseng,Anthony Fuller,Hamed Alemohammad,Evan Shelhamer,Jennifer Marcus,Hannah Kerner

类目:Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:high-stakes Earth-observation tasks, high-stakes Earth-observation, Earth-observation tasks, Geospatial foundation models, land-cover mapping

备注

点击查看摘要

Abstract:Geospatial foundation models (GFMs) have been proposed as generalizable backbones for disaster response, land-cover mapping, food-security monitoring, and other high-stakes Earth-observation tasks. Yet the published work about these models does not give reviewers or users enough information to tell which model fits a given task. We argue that nobody knows what the current state of the art is in geospatial foundation models. The methods may be useful, but the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank them. In a 152-paper audit, we find 46 cross-paper disagreements of at least 10 points for the same model, benchmark, and protocol; 94/126 papers with extractable pretraining data use a configuration no other paper uses; and 39% of GFM papers release no model weights. This lack of community standards can be solved. We propose six concrete expectations: named-license weight release, shared core evaluations, copied-versus-rerun baseline annotations, variance reporting, one shared evaluation harness, and data-vs-architecture-vs-algorithm controls. These gaps are a coordination failure, not a fault of any individual lab; the authors of this paper, like many others in the GFM community, have contributed to them. Rather than just critiquing the community, we aim to provide concrete steps toward a shared understanding of how to innovate GFMs.

129. 【2605.12650】CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

链接https://arxiv.org/abs/2605.12650

作者:Yunsung Chung,Alex El Darzi,Carlo El Khoury,Han Feng,Nassir Marrouche,Jihun Hamm

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Foundation diffusion models, imaging remains challenging, generate photorealistic natural, Foundation diffusion, medical imaging remains

备注

点击查看摘要

Abstract:Foundation diffusion models can generate photorealistic natural images, but adapting them to medical imaging remains challenging. In medical adaptation, limited labeled data can exacerbate hallucination-like and clinically implausible synthesis, while existing metrics such as FID or Inception Score do not quantify per-image alignment with pathology-relevant criteria. We introduce the Clinical Alignment Score (CAS), a foundation-model-based proxy for clinical alignment that evaluates generated images along four complementary dimensions beyond visual fidelity. Building on CAS, we propose Clinical Reward-Aligned Finetuning (CRAFT), a reward-based adaptation framework that transfers medical knowledge from multimodal large language models and vision-language models through label-conditioned prompt enrichment, clinical checklists, and differentiable reward optimization. Across four diverse modalities, CRAFT improves CAS and downstream classification performance over strong adaptation baselines. Beyond average CAS gains, CRAFT reduces the empirical low-alignment tail below a real-image reference threshold by 5.5-34.7% points relative to the strongest baseline, corresponding to a 20.4% average relative reduction across datasets. These results indicate fewer hallucination-like generations under CAS, and are corroborated by out-of-family evaluator evaluation, structured checklist auditing, memorization analysis, and a blinded physician preference study on CheXpert.

130. 【2605.12649】DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery

链接https://arxiv.org/abs/2605.12649

作者:Qianxin Xia,Zhiyong Shu,Wenbo Jiang,Jiawei Du,Jielei Wang,Guoming Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:compact proxy dataset, Dataset distillation aims, highly efficient learning, proxy dataset, Dataset

备注

点击查看摘要

Abstract:Dataset distillation aims to synthesize a compact proxy dataset that is unreadable or non-raw from the original dataset for privacy protection and highly efficient learning. However, previous approaches typically adopt a single-stage distillation paradigm, which suffers from learning specific patterns that overfit on a prior architecture, consequently suppressing the expression of semantics and leading to performance degradation across heterogeneous architectures. To address this issue, we propose a novel dual-stage distillation framework called ${\textbf{DIVER}}$, which leverages the pre-trained diffusion model to dive deeper into $\textbf{DI}$stilled data $\textbf{V}$ia $\textbf{E}$xpressive semantic $\textbf{R}$ecovery, an entire process of semantic inheritance, guidance, and fusion. Semantic inheritance distills high-level semantics of abstract distilled images into the latent space to filter out architecture-specific ``noise" and retain the intrinsic semantics. Furthermore, semantic guidance improves the preservation of the original semantics by directing the reverse procedure. Finally, semantic fusion is designed to provide semantic guidance only during the concrete phase of the reverse process, preventing semantic ambiguity and artifacts while maintaining the guidance information. Extensive experiments validate the effectiveness and efficiency of DIVER in improving classical distillation techniques and significantly improving cross-architecture generalization, requiring processing time comparable to raw DiT on ImageNet (256$\times$256) with only 4 GB of GPU memory usage. Code is available: this https URL.

131. 【2605.12640】MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation

链接https://arxiv.org/abs/2605.12640

作者:Qing Cheng,Damiano Bertolini,Wei Zhang,Dong Wang,Niclas Zeller,Daniel Cremers

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:placing joint demands, long-range context modelling, efficient dense prediction, amorphous stuff regions, countable thing instances

备注: ISPRS Congress 2026

点击查看摘要

Abstract:Panoptic segmentation requires the simultaneous recognition of countable thing instances and amorphous stuff regions, placing joint demands on long-range context modelling, multi-scale feature representation, and efficient dense prediction. Existing convolutional and transformer-based methods struggle to satisfy all three requirements concurrently: convolutional architectures are limited in their capacity to model long-range dependencies, while transformer-based methods incur quadratic computational cost that is prohibitive at high resolutions. In this paper, we propose MambaPanoptic, a fully Mamba-based panoptic segmentation framework that addresses these limitations through two principal contributions. First, we introduce MambaFPN, a top-down feature pyramid that leverages Mamba blocks to generate globally coherent, multi-scale feature representations with linear computational complexity. Second, we adopt a PanopticFCN-style kernel generator that produces unified thing and stuff kernels for proposal-free panoptic prediction, enhanced by a QuadMamba-based feature refinement module applied at multiple network stages. Experiments on the Cityscapes and COCO panoptic segmentation benchmarks demonstrate that MambaPanoptic consistently outperforms PanopticDeepLab and PanopticFCN under comparable model sizes, and matches or surpasses Mask2Former on Cityscapes in PQ and AP while requiring fewer parameters.

132. 【2605.12625】Driving Intents Amplify Planning-Oriented Reinforcement Learning

链接https://arxiv.org/abs/2605.12625

作者:Hengtong Lu,Victor Shea-Jay Huang,Chengmin Yang,Pengfei Jing,Jifeng Dai,Yan Xie,Benjin Zhu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:single demonstrated trajectory, semantically distinct alternatives, represent semantically distinct, Continuous-action policies trained, single demonstrated

备注: Work in progress. Project page: [this https URL](https://mind-omni.github.io/)

点击查看摘要

Abstract:Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.

133. 【2605.12624】MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

链接https://arxiv.org/abs/2605.12624

作者:Yuzhou Huang,Benjin Zhu,Hengtong Lu,Victor Shea-Jay Huang,Haiming Zhang,Wei Chen,Jifeng Dai,Yan Xie,Hongsheng Li

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:progressed from modular, modular pipelines, VLA, Autonomous driving, driving

备注: Work in progress. Project page: [this https URL](https://mind-omni.github.io/)

点击查看摘要

Abstract:Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose into coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces autoregressive language tokens and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A streaming design processes the driving video framewise rather than as fixed video-action chunks, while a learned memory channel carries temporal context across frames so planned trajectories evolve smoothly without redundant multi-frame VLM modeling. The unified architecture admits fast/slow execution on dense/sparse Mixture-of-Transformers (MoT) backbones via flexible self-attention context management, and exposes a measurable language-to-action route: a language-predicted driving intent steers action diffusion through classifier-free guidance (CFG), turning language-side intent into a control signal for continuous trajectory generation. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA methods by large margins, and matches VA-class throughput (16 FPS vs. RAP-DINO's 18 FPS) while preserving natural-language interfaces.

134. 【2605.12623】DocAtlas: Multilingual Document Understanding Across 80+ Languages

链接https://arxiv.org/abs/2605.12623

作者:Ahmed Heakl,Youssef Mohamed,Abdullah Sohail,Rania Elbadry,Ahmed Nassar,Peter W. J. Staar,Fahad Shahbaz Khan,Imran Razzak,Salman Khan

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:perpetuate existing biases, understanding remains limited, scarce training data, document understanding remains, low-resource languages due

备注: Under submission

点击查看摘要

Abstract:Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.

135. 【2605.12622】Action Emergence from Streaming Intent

链接https://arxiv.org/abs/2605.12622

作者:Pengfei Jing,Victor Shea-Jay Huang,Hengtong Lu,Jifeng Dai,Xie Yan,Benjin Zhu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:learned scene-action mappings, long-tail traffic scenes, generate physically feasible, formalize action emergence, Streaming Intent

备注: Work in progress. Project page: [this https URL](https://mind-omni.github.io/)

点击查看摘要

Abstract:We formalize action emergence as a target capability for end-to-end autonomous driving: the ability to generate physically feasible, semantically appropriate, and safety-compliant actions in arbitrary, long-tail traffic scenes through scene-conditioned reasoning rather than retrieval or interpolation of learned scene-action mappings. We show that previous paradigms cannot deliver action emergence: autoregressive trajectory decoders collapse the inherently multimodal future into a single averaged output, while diffusion and flow-matching generators express multimodality but are not steerable by reasoned intent. We propose Streaming Intent as a concrete way to approach action emergence: a mechanism that makes driving intent (i) semantically streamed through a continuous chain-of-thought that causally derives the intent from scene understanding, and (ii) temporally streamed across clips so that intent commitments remain coherent along the driving horizon. We realize Streaming Intent in a VLA model we call SI (Streaming Intent). SI autoregressively decodes a four-step chain-of-thought and emits an intent token; the decoded intent then drives classifier-free guidance (CFG) on a flow-matching action head, requiring only two denoising steps to generate the final trajectory. On the Waymo End-to-End benchmark, SI achieves competitive aggregate performance, with an RFS score of 7.96 on the validation set and 7.74 on the test set. Beyond aggregate metrics, the model demonstrates -- to our knowledge for the first time in a fully end-to-end VLA -- intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans, arising purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector.

136. 【2605.12608】A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline

链接https://arxiv.org/abs/2605.12608

作者:Mohamed Ahmed Mohamed,Xiaowei Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Object detection, scarcity of labelled, significant bottleneck, remains a significant, foggy data remains

备注: Project code and experimental configs available at [this https URL](https://github.com/mmohamed28/Clear2Fog)

点击查看摘要

Abstract:Object detection in adverse weather is critical for the safety of autonomous vehicles; however, the scarcity of labelled, real-world foggy data remains a significant bottleneck. In this paper, we propose Clear2Fog (C2F), an end-to-end, physics-based pipeline that simulates fog on clear-weather datasets while ensuring sensor-level consistency across camera and LiDAR. By using monocular depth estimation and a novel atmospheric light estimation method, C2F overcomes structural artifacts and chromatic biases common in existing techniques. A human perceptual study confirms C2F's physical realism, with the generated images being preferred 92.95% of the time over an established method. Utilising a training set of 270,000 images from the Waymo Open Dataset, we conduct an extensive data efficiency study to investigate how environmental diversity influences model robustness. Our findings reveal that models trained on mixed-density fog datasets at 75% scale outperform those trained on fixed-density datasets at 100% scale. Furthermore, we investigate the sim-to-real transfer by fine-tuning pre-trained models on real-world foggy data. We demonstrate that a tenfold increase over the default fine-tuning learning rate successfully overcomes negative transfer from synthetic biases, resulting in a 1.67 mAP improvement over real-only baselines. The C2F pipeline provides a scalable framework for enhancing the reliability of autonomous systems in adverse weather and demonstrates the potential of diverse synthetic datasets for efficient model training.

137. 【2605.12587】rackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

链接https://arxiv.org/abs/2605.12587

作者:Jisu Nam,Jahyeok Koo,Soowon Son,Jaewoo Jung,Honggyu An,Junhwa Hur,Seungryong Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dynamic scene understanding, scene understanding, fundamental to dynamic, dynamic scene, tracking

备注: Project page and code are available at [this https URL](https://cvlab-kaist.github.io/TrackCraft3r/)

点击查看摘要

Abstract:Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame generative paradigm of video DiTs into a reference-anchored tracking formulation with LoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method. We further demonstrate robustness to large motions and long videos.

138. 【2605.12586】3D Primitives are a Spatial Language for VLMs

链接https://arxiv.org/abs/2605.12586

作者:Junze Liu,Kun Qian,Florian Dubost,Kai Zhong,Arvind Srinivasan,Nan Chen,Anping Wang,Sam Zhang,Alejandro Mottini,Qingjun Cui,Tian Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)

关键词:correct object counts, simpler spatial questions, generate executable code, Vision-language models, exhibit a striking

备注

点击查看摘要

Abstract:Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own this http URL primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.

139. 【2605.12574】DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

链接https://arxiv.org/abs/2605.12574

作者:Hongyi Tang,Zhihao Zhu,Yi Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:large-scale image-text corpora, Vision-language models, sensitive data, training-data auditing, motivating membership inference

备注: 23 pages, 8 figures

点击查看摘要

Abstract:Vision-language models (VLMs) are trained on large-scale image-text corpora that may contain private, copyrighted, or otherwise sensitive data, motivating membership inference as a tool for training-data auditing. This is especially challenging for deployed VLMs, where auditors typically observe only generated textual responses. Existing VLM membership inference attacks either rely on probability-level signals unavailable in such settings, or use mask-based semantic prediction tasks whose effectiveness depends on object-centric visual assumptions. To address these limitations, we propose DistractMIA, an output-only black-box framework based on semantic distraction. Rather than removing visual evidence, DistractMIA preserves the original image, inserts a known semantic distractor, and measures how generated responses change. This design is motivated by the intuition that member samples remain more anchored to the original image semantics, while non-member samples are more easily redirected toward the distractor. To make this signal reliable, DistractMIA calibrates distractor configurations on a reference set and derives membership scores from repeated textual generations, capturing response stability and distractor uptake without accessing logits, probabilities, or hidden states. Experiments across multiple VLMs and benchmarks show that DistractMIA consistently outperforms both output-only and stronger-access baselines. Its performance on a medical benchmark further demonstrates applicability beyond object-centric natural images.

140. 【2605.12573】Improving Diffusion Posterior Samplers with Lagged Temporal Corrections for Image Restoration

链接https://arxiv.org/abs/2605.12573

作者:Davide Evangelista,Elena Morotti,Francesco Pivi,Maurizio Gabbrielli

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Diffusion-based posterior sampling, combining learned priors, imaging inverse problems, inverse problems, measurement constraints

备注: 9 Figures, 9 Tables, Submitted to a conference

点击查看摘要

Abstract:Diffusion-based posterior sampling (PS) is a leading framework for imaging inverse problems, combining learned priors with measurement constraints. Yet, its standard formulations rely on instantaneous data-consistent estimates, which induce temporal variability in the reverse dynamics. We reinterpret PS from a dynamical perspective, showing that the standard PS update corresponds to a first-order discretization of the diffusion dynamics plus a residual correction capturing the mismatch between the denoised prediction and the data-consistent estimate. A second-order discretization, however, naturally introduces a temporal correction based on the variation of consecutive estimates. Building on this, we propose LAMP, combining the second-order update with the residual correction characterizing a PS technique. LAMP thus inherits a lagged temporal correction, and it can be implemented as a modular plug-in over the PS backbone. We show that LAMP preserves the structure of a posterior sampler, and we perform a one-step risk analysis to characterize when LAMP improves the reverse transition via a bias-variance trade-off. Experiments across multiple imaging tasks demonstrate consistent improvements over strong baselines such as DiffPIR and DDRM, without increasing the number of denoising evaluations.

141. 【2605.12571】VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

链接https://arxiv.org/abs/2605.12571

作者:Chenhao Qiu(1),Yechao Zhang(2),Xin Luo(1),Shien Song(1),Xusheng Liu(1) ((1) Mango TV, (2) Nanyang Technological University)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:requires locating sparse, highly redundant content, Long video question, time-scattered visual evidence, long videos introduce

备注: Accepted to ICML 2026. 33 pages, 13 figures. Code and models are available at [this https URL](https://github.com/Echochef/VideoSEAL)

点击查看摘要

Abstract:Long video question answering requires locating sparse, time-scattered visual evidence within highly redundant content. Although current MLLMs perform well on short videos, long videos introduce long-horizon search and verification, which often necessitates multi-turn, agentic interaction. We show that existing LVU agents can exhibit "evidence misalignment": they produce correct answers that are not supported by the retrieved or inspected evidence. To characterize this failure, we introduce two diagnostics (temporal groundedness and semantic groundedness) and use them to reveal two pressures that amplify misalignment: prompt pressure from shared-context saturation at inference time and reward pressure from outcome-only optimization during training. These findings point to a structural root cause: the coupled agent paradigm conflates long-horizon planning with answer authority. We therefore propose the decoupled planner-inspector framework, which separates planning from answer authority and gates final answering on pixel-level verification. Across four long-video benchmarks, our framework improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench while producing interpretable search trajectories. Moreover, the decoupled architecture scales consistently with increased search budgets and supports plug-and-play upgrades of the MLLM backbone without retraining the planner. Code and models are available at this https URL.

142. 【2605.12570】M3Net: A Macro-to-Meso-to-Micro Clinical-inspired Hierarchical 3D Network for Pulmonary Nodule Classification

链接https://arxiv.org/abs/2605.12570

作者:Jinyue Li,Yuzhou Yu,Jingjing Yang,Meng Fu,Yani Zhang,Shuyao He,Dianlong Ge,Xin Ning,Yannan Chu,Qiankun Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lung cancer screening, early lung cancer, remains challenging due, malignant pulmonary nodules, pulmonary nodule classification

备注: Published in Information Fusion (2026), 15 pages, 5 figures

点击查看摘要

Abstract:The accurate classification of benign and malignant pulmonary nodules in CT scans is critical for early lung cancer screening, yet remains challenging due to the multi-scale and heterogeneous nature of pulmonary nodules. While deep learning offers potential for auxiliary diagnosis, most existing models act as "black boxes", lacking the transparency and explainability required for trustworthy clinical integration. To address this issue, we propose M3Net, a novel 3D network for pulmonary nodule classification inspired by the hierarchical diagnostic workflow of radiologists, which integrates multi-scale contextual information from fine-grained structures to global anatomical relationships. Our framework constructs a progressive multi-scale input, from fine-grained nodule structures to local semantics and global spatial relationships. M3Net employs scale-specific encoders and ensures cross-scale semantic consistency through latent space projection and mutual information maximization. Extensive experiments on the public LIDC-IDRI dataset and a self-collected clinical dataset (USTC-FHLN) demonstrate that our method achieves state-of-the-art performance, with accuracies of 86.96% and 84.24% respectively, outperforming the best baseline by 3.26% and 2.17%. The results validate that M3Net provides a more robust and clinically relevant solution for pulmonary nodule classification. The code is available at this https URL.

143. 【2605.12567】Pyramid Self-contrastive Learning Framework for Test-time Ultrasound Image Denoising

链接https://arxiv.org/abs/2605.12567

作者:Jiajing Zhang,Bingze Dai,Xi Zhang,Yue Xu,Wei-Ning Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:complicates clinical interpretation, speckle noise complicates, noise complicates clinical, complicates clinical, clinical interpretation

备注

点击查看摘要

Abstract:The inherent electronic and speckle noise complicates clinical interpretation of ultrasound images. Conventional denoising methods rely on explicit noise assumptions whose validity diminishes under composite noise conditions. Learning-based methods require massive labeled data and model parameters. These pre-defined and pre-trained manners entail an inevitable domain shift in complex in vivo environments, so they are limited to a specific noise type and often blur structural details. In this study, we propose a pure test-time training framework for one-shot ultrasound image denoising and apply it to synthetic aperture ultrasound (SAU), which synthesizes transmit focus from sub-aperture transmissions. Our Aperture-to-Aperture (A2A) framework disentangles anatomical similarity and noise randomness from shuffled sub-apertures through self-contrastive learning in pyramid latent spaces. The clean image is then decoded from the anatomy space, while discarding the noise space. A2A is trained at test time on one noisy sample of SAU signals, so it fundamentally eliminates the domain shift and pretraining costs. Simulation experiments, including electronic noise levels of 0 to 30 dB and different inclusion geometries, demonstrated an improvement of 69.3% SNR and 34.4% CNR by A2A. The in vivo results showed 84.8% SNR and 25.7% CNR gains using only two aperture data of the heart in six echocardiographic views, liver, and kidney. A2A delivers clear images/signals across diverse imaging targets and configurations, paving the way for more reliable anatomical visualization and functional assessment by ultrasound.

144. 【2605.12556】M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement

链接https://arxiv.org/abs/2605.12556

作者:Youssef Aboelwafa,Hicham G. Elmongui,Marwan Torki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-light image enhancement, including amplified noise, Low-light image, complex degradations, including amplified

备注: Accepted at 2026 IEEE International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:Low-light image enhancement is challenging due to complex degradations, including amplified noise, artifacts, and color distortion. While Retinex-based deep learning methods have achieved promising results, they primarily rely on single-modality RGB information. We propose M2Retinexformer (Multi-Modal Retinexformer), a novel framework that extends Retinexformer by incorporating depth cues, luminance priors, and semantic features within a progressive refinement pipeline. Depth provides geometric context that is invariant to lighting variations, while luminance and semantic features offer explicit guidance on brightness distribution and scene understanding. Modalities are extracted at multiple scales and fused through cross-attention, with adaptive gating dynamically balancing illumination-guided self-attention and cross-attention based on the reliability of auxiliary cues. Evaluations on the LOL, SID, SMID, and SDSD benchmarks demonstrate overall improvements over Retinexformer and recent state-of-the-art methods. Code and pretrained weights are available at this https URL

145. 【2605.12550】SSDA: Bridging Spectral and Structural Gaps via Dual Adaptation for Vision-Based Time Series Forecasting

链接https://arxiv.org/abs/2605.12550

作者:Mingrui Zhang,Hanchen Yang,Wengen Li,Xudong Jiang,Yichao Zhang,Jihong Guan,Shuigeng Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large vision models, surprisingly effective time, time series images, Large vision, time series forecasters

备注

点击查看摘要

Abstract:Large vision models (LVMs) have recently proven to be surprisingly effective time series forecasters, simply by rendering temporal data as images. This success, how ever, rests on a largely unexamined premise: the rendered time series images are sufficiently close to natural images for knowledge in pre-trained models to transfer effectively. We argue that two gaps still remain, i.e., spectral and structural gaps, fundamentally limiting the potential of LVMs for time series forecasting. Spectrally, we systematically reveal that rendered time series images exhibit a markedly shallower power spectrum than the natural images LVMs are pre-trained to recognize. Structurally, reshaping 1D temporal sequences into 2D grids fabricates spurious spatial adjacencies while severing genuine temporal continuities, misleading the spatial inductive biases of pre-trained LVMs. To bridge these gaps, we propose SSDA, a dual-branch network that spectrally and structurally adapts to unlock the full potential of LVMs for time series forecasting. At the data level, a Spectral Magnitude Aligner (SMA) applies 2D FFT to selectively enhance the magnitude spectrum toward natural-image statistics while preserving phase. At the model level, a Structural-Guided Low-Rank Adaptation (SG-LoRA) injects position-aware temporal encodings into patch embeddings and adapts at tention via low-rank updates. The two branches are further adaptively fused to produce the final forecast. Extensive experiments on seven real-world benchmarks demonstrate that SSDA consistently outperforms strong LVM- and LLM-based baselines under both full-shot and few-shot settings. Code is publicly available at this https URL.

146. 【2605.12549】What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

链接https://arxiv.org/abs/2605.12549

作者:Jiaping Lin,Fei Shen,Junzhe Li,Ping Nie,Fei Yu,Ming Li,Haizhou Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing training-free approaches, multiple inference runs, Existing training-free, GUI grounding, rely on multiple

备注

点击查看摘要

Abstract:Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that revisits inference by introducing an attention-guided second prefill stage to refine target selection. Specifically, visual tokens that consistently receive high attention from the query position, i.e., the final token, across layers are extracted as a preliminary target hypothesis and appended to the input, together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation. Experiments across four VLMs and five benchmarks, including ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI, demonstrate consistent improvements without additional training, with gains of up to 4.3% on ScreenSpot-Pro. Code will be available at this https URL.

147. 【2605.12545】CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

链接https://arxiv.org/abs/2605.12545

作者:Zhitong Dong,Chao Li,Jie Yu,Hao Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:image cropping aims, aims to enhance, image, image by improving, Aesthetic image cropping

备注

点击查看摘要

Abstract:Aesthetic image cropping aims to enhance the aesthetic quality of an image by improving its composition through spatial cropping. Previous methods often rely on saliency prediction or retrieval augmentation, ignoring the task's core requirement: a deep understanding of composition and aesthetics. Consequently, saliency-based methods struggle to make compositional trade-offs in complex scenes, while retrieval-based methods blindly refer to similar cases, lacking adaptive reasoning for unique scenes. Both approaches fail to align their automated cropping results with those of human experts. To address the above issues, we propose a novel paradigm that reformulates aesthetic cropping as a multimodal reasoning task, aiming to activate the VLM's analytical and comprehension capabilities in aesthetics. We design a Compositional Reasoning and Optimizing Preference method (CROP) that directs the VLM to think like a professional photographer. It deconstructs a complex and subjective aesthetic problem into an "analysis-proposal-decision" process, reasoning step by step through the analysis of scene elements and compositional principles. Meanwhile, our expert preference alignment module makes the model's decision consistent with human expert aesthetics. Extensive experiments across multiple datasets validate our method's superiority and component effectiveness.

148. 【2605.12528】MorphOPC: Advancing Mask Optimization with Multi-scale Hierarchical Morphological Learning

链接https://arxiv.org/abs/2605.12528

作者:Yuting Hu,Lei Zhuang,Chen Wang,Ruiyang Qin,Hua Xiang,Gi-joon Nam,Jinjun Xiong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)

关键词:accurately transferring circuit, transferring circuit patterns, feature sizes shrink, nanometer scale, accurately transferring

备注

点击查看摘要

Abstract:As feature sizes shrink to the nanometer scale, accurately transferring circuit patterns from photomasks to silicon wafers becomes increasingly challenging. Optical proximity correction (OPC) is widely used to ensure pattern fidelity and manufacturability. Recent generative mask optimization models based on encoder-decoder architecture can synthesize near-optimal masks, serving as fast machine learning (ML) surrogates for traditional OPC. However, these models often fail to capture the geometric transformations from target layouts to mask patterns, leading to suboptimal quality. In this work, we formulate mask generation as a sequence of morphological operations on local layout features and propose \textit{MorphOPC}, a multi-scale hierarchical model with neural morphological modules to learn these transformations. Experiments on edge-based OPC and ILT benchmarks across metal and via layers show that \textit{MorphOPC} consistently outperforms state-of-the-art methods, achieving higher printing fidelity and lower manufacturing cost, demonstrating strong potential for scalable mask optimization.

149. 【2605.12517】Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

链接https://arxiv.org/abs/2605.12517

作者:Mingyeong Kim,Jungwon Choi,Chaeyun Jang,Juho Lee(Kim Jaechul Graduate School of AI, KAIST)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision-language models, Vision-language, Abstract, text-only, Latent Imagination Module

备注: 9 pages, 16 figures. Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety across Modalities

点击查看摘要

Abstract:Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially restores accuracy and calibration. We propose the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input and feeds them into a frozen VLM backbone without pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM improves accuracy and reduces calibration error. These results suggest that latent modality completion is a practical approach for reliable VLM inference under missing-modality.

150. 【2605.12514】Structural Diversity Drives Disruptive Scientific Innovation

链接https://arxiv.org/abs/2605.12514

作者:Yichun Peng,Saike He,Peijie Zhang,Kang Zhao,Yi Yang,Ning Zhang,Qingpeng Zhang,Daniel Dajun Zeng,Hao Peng

类目:ocial and Information Networks (cs.SI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Digital Libraries (cs.DL); Applications (stat.AP)

关键词:remains poorly understood, breakthrough ideas remains, ideas remains poorly, innovation increasingly depends, fosters breakthrough ideas

备注

点击查看摘要

Abstract:Scientific innovation increasingly depends on collaboration, yet the organizational structure that fosters breakthrough ideas remains poorly understood. Existing metrics - such as team size or compositional diversity - capture readily observable characteristics but not the deeper architecture of collaboration. We introduce Structural Diversity (SD): the extent to which a team bridges multiple distinct knowledge communities within its prior collaboration network. Using a century-scale dataset of 260 million scientific publications (1900-2025) and combining causal inference with a quasi-natural experiment based on a U.S. National Science Foundation policy change in 2012, we show that SD is a powerful and robust predictor of disruptive innovation, outperforming traditional team novelty indicators such as team freshness and edge density. Moreover, SD positively interacts with team size and is able to mitigate the well-known "curse of scale" by transforming scale from a liability into a resource for creative synthesis. We find that one mechanism underlying this effect is Disciplinary Integration (DI): teams with higher SD can more effectively combine heterogeneous knowledge into novel configurations. Our findings position SD as both a new theoretical construct and an actionable design principle for organizing scientific collaboration. By linking the architecture of team assembly to the dynamics of creative discovery, our work offers a structural explanation for how collective intelligence can be systematically engineered to foster disruptive innovation.

151. 【2605.12506】Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

链接https://arxiv.org/abs/2605.12506

作者:Abdul Basit,Saim Rehman,Muhammad Shafique

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:Realizing on-device ML-based, varying battery-power levels, Realizing on-device, on-device ML-based gesture, tight real-time performance

备注: 7 pages, 11 figures, Accepted to DAC 2026

点击查看摘要

Abstract:Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel run-time adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by 4x (from 6.9 mJ to 1.6 mJ) while maintaining high gesture-detection performance (event-level F1 = 0.8-0.9) and low mean latency (6 ms).

152. 【2605.13619】DeepFilters: Scattering-Aware Pupil Engineering with Learned Digital Filter Reconstruction for Extended Depth of Field Microscopy

链接https://arxiv.org/abs/2605.13619

作者:Joseph L. Greene,Suet YIng Chan,Qilin Deng,Jeffrey Alido,Alexandra Lion,Guorong Hu,Ruipeng Guo,Tongyu Li,Kivilcim Kiliç,Ian Davison,Lei Tian

类目:Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)

关键词:point spread functions, field microscopy encodes, microscopy encodes axial, encodes axial information, engineered point spread

备注: 38 pages (18 main text, 20 supplement), 23 Figures (7 main text, 16 supplement)

点击查看摘要

Abstract:Extended depth of field microscopy encodes axial information into a single acquisition through engineered point spread functions, but conventional and deep optics approaches are subject to degradation in scattering tissue. We introduce DeepFilters, a scattering-aware deep optics framework that jointly optimizes a parameterized pupil filter and a digital-filter-based reconstruction network through a calibrated differentiable forward model to achieve broad generalization without retraining. Incorporating empirical scattering kernels, physics-guided regularization, and a hybrid genetic-gradient initialization strategy, DeepFilters extends the PSF from 16 micron to 400 micron in clear media and enables signal recovery beyond 120 micron deep in biological tissues, validated across fixed brain slices and sea urchin embryos.

153. 【2605.13146】On Hallucinations in Inverse Problems: Fundamental Limits and Provable Assessment Methods

链接https://arxiv.org/abs/2605.13146

作者:David Iagaru,Nina M. Gottschling,Anders C. Hansen,Josselin Garnier

类目:Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Artificial intelligence, Earth observation, diagnostics to Earth, medical diagnostics, transformed imaging inverse

备注: 31 pages, 11 figures; code available at [this https URL](https://github.com/davidiagraid/hallucinations_invpb)

点击查看摘要

Abstract:Artificial intelligence (AI) has transformed imaging inverse problems, from medical diagnostics to Earth observation. Yet deep neural networks can produce hallucinations, realistic-looking but incorrect details, undermining their reliability, especially when ground truth data is unavailable. We develop a theoretical framework showing that such hallucinations are not merely artifacts of particular models, but can arise from the ill-posed nature of the inverse problem itself. We derive necessary and sufficient conditions for hallucinations, together with computable bounds on their magnitude that depend only on the forward model. Building on this theory, we introduce algorithms to: (1) estimate the minimum hallucination magnitude achievable by any reconstruction model for a given input; (2) assess the faithfulness of reconstructed details by a given reconstruction model. Experiments across three imaging tasks demonstrate that our approach applies broadly, including to modern generative models, and provides a principled way to quantify and evaluate AI hallucinations.

154. 【2605.13015】A General Bézier Tree Encoding Counterfactual Framework for Retinal-Vessel-Mediated Disease Analysis

链接https://arxiv.org/abs/2605.13015

作者:Tan Su,Ethan Elio Meidinger,Lin Gu,Ruogu Fang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:remains primarily observational, clinical evidence remains, evidence remains primarily, primarily observational, key biomarker

备注: 33 pages, 6 figures; preprint

点击查看摘要

Abstract:The geometry of the retinal vessel is a key biomarker of vascular diseases, yet clinical evidence remains primarily observational. Existing generative counterfactuals intervene only at the image-level disease label, failing to isolate explicit anatomical structure. To address this limitation, we propose the Bézier Tree Encoding Counterfactual Framework (BTECF). By abstracting vascular networks into interconnected cubic-Bézier segments, BTECF establishes a disease-agnostic representation in which structural topology is explicitly preserved and atomically perturbable. Coupling this encoding with a diffusion-based generator enables parameter-level do-interventions on explicit geometric axes (e.g., tortuosity, caliber) while preserving background fundus textures. We validate BTECF on diabetic retinopathy, together with independent cohorts for ischemic stroke and Alzheimer's disease. Isolated counterfactual interventions produce dose-responsive shifts in classifier predictions; a matched pixel-drop control attenuates this response by an order of magnitude or more, ruling out out-of-distribution generation artifacts. By enforcing causal isolation between vessel topology and pixel-level confounders, BTECF provides a unified generative paradigm for hypothesis verification across systemic diseases. To support reproducibility, the code will be publicly released upon acceptance.

155. 【2605.12753】Optimization in Sparse 2D to Dense 3D Weakly Supervised Learning: Application to Multi-Label Segmentation of Large ex vivo MRI Data

链接https://arxiv.org/abs/2605.12753

作者:Paul Hoareau,Kuan Yi Wang,Brandon Bujak,Roy Sun,Govind Nair,Irene Cortese,Charidimos Tsagkas,Daniel Reich,Julien Cohen-Adad

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Fully supervised, Matter Lesion Dice, forcing reliance, prohibitive cost, cost of volumetric

备注: 19 pages. Submitted to Machine Learning for Biomedical Imaging (MELBA). Code and models: [this https URL](https://github.com/ivadomed/model_seg_sc-gm-lesion_human_ms_exvivo_t2star)

点击查看摘要

Abstract:INTRODUCTION | Fully supervised 3D segmentation of high-resolution ex vivo MRI is limited by the prohibitive cost of volumetric annotation, forcing reliance on sparse 2D slices. Weakly supervised Sparse-to-Dense frameworks bridge this gap, but guidelines remain ambiguous regarding human-centric visual enhancements and transferring optimization strategies across dimensions. We analyze divergent regularization needs for multi-class segmentation of high-resolution ex vivo spinal cord MRI. METHODS | We used 9.4T MRI of multiple sclerosis spinal cords (104,000 slices) with sparse annotations (428 slices). A 2D Teacher trained on sparse slices generated dense pseudo-labels to train a 3D Student. We systematically evaluated the impact of human-centric preprocessing, spatial augmentation, and soft-label regularization on both architectures. RESULTS | We identified a critical divergence in training dynamics. The 2D Teacher required strong spatial augmentation and soft-labeling to overcome data scarcity, improving White Matter Lesion Dice scores by 11 points. However, propagating these techniques to the 3D Student degraded its performance. Furthermore, human-centric preprocessing (e.g., CLAHE) disrupted global statistical cues, dropping Gray Matter Lesion Dice scores by ~25 points. DISCUSSION | Our study highlights a perception divergence (human-centric contrast enhancement harms machine models) and a regularization conflict across dimensions. 3D architectures trained on dense pseudo-labels exhibit fundamentally different optimization landscapes than 2D counterparts and require distinct, conservative regularization. Code and models: this https URL.

Comments:
19 pages. Submitted to Machine Learning for Biomedical Imaging (MELBA). Code and models: this https URL

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2605.12753 [eess.IV]

(or
arXiv:2605.12753v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2605.12753

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Paul Hoareau [view email] [v1]
Tue, 12 May 2026 21:06:53 UTC (7,390 KB)

156. 【2605.12619】Human face perception reflects inverse-generative and naturalistic discriminative objectives

链接https://arxiv.org/abs/2605.12619

作者:Wenxuan Guo,Heiko H. Schütt,Kamila Maria Jozwik,Katherine R. Storrs,Nikolaus Kriegeskorte,Tal Golan

类目:Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)

关键词:perceptual representations supporting, recognize faces remain, computational mystery, perceptual representations, representations supporting

备注: 33 pages, 10 figures, 4 tables

点击查看摘要

Abstract:The perceptual representations supporting our ability to recognize faces remain a computational mystery. Deep neural networks offer mechanistic hypotheses for human face perception, but theoretically distinct models often make indistinguishable representational predictions for randomly sampled faces. To expose diagnostic differences among these hypotheses, we compared six neural network models sharing an architecture but trained on distinct tasks, using face pairs optimized to elicit contrasting model predictions ("controversial" pairs) alongside randomly sampled pairs. We tested model predictions against face-dissimilarity judgments from 864 human participants across stimulus sets differing in realism and pose variation. Models prioritizing high-level, invariant structures (trained via inverse rendering, face identification, or object classification) most robustly matched human judgments. Furthermore, models trained on natural images typically outperformed synthetic-trained counterparts. Together, these findings suggest that human face perception is shaped by mechanisms that infer latent causes of facial appearance, discount nuisance variation, and are tuned by natural image statistics.

157. 【2605.12575】Are Compact Rationales Free? Measuring Tile Selection Headroom in Frozen WSI-MIL

链接https://arxiv.org/abs/2605.12575

作者:Hyun Do Jung,Jungwon Choi,Soojung Choi,Yujin Oh,Hwiyoung Kim

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:strong slide-level AUC, multiple instance learning, Whole-slide image, AUC while leaving, achieve strong slide-level

备注

点击查看摘要

Abstract:Whole-slide image (WSI) multiple instance learning (MIL) classifiers can achieve strong slide-level AUC while leaving the full-bag prediction opaque. Attention scores are widely reused as post-hoc explanations, but high attention can reflect aggregation preference rather than a compact, model-sufficient rationale. We study post-hoc rationale highlighting for frozen WSI-MIL: given a trained classifier, can its slide-level prediction be recovered from a compact, output-consistent tile subset without retraining the backbone? We instantiate this with Finding Optimal Contextual Instances (FOCI), a lightweight rationale-readout layer over a frozen MIL backbone. FOCI is trained with model-output sufficiency and exclusion objectives over keep/drop tile subsets, evaluated with an insertion-style Sequential Reveal Protocol (SRP) adapted to WSI-MIL, and summarized by the Selection Headroom Index (SHI). Across three WSI benchmarks and seven MIL backbones, FOCI reveals that compact rationales are selection-headroom dependent: transformer and multi-branch attention aggregators can admit compact rationales, near-minimal attention-pooling baselines enter a selection-saturation regime, and hard-selection backbones can conflict with an external readout. For TransMIL, relative to its documented CLS-proxy ranking, FOCI reduces the Minimum Sufficient K (MSK) tile count by 32-56% across benchmarks, while ACMIL+FOCI attains the highest mean SHI (+0.465). Deletion-based perturbation and selected-only downstream evaluation provide complementary checks. These results position FOCI as a model-level interpretability and audit layer: selected tiles are not claims of clinical or pathologist-level diagnostic sufficiency, but candidate rationales that offer a compact, reviewable view of when a frozen MIL prediction can be localized to a small output-consistent subset.

158. 【2605.12562】Uncovering Latent Pathological Signatures in Pulmonary CT via Cross-Window Knowledge Distillation

链接https://arxiv.org/abs/2605.12562

作者:Bo Peng,Wujian Xu,Kun Wang,Ximing Liao,Na Wang,Daqian Shi,Tian Li,Jing Gao,Johan Thygesen,Yingqun Ji,Honghan Wu

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:missing cross-density interactions, imaging captures complementary, existing deep learning, deep learning methods, learning methods fuse

备注

点击查看摘要

Abstract:Multi-window CT imaging captures complementary pathological information across anatomical structures of differing densities, yet existing deep learning methods fuse representations only at later stages, missing cross-density interactions. We propose a cross-window knowledge distillation framework in which student encoders learn latent clinical priors from a teacher trained on the most informative window. Evaluated retrospectively on three cohorts - COPD-CT-DF (n=719), RSNA PE (n=1,433), and an in-house CTEPD dataset (n=161) - distillation improved per-window AUC by 10.1-16.5 percentage points on COPD-CT-DF (0.75-0.81 to 0.90-0.94; all P0.001), with ensemble AUC reaching 0.9960. Similar gains were observed on RSNA PE (0.80-0.83 to 0.90-0.92) and CTEPD (AUC 0.7481 vs. 0.6264). Cross-window distillation internalises pathological signatures invisible to supervised approaches, offering a generalisable solution for multi-window pulmonary CT analysis.

159. 【2605.12560】Brain Tumor Classification in MRI Images: A Computationally Efficient Convolutional Neural Network

链接https://arxiv.org/abs/2605.12560

作者:Md Fahimul Kabir Chowdhury,Jannatul Ferdous

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Improving patient outcomes, patient outcomes depends, manual MRI scan, MRI scan analysis, Improving patient

备注

点击查看摘要

Abstract:Improving patient outcomes depends on the prompt and accurate diagnosis of brain tumors, but manual MRI scan analysis is still time-consuming and unreliable. Although deep learning has shown promise, many of the models that are now in use are computationally intensive and have difficulty handling the intrinsic complexity and variety of different types of brain tumors. In this work, we propose a lightweight yet high-performing Convolutional Neural Network (CNN) for multi-class brain tumor classification, employing MRI images to target gliomas, meningiomas, pituitary tumors, and healthy (no tumor) instances. The model was rigorously evaluated on two publicly accessible datasets from Figshare and Kaggle. Leveraging efficient feature extraction and optimized training strategies, our CNN achieved classification accuracies of 99.03% and 99.28%, along with ROC scores of 99.88% and 99.94% on Dataset 1 and Dataset 2, respectively-all while utilizing significantly fewer parameters than popular pre-trained architectures. In contrast to cutting-edge models like DenseNet201, MobileNetV2, VGG19, Xception, InceptionV3, and ResNet50, our approach consistently demonstrated superior performance with reduced computational overhead. These findings highlight the potential of the proposed model as a practical and reliable diagnostic aid in clinical environments.

160. 【2605.08320】Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks

链接https://arxiv.org/abs/2605.08320

作者:Marwane Hariat,Antoine Manzanera,David Filliat

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Monocular depth estimation, ambiguous depth predictions, training approaches struggles, Monocular depth, approaches struggles

备注

点击查看摘要

Abstract:Monocular depth estimation (MDE) with self-supervised training approaches struggles in low-texture areas, where photometric losses may lead to ambiguous depth predictions. To address this, we propose a novel technique that enhances spatial information by applying a distance transform over pre-semantic contours, augmenting discriminative power in low texture regions. Our approach jointly estimates pre-semantic contours, depth and ego-motion. The pre-semantic contours are leveraged to produce new input images, with variance augmented by the distance transform in uniform areas. This approach results in more effective loss functions, enhancing the training process for depth and ego-motion. We demonstrate theoretically that the distance transform is the optimal variance-augmenting technique in this context. Through extensive experiments on KITTI, Cityscapes, Waymo, NYUv2 and ScanNet our model demonstrates robust performance, surpassing competing self-supervised methods in MDE.